Training Compute & FLOP Estimator
Estimate the total floating-point operations (FLOPs) required to train a neural network, based on model parameters, dataset size, and training configuration.
Model Architecture
Total trainable parameters (e.g. 7B = 7,000,000,000)Dataset & Training
Total tokens seen during training (e.g. 2T = 2,000,000,000,000) How many times the dataset is iterated (typically 1 for LLMs)Hardware Configuration
Peak theoretical throughput in FLOP/s Typical range: 30–50% for large-scale trainingResults will appear here.
Formulas Used
Total Training FLOPs (Kaplan et al. / Chinchilla):
C = 6 × N × D × G
- C — Total compute in FLOPs
- N — Number of model parameters
- D — Total tokens processed (tokens × epochs)
- G — Gradient checkpointing multiplier (1.0 or ~1.33)
- 6 — Factor accounting for: 2 (multiply-add) × 3 (1 forward pass + 2 backward passes)
Wall-Clock Time:
T = C / (FLOP/s_peak × num_accelerators × MFU)
Chinchilla-Optimal Tokens:
D_optimal ≈ 20 × N
Memory (minimum lower bound):
Mem = weights (N × bytes) + optimizer (2 × N × 4B for Adam) + gradients (N × bytes)
Assumptions & References
- The factor of 6 per parameter per token is derived from Kaplan et al. (2020), "Scaling Laws for Neural Language Models" and validated in Hoffmann et al. (2022), "Training Compute-Optimal Large Language Models" (Chinchilla).
- Gradient checkpointing recomputes activations during the backward pass, adding ~1 extra forward pass (~33% overhead) to reduce memory usage.
- Model FLOP Utilization (MFU) of 30–50% is typical for large-scale distributed training; see Chowdhery et al. (2022), PaLM reporting ~46% MFU on TPUs.
- Memory estimates are a lower bound — activations, KV cache, and communication buffers add significant overhead in practice.
- Adam optimizer stores first and second moment estimates in FP32, adding 2 × N × 4 bytes regardless of training precision.
- Hardware peak FLOP/s figures are for the specified tensor/matrix precision (BF16/FP16). FP32 throughput is typically 2–4× lower.
- This calculator does not account for pipeline bubbles, data loading overhead, or network communication latency in multi-node training.
- Reference: Epoch AI "Compute Trends" (2023) and OpenAI "AI and Compute" (2018) for historical context on training compute scaling.