Training Compute & FLOP Estimator

Estimate the total floating-point operations (FLOPs) required to train a neural network, based on model parameters, dataset size, and training configuration.

Model Architecture

Total trainable parameters (e.g. 7B = 7,000,000,000)

Dataset & Training

Total tokens seen during training (e.g. 2T = 2,000,000,000,000) How many times the dataset is iterated (typically 1 for LLMs)

Hardware Configuration

Peak theoretical throughput in FLOP/s Typical range: 30–50% for large-scale training
Results will appear here.

Formulas Used

Total Training FLOPs (Kaplan et al. / Chinchilla):

C = 6 × N × D × G
  • C — Total compute in FLOPs
  • N — Number of model parameters
  • D — Total tokens processed (tokens × epochs)
  • G — Gradient checkpointing multiplier (1.0 or ~1.33)
  • 6 — Factor accounting for: 2 (multiply-add) × 3 (1 forward pass + 2 backward passes)

Wall-Clock Time:

T = C / (FLOP/s_peak × num_accelerators × MFU)

Chinchilla-Optimal Tokens:

D_optimal ≈ 20 × N

Memory (minimum lower bound):

Mem = weights (N × bytes) + optimizer (2 × N × 4B for Adam) + gradients (N × bytes)

Assumptions & References

  • The factor of 6 per parameter per token is derived from Kaplan et al. (2020), "Scaling Laws for Neural Language Models" and validated in Hoffmann et al. (2022), "Training Compute-Optimal Large Language Models" (Chinchilla).
  • Gradient checkpointing recomputes activations during the backward pass, adding ~1 extra forward pass (~33% overhead) to reduce memory usage.
  • Model FLOP Utilization (MFU) of 30–50% is typical for large-scale distributed training; see Chowdhery et al. (2022), PaLM reporting ~46% MFU on TPUs.
  • Memory estimates are a lower bound — activations, KV cache, and communication buffers add significant overhead in practice.
  • Adam optimizer stores first and second moment estimates in FP32, adding 2 × N × 4 bytes regardless of training precision.
  • Hardware peak FLOP/s figures are for the specified tensor/matrix precision (BF16/FP16). FP32 throughput is typically 2–4× lower.
  • This calculator does not account for pipeline bubbles, data loading overhead, or network communication latency in multi-node training.
  • Reference: Epoch AI "Compute Trends" (2023) and OpenAI "AI and Compute" (2018) for historical context on training compute scaling.

In the network