Training Compute & FLOP Estimator

Estimate the total floating-point operations (FLOPs) required to train a neural network, based on model parameters, dataset size, and training configuration.

Model Architecture

Number of Model Parameters Total trainable parameters (e.g. 7B = 7,000,000,000) Parameter Unit Training Precision

Dataset & Training

Number of Training Tokens Total tokens seen during training (e.g. 2T = 2,000,000,000,000) Token Unit Number of Epochs How many times the dataset is iterated (typically 1 for LLMs) Gradient Checkpointing

Hardware Configuration

Hardware Accelerator Peak FLOP/s per Accelerator Peak theoretical throughput in FLOP/s Number of Accelerators Model FLOP Utilization (MFU %) Typical range: 30–50% for large-scale training

Results will appear here.

Formulas Used

Total Training FLOPs (Kaplan et al. / Chinchilla):

C = 6 × N × D × G

C — Total compute in FLOPs
N — Number of model parameters
D — Total tokens processed (tokens × epochs)
G — Gradient checkpointing multiplier (1.0 or ~1.33)
6 — Factor accounting for: 2 (multiply-add) × 3 (1 forward pass + 2 backward passes)

Wall-Clock Time:

T = C / (FLOP/s_peak × num_accelerators × MFU)

Chinchilla-Optimal Tokens:

D_optimal ≈ 20 × N

Memory (minimum lower bound):

Mem = weights (N × bytes) + optimizer (2 × N × 4B for Adam) + gradients (N × bytes)

Assumptions & References

The factor of 6 per parameter per token is derived from Kaplan et al. (2020), "Scaling Laws for Neural Language Models" and validated in Hoffmann et al. (2022), "Training Compute-Optimal Large Language Models" (Chinchilla).
Gradient checkpointing recomputes activations during the backward pass, adding ~1 extra forward pass (~33% overhead) to reduce memory usage.
Model FLOP Utilization (MFU) of 30–50% is typical for large-scale distributed training; see Chowdhery et al. (2022), PaLM reporting ~46% MFU on TPUs.
Memory estimates are a lower bound — activations, KV cache, and communication buffers add significant overhead in practice.
Adam optimizer stores first and second moment estimates in FP32, adding 2 × N × 4 bytes regardless of training precision.
Hardware peak FLOP/s figures are for the specified tensor/matrix precision (BF16/FP16). FP32 throughput is typically 2–4× lower.
This calculator does not account for pipeline bubbles, data loading overhead, or network communication latency in multi-node training.
Reference: Epoch AI "Compute Trends" (2023) and OpenAI "AI and Compute" (2018) for historical context on training compute scaling.

Training Compute & FLOP Estimator

Model Architecture

Dataset & Training

Hardware Configuration

Formulas Used

Assumptions & References

In the network

Network

Training Compute & FLOP Estimator

Model Architecture

Dataset & Training

Hardware Configuration

Formulas Used

Assumptions & References

More Calculators

In the network

Network