AI Training Data Size Estimator

Estimate the minimum recommended training dataset size for a machine learning model based on model parameters, input features, task type, and desired confidence level.

params
features
classes
%
%
%
Fill in the fields above and click Estimate.

Formulas Used

1. PAC / VC-Dimension Bound (Blumer et al., 1989; Vapnik, 1998):

N_PAC = (1/ε) × (d_VC × ln(2/ε) + ln(2/δ))

where:
  ε     = 1 − (desired_accuracy / 100)   [acceptable error rate]
  δ     = 0.05                            [5% failure prob. → 95% confidence]
  d_VC  ≈ number of model parameters      [VC-dimension proxy]

2. Rule-of-Thumb Estimates (empirical, per architecture):

Linear / Logistic  : N = 10 × features × classes
Shallow NN         : N = 50 × features × log₂(classes)
Deep NN            : N = 10 × √params × classes
CNN                : N = 1000 × classes
Transformer        : N = max(100 × classes, 100 × √params)
Random Forest      : N = 100 × features × log₂(classes + 1)

3. Noise Correction:

N_noisy = N_clean / (1 − noise_rate)²

4. Total Dataset Size:

N_total = N_train / (train_split / 100)
Final   = max(N_PAC, N_rule_of_thumb) → noise correction → split adjustment

5. Storage Estimate:

bytes_per_sample = num_features × 4  (float32)
total_storage    = N_total × bytes_per_sample

Assumptions & References

  • VC-dimension proxy: The number of free parameters is used as a proxy for VC-dimension. This is an approximation; true VC-dimension depends on architecture geometry.
  • PAC confidence: Fixed at 95% (δ = 0.05), a standard choice in learning theory.
  • Rule-of-thumb multipliers are derived from empirical community guidelines: Harrell's "10 events per variable" for linear models; ImageNet's ~1000 images/class for CNNs; Goodfellow et al. informal guidance for deep networks.
  • Noise model: Label noise inflates sample requirements quadratically (Natarajan et al., 2013 — learning with noisy labels).
  • Storage assumes raw float32 feature vectors; real datasets (images, text, audio) may differ significantly due to compression and tokenisation.
  • These are minimum estimates. Production systems typically require 5–100× more data for robustness, fairness, and distribution coverage.
  • References: Vapnik (1998) Statistical Learning Theory; Blumer et al. (1989) Learnability and the Vapnik–Chervonenkis Dimension, JACM; Goodfellow et al. (2016) Deep Learning, MIT Press; Natarajan et al. (2013) Learning with Noisy Labels, NeurIPS.

In the network