AI Training Data Size Estimator
Estimate the minimum recommended training dataset size for a machine learning model based on model parameters, input features, task type, and desired confidence level.
params
features
classes
%
%
%
Fill in the fields above and click Estimate.
Formulas Used
1. PAC / VC-Dimension Bound (Blumer et al., 1989; Vapnik, 1998):
N_PAC = (1/ε) × (d_VC × ln(2/ε) + ln(2/δ)) where: ε = 1 − (desired_accuracy / 100) [acceptable error rate] δ = 0.05 [5% failure prob. → 95% confidence] d_VC ≈ number of model parameters [VC-dimension proxy]
2. Rule-of-Thumb Estimates (empirical, per architecture):
Linear / Logistic : N = 10 × features × classes Shallow NN : N = 50 × features × log₂(classes) Deep NN : N = 10 × √params × classes CNN : N = 1000 × classes Transformer : N = max(100 × classes, 100 × √params) Random Forest : N = 100 × features × log₂(classes + 1)
3. Noise Correction:
N_noisy = N_clean / (1 − noise_rate)²
4. Total Dataset Size:
N_total = N_train / (train_split / 100) Final = max(N_PAC, N_rule_of_thumb) → noise correction → split adjustment
5. Storage Estimate:
bytes_per_sample = num_features × 4 (float32) total_storage = N_total × bytes_per_sample
Assumptions & References
- VC-dimension proxy: The number of free parameters is used as a proxy for VC-dimension. This is an approximation; true VC-dimension depends on architecture geometry.
- PAC confidence: Fixed at 95% (δ = 0.05), a standard choice in learning theory.
- Rule-of-thumb multipliers are derived from empirical community guidelines: Harrell's "10 events per variable" for linear models; ImageNet's ~1000 images/class for CNNs; Goodfellow et al. informal guidance for deep networks.
- Noise model: Label noise inflates sample requirements quadratically (Natarajan et al., 2013 — learning with noisy labels).
- Storage assumes raw float32 feature vectors; real datasets (images, text, audio) may differ significantly due to compression and tokenisation.
- These are minimum estimates. Production systems typically require 5–100× more data for robustness, fairness, and distribution coverage.
- References: Vapnik (1998) Statistical Learning Theory; Blumer et al. (1989) Learnability and the Vapnik–Chervonenkis Dimension, JACM; Goodfellow et al. (2016) Deep Learning, MIT Press; Natarajan et al. (2013) Learning with Noisy Labels, NeurIPS.