AI Training Data Size Estimator

ANA›Life Services Authority›National Calculator Authority›AI Training Data Size Estimator

.calc-container { max-width: 640px; margin: 2rem 0; padding: 1.5rem; background: #fff; border: 1px solid #ddd; border-radius: 8px; box-shadow: 0 1px 3px rgba(0,0,0,0.06); font-family: system-ui, -apple-system, sans-serif; } .calc-container h3 { font-family: Georgia, serif; font-size: 1.15rem; color: #1a1a1a; margin-bottom: 1rem; padding-bottom: 0.5rem; border-bottom: 2px solid var(--ac, #3d5a80); } .calc-row { display: flex; align-items: center; gap: 0.75rem; margin-bottom: 0.75rem; flex-wrap: wrap; } .calc-row label { min-width: 160px; font-size: 0.9rem; color: #333; font-weight: 500; } .calc-row input[type="number"], .calc-row select { flex: 1; min-width: 120px; max-width: 200px; padding: 0.5rem 0.6rem; border: 1px solid #ccc; border-radius: 4px; font-size: 0.9rem; font-family: system-ui, sans-serif; color: #1a1a1a; background: #fafaf8; } .calc-row input:focus, .calc-row select:focus { outline: none; border-color: var(--ac, #3d5a80); box-shadow: 0 0 0 2px rgba(26,74,138,0.12); } .calc-row .unit { font-size: 0.82rem; color: #888; min-width: 30px; } .calc-btn { display: inline-block; margin-top: 0.5rem; padding: 0.55rem 1.5rem; background: var(--ac, #3d5a80); color: #fff; border: none; border-radius: 4px; font-size: 0.9rem; font-weight: 600; cursor: pointer; font-family: system-ui, sans-serif; } .calc-btn:hover { opacity: 0.9; } .calc-result { margin-top: 1.25rem; padding: 1rem 1.25rem; background: #f0f6fc; border-left: 3px solid var(--ac, #3d5a80); border-radius: 0 6px 6px 0; display: none; } .calc-result.visible { display: block; } .calc-result-label { font-size: 0.78rem; text-transform: uppercase; letter-spacing: 0.06em; color: #666; margin-bottom: 0.25rem; } .calc-result-value { font-size: 1.6rem; font-weight: 700; color: var(--ac, #3d5a80); } .calc-result-detail { font-size: 0.85rem; color: #555; margin-top: 0.5rem; line-height: 1.5; } .calc-note { margin-top: 1rem; font-size: 0.8rem; color: #888; font-style: italic; } .calc-grid { display: grid; grid-template-columns: 1fr 1fr; gap: 0.75rem; margin-top: 0.75rem; } .calc-grid-item { padding: 0.6rem 0.8rem; background: #f8f9fa; border-radius: 4px; border: 1px solid #eee; } .calc-grid-item .label { font-size: 0.75rem; color: #888; text-transform: uppercase; letter-spacing: 0.04em; } .calc-grid-item .value { font-size: 1.1rem; font-weight: 600; color: #1a1a1a; } @media (max-width: 720px) { .calc-row { flex-direction: column; align-items: flex-start; gap: 0.3rem; } .calc-row label { min-width: auto; } .calc-row input[type="number"], .calc-row select { max-width: 100%; width: 100%; } .calc-grid { grid-template-columns: 1fr; } } .calc-chart { margin: 1rem 0; text-align: center; } .calc-chart svg { max-width: 100%; height: auto; } .calc-chart-legend { display: flex; flex-wrap: wrap; justify-content: center; gap: 0.6rem 1.2rem; margin-top: 0.6rem; font-size: 0.8rem; color: #555; } .calc-chart-legend span { display: inline-flex; align-items: center; gap: 0.3rem; } .calc-chart-legend i { display: inline-block; width: 10px; height: 10px; border-radius: 2px; font-style: normal; } .calc-related { max-width: 640px; margin: 2rem 0 1rem; padding: 1.25rem 1.5rem; background: #f8f9fa; border: 1px solid #e8e8e8; border-radius: 8px; } .calc-related h3 { font-family: Georgia, serif; font-size: 1rem; color: #1a1a1a; margin: 0 0 0.75rem; padding-bottom: 0.4rem; border-bottom: 2px solid var(--ac, #3d5a80); } .calc-related-list { list-style: none; padding: 0; margin: 0 0 0.75rem; display: grid; grid-template-columns: 1fr 1fr; gap: 0.4rem 1.5rem; } .calc-related-list li a { font-size: 0.88rem; color: var(--ac, #3d5a80); text-decoration: none; } .calc-related-list li a:hover { text-decoration: underline; } .calc-browse-all { margin: 0.5rem 0 0; font-size: 0.9rem; font-weight: 600; } .calc-browse-all a { color: var(--ac, #3d5a80); text-decoration: none; } .calc-browse-all a:hover { text-decoration: underline; } @media (max-width: 720px) { .calc-related-list { grid-template-columns: 1fr; } }

AI Training Data Size Estimator

Estimate the minimum recommended training dataset size for a machine learning model based on model parameters, input features, task type, and desired confidence level.

Model / Task Type

Linear / Logistic Regression Shallow Neural Network (1–2 hidden layers) Deep Neural Network (3+ hidden layers) Convolutional Neural Network (CNN) Transformer / LLM Fine-tuning Random Forest / Gradient Boosting

Number of Model Parameters

params

Number of Input Features / Dimensions

features

Number of Output Classes (classification) or 1 for regression

classes

Desired Generalisation Accuracy (%)

Data Noise / Label Error Rate (%)

Training Split (%)

Estimate Dataset Size Fill in the fields above and click Estimate.

function aiCalc() { var resultDiv = document.getElementById('ai-result');

// --- Read inputs --- var modelType = document.getElementById('ai-model-type').value; var numParams = parseFloat(document.getElementById('ai-num-params').value); var numFeatures = parseFloat(document.getElementById('ai-num-features').value); var numClasses = parseFloat(document.getElementById('ai-num-classes').value); var desiredAcc = parseFloat(document.getElementById('ai-desired-accuracy').value); var noiseLevel = parseFloat(document.getElementById('ai-noise-level').value); var trainSplit = parseFloat(document.getElementById('ai-train-split').value);

// --- Validation --- var errors = []; if (isNaN(numParams) || numParams = 100) errors.push("Desired accuracy must be between 50% and 99.9%."); if (isNaN(noiseLevel) || noiseLevel 50) errors.push("Noise level must be between 0% and 50%."); if (isNaN(trainSplit) || trainSplit 95) errors.push("Training split must be between 50% and 95%.");

if (errors.length > 0) { resultDiv.innerHTML = 'Input Error:' + errors.join('') + ''; return; }

// ----------------------------------------------------------------------- // FORMULA BLOCK // ----------------------------------------------------------------------- // // 1. VAPNIK–CHERVONENKIS (VC) / PAC LEARNING LOWER BOUND // For a model with VC-dimension d_VC ≈ numParams (free parameters), // PAC learning guarantees generalisation error ε with confidence δ: // // N_PAC = (1/ε) * (d_VC * ln(2/ε) + ln(2/δ)) [Blumer et al. 1989] // // where ε = 1 - desiredAcc/100 (acceptable error rate) // δ = 0.05 (5% failure probability → 95% confidence) // // 2. RULE-OF-THUMB MULTIPLIER (empirical, per model family) // Different architectures need different samples-per-parameter ratios: // linear → 10× features // shallow_nn → 50× features // deep_nn → 10× params^0.5 (sqrt heuristic) // cnn → 1000× classes // transformer → 100× params^0.5 // random_forest → 100× features × classes // // 3. NOISE CORRECTION // Label noise inflates required samples: // N_noisy = N_clean / (1 - noiseRate)^2 // // 4. TOTAL DATASET SIZE (accounting for train/val/test split) // N_total = N_train / (trainSplit / 100) // // Final estimate = max(N_PAC, N_rule_of_thumb) after noise & split correction. // -----------------------------------------------------------------------

var epsilon = 1.0 - desiredAcc / 100.0; // acceptable generalisation error var delta = 0.05; // 5% failure probability (95% confidence) var dVC = numParams; // VC-dimension ≈ number of free parameters

// Guard against epsilon = 0 (100% accuracy is unachievable in PAC sense) if (epsilon = N_rule) ? "PAC/VC Bound" : "Rule-of-Thumb";

// Format large numbers function fmt(n) { if (n >= 1e9) return (n/1e9).toFixed(2) + "B"; if (n >= 1e6) return (n/1e6).toFixed(2) + "M"; if (n >= 1e3) return (n/1e3).toFixed(1) + "K"; return n.toFixed(0); }

resultDiv.innerHTML = '### Estimated Dataset Requirements ' + '' + 'Metric' + 'Value' +

'📊 Training Samples Required' + '' + fmt(N_train_final) + ' samples' +

'🗂️ Total Dataset Size (incl. val + test)' + '' + fmt(N_total_final) + ' samples' +

'✅ Validation Samples' + '' + fmt(N_val_final) + ' samples' +

'🧪 Test Samples' + '' + fmt(N_test_final) + ' samples' +

'💾 Estimated Raw Storage (float32 features)' + '' + storageStr + '' +

'📐 PAC/VC Bound Estimate' + '' + fmt(Math.ceil(N_PAC)) + ' training samples' +

'📏 Rule-of-Thumb Estimate' + '' + fmt(Math.ceil(N_rule)) + ' training samples' +

'🏆 Binding Constraint' + '' + dominantMethod + '' +

'🔊 Noise Inflation Factor' + '×' + (1/noiseFactor).toFixed(2) + ' (noise = ' + noiseLevel + '%)' +

'' +

'' + 'Interpretation: To achieve ~' + desiredAcc + '% generalisation accuracy ' + 'with 95% confidence on a ' + modelType.replace('_',' ') + ' model ' + 'with ' + fmt(numParams) + ' parameters, you need at least ' + fmt(N_train_final) + ' labelled training examples ' + '(total dataset: ' + fmt(N_total_final) + ' samples).' + '

'; }

#### Formulas Used

1. PAC / VC-Dimension Bound (Blumer et al., 1989; Vapnik, 1998):

N_PAC = (1/ε) × (d_VC × ln(2/ε) + ln(2/δ))

where: ε = 1 − (desired_accuracy / 100) [acceptable error rate] δ = 0.05 [5% failure prob. → 95% confidence] d_VC ≈ number of model parameters [VC-dimension proxy]

2. Rule-of-Thumb Estimates (empirical, per architecture):

Linear / Logistic : N = 10 × features × classes Shallow NN : N = 50 × features × log₂(classes) Deep NN : N = 10 × √params × classes CNN : N = 1000 × classes Transformer : N = max(100 × classes, 100 × √params) Random Forest : N = 100 × features × log₂(classes + 1)

3. Noise Correction:

N_noisy = N_clean / (1 − noise_rate)²

4. Total Dataset Size:

N_total = N_train / (train_split / 100) Final = max(N_PAC, N_rule_of_thumb) → noise correction → split adjustment

5. Storage Estimate:

bytes_per_sample = num_features × 4 (float32) total_storage = N_total × bytes_per_sample

#### Assumptions & References

VC-dimension proxy: The number of free parameters is used as a proxy for VC-dimension. This is an approximation; true VC-dimension depends on architecture geometry.
PAC confidence: Fixed at 95% (δ = 0.05), a standard choice in learning theory.
Rule-of-thumb multipliers are derived from empirical community guidelines: Harrell's "10 events per variable" for linear models; ImageNet's ~1000 images/class for CNNs; Goodfellow et al. informal guidance for deep networks.
Noise model: Label noise inflates sample requirements quadratically (Natarajan et al., 2013 — learning with noisy labels).
Storage assumes raw float32 feature vectors; real datasets (images, text, audio) may differ significantly due to compression and tokenisation.
These are minimum estimates. Production systems typically require 5–100× more data for robustness, fairness, and distribution coverage.
References: Vapnik (1998) Statistical Learning Theory; Blumer et al. (1989) Learnability and the Vapnik–Chervonenkis Dimension, JACM; Goodfellow et al. (2016) Deep Learning, MIT Press; Natarajan et al. (2013) Learning with Noisy Labels, NeurIPS.

AI Training Data Size Estimator

AI Training Data Size Estimator

More Calculators

Read Next

References

AI Training Data Size Estimator

AI Training Data Size Estimator

More Calculators

Read Next

References

Related Authorities