- How much does it cost to train an LLM from scratch?
- It depends on model size and training data. Rough estimates: A 7B parameter model on 1T tokens costs $25,000-$45,000 on H100 GPUs. A 70B model on 2T tokens costs $500,000-$1,000,000. Frontier models (GPT-4 class) cost $50-100M+. The main cost drivers are model parameters, training tokens, GPU type, utilization rate, and cloud provider pricing. Use this calculator for precise estimates with your specific configuration.
- What GPU should I use for LLM training?
- In 2026, the H100 is the standard choice for serious LLM training (7B-70B models). The H200 offers 30% more performance for frontier work. A100 80GB is a cost-effective alternative for medium models. L40S is good for fine-tuning. For hobby/research projects under 1B parameters, A10G or even T4 can work. The key factor is VRAM — you need enough memory to hold model weights, gradients, optimizer states, and activations.
- What is GPU utilization (MFU) and why does it matter?
- Model FLOPs Utilization (MFU) measures what percentage of the GPU's theoretical compute you actually use during training. Typical MFU is 30-50% for LLM training. The gap comes from: memory bandwidth bottlenecks, data loading, inter-GPU communication, gradient synchronization, and framework overhead. MFU is the single biggest cost multiplier — improving from 30% to 50% MFU reduces training time and cost by 40%.
- What is the Chinchilla scaling law?
- The Chinchilla scaling law (Hoffmann et al., 2022) found that for compute-optimal training, models should be trained on approximately 20× more tokens than parameters. A 7B model should see ~140B tokens, a 70B model should see ~1.4T tokens. Training on fewer tokens wastes GPU compute (model is undertrained); training on more tokens wastes data processing (diminishing returns). However, recent models like Llama 3 use far more tokens (15T for 8B params), suggesting 'over-training' smaller models can be cost-effective for inference.
- How do I estimate VRAM requirements for training?
- Rough VRAM estimate: Model params × bytes per param × 4 (for weights + gradients + optimizer states + activations). In BF16: a 7B model needs ~56GB minimum (7B × 2 bytes × 4). In FP32: 7B needs ~112GB. This is why 7B models need at least an A100 80GB, and 70B models need multi-GPU setups with tensor/pipeline parallelism. Techniques like gradient checkpointing and ZeRO can reduce memory requirements by 2-4×.
- Is fine-tuning much cheaper than pre-training?
- Yes, dramatically. Full fine-tuning a 7B model on 100K-1M examples costs $50-$500 (vs $25,000+ for pre-training). Parameter-efficient fine-tuning (LoRA/QLoRA) is even cheaper: $5-$50 for the same model because it only trains 0.1-1% of parameters. For most applications, fine-tuning an existing foundation model (Llama 3, Mistral) is 100-1000× more cost-effective than training from scratch.
- Why are Lambda Labs and CoreWeave cheaper than AWS?
- GPU-specialised clouds (Lambda, CoreWeave) offer lower prices because: (1) They focus exclusively on GPU workloads with optimised infrastructure; (2) No enterprise overhead (compliance, managed services, support tiers); (3) Direct NVIDIA partnerships for bulk GPU procurement; (4) Minimal egress fees and simpler pricing. The trade-off: fewer enterprise features, less geographic coverage, and potentially less availability during high demand. For pure ML training, the savings are substantial (40-50% off).
- How long does it take to train a 7B parameter model?
- On 8× H100 GPUs at 40% MFU: training a 7B model on 1T tokens takes roughly 15-25 days. On 1× H100: ~120-200 days (impractical). On 64× H100s: ~2-4 days. On 8× A100 80GB: ~40-65 days. The key relationship: training time scales linearly with tokens and inversely with (GPUs × TFLOPS × MFU). Doubling GPUs roughly halves training time (with some communication overhead).
- What's the difference between pre-training and post-training costs?
- Pre-training (learning language from raw text) is the most expensive phase — typically 70-90% of total cost. Post-training includes: Supervised Fine-Tuning (SFT) on instructions (~5-10% of pre-training cost), RLHF/DPO alignment (~5-15% of pre-training cost), and evaluation/red-teaming (~1-5%). For a $1M pre-training run, total post-training adds $100K-$300K. Our calculator estimates pre-training costs only.
- Can I train an LLM on consumer GPUs?
- Technically yes, but it's impractical for anything useful. An RTX 4090 (24GB VRAM, 82 TFLOPS BF16) can fine-tune models up to ~7B with QLoRA. For pre-training: a 1B model on 20B tokens would take ~2-3 weeks on a single 4090. A 7B model on 140B tokens would take over a year. Multi-4090 setups lack NVLink bandwidth, adding 30-50% communication overhead. For serious training, cloud H100s are far more cost-effective per FLOP.
- What is the cost per token for LLM training?
- Cost per token varies by model size and setup. Rough estimates at 2026 H100 prices: Training — $0.000001-$0.00005 per token (depending on model size). For reference: 1B tokens costs $1-$50 to process during training. Inference is much cheaper: $0.0000001-$0.000001 per token. The key insight: training is a one-time cost amortised over all future inference. A $1M training cost spread over 1 trillion inference tokens = $0.000001/token.
- How accurate is this calculator?
- This calculator provides order-of-magnitude estimates (±30-50%) suitable for budgeting and comparison. Real-world costs vary due to: actual MFU achieved (hardware/software dependent), data loading efficiency, checkpoint frequency, failure recovery, hyperparameter search runs, and post-training costs not included here. For precise budgeting, run a small-scale experiment and extrapolate using measured MFU. The relative comparisons between GPUs and providers are more accurate than absolute dollar figures.
- What about TPUs vs GPUs for training?
- Google's TPU v4/v5 chips are competitive with NVIDIA H100s and often offer better price-performance for large-scale training through Google Cloud. TPU v5e offers ~$1.20/chip-hour with strong BF16 performance. The trade-off: TPUs require JAX/XLA framework (not PyTorch native), have a smaller ecosystem, and are only available on Google Cloud. Most of the ML community uses NVIDIA GPUs with PyTorch, making GPUs the default choice despite TPU cost advantages.
- How do spot instances affect training costs?
- Spot/preemptible instances offer 50-70% discounts but can be interrupted. Strategy for LLM training: use aggressive checkpointing (every 15-30 min), implement automatic restart on interruption, and accept that you'll lose some compute to restarts. Net savings after accounting for lost compute: typically 30-50% off on-demand pricing. Works best for training runs that can tolerate interruptions. Not recommended for time-critical production training runs.