Scaling laws — Kaplan, Chinchilla, and after

10 atomic recall cards. Export to Anki and let spaced repetition do its slow work.

In Anki: File → Import, choose this TSV, set field separator to Tab, deck = Tokens to Embodied Minds · Ch 09, note type = Basic.

Front	Back
State Kaplan et al.'s headline scaling result for loss as a function of parameter count.	L(N) ~ N^(-0.076) with data held large — loss follows a power law in parameter count with exponent -0.076.
What is an IsoFLOP experiment?	Train multiple model sizes at the same total compute budget (same C = 6ND FLOPs), varying N and D inversely. Plot final loss vs N to find the compute-optimal model size for that C.
What is the Chinchilla compute-optimal training recipe?	N_opt ~ C^0.5 / 1.7 and D_opt ~ 1.7 * C^0.5. Parameters and tokens should scale roughly equally with compute. At C=1e23, this gives ~70B params and ~1.4T tokens.
Why does the compute-optimal model differ from the deployment-optimal model?	Compute-optimal minimizes pretraining loss for a fixed training budget. Deployment-optimal minimizes total cost (training + serving). At high query volume, a smaller model trained on more tokens is cheaper to serve and often competitive in quality.
What is inference-time scaling and which models first demonstrated it?	Allocating more compute at test time — extended chain-of-thought, verification, backtracking. o1 (OpenAI, September 2024) and DeepSeek-R1 (January 2025) demonstrated that reasoning quality scales with test-time compute on math and code tasks.
Why do scaling laws predict pretraining loss but not always downstream task performance?	The mapping from pretraining loss to task accuracy is nonlinear and task-specific. A 10% perplexity reduction may yield a 20% improvement on coding and 3% on summarization. Downstream evals must be run independently.
What is the difference between Kaplan's and Chinchilla's view on data vs model size?	Kaplan (2020): parameters matter more than data. Chinchilla (2022): parameters and tokens should scale equally. Kaplan's bias came from not holding compute fixed when varying model size.
For what type of tasks does inference-time scaling NOT help much?	Tasks without verifiable correct answers: factual recall, open-ended generation, multi-modal tasks without a checking mechanism. Chain-of-thought helps most when there is a clear scoring signal (math, code, formal proofs).
Approximately how much did training compute increase from GPT-3 to Llama 3 70B?	GPT-3: ~3.14e23 FLOPs. Llama 3 70B on 15T tokens: ~6ND = 670B*15T ≈ 6.3e24 FLOPs. Roughly 20x more compute — primarily invested in data, not parameters.
How would you use scaling law analysis to budget a custom VLA fine-tuning run?	Run IsoFLOP ablations at small scale (3-5 model sizes, C=1e17 FLOPs each) on your demonstration split. Fit the compute-optimal curve. Extrapolate to your actual compute budget to predict the optimal model size and token count before committing to a full training run.