From Tokens to Embodied Minds · Drill cards · Chapter 09
Drills
Scaling laws — Kaplan, Chinchilla, and after
10 atomic recall cards. Export to Anki and let spaced repetition do its slow work.
In Anki: File → Import, choose this TSV, set field separator to Tab, deck = Tokens to Embodied Minds · Ch 09, note type = Basic.
| Front | Back |
|---|---|
| State Kaplan et al.'s headline scaling result for loss as a function of parameter count. | L(N) ~ N^(-0.076) with data held large — loss follows a power law in parameter count with exponent -0.076. |
| What is an IsoFLOP experiment? | Train multiple model sizes at the same total compute budget (same C = 6ND FLOPs), varying N and D inversely. Plot final loss vs N to find the compute-optimal model size for that C. |
| What is the Chinchilla compute-optimal training recipe? | N_opt ~ C^0.5 / 1.7 and D_opt ~ 1.7 * C^0.5. Parameters and tokens should scale roughly equally with compute. At C=1e23, this gives ~70B params and ~1.4T tokens. |
| Why does the compute-optimal model differ from the deployment-optimal model? | Compute-optimal minimizes pretraining loss for a fixed training budget. Deployment-optimal minimizes total cost (training + serving). At high query volume, a smaller model trained on more tokens is cheaper to serve and often competitive in quality. |
| What is inference-time scaling and which models first demonstrated it? | Allocating more compute at test time — extended chain-of-thought, verification, backtracking. o1 (OpenAI, September 2024) and DeepSeek-R1 (January 2025) demonstrated that reasoning quality scales with test-time compute on math and code tasks. |
| Why do scaling laws predict pretraining loss but not always downstream task performance? | The mapping from pretraining loss to task accuracy is nonlinear and task-specific. A 10% perplexity reduction may yield a 20% improvement on coding and 3% on summarization. Downstream evals must be run independently. |
| What is the difference between Kaplan's and Chinchilla's view on data vs model size? | Kaplan (2020): parameters matter more than data. Chinchilla (2022): parameters and tokens should scale equally. Kaplan's bias came from not holding compute fixed when varying model size. |
| For what type of tasks does inference-time scaling NOT help much? | Tasks without verifiable correct answers: factual recall, open-ended generation, multi-modal tasks without a checking mechanism. Chain-of-thought helps most when there is a clear scoring signal (math, code, formal proofs). |
| Approximately how much did training compute increase from GPT-3 to Llama 3 70B? | GPT-3: ~3.14e23 FLOPs. Llama 3 70B on 15T tokens: ~6N*D = 6*70B*15T ≈ 6.3e24 FLOPs. Roughly 20x more compute — primarily invested in data, not parameters. |
| How would you use scaling law analysis to budget a custom VLA fine-tuning run? | Run IsoFLOP ablations at small scale (3-5 model sizes, C=1e17 FLOPs each) on your demonstration split. Fit the compute-optimal curve. Extrapolate to your actual compute budget to predict the optimal model size and token count before committing to a full training run. |