From Tokens to Embodied Minds · Drill cards · Chapter 07
Drills
Pretraining at scale
10 atomic recall cards. Export to Anki and let spaced repetition do its slow work.
In Anki: File → Import, choose this TSV, set field separator to Tab, deck = Tokens to Embodied Minds · Ch 07, note type = Basic.
| Front | Back |
|---|---|
| Name the four stages of a pretraining data pipeline. | Collect (web crawl, code, books), Filter (language ID, quality heuristics, LLM scorer), Deduplicate (MinHash LSH), Curriculum (mixing ratios and late-stage replay). |
| What is the Chinchilla compute-optimal training recipe? | Roughly equal scaling of parameters and tokens: N ~ C^0.5/1.7 parameters and D ~ 1.7*C^0.5 tokens for training compute C. At 1e23 FLOPs, this is approximately 70B parameters and 1.4T tokens. |
| Why do deployment-optimal models train on far more tokens than Chinchilla-optimal? | Smaller models trained on more tokens are cheaper to serve at inference scale. Serving cost dominates training cost at large query volumes, making it worthwhile to overtrain a smaller model beyond the compute-optimal point. |
| What is MinHash deduplication and why is it used instead of exact-match dedup? | MinHash approximates Jaccard similarity between documents using locality-sensitive hashing. Exact-match misses near-duplicate pages with minor edits. MinHash at 5-gram shingles with J=0.8 threshold removes paraphrased duplicates that exact-match would keep. |
| What distinguishes FineWeb-Edu from DCLM-Baseline in terms of data quality theory? | FineWeb-Edu uses LLM-based educational quality scoring (Llama 3-70B annotations) — strongly biased toward expository, educational prose. DCLM uses fastText classifiers trained on curated sources — broader domain coverage. FineWeb-Edu wins on knowledge benchmarks; DCLM is more balanced for technical domains. |
| What is late-stage high-quality curriculum and why does it matter? | In the final training phase (e.g., last 30% of tokens), upweight high-quality sources (code, math, textbooks). The model's last-seen distribution strongly influences final parameter values. Late-stage replay on quality data recovers benchmark performance lost during the broad early-training phase. |
| Why does Llama 3 use 15T training tokens for an 8B parameter model? | 15T tokens is ~1875 tokens/param — far beyond Chinchilla-optimal (~20 tokens/param). This overtrained-small-model strategy produces a model that outperforms larger models at inference time because serving cost scales with parameter count, not training token count. |
| What failure mode does the filter stage in data preprocessing prevent? | Model memorization of low-quality boilerplate, spam, and machine-translated garbage. Quality filtering ensures each token in the training corpus contributes meaningful signal rather than noise that biases generation toward low-quality patterns. |
| How does the reasoning era (o1, DeepSeek-R1) change the scaling picture? | Post-training RL on verifiable tasks and inference-time compute (chain-of-thought) can substitute for pretraining tokens on reasoning tasks. The old Chinchilla curves governed pretraining loss; they no longer govern reasoning performance, which scales with test-time compute. |
| Why is bad demonstration data in robot training analogous to bad web data in LLM pretraining? | Both contaminate the learned distribution. Bad demos (failed episodes, clumsy motions) teach the policy to reproduce failure modes. Bad text teaches the model to generate low-quality content. The fix in both cases is the same: filter aggressively and deduplicate before training. |