Pretraining at scale

10 atomic recall cards. Export to Anki and let spaced repetition do its slow work.

In Anki: File → Import, choose this TSV, set field separator to Tab, deck = Tokens to Embodied Minds · Ch 07, note type = Basic.

Front	Back
Name the four stages of a pretraining data pipeline.	Collect (web crawl, code, books), Filter (language ID, quality heuristics, LLM scorer), Deduplicate (MinHash LSH), Curriculum (mixing ratios and late-stage replay).
What is the Chinchilla compute-optimal training recipe?	Roughly equal scaling of parameters and tokens: N ~ C^0.5/1.7 parameters and D ~ 1.7*C^0.5 tokens for training compute C. At 1e23 FLOPs, this is approximately 70B parameters and 1.4T tokens.
Why do deployment-optimal models train on far more tokens than Chinchilla-optimal?	Smaller models trained on more tokens are cheaper to serve at inference scale. Serving cost dominates training cost at large query volumes, making it worthwhile to overtrain a smaller model beyond the compute-optimal point.
What is MinHash deduplication and why is it used instead of exact-match dedup?	MinHash approximates Jaccard similarity between documents using locality-sensitive hashing. Exact-match misses near-duplicate pages with minor edits. MinHash at 5-gram shingles with J=0.8 threshold removes paraphrased duplicates that exact-match would keep.
What distinguishes FineWeb-Edu from DCLM-Baseline in terms of data quality theory?	FineWeb-Edu uses LLM-based educational quality scoring (Llama 3-70B annotations) — strongly biased toward expository, educational prose. DCLM uses fastText classifiers trained on curated sources — broader domain coverage. FineWeb-Edu wins on knowledge benchmarks; DCLM is more balanced for technical domains.
What is late-stage high-quality curriculum and why does it matter?	In the final training phase (e.g., last 30% of tokens), upweight high-quality sources (code, math, textbooks). The model's last-seen distribution strongly influences final parameter values. Late-stage replay on quality data recovers benchmark performance lost during the broad early-training phase.
Why does Llama 3 use 15T training tokens for an 8B parameter model?	15T tokens is ~1875 tokens/param — far beyond Chinchilla-optimal (~20 tokens/param). This overtrained-small-model strategy produces a model that outperforms larger models at inference time because serving cost scales with parameter count, not training token count.
What failure mode does the filter stage in data preprocessing prevent?	Model memorization of low-quality boilerplate, spam, and machine-translated garbage. Quality filtering ensures each token in the training corpus contributes meaningful signal rather than noise that biases generation toward low-quality patterns.
How does the reasoning era (o1, DeepSeek-R1) change the scaling picture?	Post-training RL on verifiable tasks and inference-time compute (chain-of-thought) can substitute for pretraining tokens on reasoning tasks. The old Chinchilla curves governed pretraining loss; they no longer govern reasoning performance, which scales with test-time compute.
Why is bad demonstration data in robot training analogous to bad web data in LLM pretraining?	Both contaminate the learned distribution. Bad demos (failed episodes, clumsy motions) teach the policy to reproduce failure modes. Bad text teaches the model to generate low-quality content. The fix in both cases is the same: filter aggressively and deduplicate before training.