Chapter 07 · Pretraining at scale — From Tokens to Embodied Minds

Pretraining is 90% data engineering. Architecture and hyperparameters contribute a few percent of performance variance; data quality, mixing ratios, and deduplication discipline contribute the rest. The field learned this the hard way when Chinchilla (Hoffmann et al., arXiv:2203.15556, March 2022) showed that GPT-3 was significantly undertrained — it had too many parameters for its token budget and would have achieved better loss with half the parameters and twice the tokens. The follow-on lesson, not in the Chinchilla paper, came from deployment: a smaller model trained on more data is cheaper to serve, so the 2025 target is 100-200 tokens per parameter, not the compute-optimal 20. FineWeb-Edu (Penedo et al., arXiv:2406.17557, June 2024) and DCLM-Baseline (Li et al., arXiv:2406.11794, July 2024) are the two open datasets that represent divergent theories of what makes training data good. FineWeb-Edu uses an LLM-based educational quality classifier. DCLM uses a fastText-based quality signal trained on curated sources. Both outperform The Pile on downstream benchmarks. Neither is correct for all domains.

The four-stage data pipeline

Stage 1 — Collect: Common Crawl WARC files are the starting point for almost every open pretraining dataset. A single monthly dump is ~80 TB of raw HTML. The extraction step (html-to-text) is non-trivial — trafilatura and resiliparse are the standard tools. GitHub, books (Project Gutenberg, Internet Archive), academic papers (arXiv, Semantic Scholar), and code repositories are mixed in. The mixing ratios are a research-level decision with large downstream effects.

Stage 2 — Filter: A minimum quality filter removes pages below a threshold on heuristics (length, punctuation density, duplicate line ratio) and a language classifier (FastText lid.176.bin, trained on Wikipedia). The LLM-quality classifier step — used in FineWeb-Edu — scores each document with a prompted LLM on educational value and keeps the top 15-20%. This step is expensive but produces demonstrably better downstream benchmarks at matched compute.

Stage 3 — Deduplicate: Near-duplicate removal with MinHash (Broder, 1997) or exact-match deduplication is the single highest-leverage data quality step. Deduplicated corpora consistently outperform un-deduplicated ones at the same byte count. The practical implementation uses datasketch's MinHashLSH with 5-gram shingles and Jaccard threshold 0.8. Without deduplication, Common Crawl contains roughly 30% near-duplicate content — training on it teaches the model to memorize rather than generalize.

Stage 4 — Curriculum: Data mixing ratios and late-stage high-quality replay. Llama 3's recipe upweights code and math in the final 30% of training tokens. The curriculum matters because the model's last-seen distribution strongly influences its final parameter values — training on a high-quality mix at the end recovers performance on benchmarks that were not weighted in early stages.

Compute-optimal vs deployment-optimal

The Chinchilla result (Hoffmann et al., arXiv:2203.15556) is about optimal pretraining loss for a fixed compute budget C: roughly, N_optimal ≈ C^0.5 / 1.7 parameters and D_optimal ≈ 1.7 * C^0.5 tokens. For GPT-3 (175B params, 300B tokens), this implied GPT-3 was 5× parameter-heavy for its token budget. The corrected recipe gave Chinchilla 70B a lower loss than GPT-3 175B at one-quarter the inference cost.

But Chinchilla's framing treats training as a one-time cost and ignores inference cost. In practice, a model served at 10M queries per day has a much larger total inference compute budget than training compute. This changes the optimal recipe: train a smaller model on more tokens — increasing token budget beyond Chinchilla-optimal — to get a cheaper-to-serve model with the same benchmark performance. Mistral 7B (trained on 1T tokens, ~142 tokens/param) and Llama 3 8B (trained on 15T tokens, ~1875 tokens/param) both far exceed the Chinchilla compute-optimal token count for their size, and both outperform much larger models in inference-constrained settings.

FineWeb-Edu vs DCLM — two theories of quality

FineWeb-Edu (Hugging Face, arXiv:2406.17557, June 2024) applies an LLM-based educational quality classifier trained on Llama 3-70B annotations. Each document receives a score 0-5 for educational value; only documents scoring 3+ are kept (~5-15% of FineWeb). The resulting dataset is heavily biased toward Wikipedia-style expository prose, textbooks, and educational websites. It produces models with strong factual recall and reasoning on knowledge benchmarks but weaker on creative writing and code.

DCLM-Baseline (Li et al., arXiv:2406.11794, July 2024) uses a fastText classifier trained on high-quality curated sources (OpenWebText2, Dolma books, Stack) as a quality signal, then applies a strict perplexity filter using an n-gram model. This produces a broader quality distribution that is more balanced across domains. For a domain-specific fine-tuning baseline, DCLM is often better as a starting point because it has stronger coverage across technical domains.

For the JHU humanoid capstone, this chapter's data discipline applies directly to your demonstration data pipeline. Bad demonstrations contaminate VLA fine-tuning in the same way bad text contaminates pretraining: a single expert demonstrator performing 30 high-quality pick-and-place trials outperforms three novice demonstrators doing 100 trials each. The filtering and curriculum logic from pretraining transfers exactly.

Data discipline for robot demonstrations

The same four-stage discipline applies to robot demonstration data for DealLens and the humanoid capstone. For DealLens: collect (raw deal memos, term sheets, cap tables), filter (remove boilerplate, non-deal documents, corrupt OCR), dedupe (remove duplicate memos from the same funding round), curriculum (weight recent-year data more heavily in the fine-tuning mix). The analogy is exact.

For the humanoid capstone: GR00T-Dreams (NVIDIA, June 2025) generated 6,500 hours of synthetic demonstration data in 11 hours using Isaac Lab, then filtered by trajectory quality and diversity before mixing with real-world data. This is pretraining-at-scale data discipline applied to robot learning — the concepts are identical, the substrate is different.

The reasoning era broke the clean curves

o1 and DeepSeek-R1 showed that post-training compute (RL on verifiable tasks) and inference-time compute (chain-of-thought scaling) can substitute for pretraining tokens on reasoning tasks. The clean Chinchilla curves no longer govern the full picture — the new scaling axis is test-time compute.

Figure 7.1The four-stage pretraining data pipeline (collect → filter → dedupe → curriculum) and the divergence between Chinchilla compute-optimal and 2025 deployment-optimal token-to-parameter ratios.

Retrieve before you continue

Three questions on what you just read

Q1 Factual What is the compute-optimal token-to-parameter ratio from Chinchilla, and what is the 2025 deployment-optimal ratio?

Q2 Conceptual Why does MinHash deduplication improve model generalization more than just reducing dataset size?

Q3 Synthetic How does the four-stage data pipeline from pretraining apply to fine-tuning SmolVLA on JHU humanoid demonstrations?