Chapter 06 · Tokenization is a design decision — From Tokens to Embodied Minds

The tokenizer is the first transformation your model ever sees and the last one engineers audit when performance is poor. Tokenizer choice silently sets a ceiling on multilingual quality — a vocabulary trained primarily on English will spend 3-4 tokens on a single Chinese character that a multilingual vocabulary covers in one. It affects math accuracy because digit tokenization determines whether the model sees 1234 as one token or four. It affects code generation because identifiers and operators need compact representations to fit in context. None of this is visible in the standard perplexity metric, because perplexity is measured in tokens, not bytes or characters. The field has not converged on a single tokenization scheme. BPE (GPT-2, GPT-4o), byte-level BPE (Llama 2/3, Mistral), Unigram (T5, ALBERT), and tiktoken are not interchangeable. DeepSeek-V3 extended the Llama tokenizer with Chinese-specific merges. GPT-4o went to a 200K-vocabulary byte-level scheme for multilingual compression. For DealLens processing multilingual deal memos, the choice between these affects how many tokens a 10-page Chinese VC memo consumes relative to an English one.

BPE — how the merge algorithm works

Byte Pair Encoding (Sennrich et al., arXiv:1508.07909, August 2015) starts with a character-level vocabulary and iteratively merges the most frequent pair of adjacent tokens in the training corpus. After N merges, the vocabulary contains N subword units beyond the initial character set. Encoding a new string: apply all merge rules in the order they were learned, choosing the longest applicable merge at each step. The result is a sequence of subword tokens whose lengths balance frequency (common words become single tokens) and coverage (rare words split into known subwords).

Byte-level BPE (GPT-2, Llama, Mistral) starts with a 256-byte vocabulary instead of a character vocabulary. Every possible byte sequence is representable — there are no UNK tokens. The trade-off is that rare Unicode characters (e.g., mathematical symbols, CJK extensions) split into multiple bytes, increasing token count. The net effect for multilingual models is negative without a much larger vocabulary and multilingual training data, which is why GPT-4o went to 200K tokens.

tiktoken's cl100k_base (used in GPT-3.5, GPT-4, and embeddings) is a regex-pre-tokenized BPE that splits on whitespace and common code patterns before applying BPE merges. This prevents merges across word boundaries and improves code tokenization. The Llama 3 tokenizer (128K vocabulary) uses a similar approach with additional merges for code identifiers.

Why digit tokenization matters for math

GPT-2's 50K-vocabulary tokenizer merges common digit strings: 1000, 2000, 3000 are each single tokens, but 1234 might be [123][4] or [12][34] depending on training frequency. This creates an inconsistency: the model cannot compose arithmetic over digit tokens that are not aligned to place values. Llama 3's 128K tokenizer tokenizes each digit individually (0-9 each get their own token), sacrificing compression for arithmetic consistency. The empirical result (verified in multiple GSM8K ablations) is that individual-digit tokenization improves arithmetic accuracy by 3-7% on multi-step problems.

For DealLens, numerical clauses in VC term sheets — valuations like $47.5M, equity percentages like 18.75%, and dates like 2024-Q3 — are affected by tokenizer digit handling. A tokenizer that splits 47.5 inconsistently will cause the model to see different sub-token patterns for semantically equivalent representations, degrading extraction reliability. Testing your scoring model's tokenizer behavior on representative numerical strings is a one-hour diagnostic that pays back in extraction accuracy.

Vocabulary size and context economics

A larger vocabulary means fewer tokens per character — better context efficiency — but larger embedding tables (vocabulary size × d_model parameters) and output projection matrices. At 128K vocabulary and d_model=8192 (Llama 3 70B), the embedding table alone is 128K × 8192 × 2 bytes (BF16) ≈ 2.1 GB. At 200K vocabulary, this scales to 3.3 GB — meaningful at inference time on constrained hardware. The right vocabulary size is a function of the target languages, the serving hardware, and the context length.

For the JHU humanoid capstone, tokenizer choice is indirect but non-trivial: if you fine-tune SmolVLA or GR00T N1.5 on task-description strings (verbal instructions like 'pick up the red cup and place it on the tray'), the instruction token count determines how much of the context window is consumed by the instruction versus the action history. A tokenizer that splits 'pick up' into three tokens instead of two wastes context budget at every inference step.

SentencePiece and Unigram — the alternative family

SentencePiece (Kudo and Richardson, arXiv:1808.06226, August 2018) is a library that implements both BPE and Unigram tokenization directly on raw Unicode text without pre-tokenization. It is the implementation used in T5, ALBERT, and mT5. Unigram language model tokenization (Kudo, arXiv:1804.10959, April 2018) takes the opposite approach from BPE: start with a large vocabulary and prune tokens that reduce the language model likelihood the least. The result is a tokenizer that is probabilistically interpretable — each tokenization has an associated score — and supports subword regularization during training (sampling different valid tokenizations of the same word to improve robustness).

For production LLM work in 2025, BPE via tiktoken or the HuggingFace tokenizers library is the dominant choice. Unigram remains relevant for multilingual NLP tasks where subword regularization matters, and for any workflow that uses T5-family models as components. For DealLens, the relevant question is: what tokenizer does the base model you are fine-tuning use? The answer determines your chunking strategy, context budget calculation, and how you handle non-English deal memo passages.

The tokenizer is not the model

The tokenizer is fixed at training time and cannot be changed at fine-tuning time without retraining the embedding table. If the base model's tokenizer handles your domain poorly (e.g., financial abbreviations, chemical names, Chinese text), the fix is either training from scratch or accepting the overhead of embedding surgery.

Figure 6.1BPE merge process (left) and vocabulary trade-offs across major tokenizers (right). Tokenizer choice is a training-time decision with irreversible downstream effects on multilingual quality, arithmetic accuracy, and context efficiency.

Retrieve before you continue

Three questions on what you just read

Q1 Factual Why does byte-level BPE guarantee zero UNK tokens, and what is the trade-off?

Q2 Conceptual Why does individual-digit tokenization (Llama 3 style) improve arithmetic accuracy compared to merged-digit tokenization (GPT-2 style)?

Q3 Synthetic For DealLens processing 10-page Chinese VC memos alongside English memos, which tokenizer would minimize token count and why?