From Tokens to Embodied Minds · Drill cards · Chapter 06
Drills
Tokenization is a design decision
10 atomic recall cards. Export to Anki and let spaced repetition do its slow work.
In Anki: File → Import, choose this TSV, set field separator to Tab, deck = Tokens to Embodied Minds · Ch 06, note type = Basic.
| Front | Back |
|---|---|
| What is the BPE merge algorithm and what determines which pairs get merged first? | BPE iteratively merges the most frequent pair of adjacent tokens in the training corpus. Pairs that appear most often across the entire corpus are merged first, producing subword units that efficiently compress common patterns. |
| What is the difference between character-level BPE and byte-level BPE? | Character-level BPE starts with a Unicode character vocabulary — rare characters may be OOV. Byte-level BPE starts with a 256-byte vocabulary — every byte sequence is representable with zero UNK tokens. |
| Why does GPT-4o use a 200K-vocabulary tokenizer? | To compress non-English text more efficiently: a larger vocabulary trained on multilingual data gives common non-English subwords their own tokens, reducing token count by 3-4x for Chinese, Arabic, and similar scripts compared to an English-dominated small vocabulary. |
| What is SentencePiece and what advantage does it have over standard BPE? | SentencePiece operates directly on raw Unicode text without pre-tokenization, making it language-agnostic. It supports both BPE and Unigram tokenization and enables subword regularization — sampling multiple valid tokenizations during training for robustness. |
| How does Unigram tokenization differ from BPE in its construction approach? | BPE starts small and grows by merging frequent pairs. Unigram starts with a large vocabulary and prunes tokens that minimally reduce the corpus likelihood — a top-down rather than bottom-up approach. |
| Why does tokenizer choice affect DealLens context budget for VC memos? | The number of tokens per page is tokenizer-dependent. A memo tokenized at 400 tokens/page vs 600 tokens/page determines whether a 10-page document fits in 4K or 6K tokens — a 50% difference in how many memos can be processed per context window. |
| What is tiktoken's cl100k_base and how does it differ from a standard BPE tokenizer? | cl100k_base uses regex pre-tokenization (splitting on whitespace and code patterns) before applying BPE merges. This prevents merges across word boundaries and improves code tokenization by preserving common programming tokens as units. |
| Can you change the tokenizer when fine-tuning a pre-trained LLM? | No — the tokenizer is fixed at pre-training time and tied to the embedding table. Changing it requires either retraining the embedding table from scratch or accepting a vocabulary mismatch between new tokens and their embeddings. |
| Why does the Llama 3 tokenizer use a 128K vocabulary compared to Llama 2's 32K? | 128K vocabulary improves context efficiency (fewer tokens per document) and better covers code identifiers and multilingual text. The larger embedding table cost (2.1 GB at BF16) is acceptable given the context window and serving hardware targeted. |
| How does tokenizer digit handling affect mathematical accuracy in LLMs? | Inconsistent digit merging (e.g., GPT-2's frequency-based digit merges) misaligns place values, preventing the model from composing arithmetic consistently. Individual-digit tokenization (Llama 3) creates place-value-aligned representations, improving GSM8K-style arithmetic by 3-7%. |