From Sand to Superintelligence · Drill cards · Chapter 30
Drills
A Thought, Token by Token
10 atomic recall cards. Export to Anki and let spaced repetition do its slow work.
In Anki: File → Import, choose this TSV, set field separator to Tab, deck = Sand to Silicon · Ch 30, note type = Basic.
| Front | Back |
|---|---|
| What is a 'token' in the context of a frontier LLM? | A vocabulary unit — a subword piece — produced by a learned tokenizer like tiktoken or SentencePiece. Encoded as an integer in the range 0 to ~vocab_size. |
| Roughly how many multiplications per generated token on a frontier model? | ~10^11 — a few hundred billion. A full response of a few thousand tokens climbs into the quadrillions (~10^15). |
| Roughly how much wall-clock time per token, on a frontier system? | ~50 ms. |
| Roughly how much energy per generated token? | ~1 joule. |
| What is the embedding table's shape? | (vocab_size × hidden_dim), e.g. ~100,000 × ~16,000. |
| What three tensors does attention compute from the input? | Queries (Q), keys (K), and values (V), each via a linear projection of the input. |
| What does the QK^T matrix represent, post-softmax? | A weighted attention pattern: how much each token should attend to every other token. |
| What does the unembedding step produce? | Logits — one real number per vocabulary entry — which softmax converts into a probability distribution. |
| Why does KV caching speed up the autoregressive loop? | Previous tokens' K and V tensors are cached, so only the new token requires a fresh forward pass instead of recomputing the entire prefix. |
| Name 4 sampling rules used at the softmax-to-token step. | Argmax (greedy), top-k, top-p (nucleus), temperature-scaled sampling. |