From Sand to Superintelligence · Drill cards · Chapter 29
Drills
A Neural Network Lives in Numbers
10 atomic recall cards. Export to Anki and let spaced repetition do its slow work.
In Anki: File → Import, choose this TSV, set field separator to Tab, deck = Sand to Silicon · Ch 29, note type = Basic.
| Front | Back |
|---|---|
| How many parameters does a frontier model typically have? | ~200 billion (the chapter’s stat; the range is 100B–1T depending on the model). |
| How much memory do 200B parameters occupy in BF16? | ~400 GB — two bytes per parameter. |
| What is the fundamental operation of a transformer linear layer? | y = Wx + b, followed by a nonlinearity (GELU, SiLU, or ReLU). |
| What is self-attention, expressed as a formula? | softmax(QKᵀ/√d)V — a matmul of queries and keys, a softmax, then a matmul with values. |
| How many transformer blocks does a typical frontier model have? | ~80. |
| What paper introduced the transformer’s self-attention mechanism? | ‘Attention Is All You Need’ by Vaswani et al. (2017). |
| What does BF16 sacrifice compared with FP32, and what does it keep? | It keeps FP32’s exponent range; it sacrifices mantissa precision (uses FP16’s narrower mantissa). |
| What is FP8 used for, and what did Hopper add to support it? | FP8 is the new precision floor for inference (and increasingly training); Hopper added native FP8 tensor cores. |
| What is FlashAttention? | A kernel that rearranges the attention computation to be memory-efficient on real hardware, exploiting cache hierarchy to avoid materializing the full QKᵀ matrix in HBM. |
| What is the vocabulary size and embedding dimension of a typical frontier model? | ~100,000 tokens in the vocabulary; each mapped to a ~16,000-dimensional embedding vector. |