From Sand to Superintelligence  ·  Drill cards · Chapter 29
Drills

A Neural Network Lives in Numbers

10 atomic recall cards. Export to Anki and let spaced repetition do its slow work.

10 cards due for review

In Anki: File → Import, choose this TSV, set field separator to Tab, deck = Sand to Silicon · Ch 29, note type = Basic.

FrontBack
How many parameters does a frontier model typically have?~200 billion (the chapter’s stat; the range is 100B–1T depending on the model).
How much memory do 200B parameters occupy in BF16?~400 GB — two bytes per parameter.
What is the fundamental operation of a transformer linear layer?y = Wx + b, followed by a nonlinearity (GELU, SiLU, or ReLU).
What is self-attention, expressed as a formula?softmax(QKᵀ/√d)V — a matmul of queries and keys, a softmax, then a matmul with values.
How many transformer blocks does a typical frontier model have?~80.
What paper introduced the transformer’s self-attention mechanism?‘Attention Is All You Need’ by Vaswani et al. (2017).
What does BF16 sacrifice compared with FP32, and what does it keep?It keeps FP32’s exponent range; it sacrifices mantissa precision (uses FP16’s narrower mantissa).
What is FP8 used for, and what did Hopper add to support it?FP8 is the new precision floor for inference (and increasingly training); Hopper added native FP8 tensor cores.
What is FlashAttention?A kernel that rearranges the attention computation to be memory-efficient on real hardware, exploiting cache hierarchy to avoid materializing the full QKᵀ matrix in HBM.
What is the vocabulary size and embedding dimension of a typical frontier model?~100,000 tokens in the vocabulary; each mapped to a ~16,000-dimensional embedding vector.