From Tokens to Embodied Minds · Drill cards · Chapter 01
Drills
Linear algebra you actually use
10 atomic recall cards. Export to Anki and let spaced repetition do its slow work.
In Anki: File → Import, choose this TSV, set field separator to Tab, deck = Tokens to Embodied Minds · Ch 01, note type = Basic.
| Front | Back |
|---|---|
| Name the five operations a transformer forward pass spends most of its time on. | Dense matmul, softmax, layer/RMS norm, residual add, and the einsum patterns of multi-head attention. |
| Why is BF16 storage with FP32 accumulate the standard training format? | BF16 keeps FP32's exponent range (up to ~3.4e38) so activations do not overflow; FP32 accumulators preserve precision across the long reduction axis. Pure FP16 underflows gradients silently. |
| What is the symptom of softmax overflow in a training run? | NaN loss, either at step one or at a later step when activations grow large enough to push exponentiated scores to infinity. |
| What is the subtract-max trick in softmax and why does it work? | Subtract the per-row maximum from logits before exponentiation. This keeps intermediate values finite without changing the output distribution, since the constant cancels in the denominator. |
| Why does grouped-query attention (GQA) reduce KV-cache memory? | GQA shares K and V projections across groups of query heads, reducing the number of KV heads from H to G (G << H). This cuts KV-cache memory by H/G without significant accuracy loss. |
| Write multi-head attention score computation as an einsum with labeled dimensions. | torch.einsum('bhqd,bhkd->bhqk', Q, K) * (1/sqrt(d_k)), where b=batch, h=head, q=query position, k=key position, d=head dimension. |
| What dtype does FP8 E4M3 use for the forward pass, and why? | E4M3 (4-bit exponent, 3-bit mantissa) for forward. The wider exponent range compared to E5M2 preserves more of the dynamic range needed for activations and weights. |
| What is the practical throughput gap between a naive PyTorch attention and a Triton kernel at sequence length 16K? | Typically 3-5x on an H100, driven by memory bandwidth: the naive version materializes the full N×N attention matrix in HBM, while a Triton kernel tiles and keeps it in SRAM. |
| Why does residual add matter architecturally despite consuming almost no FLOPs? | Residuals create gradient highways through the network — without them, gradients vanish in deep models. They also allow each layer to represent a small correction to the previous representation rather than a full re-encoding. |
| In the JHU humanoid control loop, what is the consequence of wasting 100µs in the attention kernel? | 100µs less budget for the PD controller per step. At 50Hz, control steps are 20ms each; 100µs is 0.5% of the budget and compounds into measurably rougher trajectories over a long-horizon task. |