From Tokens to Embodied Minds · Drill cards · Chapter 05
Drills
The transformer block, end to end
10 atomic recall cards. Export to Anki and let spaced repetition do its slow work.
In Anki: File → Import, choose this TSV, set field separator to Tab, deck = Tokens to Embodied Minds · Ch 05, note type = Basic.
| Front | Back |
|---|---|
| Write the two residual update equations for a modern Llama-style transformer block. | x = x + Attention(RMSNorm(x)); x = x + FFN(RMSNorm(x)). Pre-norm RMSNorm before each sublayer, residual addition after each. |
| What does RMSNorm compute compared to LayerNorm? | RMSNorm: x / RMS(x) * gamma, where RMS(x) = sqrt(mean(x^2)). LayerNorm additionally subtracts the mean and adds a beta bias. RMSNorm is ~40% faster and drops the beta parameter. |
| What is the SwiGLU FFN formula? | FFN(x) = (W_1 x ⊙ SiLU(W_3 x)) W_2, where SiLU(z) = z * sigmoid(z). The gated product multiplies the linear projection by a sigmoid-gated version of a second projection before the output projection. |
| Where in the transformer block is RoPE applied? | Inside the attention sublayer, applied to the Q and K vectors before computing the dot product. NOT applied to the input embeddings or the residual stream. |
| Why does RoPE enable length generalization better than learned absolute positional embeddings? | RoPE encodes relative position in the QK dot product via rotation. The dot product depends only on (pos_q - pos_k), not absolute positions. Extending the rotation base frequency (YaRN/NTK scaling) extends the context window without full retraining. |
| Why were bias parameters removed from modern transformer blocks? | At scale, bias parameters are absorbed into the weight matrices and norm gain parameters — they add no representational capacity but increase parameter count and communication overhead in distributed training. |
| What is the SwiGLU hidden dimension expansion ratio and why is it different from the standard 4x? | 8/3 × d_model (approximately 2.67x). This keeps total parameter count similar to a 4x GELU FFN since SwiGLU uses three weight matrices (W_1, W_2, W_3) instead of two. |
| Why did ALiBi position encoding not survive to 2025 production models despite strong long-context results? | ALiBi adds a position bias to attention scores, which is incompatible with KV-cache prefix sharing — the bias depends on absolute position, breaking the prefix reuse optimization. RoPE's relative-position property makes it compatible with prefix caching. |
| In the Llama 3 block, is there a bias term in the Q, K, V projection matrices? | No. The Q, K, V, and O projection matrices are all bias-free linear transformations. This is a deliberate design choice across Llama 3, Mistral, Qwen 2.5, and Falcon. |
| Which exact components are frozen vs fine-tuned when applying LoRA to GR00T N1.5's VLM backbone? | Standard LoRA: freeze all base model weights, add low-rank adapters (rank 8-64) to W_Q, W_K, W_V, W_O. The RMSNorm gains, SwiGLU projections, and embedding layer are typically frozen in default configs; unfreeze them for domain-shifted data. |