The transformer block, end to end

10 atomic recall cards. Export to Anki and let spaced repetition do its slow work.

In Anki: File → Import, choose this TSV, set field separator to Tab, deck = Tokens to Embodied Minds · Ch 05, note type = Basic.

Front	Back
Write the two residual update equations for a modern Llama-style transformer block.	x = x + Attention(RMSNorm(x)); x = x + FFN(RMSNorm(x)). Pre-norm RMSNorm before each sublayer, residual addition after each.
What does RMSNorm compute compared to LayerNorm?	RMSNorm: x / RMS(x) * gamma, where RMS(x) = sqrt(mean(x^2)). LayerNorm additionally subtracts the mean and adds a beta bias. RMSNorm is ~40% faster and drops the beta parameter.
What is the SwiGLU FFN formula?	FFN(x) = (W_1 x ⊙ SiLU(W_3 x)) W_2, where SiLU(z) = z * sigmoid(z). The gated product multiplies the linear projection by a sigmoid-gated version of a second projection before the output projection.
Where in the transformer block is RoPE applied?	Inside the attention sublayer, applied to the Q and K vectors before computing the dot product. NOT applied to the input embeddings or the residual stream.
Why does RoPE enable length generalization better than learned absolute positional embeddings?	RoPE encodes relative position in the QK dot product via rotation. The dot product depends only on (pos_q - pos_k), not absolute positions. Extending the rotation base frequency (YaRN/NTK scaling) extends the context window without full retraining.
Why were bias parameters removed from modern transformer blocks?	At scale, bias parameters are absorbed into the weight matrices and norm gain parameters — they add no representational capacity but increase parameter count and communication overhead in distributed training.
What is the SwiGLU hidden dimension expansion ratio and why is it different from the standard 4x?	8/3 × d_model (approximately 2.67x). This keeps total parameter count similar to a 4x GELU FFN since SwiGLU uses three weight matrices (W_1, W_2, W_3) instead of two.
Why did ALiBi position encoding not survive to 2025 production models despite strong long-context results?	ALiBi adds a position bias to attention scores, which is incompatible with KV-cache prefix sharing — the bias depends on absolute position, breaking the prefix reuse optimization. RoPE's relative-position property makes it compatible with prefix caching.
In the Llama 3 block, is there a bias term in the Q, K, V projection matrices?	No. The Q, K, V, and O projection matrices are all bias-free linear transformations. This is a deliberate design choice across Llama 3, Mistral, Qwen 2.5, and Falcon.
Which exact components are frozen vs fine-tuned when applying LoRA to GR00T N1.5's VLM backbone?	Standard LoRA: freeze all base model weights, add low-rank adapters (rank 8-64) to W_Q, W_K, W_V, W_O. The RMSNorm gains, SwiGLU projections, and embedding layer are typically frozen in default configs; unfreeze them for domain-shifted data.