Chapter 05 · The transformer block, end to end — From Tokens to Embodied Minds

A modern transformer block is not the 2017 original — it has been quietly refactored across a dozen production systems into a stable canonical form: pre-norm with RMSNorm, GQA, SwiGLU FFN, RoPE at the QK projection, no biases. This design appears in Llama 3, Mistral 7B, Qwen 2.5, Falcon, and as the VLM backbone of GR00T N1.5. Once you internalize it, every new architecture paper reduces to a diff against this template — and the diff is usually two or three sentences. The changes from the 2017 Transformer are not random improvements: each one addresses a specific failure mode. Pre-norm fixed gradient instability. RMSNorm replaced LayerNorm's mean subtraction with just the RMS term, cutting compute by ~40% with no measurable accuracy loss. SwiGLU replaced GELU to add a gating mechanism without adding parameters at the hidden-dim level. RoPE replaced absolute positional embeddings to enable length generalization. No biases removed redundant parameters that contribute nothing at scale.

Block anatomy — the canonical form

The exact sequence in a Llama 3 block: (1) x = x + Attention(RMSNorm(x)), (2) x = x + FFN(RMSNorm(x)). That is two residual additions and two pre-norm applications per block, with GQA attention and SwiGLU FFN. RoPE is applied inside the attention sublayer at the QK projection, not to the input. No bias parameters anywhere — not in the attention projections, not in the FFN linear layers, not in the norm. This no-bias design was empirically validated across multiple training runs and reflects a finding that at scale, biases are absorbed into the weight matrices and the norm's gain parameters.

RMSNorm replaces LayerNorm's operation of (x - mean(x)) / std(x) * gamma + beta with just x / RMS(x) * gamma, where RMS(x) = sqrt(mean(x^2)). The mean subtraction is dropped, beta is dropped. The justification (Zhang et al., arXiv:1910.07467, October 2019) is that centering the activations is not necessary for training stability — the RMS scaling is the load-bearing operation. The gain gamma is retained. In practice, RMSNorm converges as well as LayerNorm while being ~40% faster on GPU.

SwiGLU (Shazeer, arXiv:2002.05202, February 2020) defines the FFN as: FFN(x) = (W_1 x ⊙ SiLU(W_3 x)) W_2, where SiLU(z) = z * sigmoid(z). The gating mechanism (the ⊙ product) allows the network to modulate each neuron's contribution before the output projection. The hidden dimension uses an 8/3 expansion ratio instead of the standard 4x, keeping total parameter count comparable. SwiGLU consistently outperforms GELU in matched-compute ablations across Llama and PaLM training.

RoPE — the only position encoding that survived

Rotary position embeddings (Su et al., arXiv:2104.09864, April 2021) encode position by rotating the Q and K vectors by a position-dependent angle before computing the dot product. The key property: the dot product QK^T depends only on the relative position (position_q - position_k), not absolute positions. This makes RoPE naturally composable with length generalization — you can extend the rotation base frequency (YaRN, NTK-aware scaling) to reach longer contexts without retraining the full model.

The alternatives that did not survive: learned absolute positional embeddings (GPT-2, BERT) — do not generalize beyond training length. ALiBi (Press et al., arXiv:2108.12409, August 2021) — strong for long context but incompatible with KV-cache prefix sharing, killing it for production serving. T5's relative position bias — compatible but slower to implement efficiently in CUDA kernels. RoPE dominates because it is fast to compute (just complex number multiplications), composable with FlashAttention, and supports the YaRN/NTK extensions used for long-context deployment.

RoPE is applied at the QK projection inside the attention sublayer, not at the token embedding layer. This is a common point of confusion: adding RoPE to the input embeddings would require position information to propagate through the entire attention and FFN computation, which is wasteful. Applying it only to Q and K injects position directly into the similarity computation where it matters.

Why pre-norm beats post-norm

Post-norm (the 2017 original) applies LayerNorm after each sublayer: x = LayerNorm(x + Sublayer(x)). The gradient of the loss with respect to early layer weights must flow through many LayerNorm operations in sequence. In deep networks (48+ layers), these accumulate and cause gradient scale to grow with depth, requiring careful learning rate warmup schedules. Pre-norm (Zhang and Sennrich, arXiv:1906.01787, June 2019) applies the norm before: x = x + Sublayer(Norm(x)). The residual stream now bypasses the norm, providing a direct gradient path from output to input. This removes the depth-dependence of gradient scale and eliminates the need for warmup in many configurations.

The practical implication for fine-tuning: when fine-tuning GR00T N1.5 on your own humanoid demonstrations, the pre-norm design means the learning rate can be set more aggressively without gradient explosion. For DealLens, when adding a LoRA adapter on top of a frozen Llama 3 backbone, the pre-norm architecture guarantees that the adapter gradients flow cleanly through the residual stream without being disrupted by norm rescaling.

This block is inside every model you will deploy

Eagle 2.5, the VLM inside GR00T N1.5 (NVIDIA Research, June 2025), uses this canonical block structure. The DiT action model uses a variant adapted for continuous action token prediction rather than discrete text. OpenVLA's Llama 2 backbone is this block verbatim. SmolVLA-450M is this block at reduced width and depth. When you fine-tune any of these models on JHU humanoid demonstration data, you are fine-tuning a stack of these blocks — LoRA adapters target the W_Q, W_K, W_V, and W_O projections within the attention sublayer, leaving the FFN and norm layers frozen by default.

Knowing the block anatomy also tells you what to profile first when debugging latency: on an Orin AGX with a quantized SmolVLA, the attention sublayer (dominated by GQA projections and KV-cache reads) accounts for ~60% of forward-pass latency at the sequence lengths used in VLA inference. Profiling torch.profiler at the sublayer level confirms this within one run.

DeepSeek-V3's MLA differs from GQA

Multi-Head Latent Attention (MLA, DeepSeek-V3 arXiv:2412.19437) compresses KV into a low-rank latent vector rather than grouping heads. It achieves better parameter efficiency than GQA at the cost of a slightly more complex implementation — GQA remains the easier production choice for most teams.

Figure 5.1The canonical modern transformer block: pre-norm RMSNorm, GQA with RoPE, SwiGLU FFN, two residual additions, no biases. Every production LLM in 2024-2025 is a diff against this template.

Retrieve before you continue

Three questions on what you just read

Q1 Factual What is the exact operation sequence in a Llama 3 transformer block?

Q2 Conceptual Why does pre-norm improve training stability over post-norm for deep networks?

Q3 Synthetic When applying LoRA to fine-tune GR00T N1.5's VLM backbone on humanoid demonstration data, which weight matrices in the transformer block should the LoRA adapters target?