Chapter 04 · Attention from scratch — From Tokens to Embodied Minds

Self-attention is a soft, content-addressable memory: each query token looks up all key tokens to get weighted access to their values. The mathematical form — softmax(QK^T / sqrt(d_k)) V — is the only one that satisfies scale-invariance, differentiability, and computational tractability simultaneously. The 1/sqrt(d_k) term is not an empirical hack; it follows from the variance of random dot products in high dimensions. The multi-head wrapper exists because one attention head can only attend to one subspace of the representation — stacking heads across orthogonal projections recovers the representational richness. The four production variants — full attention, causal attention, sliding-window (Mistral, Gemma), and grouped-query attention (Llama 3, Qwen 2.5) — are not modelling improvements. They are engineering choices driven by KV-cache memory economics and inference latency targets. Understanding why each exists requires understanding the KV-cache, not the math of attention.

Q, K, V — the lookup mechanism

The QKV projection takes a token embedding x and produces three matrices: Q = xW_Q, K = xW_K, V = xW_V, each of shape (seq_len, d_model). The scaled dot-product attention is then: A = softmax(QK^T / sqrt(d_k)) V, where A has shape (seq_len, d_v). The interpretation: QK^T computes the compatibility of each query with each key (an unnormalized similarity matrix of shape seq×seq). Softmax normalizes across the key dimension to produce attention weights. V is the content retrieved — each output token is a weighted sum of all value vectors.

Multi-head attention partitions the d_model dimension into h heads of dimension d_k = d_model / h each. Each head independently computes scaled dot-product attention on its slice of Q, K, and V, and the h outputs are concatenated and projected back to d_model. The motivation: different heads learn to attend to different kinds of relationships (syntactic, semantic, positional) simultaneously. In practice, heads do specialize — some heads consistently attend to syntactic dependencies, others to long-range co-references — but the degree of specialization varies by layer.

For GR00T N1.5's DiT action model, cross-attention is the mechanism connecting the VLM's language-conditioned features to the DiT's diffusion process. The query comes from the noisy action tokens; the key and value come from the VLM's output embeddings. Understanding this cross-attention pathway is necessary to debug why certain visual-linguistic conditioning fails to propagate into the action output.

The four production variants

Full attention computes the complete seq×seq score matrix — O(N^2) memory and compute. It is used for encoders (BERT-style) and for short-context decoders. Causal attention masks the upper triangle of the score matrix so each token attends only to itself and earlier tokens — the standard for autoregressive language models. The mask is applied before softmax by setting upper-triangle entries to -inf. Sliding-window attention (Mistral, Gemma, Falcon) restricts each token to attend to only the W nearest tokens in each direction — O(N*W) instead of O(N^2). For DealLens VC memo analysis, sliding-window is insufficient because cross-document dependencies span the full 50K+ token context.

Grouped-query attention (Ainslie et al., arXiv:2305.13245, May 2023) is the most impactful recent variant for production serving. Instead of H independent K and V projections, GQA uses G groups (G << H) where each group shares one K and one V projection across H/G query heads. Llama 3 uses G=8 with H=32 (4 query heads per group). The KV-cache memory reduction is H/G = 4x, which at long context lengths is the difference between fitting on one GPU and needing KV offloading.

Multi-query attention (MQA, Shazeer 2019) is the extreme case of GQA with G=1 — all query heads share a single K and V. MQA reduces KV-cache by H/1 but at measurable quality cost. The empirical finding from the GQA paper is that G=8 recovers almost all of MQA's memory efficiency while matching MHA quality — making G=8 GQA the dominant production choice for decoder-only models in 2024-2025.

KV-cache economics drive the design space

During autoregressive decoding, each new token needs to attend to all previous tokens. Recomputing K and V for all previous tokens at every decoding step would be O(N^2) per token. The KV-cache stores K and V for all past tokens, reducing decode cost to O(N) per token at the expense of O(N * d_model * num_layers * 2) persistent memory. For a Llama 3 70B model serving a 128K-token context, the KV-cache at FP16 is approximately 128K * 8192 * 80 * 2 * 2 bytes = ~170 GB — larger than the model itself.

This memory pressure is why GQA exists, why PagedAttention (in vLLM) virtualizes KV-cache into pages, and why speculative decoding (Chapter 19) drafts multiple tokens at once rather than one. For DealLens, the KV-cache budget determines how many VC memos you can hold in context simultaneously during a multi-memo synthesis query.

Attention in VLA models

Every VLA model (OpenVLA, GR00T N1.5, SmolVLA, π0) uses attention as the core information-mixing mechanism. OpenVLA's Llama 2 backbone uses GQA; GR00T N1.5's Eagle 2.5 VLM uses GQA; the DiT action model uses full cross-attention over short action-token sequences. Understanding which variant is used and why matters when you debug inference latency: GQA decode is 4-8x faster than full MHA decode on long sequences because of KV-cache size.

For the JHU humanoid, the practical implication is at inference time: SmolVLA at 450M parameters with GQA can run on a single consumer GPU; a hypothetical full-MHA model at the same parameter count would require 4x the KV-cache memory per decoding step. The 78.3% real-world task success reported for SmolVLA (Hugging Face, June 2025) is achieved partly because the architecture is designed for low-latency inference, not just accuracy.

Flash Attention is not a new attention variant

FlashAttention (Dao et al., 2022) and FlashAttention-2/3 are IO-aware implementations of standard scaled dot-product attention — same math, same output, but tiled computation that avoids materializing the N×N matrix in HBM. Chapter 14 covers this in depth.

Figure 4.1Scaled dot-product attention (left) and the three production variants (right) — Full MHA, GQA, and Sliding-Window — differ by KV-cache cost, not by the core attention math.

Retrieve before you continue

Three questions on what you just read

Q1 Factual Why does scaled dot-product attention divide by sqrt(d_k)?

Q2 Conceptual What is the memory reduction from grouped-query attention with G=8 groups and H=32 heads, and why is this the production standard?

Q3 Synthetic For DealLens processing 50K-token VC memos, which attention variant is most appropriate and what is the approximate KV-cache memory at serving time?