Predict before you read

Why does attention use a 1/sqrt(d_k) scaling factor on the dot products before softmax?

Think about what happens to the variance of a dot product as the dimension d_k grows.

From Tokens to Embodied Minds  ·  Chapter 04 of 36
Chapter 04

Attention from scratch

Q, K, V — and why scaled dot-product is the right shape

4
production attention variants — full, causal, sliding-window, GQA
√d
scaling factor — prevents softmax saturation in high-dimensional heads
4–8×
KV-cache memory reduction from grouped-query attention
Maturity ladder

Self-attention is a soft, content-addressable memory: each query token looks up all key tokens to get weighted access to their values. The mathematical form — softmax(QK^T / sqrt(d_k)) V — is the only one that satisfies scale-invariance, differentiability, and computational tractability simultaneously. The 1/sqrt(d_k) term is not an empirical hack; it follows from the variance of random dot products in high dimensions. The multi-head wrapper exists because one attention head can only attend to one subspace of the representation — stacking heads across orthogonal projections recovers the representational richness. The four production variants — full attention, causal attention, sliding-window (Mistral, Gemma), and grouped-query attention (Llama 3, Qwen 2.5) — are not modelling improvements. They are engineering choices driven by KV-cache memory economics and inference latency targets. Understanding why each exists requires understanding the KV-cache, not the math of attention.

Q, K, V — the lookup mechanism

The QKV projection takes a token embedding x and produces three matrices: Q = xW_Q, K = xW_K, V = xW_V, each of shape (seq_len, d_model). The scaled dot-product attention is then: A = softmax(QK^T / sqrt(d_k)) V, where A has shape (seq_len, d_v). The interpretation: QK^T computes the compatibility of each query with each key (an unnormalized similarity matrix of shape seq×seq). Softmax normalizes across the key dimension to produce attention weights. V is the content retrieved — each output token is a weighted sum of all value vectors.

Multi-head attention partitions the d_model dimension into h heads of dimension d_k = d_model / h each. Each head independently computes scaled dot-product attention on its slice of Q, K, and V, and the h outputs are concatenated and projected back to d_model. The motivation: different heads learn to attend to different kinds of relationships (syntactic, semantic, positional) simultaneously. In practice, heads do specialize — some heads consistently attend to syntactic dependencies, others to long-range co-references — but the degree of specialization varies by layer.

For GR00T N1.5's DiT action model, cross-attention is the mechanism connecting the VLM's language-conditioned features to the DiT's diffusion process. The query comes from the noisy action tokens; the key and value come from the VLM's output embeddings. Understanding this cross-attention pathway is necessary to debug why certain visual-linguistic conditioning fails to propagate into the action output.

The four production variants

Full attention computes the complete seq×seq score matrix — O(N^2) memory and compute. It is used for encoders (BERT-style) and for short-context decoders. Causal attention masks the upper triangle of the score matrix so each token attends only to itself and earlier tokens — the standard for autoregressive language models. The mask is applied before softmax by setting upper-triangle entries to -inf. Sliding-window attention (Mistral, Gemma, Falcon) restricts each token to attend to only the W nearest tokens in each direction — O(N*W) instead of O(N^2). For DealLens VC memo analysis, sliding-window is insufficient because cross-document dependencies span the full 50K+ token context.

Grouped-query attention (Ainslie et al., arXiv:2305.13245, May 2023) is the most impactful recent variant for production serving. Instead of H independent K and V projections, GQA uses G groups (G << H) where each group shares one K and one V projection across H/G query heads. Llama 3 uses G=8 with H=32 (4 query heads per group). The KV-cache memory reduction is H/G = 4x, which at long context lengths is the difference between fitting on one GPU and needing KV offloading.

Multi-query attention (MQA, Shazeer 2019) is the extreme case of GQA with G=1 — all query heads share a single K and V. MQA reduces KV-cache by H/1 but at measurable quality cost. The empirical finding from the GQA paper is that G=8 recovers almost all of MQA's memory efficiency while matching MHA quality — making G=8 GQA the dominant production choice for decoder-only models in 2024-2025.

KV-cache economics drive the design space

During autoregressive decoding, each new token needs to attend to all previous tokens. Recomputing K and V for all previous tokens at every decoding step would be O(N^2) per token. The KV-cache stores K and V for all past tokens, reducing decode cost to O(N) per token at the expense of O(N * d_model * num_layers * 2) persistent memory. For a Llama 3 70B model serving a 128K-token context, the KV-cache at FP16 is approximately 128K * 8192 * 80 * 2 * 2 bytes = ~170 GB — larger than the model itself.

This memory pressure is why GQA exists, why PagedAttention (in vLLM) virtualizes KV-cache into pages, and why speculative decoding (Chapter 19) drafts multiple tokens at once rather than one. For DealLens, the KV-cache budget determines how many VC memos you can hold in context simultaneously during a multi-memo synthesis query.

Every VLA model (OpenVLA, GR00T N1.5, SmolVLA, π0) uses attention as the core information-mixing mechanism. OpenVLA's Llama 2 backbone uses GQA; GR00T N1.5's Eagle 2.5 VLM uses GQA; the DiT action model uses full cross-attention over short action-token sequences. Understanding which variant is used and why matters when you debug inference latency: GQA decode is 4-8x faster than full MHA decode on long sequences because of KV-cache size.

For the JHU humanoid, the practical implication is at inference time: SmolVLA at 450M parameters with GQA can run on a single consumer GPU; a hypothetical full-MHA model at the same parameter count would require 4x the KV-cache memory per decoding step. The 78.3% real-world task success reported for SmolVLA (Hugging Face, June 2025) is achieved partly because the architecture is designed for low-latency inference, not just accuracy.

Flash Attention is not a new attention variant

FlashAttention (Dao et al., 2022) and FlashAttention-2/3 are IO-aware implementations of standard scaled dot-product attention — same math, same output, but tiled computation that avoids materializing the N×N matrix in HBM. Chapter 14 covers this in depth.

Scaled Dot-Product Attention and Production VariantsQ = xW_QK = xW_KV = xW_VQK^T / sqrt(d_k)softmax + maskweighted sum VVariants — driven by KV-cache economicsFull MHAH K,V headsO(N^2) · max qualityGQA (G=8)H/G shared K,V4x KV mem ↓ · ~same qualitySliding Windowattend W neighborsO(N·W) · local onlyKV-cache at FP16 for 128K context (Llama-3-70B): ~170 GB — larger than model weightsGQA G=8 reduces to ~42 GB · enables single-node serving of 128K contextsqrt(d_k) scaling: dot product variance = d_k · dividing by sqrt(d_k) restores variance = 1 · prevents softmax saturation
Figure 4.1Scaled dot-product attention (left) and the three production variants (right) — Full MHA, GQA, and Sliding-Window — differ by KV-cache cost, not by the core attention math.
Retrieve before you continue

Three questions on what you just read

Q1 Factual Why does scaled dot-product attention divide by sqrt(d_k)?
Q2 Conceptual What is the memory reduction from grouped-query attention with G=8 groups and H=32 heads, and why is this the production standard?
Q3 Synthetic For DealLens processing 50K-token VC memos, which attention variant is most appropriate and what is the approximate KV-cache memory at serving time?