Predict before you read

Which numerical pattern most often causes silent precision loss in modern transformer training — not a NaN, but quietly degraded model quality?

Pick the one that fails not by exploding but by silently squashing gradients.

From Tokens to Embodied Minds  ·  Chapter 01 of 36
Chapter 01

Linear algebra you actually use

The five operations every forward pass spends its time on

5
operations that consume every forward pass
BF16+FP32
the only numerical recipe that survives at scale
100µs
saved on attention is 100µs more for the controller
Maturity ladder

The five operations that decide every transformer forward pass — dense matmul, softmax, layer or RMS norm, residual add, and the einsum patterns that express multi-head attention — are not interesting because of their math. They are interesting because their numerical behavior at FP16, BF16, and FP8 is what bites in practice. Softmax overflows. Accumulators lose precision. BF16 matmul with FP32 accumulate is the only reason large-scale training works at all, and if you do not know why, you are one rogue kernel away from a training run that silently degrades for three days before anyone notices. For the JHU humanoid capstone, this is not academic. Perception-to-action latency has a hard budget — every 100 microseconds you waste in the attention kernel is 100 microseconds the PD controller cannot use to track a smoother trajectory. For DealLens, the same matmul kernel sits below your retriever; understanding its precision trade-offs tells you why embedding quality degrades at certain dtypes.

The five operations

Dense matmul dominates wall-clock time in every transformer layer — the QKV projection, the attention output projection, and both FFN weight multiplications are all GEMM calls. At sequence length 512 on an H100, matmul accounts for roughly 60-70% of forward-pass FLOPs. Softmax is local over the sequence dimension and fast, but it is also where overflow enters: the raw dot products before scaling can reach magnitudes that push IEEE float into NaN territory. Layer norm and RMS norm are normalization steps over the hidden dimension — cheap in FLOPs but critical for training stability because they reset the scale of activations at every sublayer boundary. Residual add costs almost nothing but is architecturally load-bearing: without it, gradients vanish.

The einsum patterns behind multi-head attention tie all four together. A single forward pass of causal self-attention is: Q = x W_Q, K = x W_K, V = x W_V (three matmuls), scores = Q K^T / sqrt(d_k) (one batched matmul), weights = softmax(scores + mask) (one softmax), output = weights V (one matmul), out_proj = output W_O (one matmul). That is seven distinct GEMM or GEMM-adjacent operations per attention sublayer, all of which operate on tensors in BF16 on modern training stacks. Writing this as a single einsum — torch.einsum('bhqd,bhkd->bhqk', Q, K) for the score step — is not just style; it forces you to track every dimension and spot shape errors before they become runtime bugs.

Grouped-query attention (GQA), used in Llama 3, Qwen 2.5, and Mistral, reduces the K and V heads to a smaller group count, cutting KV-cache memory by 4-8x with minimal accuracy loss. This is not a theoretical curiosity — for DealLens running long-context VC memo analysis, GQA is the reason you can fit a 128K-context model on a single A100.

Numerics is the bug, not the algorithm

Softmax overflow is a recurring bug rather than a one-time fix because every new attention kernel — Triton, raw CUDA, KV-cache code paths — reintroduces the risk. PyTorch's built-in F.softmax subtracts the row maximum before exponentiation. Custom kernels routinely omit this. The symptom is NaN loss at step one or step N, depending on when activations first grow large enough to overflow. The fix is mechanical: subtract the per-row max before exponentiation, which does not change the output distribution but keeps intermediate values finite.

BF16 stores the same exponent bits as FP32 (8 bits, range up to ~3.4e38) but only 7 mantissa bits instead of 23. That exponent range is what matters for activations: they do not overflow at the values LLMs routinely see. FP16, by contrast, has a 5-bit exponent and saturates at ~65504 — a value large embedding layers and attention scores regularly exceed. FP32 accumulators then clean up the mantissa precision loss across the reduction axis. The combination is not a heuristic; it is the result of empirical observation across GPT-3, PaLM, and every major training run since 2021.

FP8 (E4M3 for forward, E5M2 for backward) is now production-default on Hopper and Blackwell. DeepSeek-V3 trained its 671B-parameter model in FP8 end-to-end (DeepSeek-AI, arXiv:2412.19437, December 2024). The caveats: per-tensor scaling is insufficient for activation outliers; per-channel or per-group scaling is required. FP8 also has no standard hardware support on older GPUs — do not attempt it on anything pre-Hopper.

Einsum is the language of attention

Writing multi-head attention as a sequence of raw matmuls obscures the structure. Writing it as a set of einsums makes every dimension explicit and makes bugs obvious. The full forward pass collapses to: scores = einsum('bhqd,bhkd->bhqk', Q, K) * scale, weights = softmax(scores), out = einsum('bhqk,bhkd->bhqd', weights, V). The dimension labels are b (batch), h (head), q (query position), k (key position), d (head dimension). Any shape mismatch in a production kernel shows up immediately in the einsum notation before it becomes a CUDA illegal memory access.

The practical benchmark every engineer should run at least once: implement scaled dot-product attention three ways — naive PyTorch loops, torch.einsum, and a Triton kernel — and measure throughput at sequence lengths 512, 4K, and 16K. The gap between the naive version and the Triton kernel at 16K sequences is typically 3-5x on an H100, and every microsecond in that gap is budgeted either to the policy inference or the controller in an embodied system.

For the JHU humanoid capstone: GR00T N1.5 runs a DiT action model at roughly 50Hz. The VLM backbone (Eagle 2.5) runs slower — its attention kernel determines the System 2 latency floor. Every microsecond you recover through numerical precision choices and kernel selection feeds directly into the control loop budget. INT8 activation quantization on the attention layers alone can recover 15-20% of latency on Jetson AGX Orin without measurable accuracy loss.

For DealLens: your retriever embeds passages using a transformer encoder. The embedding quality at BF16 versus FP16 is not zero — outlier dimensions in the hidden state are more stable in BF16, and the similarity scores your reranker sees downstream are more consistent. The difference is small per query but compounds across a 10K-memo corpus sweep.

FP8 on Hopper and Blackwell

FP8 E4M3 for forward pass and E5M2 for backward is the 2024-2025 production default on H100 and B200. The precision loss is real — you need per-channel or per-group scaling for activation outliers — but DeepSeek-V3's 671B FP8 training run (December 2024) proved that end-to-end FP8 is viable at frontier scale.

Five Operations of the Transformer Forward PassDense MatmulGEMM · BF16+FP32Softmaxsubtract-max · stableRMSNormper hidden dimResidual Addgradient highwayEinsumMHA patternsNumerical Recipe: BF16 storage + FP32 accumulateBF16 exponent range prevents overflow · FP32 accumulator preserves precision across reductionPure FP16 saturates at ~65504 · silently underflows gradients without NaNFP8 E4M3 (forward) + E5M2 (backward) — Hopper/Blackwell production default · requires per-channel scaling
Figure 1.1The five operations of a transformer forward pass, with the BF16+FP32 numerical recipe that makes large-scale training stable.
Retrieve before you continue

Three questions on what you just read

Q1 Factual What is the standard numerical format for modern large-scale transformer training, and what makes it work?
Q2 Conceptual Why does softmax overflow keep reappearing in new attention kernels even though the fix is well-known?
Q3 Synthetic How does saving 100µs on the attention kernel translate to concrete value in the JHU humanoid capstone?