The five operations that decide every transformer forward pass — dense matmul, softmax, layer or RMS norm, residual add, and the einsum patterns that express multi-head attention — are not interesting because of their math. They are interesting because their numerical behavior at FP16, BF16, and FP8 is what bites in practice. Softmax overflows. Accumulators lose precision. BF16 matmul with FP32 accumulate is the only reason large-scale training works at all, and if you do not know why, you are one rogue kernel away from a training run that silently degrades for three days before anyone notices. For the JHU humanoid capstone, this is not academic. Perception-to-action latency has a hard budget — every 100 microseconds you waste in the attention kernel is 100 microseconds the PD controller cannot use to track a smoother trajectory. For DealLens, the same matmul kernel sits below your retriever; understanding its precision trade-offs tells you why embedding quality degrades at certain dtypes.
The five operations
Dense matmul dominates wall-clock time in every transformer layer — the QKV projection, the attention output projection, and both FFN weight multiplications are all GEMM calls. At sequence length 512 on an H100, matmul accounts for roughly 60-70% of forward-pass FLOPs. Softmax is local over the sequence dimension and fast, but it is also where overflow enters: the raw dot products before scaling can reach magnitudes that push IEEE float into NaN territory. Layer norm and RMS norm are normalization steps over the hidden dimension — cheap in FLOPs but critical for training stability because they reset the scale of activations at every sublayer boundary. Residual add costs almost nothing but is architecturally load-bearing: without it, gradients vanish.
The einsum patterns behind multi-head attention tie all four together. A single forward pass of causal self-attention is: Q = x W_Q, K = x W_K, V = x W_V (three matmuls), scores = Q K^T / sqrt(d_k) (one batched matmul), weights = softmax(scores + mask) (one softmax), output = weights V (one matmul), out_proj = output W_O (one matmul). That is seven distinct GEMM or GEMM-adjacent operations per attention sublayer, all of which operate on tensors in BF16 on modern training stacks. Writing this as a single einsum — torch.einsum('bhqd,bhkd->bhqk', Q, K) for the score step — is not just style; it forces you to track every dimension and spot shape errors before they become runtime bugs.
Grouped-query attention (GQA), used in Llama 3, Qwen 2.5, and Mistral, reduces the K and V heads to a smaller group count, cutting KV-cache memory by 4-8x with minimal accuracy loss. This is not a theoretical curiosity — for DealLens running long-context VC memo analysis, GQA is the reason you can fit a 128K-context model on a single A100.
Numerics is the bug, not the algorithm
Softmax overflow is a recurring bug rather than a one-time fix because every new attention kernel — Triton, raw CUDA, KV-cache code paths — reintroduces the risk. PyTorch's built-in F.softmax subtracts the row maximum before exponentiation. Custom kernels routinely omit this. The symptom is NaN loss at step one or step N, depending on when activations first grow large enough to overflow. The fix is mechanical: subtract the per-row max before exponentiation, which does not change the output distribution but keeps intermediate values finite.
BF16 stores the same exponent bits as FP32 (8 bits, range up to ~3.4e38) but only 7 mantissa bits instead of 23. That exponent range is what matters for activations: they do not overflow at the values LLMs routinely see. FP16, by contrast, has a 5-bit exponent and saturates at ~65504 — a value large embedding layers and attention scores regularly exceed. FP32 accumulators then clean up the mantissa precision loss across the reduction axis. The combination is not a heuristic; it is the result of empirical observation across GPT-3, PaLM, and every major training run since 2021.
FP8 (E4M3 for forward, E5M2 for backward) is now production-default on Hopper and Blackwell. DeepSeek-V3 trained its 671B-parameter model in FP8 end-to-end (DeepSeek-AI, arXiv:2412.19437, December 2024). The caveats: per-tensor scaling is insufficient for activation outliers; per-channel or per-group scaling is required. FP8 also has no standard hardware support on older GPUs — do not attempt it on anything pre-Hopper.
Einsum is the language of attention
Writing multi-head attention as a sequence of raw matmuls obscures the structure. Writing it as a set of einsums makes every dimension explicit and makes bugs obvious. The full forward pass collapses to: scores = einsum('bhqd,bhkd->bhqk', Q, K) * scale, weights = softmax(scores), out = einsum('bhqk,bhkd->bhqd', weights, V). The dimension labels are b (batch), h (head), q (query position), k (key position), d (head dimension). Any shape mismatch in a production kernel shows up immediately in the einsum notation before it becomes a CUDA illegal memory access.
The practical benchmark every engineer should run at least once: implement scaled dot-product attention three ways — naive PyTorch loops, torch.einsum, and a Triton kernel — and measure throughput at sequence lengths 512, 4K, and 16K. The gap between the naive version and the Triton kernel at 16K sequences is typically 3-5x on an H100, and every microsecond in that gap is budgeted either to the policy inference or the controller in an embodied system.
Why this maps directly to your two projects
For the JHU humanoid capstone: GR00T N1.5 runs a DiT action model at roughly 50Hz. The VLM backbone (Eagle 2.5) runs slower — its attention kernel determines the System 2 latency floor. Every microsecond you recover through numerical precision choices and kernel selection feeds directly into the control loop budget. INT8 activation quantization on the attention layers alone can recover 15-20% of latency on Jetson AGX Orin without measurable accuracy loss.
For DealLens: your retriever embeds passages using a transformer encoder. The embedding quality at BF16 versus FP16 is not zero — outlier dimensions in the hidden state are more stable in BF16, and the similarity scores your reranker sees downstream are more consistent. The difference is small per query but compounds across a 10K-memo corpus sweep.
FP8 E4M3 for forward pass and E5M2 for backward is the 2024-2025 production default on H100 and B200. The precision loss is real — you need per-channel or per-group scaling for activation outliers — but DeepSeek-V3's 671B FP8 training run (December 2024) proved that end-to-end FP8 is viable at frontier scale.