Name the five operations a transformer forward pass spends most of its time on.	Dense matmul, softmax, layer/RMS norm, residual add, and the einsum patterns of multi-head attention.
Why is BF16 storage with FP32 accumulate the standard training format?	BF16 keeps FP32's exponent range (up to ~3.4e38) so activations do not overflow; FP32 accumulators preserve precision across the long reduction axis. Pure FP16 underflows gradients silently.
What is the symptom of softmax overflow in a training run?	NaN loss, either at step one or at a later step when activations grow large enough to push exponentiated scores to infinity.
What is the subtract-max trick in softmax and why does it work?	Subtract the per-row maximum from logits before exponentiation. This keeps intermediate values finite without changing the output distribution, since the constant cancels in the denominator.
Why does grouped-query attention (GQA) reduce KV-cache memory?	GQA shares K and V projections across groups of query heads, reducing the number of KV heads from H to G (G << H). This cuts KV-cache memory by H/G without significant accuracy loss.
Write multi-head attention score computation as an einsum with labeled dimensions.	torch.einsum('bhqd,bhkd->bhqk', Q, K) * (1/sqrt(d_k)), where b=batch, h=head, q=query position, k=key position, d=head dimension.
What dtype does FP8 E4M3 use for the forward pass, and why?	E4M3 (4-bit exponent, 3-bit mantissa) for forward. The wider exponent range compared to E5M2 preserves more of the dynamic range needed for activations and weights.
What is the practical throughput gap between a naive PyTorch attention and a Triton kernel at sequence length 16K?	Typically 3-5x on an H100, driven by memory bandwidth: the naive version materializes the full N×N attention matrix in HBM, while a Triton kernel tiles and keeps it in SRAM.
Why does residual add matter architecturally despite consuming almost no FLOPs?	Residuals create gradient highways through the network — without them, gradients vanish in deep models. They also allow each layer to represent a small correction to the previous representation rather than a full re-encoding.
In the JHU humanoid control loop, what is the consequence of wasting 100µs in the attention kernel?	100µs less budget for the PD controller per step. At 50Hz, control steps are 20ms each; 100µs is 0.5% of the budget and compounds into measurably rougher trajectories over a long-horizon task.
