From Tokens to Embodied Minds · Drill cards · Chapter 03
Drills
Backprop, autograd, and the chain rule on GPUs
10 atomic recall cards. Export to Anki and let spaced repetition do its slow work.
In Anki: File → Import, choose this TSV, set field separator to Tab, deck = Tokens to Embodied Minds · Ch 03, note type = Basic.
| Front | Back |
|---|---|
| What is a vector-Jacobian product (VJP) and how does it relate to backward()? | A VJP computes grad_output @ J, where J is the local Jacobian. backward() calls VJPs sequentially through the computation graph, accumulating scalar-loss gradients with respect to all parameters in O(parameter_count) time. |
| What happens if you call tensor.detach() inside a loss computation by accident? | Gradients will not flow through the detached tensor. All parameters upstream of the detach will receive zero gradients. The loss will still compute to a finite value, making this a silent bug. |
| What is the peak memory cost without gradient checkpointing for a 80-layer transformer? | O(depth) — all intermediate activations from 80 layers must be kept alive simultaneously during the backward pass. This can reach tens of gigabytes for large batch sizes. |
| What is the compute overhead of gradient checkpointing? | Roughly 33% extra compute — one additional forward pass is required during the backward pass to recompute the discarded activations. |
| What dtype should you use when running torch.autograd.gradcheck? | torch.float64. Finite-difference gradient estimates require high precision — FP32 is insufficient for the epsilon perturbations to be numerically accurate. |
| What does torch.autograd.set_detect_anomaly(True) do, and when should you use it? | It inserts a check after every backward op that verifies no gradients are NaN or Inf, with a stack trace to the forward op that produced the bad value. Use during debugging only — it adds 10-20% overhead. |
| When should you use a custom torch.autograd.Function instead of torch.compile? | When you need a numerically different backward pass — not just a faster one. FlashAttention is the canonical example: the backward recomputes Q, K, V from stored inputs to avoid materializing the full attention matrix. |
| What is retain_graph=True in backward() and when is it legitimately needed? | It prevents the autograd graph from being freed after backward(), allowing multiple backward passes through the same graph. Legitimate use: computing higher-order gradients. Accidental use: memory leak. |
| What is the difference between forward-mode autodiff (JVP) and reverse-mode autodiff (VJP)? | JVP propagates tangent vectors forward through the graph — cost is O(inputs). VJP propagates gradient vectors backward — cost is O(outputs). For neural nets (many inputs, scalar loss), VJP/backward is always the right choice. |
| Name two common causes of NaN loss in transformer training that root-cause to autograd. | Softmax overflow in a custom attention kernel (intermediate values exceed float range before subtract-max) and gradient explosion in a deep residual network (missing gradient clipping or missing layer norm between residual additions). |