Backprop, autograd, and the chain rule on GPUs

10 atomic recall cards. Export to Anki and let spaced repetition do its slow work.

In Anki: File → Import, choose this TSV, set field separator to Tab, deck = Tokens to Embodied Minds · Ch 03, note type = Basic.

Front	Back
What is a vector-Jacobian product (VJP) and how does it relate to backward()?	A VJP computes grad_output @ J, where J is the local Jacobian. backward() calls VJPs sequentially through the computation graph, accumulating scalar-loss gradients with respect to all parameters in O(parameter_count) time.
What happens if you call tensor.detach() inside a loss computation by accident?	Gradients will not flow through the detached tensor. All parameters upstream of the detach will receive zero gradients. The loss will still compute to a finite value, making this a silent bug.
What is the peak memory cost without gradient checkpointing for a 80-layer transformer?	O(depth) — all intermediate activations from 80 layers must be kept alive simultaneously during the backward pass. This can reach tens of gigabytes for large batch sizes.
What is the compute overhead of gradient checkpointing?	Roughly 33% extra compute — one additional forward pass is required during the backward pass to recompute the discarded activations.
What dtype should you use when running torch.autograd.gradcheck?	torch.float64. Finite-difference gradient estimates require high precision — FP32 is insufficient for the epsilon perturbations to be numerically accurate.
What does torch.autograd.set_detect_anomaly(True) do, and when should you use it?	It inserts a check after every backward op that verifies no gradients are NaN or Inf, with a stack trace to the forward op that produced the bad value. Use during debugging only — it adds 10-20% overhead.
When should you use a custom torch.autograd.Function instead of torch.compile?	When you need a numerically different backward pass — not just a faster one. FlashAttention is the canonical example: the backward recomputes Q, K, V from stored inputs to avoid materializing the full attention matrix.
What is retain_graph=True in backward() and when is it legitimately needed?	It prevents the autograd graph from being freed after backward(), allowing multiple backward passes through the same graph. Legitimate use: computing higher-order gradients. Accidental use: memory leak.
What is the difference between forward-mode autodiff (JVP) and reverse-mode autodiff (VJP)?	JVP propagates tangent vectors forward through the graph — cost is O(inputs). VJP propagates gradient vectors backward — cost is O(outputs). For neural nets (many inputs, scalar loss), VJP/backward is always the right choice.
Name two common causes of NaN loss in transformer training that root-cause to autograd.	Softmax overflow in a custom attention kernel (intermediate values exceed float range before subtract-max) and gradient explosion in a deep residual network (missing gradient clipping or missing layer norm between residual additions).