Quantization, distillation, pruning

10 atomic recall cards. Export to Anki and let spaced repetition do its slow work.

In Anki: File → Import, choose this TSV, set field separator to Tab, deck = Tokens to Embodied Minds · Ch 11, note type = Basic.

Front	Back
What is the W4A16 inference format and why is it preferred over W4A4?	4-bit weights, 16-bit activations. W4A4 fails because transformer activation outliers cause catastrophic 4-bit quantization error. W4A16 keeps activations in FP16 where outliers are represented accurately, with only a small weight dequantization overhead.
What does GPTQ do that naive round-to-nearest quantization does not?	GPTQ uses second-order Hessian information to minimize per-layer quantization error. It sequentially quantizes weights and adjusts remaining weights to compensate — reducing INT4 perplexity loss from ~3-5 to ~0.5-1.0 points.
What are activation outliers in transformer hidden states and why do they matter for quantization?	A small fraction of hidden-state dimensions consistently carry values 5-10× larger than typical (LLM.int8(), August 2022). 4-bit quantization maps these outliers and normal values into the same narrow range, causing catastrophic relative error for the outlier dimensions.
What is SmoothQuant and what problem does it solve?	SmoothQuant migrates the scale of activation outliers into the weight matrices (multiplying activations by 1/s and weights by s) before INT8 quantization. This smooths the activation distribution, enabling W8A8 quantization without per-channel scaling overhead.
What is FP8 E4M3 vs E5M2 and why are different formats used for forward vs backward?	E4M3 (4-bit exponent, 3-bit mantissa): wider dynamic range, better for weights and activations. E5M2 (5-bit exponent, 2-bit mantissa): even wider dynamic range, needed for gradients which can span many orders of magnitude. Different precision needs require different formats per pass.
What is structured pruning and why does it provide GPU speedup while unstructured pruning does not?	Structured pruning removes whole components (heads, neurons, layers) — the resulting network has smaller but still dense matrices that execute efficiently on GPU tensor cores. Unstructured pruning zeroes individual weights — the matrix remains the same shape and executes at the same throughput.
What is knowledge distillation and what does the temperature parameter control?	Train a student to minimize KL(p_teacher \|\| p_student). Temperature T softens the teacher's distribution — higher T spreads probability mass to top-k tokens, giving the student richer learning signal beyond just the argmax. T=2-4 is typical.
What is QAT and when should it be used instead of PTQ?	Quantization-aware training (QAT) trains with fake quantization nodes, allowing the model to adapt. Costs full training compute but recovers 0.5-1.0 perplexity points over PTQ at 4-bit. Use QAT only when the deployment precision is known before training and the quality gap from PTQ is unacceptable.
How much latency reduction does INT8 quantization of the attention layers provide on Jetson AGX Orin?	Approximately 40% reduction on the quantized attention layers. Total model latency reduction is lower (depends on FFN ratio) — typically 25-35% end-to-end on Orin's 275 TOPS INT8 compute.
Why does removing entire transformer layers (block pruning) provide more reliable speedup than head pruning?	Removing a whole layer eliminates all its operations — attention, FFN, and both norms. Head pruning reduces KV-cache and QKV compute but leaves the FFN and residual ops in place. Layer removal gives a larger, more predictable latency reduction.