From Tokens to Embodied Minds · Drill cards · Chapter 11
Drills
Quantization, distillation, pruning
10 atomic recall cards. Export to Anki and let spaced repetition do its slow work.
In Anki: File → Import, choose this TSV, set field separator to Tab, deck = Tokens to Embodied Minds · Ch 11, note type = Basic.
| Front | Back |
|---|---|
| What is the W4A16 inference format and why is it preferred over W4A4? | 4-bit weights, 16-bit activations. W4A4 fails because transformer activation outliers cause catastrophic 4-bit quantization error. W4A16 keeps activations in FP16 where outliers are represented accurately, with only a small weight dequantization overhead. |
| What does GPTQ do that naive round-to-nearest quantization does not? | GPTQ uses second-order Hessian information to minimize per-layer quantization error. It sequentially quantizes weights and adjusts remaining weights to compensate — reducing INT4 perplexity loss from ~3-5 to ~0.5-1.0 points. |
| What are activation outliers in transformer hidden states and why do they matter for quantization? | A small fraction of hidden-state dimensions consistently carry values 5-10× larger than typical (LLM.int8(), August 2022). 4-bit quantization maps these outliers and normal values into the same narrow range, causing catastrophic relative error for the outlier dimensions. |
| What is SmoothQuant and what problem does it solve? | SmoothQuant migrates the scale of activation outliers into the weight matrices (multiplying activations by 1/s and weights by s) before INT8 quantization. This smooths the activation distribution, enabling W8A8 quantization without per-channel scaling overhead. |
| What is FP8 E4M3 vs E5M2 and why are different formats used for forward vs backward? | E4M3 (4-bit exponent, 3-bit mantissa): wider dynamic range, better for weights and activations. E5M2 (5-bit exponent, 2-bit mantissa): even wider dynamic range, needed for gradients which can span many orders of magnitude. Different precision needs require different formats per pass. |
| What is structured pruning and why does it provide GPU speedup while unstructured pruning does not? | Structured pruning removes whole components (heads, neurons, layers) — the resulting network has smaller but still dense matrices that execute efficiently on GPU tensor cores. Unstructured pruning zeroes individual weights — the matrix remains the same shape and executes at the same throughput. |
| What is knowledge distillation and what does the temperature parameter control? | Train a student to minimize KL(p_teacher || p_student). Temperature T softens the teacher's distribution — higher T spreads probability mass to top-k tokens, giving the student richer learning signal beyond just the argmax. T=2-4 is typical. |
| What is QAT and when should it be used instead of PTQ? | Quantization-aware training (QAT) trains with fake quantization nodes, allowing the model to adapt. Costs full training compute but recovers 0.5-1.0 perplexity points over PTQ at 4-bit. Use QAT only when the deployment precision is known before training and the quality gap from PTQ is unacceptable. |
| How much latency reduction does INT8 quantization of the attention layers provide on Jetson AGX Orin? | Approximately 40% reduction on the quantized attention layers. Total model latency reduction is lower (depends on FFN ratio) — typically 25-35% end-to-end on Orin's 275 TOPS INT8 compute. |
| Why does removing entire transformer layers (block pruning) provide more reliable speedup than head pruning? | Removing a whole layer eliminates all its operations — attention, FFN, and both norms. Head pruning reduces KV-cache and QKV compute but leaves the FFN and residual ops in place. Layer removal gives a larger, more predictable latency reduction. |