Chapter 11 · Quantization, distillation, pruning

The three compression techniques — quantization, distillation, and pruning — exist because frontier models are too large to run on the hardware where most inference happens: phones, edge devices, Jetson AGX Orin in a humanoid robot. Quantization reduces the bit-width of weights and/or activations. Distillation trains a smaller student to match a larger teacher's output distribution. Pruning removes parameters. Each technique has a different failure mode, a different hardware dependency, and a different set of conditions under which it is worth the engineering cost. For the JHU humanoid capstone, all three are directly relevant. GR00T N1.5 at full BF16 precision exceeds the Jetson AGX Orin memory budget. INT8 activation quantization on the attention layers alone recovers ~40% latency with negligible quality loss. SmolVLA-450M was designed explicitly for consumer-GPU deployment — it is the result of the same design logic applied at the model architecture level rather than post-hoc. For DealLens, quantization determines the cost per inference at serving scale.

Quantization — the three knobs

Post-training quantization (PTQ) is the standard first approach: quantize a trained model without retraining. GPTQ (Frantar et al., arXiv:2210.17323, October 2022) uses second-order Hessian information to minimize the quantization error per layer, producing INT4 models with perplexity loss of ~0.5-1.0 on WikiText-103 compared to FP16. AWQ (Lin et al., arXiv:2306.00978, June 2023) uses activation-aware weight quantization — identifying and protecting the weights that matter most for large-magnitude activations. Both achieve INT4 on weights with FP16 activations (W4A16), which is the practical serving sweet spot.

The reason W4A16 beats W4A4: transformer hidden states contain systematic outlier dimensions (LLM.int8(), Dettmers et al., arXiv:2208.07339, August 2022). A small fraction of channels consistently carry values 5-10× larger than others. 4-bit quantization maps all values into a narrow range, causing catastrophic relative error for outliers. Keeping activations in FP16 avoids this entirely. SmoothQuant (Xiao et al., arXiv:2211.10438, November 2022) mathematically smooths the activation outliers by migrating their magnitude into the weights before quantization, enabling W8A8 without per-channel scaling overhead.

FP8 training (E4M3 for forward pass, E5M2 for backward) is the 2024-2025 H100/B200 standard. DeepSeek-V3 trained its full 671B model end-to-end in FP8 (arXiv:2412.19437). The key requirement: per-channel or per-group scaling for activation outliers — per-tensor scaling is insufficient. FP8 training delivers ~2x training throughput over BF16 on H100 tensor cores.

Distillation — matching the teacher's distribution

Knowledge distillation (Hinton et al., arXiv:1503.02531, March 2015) trains a smaller student to minimize KL(p_teacher || p_student) rather than just cross-entropy on hard labels. The intuition: the teacher's soft predictions encode information about the structure of the problem — 'this token is almost as likely as that token' — that hard labels discard. At temperature T=2-4, soft targets spread probability mass to the top-k tokens and give the student richer gradient signal.

Every 'Llama 3.2 3B' or 'Qwen 2.5 1.5B' released in 2024-2025 is a distilled model. The base pretraining creates a capable student at the target size; fine-tuning adds instruction following; the small size is justified by distillation from a larger model on the fine-tuning data. For DealLens, distillation from GPT-4o on a labeled memo dataset produces a domain-adapted scorer that is cheap to run at inference and more reliable than the generic base model on deal-specific terminology.

Pruning — what actually works on GPU

Unstructured pruning — zeroing out individual weights below a magnitude threshold — produces sparse weight matrices that are theoretically smaller but practically offer no speedup on GPUs. GPU tensor cores execute dense matmuls; sparse matmul support (via NVIDIA's Ampere 2:4 structured sparsity) requires exactly 2 non-zero weights per 4 consecutive elements. Arbitrary unstructured sparsity provides no hardware benefit without custom sparse kernels.

Structured pruning removes entire components: whole attention heads, entire FFN neurons, or entire transformer layers. Removing the 10% lowest-magnitude attention heads from a 32-head model at each layer degrades perplexity by ~0.3 and reduces KV-cache size by 10%. Removing entire transformer layers (the 4-5 layers with lowest block importance) from a 32-layer model can cut inference by 15% with 1-2 perplexity point loss — acceptable for applications where latency matters more than marginal accuracy.

Compression for edge deployment

For the JHU humanoid, the Jetson AGX Orin has 64 GB unified memory and delivers ~275 TOPS at INT8. SmolVLA-450M at FP16 occupies ~900 MB of weights plus KV-cache and activations — well within budget. GR00T N1.5's full model (the VLM backbone plus DiT) is far larger. The practical deployment path: run the DiT action model at INT8 for ~40% latency recovery on Orin, run the VLM backbone at BF16 (or INT8 with SmoothQuant), and use structured head pruning to remove the 10% lowest-importance attention heads from the VLM at a 0.3 perplexity cost.

For DealLens at serving scale (say 10K queries/day), the cost comparison: Llama 3 70B at BF16 on an A100 costs approximately $0.015/query. Quantized to INT4 (W4A16 via GPTQ) on the same hardware: $0.005/query. Distilled to a 7B model: $0.002/query. The three compression techniques are not academic — they determine whether DealLens is economically viable at scale without a frontier API dependency.

QAT vs PTQ — when to pay the training cost

Quantization-aware training (QAT) trains with fake quantization nodes in the forward pass, allowing the model to adapt to quantization error. It costs full-training compute but recovers 0.5-1.0 perplexity points over PTQ at 4-bit. Pay for QAT only if your deployment precision is known before training — otherwise PTQ with GPTQ is the correct first approach.

Figure 11.1The three compression techniques and their hardware dependencies: quantization (W4A16 is the GPU sweet spot, FP8 for H100 training), distillation (every small model is a distilled model), and pruning (structured removal works, unstructured does not on tensor cores).

Retrieve before you continue

Three questions on what you just read

Q1 Factual What is GPTQ and why does it produce better quantized models than naive round-to-nearest at INT4?

Q2 Conceptual Why does unstructured weight pruning not provide inference speedup on modern GPUs?

Q3 Synthetic Propose a quantization strategy for deploying SmolVLA-450M on Jetson AGX Orin within a 15ms per-step latency budget.