Predict before you read

Before you read — why does speculative decoding always produce the same output distribution as the target model, even though a weaker draft model proposes the tokens?

The target model does not blindly accept draft tokens — it applies a specific statistical correction.

From Tokens to Embodied Minds  ·  Chapter 19 of 36
Chapter 19

KV-cache, speculative decoding, Medusa

Latency tricks that actually work

2–3×
decode speedup from speculative decoding with a well-matched draft model
~0ms
effective prefill cost for cached prefixes — prefix caching is the biggest ROI optimization
Medusa
eliminates the separate draft model with multi-head speculative prediction on the target model itself
Maturity ladder

Three latency tricks dominate production LLM serving optimization in 2026, and they are independent enough to stack: KV-cache (without which serving is not possible), prefix caching (the biggest ROI optimization for any workload with repeated context), and speculative decoding (the 2–3× decode speedup that closes the gap between auto-regressive generation and human reading speed). Each deserves precise understanding, because each has failure modes that practitioners consistently misattribute to model quality. KV-cache is the single largest memory cost in serving and the constraint that drives every other architectural decision in this chapter. Speculative decoding is the family of techniques — draft model, Medusa, EAGLE — that exploits the observation that a weaker model can propose tokens cheaply, and the target model can verify a batch of them in one forward pass, yielding a 2–3× speedup with exact output distribution. The unifying theme is latency arithmetic: every millisecond matters in an agentic system where calls are sequential and deadlines are tight.

KV-cache: the memory cost formula

KV-cache stores the key and value projections for every layer and every token in the context. Memory formula: 2 (K and V) × num_layers × num_heads × head_dim × seq_len × bytes_per_element. For Llama 3.1 70B (80 layers, GQA with 8 KV heads, head_dim=128, BF16): 2 × 80 × 8 × 128 × seq_len × 2 bytes = 327,680 × seq_len bytes. At seq=8192: 2.68 GB per request. At batch=32 and seq=8192: 85.9 GB — more than a single H100 80GB GPU's full HBM. This is why PagedAttention is not optional at production batch sizes, and why quantizing the KV-cache (INT8 or FP8) reduces serving memory pressure proportionally.

Prefix caching eliminates KV recomputation for the shared portion of requests. For DealLens, the 6,000-token screening system prompt accounts for 2.68 GB × (6000/8192) ≈ 1.96 GB of KV state per request at Llama 3.1 70B. Without prefix caching, this is recomputed for every deal. With prefix caching (SGLang RadixAttention or vLLM's prefix cache), this KV state is computed once and reused — reducing effective prefill compute by 75% for a 6K-system-prompt + 2K-deal-memo workload. Prefix caching hits are essentially free; the bottleneck shifts entirely to generating the 200-token response.

KV-cache quantization is a practical optimization: INT8 KV-cache (storing keys and values at 8-bit precision instead of BF16) halves the KV memory at a typical 0.1–0.5 perplexity point degradation. For the JHU humanoid's planning LLM running on a memory-constrained edge device, INT8 KV-cache may be the difference between fitting the required context window and not.

Speculative decoding: the draft-then-verify pattern

Speculative decoding (Leviathan et al., arXiv:2211.17192, November 30, 2022) exploits the fact that verifying K tokens in one target model forward pass is only marginally more expensive than verifying 1 token. The algorithm: (1) a cheap draft model proposes K tokens autoregressively; (2) the target model verifies all K tokens in one forward pass, computing p_target for each; (3) each token is accepted or rejected via the rejection sampling rule; (4) the first rejected token is resampled from the corrected distribution. Expected tokens accepted per step: K × acceptance_rate. The speedup over pure autoregressive decode is K × acceptance_rate / (1 + draft_cost_fraction). For acceptance_rate=0.8 and K=4 with a draft model 10× cheaper than the target: (4 × 0.8) / (1 + 0.1) = 2.9×.

Acceptance rate depends on domain match between draft and target. For Llama 3.2 1B (draft) vs Llama 3.1 8B (target) on general text: acceptance rate ~0.75–0.85, yielding ~2–2.5× speedup. For a more specialized target domain (financial analysis, code), acceptance rate drops to 0.55–0.70 unless the draft model was also trained on similar data. A domain-fine-tuned draft model is a significant engineering investment but recovers the full speedup. For DealLens using a VC-memo-fine-tuned 3B draft with a 70B target: expected acceptance rate 0.70–0.80, expected speedup ~2–2.5×.

Medusa and EAGLE: eliminating the draft model

Medusa (Cai et al., arXiv:2401.10774, January 19, 2024) adds K speculative prediction heads to the target model itself — each head predicts the token at offset 1, 2, ..., K from the current position, trained with the same objective as the original LM head but on shifted targets. At inference, the K Medusa heads generate K candidate continuations in one forward pass; the standard speculative decoding rejection sampling selects which prefix to accept. No separate draft model is needed. The accuracy of Medusa heads is lower than a strong draft model (acceptance rate ~0.6–0.7 vs ~0.8 for a capable draft), but the implementation cost is minimal — add K linear layers to the existing model and fine-tune for a few steps.

EAGLE (Li et al., arXiv:2401.15077, January 26, 2024) and EAGLE-2 (arXiv:2406.16858) take a middle path: a shallow draft model (1–2 layers) that conditions on the target model's internal hidden states rather than generating autoregressively from scratch. This gives the draft model access to the target model's feature representations, enabling higher acceptance rates (~0.85+) than Medusa while being much lighter than a full draft model. EAGLE-2 adds dynamic tree construction — the draft tree branches are selected based on predicted acceptance probabilities per branch, rather than a fixed tree structure.

Choice guide: draft model (Medusa < EAGLE < separate draft) trades accuracy for implementation simplicity. Separate draft model wins on acceptance rate but requires maintaining a second model version and matching tokenizers. For DealLens at scale, a vLLM-integrated Llama 3.2 1B draft with Llama 3.1 8B target is the fastest path to 2× speedup. Medusa heads on the 8B model are the cheapest path to 1.5× speedup with no second model.

Putting it together: the latency budget for agentic systems

An agentic system's response latency is the sum of: (1) prefill latency (dominates for long prompts), (2) decode latency (dominates for long responses or fast-response requirements), (3) tool execution latency, and (4) network round-trip time. For DealLens: 6K-token prompt + 200-token response. With prefix caching, prefill latency ≈ 0 (cache hit). Decode latency at 1× (no spec decoding) at ~500 tokens/sec on one H100 serving Llama 3.1 70B: 400 ms. With 2.5× speculative decoding: 160 ms. Adding LangGraph orchestration overhead (~50 ms) and tool calls (~100 ms each): total turn latency ~350 ms — acceptable for batch processing, borderline for interactive use.

For the JHU humanoid's planning LLM running on a Jetson AGX Orin: the planning call happens every few seconds (not every 20ms control step — that is the policy's job). A 6K-token plan-generation call at 80 tokens/sec on Orin takes ~2.5 seconds — acceptable for high-level task planning. Speculative decoding on Orin requires a draft model that fits alongside the planning LLM; at typical Orin memory constraints this requires heavy quantization of both models, which degrades acceptance rate. Prefix caching of the robot's home-environment description is the higher-ROI optimization.

KV quantization on Orin

INT8 KV-cache on Jetson AGX Orin halves the context memory requirement, allowing a Llama 3.2 3B planning LLM to hold 8K-token contexts in the 64 GB unified memory alongside the SmolVLA policy weights. FP8 KV-cache is not yet supported on Orin's Ampere GPU.

KV-Cache Prefix Caching vs Speculative Decoding TimingLatency Decomposition: Prefix Cache + Speculative DecodingWithout caching or spec decodingPrefill 6K tokens — ~500msDecode — 400ms~900ms totalWith prefix caching (cache hit)≈0Decode 200tok — 400ms~400ms totalWith prefix caching + speculative decoding (2.5×)≈0Decode160ms~160ms totalSpeculative Decoding: Draft → VerifyDraft modelproposes K tokensTarget modelverifies K in 1 fwd passRejection samplingp_accept = min(1, p_t/p_d)Exact target distributionpreserved
Figure 19.1Top three rows: latency decomposition for a 6K-token-prompt + 200-token-response DealLens request. Without caching: 900ms. With prefix caching: 400ms (prefill eliminated). With caching + speculative decoding (2.5×): 160ms. Bottom: the speculative decoding loop — draft proposes K tokens, target verifies in one forward pass, rejection sampling preserves the exact target output distribution.
Retrieve before you continue

Three questions on what you just read

Q1 Factual What is the KV-cache memory formula, and what is the size per request for Llama 3.1 70B at seq=8192 in BF16?
Q2 Conceptual Why does speculative decoding with rejection sampling produce the exact same output distribution as pure autoregressive sampling from the target model?
Q3 Synthetic For DealLens with Llama 3.1 70B and a 2.5× speculative decoding speedup, what is the approximate per-deal end-to-end latency if prefix caching handles the 6K-token system prompt?