From Tokens to Embodied Minds · Drill cards · Chapter 18
Drills
llm-d and disaggregated inference
10 atomic recall cards. Export to Anki and let spaced repetition do its slow work.
In Anki: File → Import, choose this TSV, set field separator to Tab, deck = Tokens to Embodied Minds · Ch 18, note type = Basic.
| Front | Back |
|---|---|
| Why is prefill compute-bound and decode memory-bound? | Prefill processes N tokens in parallel (high FLOPs per weight byte). Decode generates one token at a time — arithmetic intensity ~2 FLOPs/byte, far below the H100 ridge point. |
| What are the three core components of llm-d? | Prefill pods (compute-optimized), decode pods (bandwidth-optimized), and a KV-cache-aware gateway router. |
| Who announced llm-d and when? | Red Hat (with Google, IBM, NVIDIA), November 21, 2025. |
| What is the typical throughput improvement from disaggregated inference on long-prompt workloads? | Approximately 25%. |
| How does KV-cache-aware routing differ from round-robin routing? | It routes requests to decode pods that already hold the request's prefix KV states, enabling cache reuse and eliminating redundant KV transfer. |
| Why does llm-d handle MoE expert parallelism differently for prefill vs decode? | Prefill activates experts evenly across many tokens; decode at batch size 1 activates experts unevenly. Per-phase EP configuration allows load balancing that monolithic serving cannot provide. |
| What infrastructure model does llm-d use for deployment? | Kubernetes-native — a Kubernetes operator that manages prefill and decode pod counts, autoscales each independently, and handles KV transfer via pod sidecars. |
| For a 30:1 prefill/decode token ratio, roughly how many decode pods can one prefill pod feed? | 5–10 decode pods, depending on response length and KV transfer speed. |
| What is the KV transfer mechanism between prefill and decode pods in llm-d? | A sidecar network buffer manages KV state transfer via NVLink (within node) or RoCE/InfiniBand (cross-node). |
| What scaling metric triggers prefill pod autoscaling vs decode pod autoscaling in llm-d? | Prefill scales on queue depth (CPU-measured prefill backlog). Decode scales on time-in-decode (long session duration indicates decode pod saturation). |