Chapter 18 · llm-d and disaggregated inference — From Tokens to Embodied Minds

The premise of monolithic LLM serving is a category error: prefill (processing the prompt, compute-bound, parallelizable across tokens) and decode (generating tokens one at a time, memory-bandwidth-bound, inherently sequential) share a GPU but have nothing in common computationally. Running them together means the GPU is simultaneously under-provisioned for compute during decode and under-utilized for bandwidth during prefill. Every major cloud provider's internal serving infrastructure disaggregated these phases by 2024. llm-d (Red Hat, with Google, IBM, and NVIDIA, announced November 21, 2025) is the first Kubernetes-native open-source implementation — prefill and decode as separate pod types, with a KV-cache-aware router that sends each request to the right tier. The immediate throughput lift is real: disaggregated inference typically delivers a 25% improvement on long-prompt workloads because prefill pods can sustain larger batch sizes (higher arithmetic intensity, better tensor core utilization) while decode pods can be over-provisioned for memory bandwidth rather than compute. For workloads like DealLens with 6K-token prompts and 200-token responses, the prefill phase dominates compute cost — a single prefill pod serving multiple decode pods is the right configuration.

Why the split matters computationally

Prefill processes the full prompt in one (or a few) forward passes. At prompt length N, the prefill attention computation scales as O(N²) FLOPs — compute-bound above a few hundred tokens. The prefill pod benefits from large batch accumulation, full tensor core utilization at BF16, and FlashAttention-3's async pipelines on Hopper. The optimal GPU configuration for prefill is maximum compute with sufficient HBM for model weights.

Decode generates one token per forward pass. Each step loads all model weights (at Llama 3.1 70B × BF16: 140 GB) and performs O(1) FLOPs per weight byte — arithmetic intensity near 2, deep below the H100 roofline ridge at 295 FLOPs/byte. The decode pod is bandwidth-limited. Adding more compute (higher batch size) helps until the memory wall is hit; beyond that, adding more decode pods (not faster GPUs) is the scaling path. The optimal GPU configuration for decode is maximum HBM bandwidth — or, for long-running decode sessions, multiple smaller GPUs with aggregated bandwidth.

In a monolithic serving pod, the GPU scheduler alternates between running prefill batches (compute-bound, high tensor core utilization) and decode batches (memory-bound, low tensor core utilization). Neither workload ever sees the GPU in its optimal state for a sustained period. Continuous batching in vLLM mitigates this somewhat by interleaving decode steps with new prefill requests, but it does not solve the fundamental hardware mismatch.

llm-d architecture: prefill pods, decode pods, routing

llm-d (llm-d.ai) is built on Kubernetes with three components: prefill pods (compute-optimized, handle the prefill phase), decode pods (bandwidth-optimized, handle auto-regressive generation), and a KV-cache-aware gateway router. When a request arrives: the router sends the full prompt to a prefill pod; the prefill pod processes the prompt and generates KV states; the KV states are transferred to a decode pod (via NVLink if on the same node, via RoCE or InfiniBand across nodes); the decode pod continues auto-regressive generation until the response is complete.

KV-cache-aware routing is the key innovation beyond simple round-robin: the router tracks which decode pods hold which prefix KV states. For a request that begins with a system prompt already cached on decode pod 3, the router directs it to pod 3 — reusing the cached KV rather than re-transferring. This is disaggregated serving's analog of SGLang's RadixAttention: prefix reuse at the cluster level rather than the process level. The combination of disaggregation + prefix-aware routing is what produces the cumulative throughput gains above the 25% baseline.

Expert parallelism for MoE (DeepSeek-V3, Qwen3-MoE) is a first-class concept in llm-d: prefill pods and decode pods can each be configured with different expert-to-GPU assignments, because the expert utilization pattern differs between the phases (prefill processes many tokens and activates experts evenly; decode at batch size 1 activates experts unevenly). This allows per-phase load balancing that monolithic serving cannot provide.

llm-d for DealLens at scale

DealLens at 10,000 deals per day with 6K-token prompts and 200-token responses: the prefill-to-decode token ratio is 30:1. In a monolithic serving pod, the GPU spends 30× more time on prefill than decode for each request, but the decode phase is the latency bottleneck (sequential, one token at a time). Disaggregation allows 1 prefill pod to serve 5–10 decode pods at this ratio — the prefill pod generates KV states quickly (compute-bound, fast) and hands them off, then immediately processes the next request's prompt. The decode pods generate 200-token responses in parallel across separate requests.

Estimated cost comparison: monolithic vLLM with Llama 3.1 70B on 4 H100s processing 10K deals/day at ~8K tokens/deal = 80M tokens/day. At ~3,000 tokens/second/H100 in monolithic serving: ~26.7K seconds / 4 H100s = 6,666 seconds ≈ 1.85 GPU-hours. At $2/GPU-hour × 4 H100s = $7.40/day. llm-d with 25% throughput improvement: same compute in ~$5.55/day. Multiply this by analyst scale and the savings compound. The llm-d benefit is proportionally larger for workloads with longer prompts or requiring real-time response at high concurrency.

Deployment: Kubernetes-native from the start

llm-d is designed as a Kubernetes operator from day one — unlike vLLM or SGLang, which were originally single-process serving frameworks with Kubernetes deployment as an afterthought. The llm-d operator manages prefill and decode pod counts, autoscales them independently based on queue depth, and handles KV transfer between pods via a sidecar that manages the network buffer. This means prefill pod count can autoscale on CPU-measured queue length (prefill backlog) while decode pod count autoscales on time-in-decode (long decode sessions drive up decode pod demand).

For a DealLens deployment on a small cloud Kubernetes cluster: 1 prefill pod (2× H100 SXM5) handles the 6K-token prompt encoding; 4 decode pods (1× H100 each) handle the 200-token response generation in parallel. The router distributes incoming deals across decode pods based on KV cache residency. The llm-d GitHub (github.com/llm-d/llm-d) includes Kind and Helm deployment templates. As of the November 2025 announcement, it is in beta — stable enough for load testing, production-ready by mid-2026.

The disaggregation paper

Patel et al. (arXiv:2401.09670, January 2024) documented the theoretical case for disaggregation. llm-d is the Kubernetes-native implementation that Red Hat, Google, IBM, and NVIDIA shipped 10 months later. Read the paper, then read the llm-d announcement — the gap between theory and production engineering is the chapter.

Figure 18.1llm-d architecture: a KV-cache-aware router sends prompts to compute-optimized prefill pods, which transfer KV states to bandwidth-optimized decode pods. At a 30:1 prompt/response token ratio (DealLens use case), one prefill pod feeds 4–5 decode pods. The router's prefix cache awareness routes requests to decode pods that already hold the relevant KV state, eliminating redundant transfers.

Retrieve before you continue

Three questions on what you just read

Q1 Factual What is the typical throughput improvement from prefill/decode disaggregation on long-prompt workloads, and who announced llm-d?

Q2 Conceptual What does KV-cache-aware routing add to disaggregated serving beyond simple round-robin request distribution?

Q3 Synthetic For DealLens at 10K deals/day with 6K-token prompts and 200-token responses, what pod configuration does llm-d enable and what is the approximate cost benefit?