By 2026, the choice of inference serving framework is a cost and latency decision measured in dollars per million tokens and milliseconds of TTFT — not a philosophical one. Three frameworks dominate production: vLLM (Kwon et al., arXiv:2309.06180, September 12, 2023) with PagedAttention and continuous batching as the open-source default; TensorRT-LLM (NVIDIA) as the fastest-on-hardware option for Hopper/Blackwell with the steepest deployment cost; and SGLang (Zheng et al., arXiv:2312.07104, December 12, 2023) with RadixAttention as the purpose-built agentic serving stack. You need to know all three, deploy one, and be honest about trade-offs. The key differentiator is not throughput on a single request — all three are within 20% of each other there. The key differentiator is the workload class. For random single requests with no shared context, vLLM's mature ecosystem and broad model support wins. For workloads dominated by a shared long system prompt — DealLens calling the same VC screening system prompt 10,000 times per day — SGLang's RadixAttention prefix cache turns that repeated compute into a cache hit, reducing effective TTFT from 400ms to 30ms and inference cost by 10×. For maximum raw throughput on NVIDIA hardware with FP8, TensorRT-LLM wins — if you can tolerate its deployment complexity and NVIDIA dependency.
vLLM and PagedAttention
The core insight of vLLM (Kwon et al., arXiv:2309.06180): KV cache allocation is the largest source of GPU memory waste in LLM serving. Before PagedAttention, serving systems pre-allocated a contiguous KV cache block for each request equal to the maximum possible sequence length. At 4K tokens per request, 80% of that allocation was typically unused — wasted HBM. PagedAttention applies the OS virtual memory model: KV blocks are allocated in fixed-size pages (typically 16 tokens), requests build a page table, and non-contiguous physical blocks are joined through the page table. Fragmentation drops from 20–30% to 1–4%. Continuous batching — processing tokens from different requests in the same forward pass as they arrive, rather than waiting for a batch to fill — increases GPU utilization from 30–40% to 80–90%.
vLLM's production advantages: the broadest model support (Llama, Mistral, Qwen, DeepSeek, Gemma, Phi, Falcon), quantization support (AWQ, GPTQ, FP8, INT8), speculative decoding integration, and a large open-source community with rapid updates. vLLM's disadvantage: it is not the fastest on any specific hardware, and its prefix caching was added post-hoc rather than designed in. At current state (2026), vLLM supports prefix caching via an explicit KV cache hash, but the implementation is less aggressive than SGLang's tree structure.
SGLang and RadixAttention
SGLang (Zheng et al., arXiv:2312.07104, December 2023) was designed for agentic multi-call workloads from the start. RadixAttention represents the KV cache as a radix tree: each node is a sequence of tokens, each edge is a prefix match, and any request that shares a prefix with an existing cached sequence hits the tree and reuses those KV states at near-zero cost. For a DealLens workload where every request begins with the same 6,000-token system prompt (company description, scoring rubric, few-shot examples), RadixAttention computes those KV states once, caches them as a tree root, and serves all subsequent requests by reading from cache and only computing the per-deal portion. The prefix hit rate approaches 100%, reducing effective computation per request to the unique portion only.
SGLang also introduced the RadixAttention scheduling policy: when GPU memory is under pressure, it evicts tree leaves (less-shared prefixes) before roots (most-shared prefixes). This LRU-by-share-count policy maintains the cache entries that benefit the most requests. The result for DealLens: at 10,000 deals per day with a 6K-token shared prefix, the cost difference between SGLang prefix caching and uncached vLLM is roughly 10× in token compute — translating directly to cost. For the JHU humanoid's planning LLM, which calls the same environment-description system prompt across thousands of simulated episodes, the same principle applies.
SGLang's trade-offs: narrower model support than vLLM, less mature quantization pipeline, and the radix tree adds memory management complexity. For a diverse model zoo, vLLM is safer. For a single-model production deployment with high prefix reuse, SGLang consistently wins on both latency and cost.
TensorRT-LLM: maximum throughput on NVIDIA hardware
TensorRT-LLM (NVIDIA) is the closed-source, hardware-optimized inference stack built on TensorRT's graph capture and kernel fusion infrastructure. It supports FP8 natively on Hopper and Blackwell, with custom CUDA kernels for every attention variant, MoE routing, and quantized matmul. On H100 SXM5, TensorRT-LLM typically delivers 10–30% higher throughput than vLLM for the same model and batch configuration — because its fused kernels and hardware-specific optimizations are not available in the open-source ecosystem. The Triton Inference Server provides the deployment wrapper.
The cost of TensorRT-LLM: vendor lock-in (NVIDIA only), complex compilation pipeline (model → ONNX → TensorRT engine, can take hours per model), limited community support relative to vLLM, and aggressive version churn. For a team with a stable model serving a specific hardware, it is the right choice. For a team that changes models frequently, iterates on serving code, or runs on non-NVIDIA hardware, it is a trap. For DealLens in production on a fixed NVIDIA deployment, TensorRT-LLM deserves a benchmark comparison against SGLang before committing.
Framework selection guide
Choose vLLM when: you need broad model support, your workload has diverse request prefixes (no prefix reuse), you are prototyping or frequently changing models, or you need speculative decoding with a draft model. Choose SGLang when: your workload has a dominant shared prefix (agentic multi-call, RAG with a fixed system prompt), you are serving a single model at scale, and cost per call is the primary optimization target. Choose TensorRT-LLM when: maximum throughput is required on a fixed NVIDIA hardware deployment, you have budget for the compilation overhead, and you have 3–6 months between model updates.
For DealLens: SGLang is the correct choice for the deal-screening loop. The screening prompt is fixed per GP, so prefix hit rate will be near 100% for repeat calls. Estimate: 6K-token shared prefix × Llama 3.1 70B serving 10K deals/day at $2/GPU-hour — vLLM costs ~$X, SGLang costs ~0.1X. For the JHU humanoid's planning LLM: SGLang for simulation-time planning calls (repeated system prompt). vLLM or TensorRT-LLM for on-robot deployment where single-request latency dominates.
All three frameworks assume prefill and decode run on the same GPU pod. llm-d (Chapter 18) separates them. In the llm-d world, the serving framework choice applies separately to prefill pods and decode pods — you could run SGLang on prefill pods (for prefix caching) and vLLM on decode pods (for broader hardware support) in the same cluster.