llm-d Blog

Predicted-Latency Based Scheduling for LLMs

2026-03-13T09:00:00.000Z

Not all LLM requests cost the same. A short prompt might complete in milliseconds, while a long one can occupy a GPU for seconds. If we can predict how long a request will take on each candidate server before dispatching it, we can make substantially better routing decisions. This post describes a system that does exactly that: a lightweight ML model trained online from live traffic that replaces manually tuned heuristic weights with direct latency predictions.

The Load Balancing Problem in LLM Serving

The variation in request cost comes from how LLM inference works. It happens in two phases: first, the model processes the entire input prompt (the prefill phase), which is compute-heavy and scales with prompt length. Prefill can be accelerated when the server has already cached results from a similar prompt (prefix caching). Then it generates output tokens one at a time (the decode phase), which is memory-heavy and scales with the number of tokens generated.

Current load balancers try to account for this using signals like queue depth, memory pressure, cache locality, and batch size. But these signals often conflict: routing for cache reuse concentrates load, while routing for low utilization spreads it. Getting the balance right requires manual tuning of weights (see NVIDIA Dynamo or Inference Gateway), and the right balance shifts as the workload varies.

Production traffic makes this concrete. Figures below show metrics from an internal Google service serving an open model over 24 hours, patterns representative of what we see across production LLM deployments.

Request and response sizes are bursty and have huge variance: Input and output token counts swing by orders of magnitude over the course of hours, driven by traffic that arrives in waves rather than at a steady rate. (Token counts are normalized by the model's maximum context length.)

Server load is uneven: These bursts hit servers differently. KV cache utilization (a measure of how much GPU memory is occupied by in-flight requests) spikes from 30% to over 70%. Queue depths spike just as unevenly.

Cache reuse is unstable: LLM servers cache previously computed results so that repeated prefixes (like a shared system prompt) don't need to be recomputed. Most requests see little to no cache reuse, while a subset benefits from high hit rates. But even that subset is unstable: hit rates collapse frequently as traffic patterns shift and cached prefixes get evicted. A load balancer tuned for high cache reuse will routinely encounter requests where that assumption doesn't hold.

These patterns are consistent with observations across production LLM deployments. Azure inference traces show significant variation in request sizes and heavy-tailed token distributions over short time windows [Stojkovic et al., "DynamoLLM," HPCA 2025]. BurstGPT documents burstiness and diversified concurrency patterns across 10.31 million traces from Azure OpenAI services [Wang et al., "BurstGPT," 2024].

No fixed configuration can handle this. Weights tuned for high cache reuse cause latency violations during cache misses, and weights tuned for worst-case reuse waste capacity when caching is working well.

Predicted-Latency Aware Scheduling

The two phases of LLM inference each have a standard latency metric: Time to First Token (TTFT) measures how long prefill takes, and Time Per Output Token (TPOT) measures how long each subsequent token takes to generate.

We train a lightweight XGBoost regression model in real-time on the relationship between request and server characteristics -- prompt length, prefix cache hit rate, number of running requests, queue depth, KV cache utilization -- and the observed TTFT and TPOT for completed requests. The model learns to approximate the underlying performance behavior of the model server and accelerator hardware, continuously retraining on a sliding window of recent data to track shifting workload patterns.

At scheduling time, the model predicts the TTFT and TPOT a new request would experience on each candidate server, given that server's current state. The scheduler then routes the request to the server with the best predicted outcome. When SLOs are provided, the scheduler prefers servers with positive headroom (predicted latency below the SLO); otherwise, it simply picks the server with the lowest predicted latency.

This largely eliminates manual weight tuning. Rather than deciding how much to value cache locality versus queue depth versus memory pressure, the model learns those tradeoffs directly from observed latency data.

Across five benchmark scenarios ranging from cache-friendly to cache-intensive workloads, predicted-latency aware scheduling outperforms or matches load+prefix-aware routing in four out of five cases. Additionally we achieve 43% improvement in P50 end-to-end latency on a representative MaaS workload, with 70% improvements in TTFT.

How It Works

Design Goals

The predicted-latency approach improves on utilization-based balancing in two key ways:

Balancing spread vs consolidation under changing traffic: Minimizing TPOT requires spreading load to reduce batch size, while minimizing TTFT benefits from consolidation to maximize prefix cache reuse. The optimal balance between these strategies shifts as traffic patterns change. Utilization-based balancers rely on manually tuned weights to make this tradeoff, and any fixed configuration will be wrong as the workload varies. The predicted-latency model learns the relationship between server state, request characteristics, and actual latency outcomes, allowing it to dynamically balance between spreading and consolidation as conditions change.
Best-fit scheduling in the presence of SLOs: When requests have latency SLOs, the optimal strategy is best-fit: pack requests into servers that can still meet SLO targets, keeping other servers free for future requests with higher GPU requirements. Utilization-based balancers have no way to determine whether a server can meet a request's SLO, they only see proxy signals like queue depth and memory pressure. The predicted-latency model directly estimates TTFT and TPOT per server, allowing the scheduler to compute headroom against SLO targets and route accordingly. This best-fit strategy can be especially useful with a heterogeneous pool of model servers (say a mix of H100 and B200 GPUs).

Predicting TTFT and TPOT

We use an XGBoost regression model that takes request and server state as input and outputs predicted TTFT and TPOT. The model was designed to be fast, accurate, pluggable, and able to learn online as workload characteristics shift.

Features

We assume all pods use the same accelerator class, a simplification that can be addressed in the future. Beyond that, load and request shape drive most of the variation in TTFT and TPOT.

Feature	What It Captures	Why It Matters
KV Cache Usage %	How full the decode state is	High KV cache -> higher TPOT and slower TTFT when memory is saturated
Input Length	Weight of the prefill step	Longer prompts -> higher prefill cost -> higher TTFT
Queue Depth	Backlog before scheduling	More waiting requests -> higher TTFT; correlates with prefill interruptions -> affects TPOT
Running Requests	Active GPU concurrency	Higher concurrency -> higher TTFT; larger decode batches -> higher TPOT
Prefix Cache Match %	How much KV reuse is possible	High match -> faster prefill -> lower TTFT; low match -> full attention -> higher TTFT
Input Tokens In Flight	Input tokens dispatched but not yet prefilled, plus input tokens already prefilled but still occupying KV cache (request not complete)	Captures both incoming prefill pressure and lingering memory footprint -> higher TTFT; helps the model anticipate load before it hits the server

Training Data

Performance characteristics in LLM serving are batch-dependent and shift too quickly for long-term historical averages to remain meaningful. To stay aligned with real traffic, the model:

Collects the most recent samples using a sliding window
Stratifies them into coarse buckets (KV cache % in steps of 10, prefix hit rate in steps of 0.25, etc.)
Continuously retrains on this stratified dataset

Bucketing with a sliding window is important because it maintains samples from regimes that aren't showing up in the latest traffic. Without bucketing, a single global sliding window would let the newest data overwrite everything. For example, if current traffic sits around 60% KV cache utilization, older samples from 30% KV cache would eventually disappear -- and the model would forget how to predict in that regime.

Request Scheduling

The latency predictor runs as a sidecar to the Inference Gateway Endpoint Picker (EPP), hosting both training and prediction servers:

Training server: Continuously trains the model from live traffic, sampling data across KV cache, queue states, and prefix scores to maintain a stratified training dataset. As new requests complete, it refreshes the dataset and retrains the TTFT/TPOT models.
Prediction servers: Serve the trained models, returning predicted TTFT and TPOT given the current server load and request features.

We added a predicted-latency scorer to the EPP. The scorer compares predicted latencies to per-request SLOs and computes headroom (predicted latency minus SLO target). It then gives higher scores to servers with positive headroom, packing requests into servers that can still meet SLOs while keeping other pods free for future requests. If no SLOs are provided, it simply prefers servers with lower predicted latencies.

The predicted-latency trainer and predictor modules are deployed as sidecars to the EPP. The trainer is invoked at the post-response stage. The predictor is invoked during scheduling and optionally at the post-response stage. A new predicted-latency scorer utilizes predictions from the ML model.

Prediction Accuracy

Below we show predicted vs actual TTFT and TPOT over a benchmark run (scenario C below) as QPS increases stepwise toward saturation.

Predicted (red) vs actual (blue) TPOT over time. The model tracks steady-state TPOT well, even at transient spikes.

Predicted (red) vs actual (blue) TTFT over time. The model tracks TTFT closely as it ramps from near zero to 5 minutes across increasing QPS levels.

Across multiple benchmark runs, the model achieves a Mean Absolute Percentage Error (MAPE) of approximately 5%. This is not surprising -- accelerator performance is fairly deterministic given the current server state and request characteristics. The same prompt length, at the same KV cache utilization, with the same number of running requests, will produce similar TTFT and TPOT. The model simply learns this mapping.

Endpoint Selection

Given accurate TTFT and TPOT predictions, there are multiple possible algorithms for choosing the optimal endpoint -- for example, optimization-based approaches or multi-armed bandit strategies. We chose a greedy approach for its simplicity and low overhead, combined with a cache-aware affinity gate to prevent fragmentation.

Latency-based scoring. When SLOs are provided, the scorer computes headroom for each candidate server: how much room remains before the predicted latency exceeds the SLO target. To combine TTFT and TPOT into a single headroom score, we use a weighted combination -- by default 80% TTFT and 20% TPOT -- reflecting that TTFT is typically the more constraining metric in practice. The scheduler then does best-fit: it routes to the server with the least positive headroom, packing requests into servers that can still meet SLOs while keeping other servers free for future requests. When no SLOs are provided, the scorer simply routes to the server with the lowest predicted latency (most room).

Cache-aware affinity gate. Pure greedy routing can be counterproductive: a pod with no load but no cached prefix may have a lower predicted latency right now, but routing there means paying the full prefill cost and abandoning the cache built up on another pod. Over many requests, this leads to cache fragmentation -- prefixes are scattered across pods, no pod builds deep cache reuse, and the cluster loses the latency benefit of caching altogether. The opposite extreme is equally harmful: always routing to the pod with the best cache match concentrates popular prefixes on a few pods, which collapse under memory pressure.

To balance cache exploitation with exploration, the scorer uses an epsilon-greedy affinity gate:

Exploit (99%): Filter candidates to pods whose prefix cache score exceeds a threshold (affinityGateTau, default 0.80). Among these, select the pod with the best predicted latency. Because the latency model already credits the cache benefit, this naturally picks the pod where cache reuse translates into actual lower latency, not just the highest raw cache score.
Explore (1%): With probability epsilonExploreSticky (default 0.01), ignore the affinity gate entirely and consider all pods. These seeds cache entries on non-sticky pods, ensuring the cluster maintains cache diversity. Over time, these seeded entries grow into viable affinity targets, preventing the system from collapsing into a few overloaded cache-hot pods.
Load gate: Even in the exploit path, if the best sticky pod's predicted TTFT exceeds the best overall pod's TTFT by more than affinityMaxTTFTPenaltyMs (default 5000ms), affinity is broken. This catches the case where queueing cost on a cache-hot pod has grown to outweigh the cache benefit, the predictor's latency estimate makes this comparison possible without manual thresholds on queue depth or memory.

The default threshold of 0.80 comes from production observation: prefix cache scores follow a bimodal distribution, roughly half of request-pod pairs have very high cache match (>0.80) and half have low match (<0.80). This reflects how prefix caching works in practice: in multi-turn conversations, a pod either has the conversation history cached from prior turns or it doesn't. Partial matches from unrelated conversations contribute very little because caching is block-based. The 0.80 threshold cleanly separates these two populations, so the affinity gate routes to pods that genuinely have your conversation cached rather than pods with incidental partial matches.

A typical production prefix hit ratio distribution observed in internal workloads, showing the bimodal pattern that motivates the 0.80 affinity threshold.

Benchmark Scenario Comparison

The table below contrasts five scenarios, ranging from cache-friendly (high prefix-sharing) to cache-intensive scenarios. We used the inference-perf library to enable shared prefix benchmarking configurations with multi-turn chat support. See Appendix for a complete analysis of the workloads.

Load Balancing Scorers

Predicted Latency Scorer
Load+Prefix Scorer: Combines pod load metrics (KV cache utilization and queued requests) with prefix cache awareness, balancing between resource utilization and cache locality.
- Load metrics: KV cache utilization and queued request count
- Prefix cache awareness: Considers cached prefix availability
The set of weights used was: (1, 1, 1) prefix scorer: 1, queue scorer: 1, kv cache scorer: 1
K8s Default Load Balancer: Standard Kubernetes round-robin or least-connection load balancing without cache or latency awareness (baseline).

Note that Predicted Latency Scorer eliminates the need to manually tune relative weights between different scoring components, as the ML model learns optimal trade-offs from historical data.

Hardware Configuration: 10 model servers, each with 2x H100 80GB GPUs (TP=2, DP=1, EP=1, no disaggregation) for scenario A - D. For ShareGPT workload, which has much shorter prompts, to achieve high KV Cache utilization, we have 8 model servers, each with 1x H100 80GB GPUs (TP=1).

Benchmark configuration: we tested multiple scenarios detailed in the following table. Think of num_groups as the number of unique system prompts and num_prompts_per_group as the number of users that share a system prompt.

Scenario	Description	Benchmark Configuration	Best Performing Scorer
A. Shared Prefix -- High System Prompt Overlap, No System Cache Pressure	This workload represents a regime where shared system prefixes amortize extremely well, and user context grows slowly enough that cache pressure remains low.	`num_groups=6` `num_prompts_per_group=1000` `system_prompt_len=1000` `question_len=30 +/- 9` `output_len=1000 +/- 300` `enable_multi_turn_chat=true`	Predicted Latency Scorer
B. Shared Prefix -- Moderate System Prompt Overlap, No System Cache Pressure	This workload also represents a regime where shared system prefixes amortize extremely well. However user context grows faster than workload A leading to onset of prefix cache evictions at a lower QPS.	`num_groups=6` `num_prompts_per_group=1000` `system_prompt_len=1000` `question_len=3000 +/- 900` `output_len=1000 +/- 300` `enable_multi_turn_chat=true`	Predicted Latency Scorer
C. Shared Prefix -- Low System Prompt Overlap, High System Cache Pressure	In this workload large system prompts fail to amortize due to low reuse, quickly consuming cache capacity. System-prefix eviction destroys shared reuse, leading to abrupt performance degradation across users.	`num_groups=150` `num_prompts_per_group=5` `system_prompt_len=6000` `question_len=1200 +/- 360` `output_len=1000 +/- 300` `enable_multi_turn_chat=true`	Comparable
D. Shared Prefix -- Low System Prompt Overlap, Low System Cache Pressure	In this workload, like in workload C, system prompts fail to amortize due to low reuse. But smaller system prompts relative to user prompts ensure that performance is dominated by per-request computation rather than system cache.	`num_groups=150` `num_prompts_per_group=5` `system_prompt_len=1000` `question_len=6200 +/- 1860` `output_len=1000 +/- 300` `enable_multi_turn_chat=true`	Predicted Latency Scorer
ShareGPT	A chatbot-style workload with minimal prefix overlap across prompts.		Predicted Latency Scorer

Results

Below we compare different load balancing strategies across the four scenarios above and the ShareGPT dataset. In every scenario, the QPS is increased until throughput saturation occurs and queues begin to form. At each point, requests are sent for 100 seconds, and we wait for their completion before moving on to the next QPS. No SLOs were assumed; the predicted latency scorer simply selects pods with lower predicted latency.

The charts show two metrics: NTPOT (Normalized Time Per Output Token — E2E Latency divided by output length to make it comparable across requests of different output lengths), and output tokens per sec.

In Scenarios A and B, where system cache is amortized across pods, the predicted latency scorer performs best. In Scenario D, which has some system cache churn but user prompts much larger than system prompts, the predicted latency scorer performs as well as the load+prefix aware routing with weights (1, 1, 1). In Scenario C, which has very high system cache churn, the predicted latency scorer performs comparably to load+prefix aware scorers, while still outperforming standard Kubernetes load balancing. In this scenario, performance is governed by discrete cache-eviction events rather than gradual saturation, whereas the latency predictor's greedy routing strategy is inherently better suited to modeling continuous resource contention, such as queueing and KV-cache utilization. Alternate prefix distribution strategies could further improve performance in high-churn scenarios. For instance, pinning critical KV cache prefixes (like system prompts in this case) ensures they remain non-evictable. Similarly, using a no-hit-lru-scorer can improve performance by intelligently distributing "cold" requests to prevent hotspots during the formation of new prefix-caches.

Overall, predicted-latency aware routing consistently performs as well as or better than standard Kubernetes routing and load+prefix-aware routing in all tested scenarios, while eliminating the need for manual parameter tuning.

Production Workload

In addition to the synthetic scenarios above, we evaluated against a workload derived from real production traffic at an internal Google service. The benchmarking profile was constructed from 7 days of traffic serving a large open model via vLLM, analyzing input token counts, output token counts, request rates, and prefix cache hit rates. The resulting profile represents the median (p50) production request.

This workload exhibits the characteristics discussed in the introduction: high variance in both input token counts (mean 729, std 13550, max 131072) and output token counts (mean 300, std 2213, max 8192), along with high but unstable prefix cache reuse (~94% at peak, with frequent collapses). This makes it a strong test of whether routing strategies can adapt to rapidly shifting traffic patterns.

Benchmark Configuration

Model: Qwen3-480B on vLLM
Hardware: 13 servers, each with 8x NVIDIA H200 GPUs
Traffic shape: 8-stage load ladder alternating between 1.0 and 5.0 RPS to simulate realistic traffic spikes
Request distribution: system_prompt_len=1000, question_len=729 +/- 13550, output_len=300 +/- 2213, multi_turn=true
Cache hit rate: ~94% (matching production peak)
Load type: Poisson distribution with concurrency limit of 1000

Routing Strategies Compared

Strategy	Description
No Gateway	Direct connection to model servers (k8s round-robin baseline)
Old Params (111)	Load+prefix scorer with weights (1, 1, 1)
New Params (322)	Load+prefix scorer with tuned weights (3, 2, 2)
Latency Prediction	Predicted-latency based routing, replacing all heuristic scorers

Results

Scenario	Success Rate	E2E p50	E2E p95	TTFT p50	TTFT p95	TPOT p50	TPOT p99
No Gateway	~99.8%	15.98s	38.85s	4.47s	24.04s	35ms	93ms
Load + Prefix (111)	~99.9%	16.42s	35.06s	2.86s	18.06s	39ms	103ms
Load + Prefix (322)	100%	13.42s	26.55s	3.38s	16.78s	28ms	63ms
Latency Prediction	~99.9%	9.06s	22.57s	0.97s	11.34s	22ms	53ms

Latency prediction delivers the lowest latency across all metrics:

E2E latency: 9.06s p50 (43% lower than the best heuristic-based approach) and 22.57s p95 (15% lower)
TTFT: 0.97s p50 (70% lower than the best heuristic) and 11.34s p95 (32% lower)
TPOT: 22ms p50 and 53ms p99, the lowest across all strategies

The improvement is most pronounced at the high-load stages (stages 3, 5, 7), where latency prediction consistently stays well below all other strategies.

These results are particularly notable because the Load+Prefix (3,2,2) weights were specifically tuned for this workload based on analysis of the production traffic profile. Latency prediction outperforms it -- and all other heuristic-based approaches -- without any workload-specific tuning.

Prediction Server Scalability

At high replica counts, the EPP issues one prediction call per candidate pod per incoming request, so prediction QPS scales with both request rate and cluster size. At each QPS level in the table below, we assume 100 model server endpoints, so the prediction server is generating predictions for all 100 pods per request. To handle this, the Go sidecar coalesces concurrent EPP prediction requests within a 1ms window into a single batched HTTP call, and load-balances across multiple prediction server instances -- each running 28 uvicorn workers. Latency scales roughly linearly with QPS, and adding a prediction server adds 28 cores of inference capacity.

QPS	Avg (ms)	p50 (ms)	p99 (ms)	p99.9 (ms)	Prediction Servers
1	15.7	15	25	25	1
10	13.5	13	18	46	1
100	12.8	12	16	46	1
1,000	15.0	15	26	49	1
2,500	~19	~17	~36	~65	1
5,000	~27	~23	~74	~99	2
7,500	~35	~30	~96	~137	3
10,000	~48	~40	~137	~189	4

5-minute stability test at each level. Each prediction server runs 28 uvicorn workers on a C4 machine (28 cores). All runs achieved 100% success rate.

Try It: A Well-Lit Path

Prereqs

Install the Inference Gateway extension with the latency prediction sidecars: https://gateway-api-inference-extension.sigs.k8s.io/guides/latency-based-predictor/

Smoke test

Health checks

kubectl get pods
curl http://:8000/readyz   # training
curl http://:8001/readyz   # prediction (and 8002, 8003, ...)
# EPP health on 9003

Send an SLO-aware request

curl -v $GW_IP/v1/completions \
  -H 'Content-Type: application/json' \
  -H 'x-prediction-based-scheduling: true' \
  -H 'x-slo-ttft-ms: 200' \
  -H 'x-slo-tpot-ms: 50' \
  -d '{
    "model": "Qwen/Qwen3-32B",
    "prompt": "what is the difference between Franz and Apache Kafka?",
    "max_tokens": 200,
    "temperature": 0,
    "stream_options": {"include_usage": "true"},
    "stream": "true"
  }'

Watch the picker think (EPP logs, -v=4)

msg:"Running profile handler, Pick profiles" plugin:"slo-aware-profile-handler/slo-aware-profile-handler"
msg:"Before running scorer plugins" pods:[{... "pod_name":"...-5k7qr"}, {... "pod_name":"...-9lp5g"}]
msg:"Pod score" scorer_type:"slo-scorer" pod_name:"vllm-...-9b4wt" score:0.82
msg:"Picked endpoint" scorer_type:"slo-scorer" selected_pod:"vllm-...-9b4wt"

Tradeoffs & Gaps

The following areas highlight current limitations and ongoing work for the Predicted-Latency Aware Scheduling system:

Addressing Homogeneous Pool Assumptions Current models assume uniform serving pods regarding GPU types and runtimes. Future updates will incorporate richer features and per-variant training to better support heterogeneous pools.

Takeaway

Accelerator performance is fairly predictable when we account for both the current model server GPU state and request characteristics. By applying online machine learning with a narrow horizon, we can train a model that avoids overfitting while staying accurate to changing workloads. With a good predictor in place, we can route requests based on expected latency, leading to smarter and more efficient load balancing.

Get Involved

Docs & guides: https://gateway-api-inference-extension.sigs.k8s.io/guides/latency-based-predictor/

Appendix

Multi-Turn Cache Capacity Analysis

Theoretical Capacity Estimates

This analysis evaluates what percentage of user and system prompts can be prefix cached assuming perfect load balancing. Assuming 10 pods (H100 80 GB), the total cluster capacity is ~5,120,000 tokens.

Multi-Round Context Assumptions:

We assume a multi-turn chat scenario where the conversation context grows cumulatively.
For every round, the previous question AND the previous model response are appended to the context of the next question.
To satisfy a 5-turn session, the cache must hold the "Working Set" for 4 rounds (the accumulated history of Turns 1-4 serves as the prefix for Turn 5).
The calculation assumes that System Prompts are prioritized and pinned in the cache. User history is only allocated space from the remaining capacity after all unique system prompts for the active groups are stored.

Key Variables:

Total Capacity: 32000 blocks * 16 tokens * 10 pods = 5,120,000 tokens
Unique Sys Tokens: #Groups * System Prompt Tokens
Unique User Tokens (4 Rounds): #Users * (User Prompt Tokens + Output Len) * 4

Workload	Groups	Users Per Group	Sys Prompt Tokens	User Prompt Tokens	Output Len	Total Unique Sys Tokens	Total Rounds	Total Unique User Tokens (1 round)	Total Cache Capacity	% sys prompts fit	% user prompts fit
A	6	1,000	1,000	30	1,000	6,000	4	24,720,000	5,120,000	100.00%	20.69%
B	6	1,000	1,000	3,000	1,000	6,000	4	96,000,000	5,120,000	100.00%	5.33%
C	150	5	6,000	1,200	1,000	900,000	4	6,600,000	5,120,000	100.00%	63.94%
D	150	5	1,000	6,200	1,000	150,000	4	21,600,000	5,120,000	100.00%	23.01%

We notice that we have theoretically enough capacity to store the system prompts, however as we shall see below what also matters is the ratio between user and system prompts that determines the cache eviction dynamics.

KV Cache Behavior Across Workloads A-D

This section analyzes how workload inputs shape cache behavior, and manifests in prefix reuse, cache pressure, and eviction dynamics using a simple event based simulation.

Each workload varies six core inputs:

System prompt size (system_prompt_len)
Prefix sharing structure (num_groups each with a unique system prompt and num_prompts_per_group signifying number of users per group)
User context growth (enable_multi_turn_chat) If turned on, each user appends its prompt to its previous prompts and responses.
Request shape (question_len, output_len)

The simulation produces the figures below that should be read as answering four questions:

How much of the system prompt and user prompt is typically reused?
How quickly does cache pressure build under load?
When evictions occur, what is being evicted?
At what QPS do evictions occur?

Here we assume a single model server running on a 2 x H100 80 GB pod (TP=2). The requests are sent at different QPS for a fixed duration (100 secs).

Note: We assume cache creation and eviction is instantaneous; in reality, it depends on the prefill and decode times which we are not simulating. So effectively, in this simulation, QPS functions only as a proxy for the total number of prompts processed, ignoring the concurrency overhead that actual QPS imposes. Thus, in practice, the actual onset of cache eviction can be much earlier than what the simulation suggests. For example, in Scenario A, we see the onset of cache eviction happen around QPS = 14 with 10 pods (which roughly translates to QPS = 1.4 with 1 pod) experimentally, whereas the theoretical results below indicate a higher threshold. For other scenarios, the theoretical cache eviction QPS matches closely with what we observed experimentally.

Workload A -- High System Prompt Overlap, No System Cache Pressure

Inputs

Structure: Few groups, many users per group (6 groups, 1000 users per group).
Shape: Large system prompt relative to user prompt (1000:30 ratio).
Context: Multi-turn enabled.
Variance: 30% variance in user prompt and output length.

Interpretation Workload A represents a regime where shared system prefixes amortize extremely well, and user context grows slowly enough that cache pressure remains low.

Workload B -- Moderate System Prompt Overlap, No System Cache Pressure

Inputs

Structure: Few groups, many users per group (6 groups, 1000 users per group).
Shape: Smaller system prompt relative to user prompt (1000:3000 ratio).
Context: Multi-turn enabled.
Variance: 30% variance in user prompt and output length.

Interpretation Although this workload benefits from strong system reuse, longer user turns accelerate user context growth. Importantly, cache degradation is localized: user-block eviction preserves shared prefixes, preventing a global collapse in reuse.

Workload C -- Low System Prompt Overlap, High System Cache Pressure

Inputs

Structure: Many groups, few users per group (150 groups, 5 users per group).
Shape: Large system prompt relative to user prompt (6000:1200 ratio).
Context: Multi-turn enabled.
Variance: 30% variance in user prompt and output length.

Interpretation In Workload C, large system prompts fail to amortize due to low reuse, quickly consuming cache capacity. System-prefix eviction destroys shared reuse, leading to abrupt performance degradation across users.

Workload D -- Low System Prompt Overlap, Low System Cache Pressure

Inputs

Structure: Many groups, few users per group (150 groups, 5 users per group).
Shape: Moderate system prompt with very large user questions. (1000:6200 ratio)
Context: Multi-turn enabled.
Variance: 30% variance in user prompt and output length.

Interpretation In Workload D, cache overlap is low because of many groups. But smaller system prompts relative to user prompts ensure that performance is dominated by per-request computation rather than cache churn.

Summary

Metric	Workload A	Workload B	Workload C	Workload D
Prompt Reuse	High (~80%)	Moderate (~20%)	Low (~10%)	Very Low (5%)
Eviction Onset QPS	7	0.6	0.4	0.4
Eviction Target	User History	User History	System Prompts	Mix of user history and system prompts though largely dominated by user prompt
Best Scheduling Strategy	Latency Aware Scheduling	Latency Aware Scheduling	Comparable	Latency Aware

Native KV Cache Offloading to Any Filesystem with llm-d

2026-02-10T09:00:00.000Z

llm-d is a distributed inference platform spanning multiple vLLM instances. KV cache hits are critical to achieving high inference throughput. Yet, in a distributed environment, cache hits do not occur across different nodes as the KV cache is local to each vLLM instance. In addition, this local cache is limited in size, further limiting KV data reuse. This blog presents a new way to offload KV cache to storage, tackling both aforementioned challenges – KV cache sharing and KV cache scale. llm-d's filesystem (FS) backend is a KV cache storage connector for vLLM that offloads KV blocks to shared storage based on vLLM's native Offloading Connector. While the llm-d FS backend can speed up serving of single requests (improve TTFT), its main goal is rather to preserve stable throughput and low latency at scale, as concurrency and context lengths grow. This is accomplished by significantly enlarging the cache space and enabling KV reuse across multiple replicas and nodes in llm-d.

While there are a number of existing solutions for KV cache offload to storage (e.g. LMCache or Dynamo KVBM), the new connector offers simplicity, can run with llm-d and vLLM as the only dependency, and exhibits improved performance over state-of-the-art shared storage connectors.

The Importance of KV-cache Reuse

In transformer-based inference, the prefill stage computes key and value (KV) tensors for the input tokens, which are then used when decoding output tokens. This stage is computationally intensive, especially for long input contexts. But once the KV tensors are available, they are kept in a KV cache and can be reused, avoiding the prefill computation entirely.

When the same prefix appears repeatedly - for example, shared system prompts, common documents, agentic loops, or multi-turn conversations - recomputing the KV tensors wastes significant compute. Reusing the KV cache allows the system to skip a large portion of the prefill work, reducing latency and improving overall throughput (a deeper dive on KV reuse use cases appears here).

Why Storage Offloading is Needed

vLLM already supports keeping KV-cache data in GPU (High Bandwidth Memory) HBM and, more recently, offloading KV to host memory. These approaches work well for a single server or small deployments, but they become limited at scale. GPU HBM is typically on the order of tens of gigabytes per GPU. CPU memory is usually larger but still on the same order of magnitude. For example, consider a high-end node hosting 8 GPUs with 2TB of DRAM. Divided by 8, the CPU DRAM per GPU is 250GB while the HBM is, say 80GB.

On the other hand, KV-cache takes up lots of space, especially with longer context lengths and higher concurrency from multiple users and requests. Even a medium-sized class model, such as Llama-3.1-70B, requires 305 GB of KV-cache for one million tokens. Storage scales nearly infinitely compared to memory solutions and offers a far superior $ per GB ratio.

In addition, shared storage is a simple method to share KV data across an entire cluster spanning multiple vLLM instances and physical nodes. New nodes added to a cluster can immediately benefit from existing KV-cache data without warming the cache from scratch. Shared persistent KV-cache also benefits post-peak scale-down since localized KV-cache data is not lost. Finally, KV-cache persistence matters, so cached data is not lost during restarts or rescheduling events.

What We Built: llm-d FS Backend

The llm-d FS backend is a storage backend that plugs into vLLM's Offloading Connector. It stores KV blocks as files on a shared filesystem and loads them back on demand. It uses the filesystem directory as the index of what KV values are in the storage, and as such is persistent and sharable across all nodes connected to the filesystem.

The following are some key properties of our solution:

Filesystem agnostic: Relies on standard POSIX file operations, so it works with any standard filesystem.
KV sharing across instances and nodes: Multiple vLLM servers can reuse cached prefixes by accessing the same shared path.
Persistence across restarts or failures: KV data can survive pod restarts, rescheduling, and node failures (depending on storage durability).
Enterprise storage integration: Can leverage mature storage systems with existing durability, monitoring, and access control.

In addition, the following performance-related design choices were made:

Fully asynchronous I/O: By using vLLM's Offloading Connector KV reads and writes can run without blocking the main path. Details about the vLLM offloading connector can be found here.
High throughput via parallelism: I/O operations are parallelized across worker threads to increase bandwidth and reduce tail latency.
Minimal interference with GPU computations: Default transfers use GPU DMA, reducing interference with compute kernels.

How to Use it

Using the FS offloading connector is simple, requires a pip install, and a directory path to the storage being used. Other optional tunable parameters are the storage block size (in tokens) and the number of worker threads.

Detailed instructions can be found in the llm-d well-lit path guide.

Note that while the results presented in this blog were of tests run with IBM Storage Scale, the connector was also tested with other storage options including local storage (NVMe drive with a filesystem mounted on it) and CephFS. In general it will work seamlessly with any storage supporting a filesystem API or that has a filesystem mounted on it.

Results and Benchmarks

Single request speed-up

Depending on storage speed, loading KV data from storage can dramatically reduce TTFT compared to prefilling. We start by examining the benefit that offloading of KV data to storage can have on the speed of a single request, while emphasizing that the main benefit of offloading KV to storage (supporting high throughput at scale) will be shown later on.

Figure 1: Single-request KV-cache load across tiers

In Figure 1, we measure the TTFT for a single request as the number of tokens increases using various offloading mediums. This shows the evaluation for a Llama-3.1-70B model on a system with 4x NVIDIA H100 GPUs and an IBM Storage Scale. We compare KV loading from GPU memory, CPU memory, and shared storage and measure the speedup achieved vs. Prefill (a KV cache miss).
As the number of tokens increases, KV loading becomes increasingly efficient compared to recomputing the same tokens during prefill, achieving a speedup of up to 16.8X on long prompts. However, for single-request workloads, GPU and CPU caching remain the faster options. This is because storage, for the most part, is slower than DRAM (with some exceptions). Moreover, transferring data between storage and GPU HBM usually involves an additional hop via CPU DRAM, which adds to its overhead. This is a fundamental reality that reinforces why CPU offloading and smart routing based on prefix cache hits remain valuable optimizations in the llm-d system, even with shared storage available. Storage is not intended to replace these mechanisms, but to complement them. Its main value lies in providing far greater scalability, significantly lower cost per gigabyte, and persistent KV storage.

Scalability test

In order to exemplify the benefit of storage for scalability, we start by examining a somewhat artificial workload that consists of multiple users, where each user has their own distinct system prompt (we will consider a more realistic workload in the following section). We ask how many concurrent users a single vLLM node can support without a significant drop in throughput due to cache misses.

Figure 2: Multiple-request KV-cache load across tiers

In Figure 2, we evaluate KV-cache loading under varying concurrency levels by issuing 16K token requests from a growing number of users. In this test, all prompts have previously appeared, and the decode is of a single token. We chose this extreme workload just to emphasize the point, and we will show a more realistic workload next. Again, we used a single node running Llama-3.1-70B on a system with 4x NVIDIA H100 GPUs and an IBM Storage Scale filesystem.

We see that only a small number of user prompts can fit in GPU memory. With such a small number of users, performance is extremely fast, but once we grow beyond this, performance drops significantly as essentially almost all requests undergo prefill. With CPU offloading, this drop-off is postponed, and the system can handle a higher (yet limited) number of users with a very small drop-off for these. On the other hand, storage-backed KV caching shines once we scale up. While the speed offered by storage does not match that of the GPU or CPU, it allows the system to sustain throughput as the working set increases nearly infinitely.

This experiment highlights the key benefit of storage offloading: it prevents performance collapse when workloads outgrow GPU or CPU cache capacity. The gains come from higher cache hit ratios at scale and from shared access across replicas. Its value lies in maintaining throughput and latency stability as workloads exceed the capacity of GPU or CPU memory, supporting efficient scaling of the system. This also highlights that storage performance is key to achieving high throughput on KV cache hits.

Scalability in realistic workloads

Finally, we evaluate a more realistic workload that mixes KV loading, prefill, and decode operations. We use the llm-d benchmarking framework to run inference-perf with a shared-prefix synthetic workload. Each query consists of a previously seen user-specific system prompt of 2000 tokens and a question made of 256 tokens. 256 tokens are decoded in response. The queries are issued at a rate of 40 QPS from a pool of users of variable size. The more users in the system, the greater the working set size is and hence the more we expect storage to shine. This setup helps us study how the different caching options behave with a growing number of users. This setup runs llm-d with two decode nodes executing a Llama-3.1-8B model on a system with 2x NVIDIA H100 GPUs and a cloud-based storage offering approximately 10GB/s I/O throughput.

Figure 3: Multiple requests with mixed workload (load + prefill + decode)

The results, in Figure 3, show similar behavior to the previous test. By extending KV-cache capacity using shared storage, llm-d can reuse KV data more effectively across requests and replicas, maintaining an improved overall throughput and TTFT as the system scales. An additional important observation is that even though the storage in this test was not top-end, the asynchronous use of storage frees up precious GPU cycles for prefill and decode operations and hence achieves higher throughput.

Summary and Next Steps

Storage offloading is an important and essential capability for scalable AI inference platforms such as llm-d. It increases effective KV-cache capacity, enables cross-replica reuse, and makes llm-d clusters more elastic by allowing them to scale efficiently with growing request volume and user concurrency. The FS backend keeps the integration native to vLLM and llm-d, using an asynchronous design and high-throughput transfers built around parallelism.

The initial FS backend is the first llm-d native solution for storage KV offloading. As next steps, we are working on new features, including tiered storage offload (as a second tier to CPU DRAM), integration with NIXL backends, offloading to object storage, support for GPU Direct Storage (GDS), and more.

Acknowledgement

For performance testing, we used IBM Storage Scale, an enterprise-level storage system offering high performance, scalability, and reliability.

llm-d 0.5: Sustaining Performance at Scale

2026-02-04T00:00:00.000Z

In our previous release (v0.4), we focused on improving the end-to-end latency of production inference, introducing speculative decoding and extending prefill/decode disaggregation across a broader set of accelerator architectures. That work established llm-d’s ability to deliver state-of-the-art latency along the critical serving path. Sustaining low latency increasingly depended on how KV-cache pressure is handled once GPU memory is saturated, whether cached state can be reused across replicas instead of being repeatedly rebuilt, and how requests are routed when workloads mix adapters, models, and availability requirements.

With v0.5, llm-d expands its focus from peak performance to the operational rigor required to sustain performance at scale. This release prioritizes reproducibility, resilience, and cost efficiency, with concrete improvements across the following areas:

Developer Experience and reproducibility: We have simplified the benchmarking workflow with dedicated, in-guide benchmark support, allowing users to validate each “well-lit path” with a single command.
Hierarchical KV Offloading: A new storage architecture decouples cache capacity from GPU memory through native CPU and filesystem tiers.
Advanced Scheduling: Cache-aware routing now supports LoRA adapters and active-active high availability.
Resilient Networking: A new transport backend (UCCL) improves stability in congested networks.
Autoscaling Updates: We have introduced scale-to-zero capabilities for cost-efficient intermittent workloads.

Updated developer experience

A core challenge in distributed inference is the variance between reported benchmark numbers and realized production performance. In v0.5, we address this by adopting the "Research Paper Principle": the requirement that every performance claim be accompanied by a reproducible, version-controlled configuration.

Simplified benchmarking (one guide, one benchmark): We have moved away from generic "one-size-fits-all" scripts. v0.5 introduces in-guide benchmark support and a standalone benchmark script. Each "well-lit path" (e.g., Inference-Scheduling, Wide EP, Disaggregated) now has a defined benchmark configuration that reproduces the benchmark results shown in the guides.
Use cases and personalities: The new harness supports distinct "personalities" or use cases. While the “in-guide” benchmarking is targeted at “feature developers” who quickly and efficiently wanted to check if their code changes had a positive effect, taking into account a baseline, the llm-d-benchmark has a toolset for a variety of other users. For “config tuners”, aiming to carry out parameter exploration sweeps, there is support for experiments crafted with the purpose of finding optimal configurations and reference architecture. An llm-d service owner can track performance regressions on his production stack by selecting a mix of workloads to be periodically run.
Parameter sweep exploration: Capacity planning often relies on trial-and-error. The new Configuration Explorer allows analysts, researchers, or customers to perform their own parameter sweep exploration, simulating activation memory usage against specific hardware constraints. This allows this category of users to validate hardware feasibility and generate memory-safe configurations prior to provisioning physical resources.

Performance update

With the overhauled benchmarking harness, we have re-validated our core architectural patterns. Below are the updated performance numbers for v0.5, representing the "well-lit paths" enabled by this release. These results are fully reproducible using the configurations provided in the corresponding guides.

Throughput-oriented: Wide-EP on Nvidia B200

For batch-intensive workloads, throughput is the primary metric. We validated a Wide-EP topology on a cluster of Nvidia B200 GPUs, specifically targeting the "Batch Processing" use case where latency constraints are relaxed in favor of massive token generation.

Topology: 16x Prefill GPUs / 16x Decode GPUs (EP=16, DP=16, TP=1)
Workload: Random dataset, 1k/1k input/output length
Result: ~50k output tokens/sec total throughput (~3.1k output tokens/sec per decode GPU).

Figure 1: Wide-EP throughput on NVIDIA B200 for batch-oriented inference. Total output throughput scales with decode parallelism in a disaggregated 16×16 prefill/decode topology, demonstrating efficient utilization of B200 GPUs for batch-intensive workloads where latency constraints are relaxed.

Latency-oriented: Inference Scheduling

For workloads with shared context and compute reuse opportunities, intelligent inference scheduling maximizes cache reuse and increases throughput. We validated our inference scheduling guides on a cluster deploying Qwen/Qwen3-32B, specifically targeting the "Multi-tenant SaaS" use case where shared customer contexts enable significant computational savings through prefix caching.

Topology: 8x vLLM pods / 16x NVIDIA H100 GPUs (TP=2)
Workload: Shared prefix synthetic, 150 groups × 5 prompts, 6k/1.2k/1k system/question/output length
Result: 4.5-11k output tokens/sec throughput, P50 TTFT 136–157ms, up to 109% higher throughput and 99% lower TTFT vs baseline Kubernetes service

Figure 2: TTFT and throughput vs QPS on Qwen3-32B (8×vLLM pods, 16×NVIDIA H100). llm-d inference scheduling maintains near-zero TTFT and scales to ~120k tok/s, while baseline Kubernetes service degrades rapidly under load.

Optimized offloading: Hierarchical KV-caching

In transformer-based inference, the KV-cache is a critical resource for reducing latency, yet it is strictly bounded by the finite capacity of GPU HBM and local CPU DRAM. Even in high-end nodes, local memory creates a hard ceiling on the number of concurrent contexts a system can serve before reverting to expensive prefill computations.

In v0.5, we introduce the `llm-d FS backend`, a storage connector that plugs into vLLM’s native offloading interface. This architecture effectively establishes a three-tier memory hierarchy (GPU, CPU, and Disk) to address the twin challenges of scale and sharing.

Decoupling capacity from compute: By offloading KV blocks to a shared file system, the system decouples cache capacity from local node memory. This allows the inference engine to sustain high throughput even as the active working set grows significantly beyond the aggregate RAM of the cluster.
Cross-replica reuse: Unlike local CPU caches, which are isolated to a single instance, a shared file system acts as a persistent, global store of KV states. New nodes added to the cluster can "hydrate" their cache immediately from the shared tier, bypassing the warm-up phase typically required for new replicas.
Asynchronous design: To minimize interference with the decoding loop, the backend utilizes fully asynchronous I/O and parallel worker threads. This ensures that the latency cost of fetching blocks from disk does not block the main compute path.

Our internal benchmarks illustrate that the primary value of this architecture is operational stability under load. As shown in Figure 3, standard GPU-only deployments experience a sharp performance collapse once HBM is saturated. In contrast, the storage-backed configuration creates a "performance floor," sustaining throughput as user concurrency increases well beyond local memory limits.

Figure 3: KV-cache throughput under growing user concurrency on Llama-3.1-70B (4×NVIDIA H100 with IBM Storage Scale). 16K token requests with previously-seen prompts. GPU-only collapses once HBM is saturated; storage offload sustains ~185k tok/s as concurrency scales (13.9x improvement at 250 users).

For high-performance computing environments, we have also validated Tiered Prefix Caching on Lustre, demonstrating how parallel file systems can be leveraged to provide persistence and fast throughput capabilities and minimize the latency overhead of this tertiary storage tier.

Advancements in scheduling

In v0.5, the scheduler has evolved to handle more complex routing scenarios, specifically targeting multi-adapter workloads and high-availability topologies.

Unified tokenization pipeline: We have standardized the preprocessing stack by reusing the native vLLM Python module for all tokenization and rendering paths. This lays the groundwork towards disaggregated tokenization by converging on a single, comprehensive rendering pipeline built on vLLM’s evolving Renderer API.
Active-active HA with dynamic discovery: We have introduced a new subscription manager for KV-Events that moves away from static connections. Instead of relying on deployment-time configuration, the scheduler now dynamically discovers vLLM pods and manages ZeroMQ (ZMQ) subscriptions based on the pod lifecycle. This enables robust Active-Active multi-replica scheduler High-Availability, ensuring that cache state is tracked accurately even as replicas scale up or down dynamically.
LoRA-precise prefix caching: The precise scheduling path now fully supports LoRA adapters. This allows the scheduler to route requests based on the specific cache locality of LoRA adapters, preventing the "thundering herd" problem where every replica attempts to load every adapter and maximizing cache efficiency for multi-tenant workloads.

Figure 4: LoRA-aware prefix-cache scheduling improves throughput. Prefix-cache aware scheduling mitigates LoRA overhead by minimizing effective compute and avoiding redundant LoRA kernel execution, delivering higher throughput than standard K8s round-robin load-balancing.

Resilient networking: NIXL - UCCL backend

In disaggregated architectures, tail latency is governed by the efficiency of KV-cache transport between prefill and decode nodes. In v0.5, we have integrated our contribution of the UCCL (Unified Collective Communication Library) backend into the NIXL networking layer.

UCCL provides a unified abstraction over vendor-specific collective primitives (NCCL, RCCL, MCCL). Crucially, for point-to-point (P2P) operations, the dominant traffic pattern in context migration, UCCL implements a host-resident software transport stack. By managing transport logic on the CPU rather than relying solely on hardware offload, the backend enables fine-grained flow splitting and adaptive congestion control strategies. This architecture currently supports both native RDMA and GPUDirect TCP-X transport.

To quantify the impact of this architecture, we evaluated a Llama-3.1-8B deployment on an OpenShift cluster with 200 Gb/s interconnects, measuring the transfer latency of a 4 GB KV-cache payload.

Figure 5: PD transfer latency growth under congestion. Comparing baseline vs. congested states for UCX and UCCL transport.

While baseline latency remained comparable between transports (~360ms), the architectures diverged significantly under network stress. When subjecting the cluster to heavy cross-traffic congestion:

UCX backend: Latency degraded from 362ms to 424ms (+17.1%).
UCCL backend: Latency degraded from 359ms to 384ms (+7.1%).

The UCCL backend demonstrated 2.4x greater resilience to network contention (7.1% degradation vs. 17.1%). These results validate the use of host-driven congestion control for ensuring consistent tail latency in shared, multi-tenant production environments.

Updates to autoscaling

In release v0.4, we introduced the Workload Variant Autoscaler (WVA) as an experimental feature. In v0.5, we have iterated on this foundation to support more aggressive cost-saving strategies for intermittent workloads.

Scale-to-zero and from-zero: We have implemented support for scaling inference pools down to zero replicas during idle periods. Unlike simple timeout-based shutdowns, this implementation uses a specialized activator component to handle the "cold start" sequence, ensuring that incoming requests trigger provisioning without being dropped. This capability is critical for cost-sensitive environments, such as development clusters or internal RAG applications that do not require 24/7 GPU allocation.
Saturation-based scaling: The control loop has been refined to better detect saturation points based on queue depth and KV-cache pressure, allowing for more responsive scaling decisions before latency SLOs are violated.

Community and ecosystem

This progress wouldn't be possible without close collaboration with our broader community. llm-d exists within a rich ecosystem of open-source projects, and we are proud to contribute our findings and code back to the upstream. A few contribution highlights from this release.

vLLM KV-Connector: We worked closely with the vLLM maintainers to define and implement the KV Offloading Connector, ensuring that the storage hierarchy we built for llm-d rests on standard, upstream interfaces.
NIXL Integration: Our UCCL backend has been merged into the NIXL 0.9 release, enabling the wider community to benefit from the host-driven congestion control strategies we developed for disaggregated serving.

Broader hardware ecosystem integration

We continue to validate llm-d across a broad and growing hardware ecosystem, reinforcing its role as a hardware-agnostic inference control plane. Additional results will be shared as validations complete.

What Is Next?

The v0.5 release establishes a foundation for reproducible research and production stability. We invite the community to validate these findings using the new benchmarking tools available on our GitHub repository.

Looking ahead, our focus shifts to four key areas:

Scheduling: Moving toward predictive, latency-aware routing (using TTFT/TPOT targets rather than queue depth) and native batch inference support.
Offloading: Developing proactive state management APIs to "pin" critical context blocks and exploring semantic caching policies.
Autoscaling: Introducing pluggable optimizers that leverage queuing theory and ML for predictive scaling, alongside direct SLO integration.
Observability: Implementing end-to-end distributed tracing, from the gateway through the scheduler to the engine, to expose granular latency bottlenecks in disaggregated architectures.

We will publish a feature roadmap for v0.6, outlining the next set of priorities informed by ongoing community discussions and feedback.

Upcoming Events

Please stay tuned for updates on upcoming community events. You can follow these and other events where you can find the llm-d contributors on our community events page.

Follow the project updates on Twitter/X, Bluesky, and LinkedIn. You can also find recordings of our community calls as well as demos of current and upcoming features by being a subscriber to the llm-d YouTube.

Contribute on GitHub, join our community calls (Wed 12:30pm ET), join the SIGs and come build with us!

llm-d 0.4: Achieve SOTA Performance Across Accelerators

2025-12-02T00:00:00.000Z

llm-d’s mission is to provide the fastest time to SOTA inference performance across any accelerator and cloud. In our 0.3 release we enabled wide expert parallelism for large mixture-of-expert models to provide extremely high output token throughput - a key enabler for reinforcement learning - and we added preliminary support for multiple non-GPU accelerator families.

This release brings the complement to expert parallelism throughput: improving end-to-end request latency of production serving. We reduce DeepSeek per token latency up to 50% with speculative decoding and vLLM optimizations for latency critical workloads. We add dynamic disaggregated serving support to Google TPU and Intel XPU to further reduce time to first token latency when traffic is unpredictable, while our new well-lit path for prefix cache offloading helps you leverage CPU memory and high performance remote storage to increase hit rates and reduce tail latency. For users with multiple model deployments our workload autoscaler preview takes real-time server capacity and traffic into account to reduce the amount of time a model deployment is queuing requests - lessening the operational toil running multiple models over constrained accelerator capacity.

These OSS inference stack optimizations, surfaced through our well-lit paths, ensure you reach SOTA latency on frontier OSS models in real world scenarios.

SOTA Low Latency Serving on Large MoE

While our previous releases focused on throughput, v0.4 delivers the features necessary to guarantee ultra-low latency, particularly for demanding wide EP MoE models.

We integrated critical low-latency optimizations for MoE models in vLLM and observed an over 40% reduction of per token latency for DeepSeek V3.1 on H200 GPUs:

Speculative decoding leverages under utilized compute at low concurrency to run a smaller “draft” model and guesses at the next valid token - correct guesses are cheaper to validate and thus reduce per output token latency. We also tested and improved a number of MoE specific optimizations, including DeepSeek’s native MTP support. We launched the speculators OSS project to broaden support for speculative decoding among popular models and are continuously adding new architectures to our Hugging Face Hub.
Async Scheduling: integration of asynchronous scheduling in vLLM allows for more efficient overlapping of compute and CPU scheduling operations, ensuring that the system can sustain high request rates while maintaining low latencies.
Block-FP8 kernel optimizations to fuse key elementwise operations, overlap shared expert computation with grouped expert routing, and select more efficient kernels

The recommendations and tradeoffs in tuning for latency will become a well-lit path in the next release of llm-d.

Expanding Hardware Choice

The community has continued to integrate support across a diverse range of accelerators and providers, making llm-d the definitive hardware-agnostic control plane:

Google TPU Disaggregation over DCN: We integrate the latest vLLM + TPU architecture with our dynamic disaggregation recipe for llama3-70b, allowing high performance KV cache transfer over datacenter TCP and allowing on-demand scaling of prefill instances in response to prefill load.
Intel XPU Integration: Initial support and verification for the intelligent scheduling and disaggregation well-lit paths.
AWS EFA Support: Support for AWS Elastic Fabric Adapters (EFA) has been added to the llm-d CUDA image in the form of NIXL libfabric libraries to enable low-latency communication on AWS accelerator clusters with full EFA support in a future release.

Expanding the well-lit paths

New well-lit path for prefix cache offloading

For long-context or high-concurrency multi-turn workloads, GPU memory is the bottleneck. To address this, we are recommending tiered prefix-cache offloading as a standard practice in llm-d v0.4.

We have formalized vLLM-native CPU offloading and LMCache connectors under a new well-lit path that allows the system to transparently utilize host CPU RAM - which is often underutilized - as a secondary storage tier for the KV cache. Instead of dropping context when VRAM fills up, llm-d allows you to swap blocks to the CPU and retrieve them when needed.

In our user guide benchmarking, enabling CPU offloading when KV cache working set exceeded the available HBM capacity yielded a 25% reduction in mean Time To First Token (TTFT) and a 21% increase in total throughput.

HBM < KVCache < HBM + CPU RAM	Mean TTFT (second)	P90 TTFT (second)	Mean E2E Latency (second)	P90 E2E Latency (second)	Overall Throughput (token per second)
Baseline vLLM	9.0	20.9	37.8	49.7	38534.8
vLLM + CPU offloading 100GB	6.7 (-25.6%)	20.2 (-3.3%)	30.9 (-18.3%)	44.2 (-11.1%)	46751.0 (+21.3%)
vLLM + LMCache CPU offloading 100GB	6.5 (-27.8%)	18.8 (-10.0%)	30.8 (-18.5%)	43.0 (-13.5%)	46910.6 (+21.7%)

Table: High Performance compares the performance of the baseline vLLM with the vLLM using the CPU offloading connector when the KVCache size is larger than the available HBM.)

By treating CPU memory as an active extension of the GPU, operators can run larger models or higher concurrency on existing hardware, significantly improving token-per-dollar value. This sets the stage for deeper tiering strategies, including upcoming guides on distributed storage offloading.

Enhanced intelligent scheduling

To complement the new offloading capabilities, we have updated our intelligent inference scheduling well-lit path to fully leverage the tiered memory architecture.

Tier-aware precision: Prefix-cache affinity now accounts for the cost dynamics of the new cache tier. We implemented granular KV-cache tracking with configurable weighted scoring (e.g., weighting GPU hits higher than CPU hits). This allows the router to calculate the most efficient path for KV retrieval, balancing the high value of GPU-resident data against the available capacity of the CPU tier.

Drawing on extensive benchmarking data, we also refined our load-oriented distribution scorers, highlighting:

Saturation stability: we introduced the no-hit-lru-scorer. This scorer intelligently distributes "cold" requests - the seeds that form new prefix-cache trails - to prevent hotspots during cache formation. This eliminates oscillating wait-queues, allowing the cluster to maintain stability at higher concurrency on the same hardware for workloads with prefix reuse opportunities.

For workloads with prefix-reuse opportunities, this improved placement logic cut P99 Time To First Token (TTFT) in half (18.3s -> 8s) while maintaining stability at peak concurrency.

Benchmark comparison: Running 8x Qwen-32B on 16 NVIDIA H100 GPUs at 50 QPS with 5500:1000 ISL/OSL requests sharing 150 unique prefixes. The v0.4 scheduler (bottom) eliminates the wait-queue oscillations seen in v0.3, resulting in a 50% reduction in P99 TTFT.

High queueing in llm-d v0.3 for cold prefix cache requests:

Significant reduction in queueing in llm-d v0.4 for cold prefix cache requests:

Best practices and benchmarking for configuring the appropriate scoring for prefix-cache dependent workloads is planned for our next release.

Benchmarking and validation

To help the community understand the trade-offs in distributed inference, we are making it easier to benchmark the well-lit paths. llm-d-benchmark provides all the tooling needed for performance characterization using a "design of experiments" methodology, ensuring results are reproducible and standardized.

Comprehensive automation: We now fully automate the setup and execution of the llm-d well lit paths in multiple scenarios using supported harnesses like inference-perf, guidellm, vllm-benchmark, and inferenceMAX.
Flexible execution: new guides covering automated stack setup, running against existing stacks, and interactive benchmarking.
Data-driven insights: Benchmark data is collected in a standardized report format. Our "Configuration Explorer" allows you to parse this data to visualize Pareto curves and find optimal deployment parameters for your specific SLOs.

Improving efficiency

Introducing the Workload Variant Autoscaler (experimental)

The Workload Variant Autoscaler uses a reactive saturation-based optimizer that operates on per-replica metrics such as queue length and KV-cache utilization to identify saturated replicas and calculate the spare capacity on non-saturated replicas. It relies on similar signals as the inference scheduler with a wider safety margin to avoid oscillation. This approach is robust for workloads where performance parameters are uncertain and allows scaling on observed load to prevent queue overflow.

Saturation-based scaling

Saturation-based scaling is the default autoscaling approach for the v0.4 release and is recommended for most workloads, including hybrid state space models (HSSM), MoE, and diffusion architectures.

How it works: Scales reactively based on observed metrics (arrival rate vs. capacity).
Benefit: Does not require complex performance parameters and tuning, making it robust and less sensitive to architecture-specific behavior differences.
Limitation: It is reactive only, scales when the average observed spare capacity across non-saturated replicas falls below statically configured thresholds, and may be less cost-efficient than accurate predictive scaling. Scaling happens in increments of one replica at a time.

Future releases will enhance WVA with an SLO-driven, proactive, adaptive optimizer that responds not only to queries per second (QPS) but also to evolving traffic patterns striking a balance between ease of use and performance. Following this release, we are planning a detailed blog on autoscaling, including performance evaluations and roadmap.

Making it easier to operate

Production readiness for chat

Chat completion API support: Full support for the OpenAI-compatible /v1/chat/completions API has been merged and tested, simplifying the operational deployment of all conversational and agentic workloads.

Breaking changes

As part of our refactoring to align with best practices in the Kubernetes Gateway API Inference Extension (GAIE), we are announcing a critical breaking change for all deployments using our Helm charts: The method for setting configuration flags for the Inference Scheduler has changed from an array format to a map format. If you are upgrading to v0.4, please review the latest Helm chart values.yaml files and update your overrides immediately. This change is necessary to decouple configuration from code and enable future advanced scheduling features.

What Is Next?

0.5 will continue adding accelerator support on a wider range of scenarios and models and also increase our focus on tuning for specific use cases, like multi-turn chat or reinforcement learning. If there is a use case important to you please suggest it in our 0.5 roadmap issue or raise it in a community meeting!

Community and Upcoming Events

Driven by our community, llm-d continues to evolve. Release v0.4 incorporates feedback and code from a diverse range of contributors, cementing our status as the definitive hardware-agnostic control plane. We are incredibly grateful to everyone involved.

Please stay tuned for updates on upcoming community events. You can follow these and other events on our community events page.

Contribute on GitHub, join our community calls (Wed 12:30pm ET), join the SIGs and come build with us!

llm-d 0.3: Wider Well-Lit Paths for Scalable Inference

2025-10-10T00:00:00.000Z

In our 0.2 release, we introduced the first well-lit paths, tested blueprints for scaling inference on Kubernetes. With our 0.3 release, we double down on the mission: to provide a fast path to deploying high performance, hardware-agnostic, easy to operationalize, at scale inference.

This release delivers:

Expanded hardware support, now including Google TPU and Intel support
TCP and RDMA over RoCE validated for disaggregation
A predicted latency based balancing preview that improves P90 latency by up to 3x in long-prefill workloads
Wide expert parallel (EP) scaling to 2.2k tokens per second per H200 GPU
The GA release of the Inference Gateway (IGW v1.0).

Taken together, these results redefine the operating envelope for inference. llm-d enables clusters to run hotter before scaling out, extracting more value from each GPU, and still meet strict latency objectives. The result is a control plane built not just for speed, but for predictable, cost-efficient scale.

Commit to the Mission

llm-d’s mission is to be a hardware-agnostic, upstream-aligned control plane for inference. 0.3 takes a decisive step forward, not only by expanding platform coverage, but also by making it easier to adopt and experiment with.

Broader Hardware Support

llm-d support spans now across an expanded array of accelerator families: NVIDIA, Google TPU, and Intel XPU. TPU artifacts are tested in cloud environments and Intel XPU builds run through CI runners alongside NVIDIA.

Disaggregation support is now more portable. With 0.3, TCP and RDMA over RoCE have been validated as transports for prefill/decode separation. This extends disaggregation beyond specialized InfiniBand fabrics, opening the path to reproducible deployments in mainstream cloud networking environments.

Simplified User Experience Across A Variety of Infrastructure

Serving large language models is complex - our documentation and configuration should be simple. Quickstarts have been streamlined and renamed to guides, with fewer options and more context around the key decisions you need to make. They are now located in the main repository and treated as living documents alongside our growing documentation for common scenarios. Since llm-d is about exposing the key tradeoffs and exposing useful patterns, we’ve split out the key prerequisites for each guide - cluster configuration, client setup, and gateway choice - into their own sections, and replaced our all-in-one installer scripts with better step by step instructions.

As more cluster providers integrate into llm-d we’ve expanded the documentation for infrastructure with provider specific troubleshooting, configuration, and testing. This release adds documentation and steps for CoreWeave, Digital Ocean, Google Kubernetes Engine, and OpenShift.

Guides now include curated Inference Gateway installs and static manifests for clarity, with overlays available for benchmarking sweeps. RBAC patterns were refactored toward namespace scope for smoother multi-tenancy.

Why it matters: With 0.3, experimenting with intelligent scheduling or disaggregation is as simple as running a documented guide. The control plane is more transparent, reproducible, and extensible, independently of the platform you are running.

Brighten the well-lit paths

The well-lit paths we introduced in 0.2 - wide-EP, intelligent scheduling, disaggregation - are now sharper and more reproducible.

Wide-EP Performance

The wide-EP path, which parallelizes across experts to maximize throughput, has reached 2.2k tokens/s per H200 GPU in community benchmarks on H200 clusters.

The result reflects sustained throughput in multi-node deployments with production-like settings. Earlier results hovered around 1.5k tokens/s per GPU; the jump to 2.2k confirms that kernel optimizations (silu-mul-quant fusion, Cutlass QKV kernels, TP attention bugfixes) and the addition of Dual Batch Overlap (DBO) for decode are delivering compound gains.

Why it matters: At these throughput levels operators can consolidate workloads onto fewer nodes, reduce the number of replicas needed for a given QPS, and lower token-per-dollar cost.

Inference Gateway v1.0 GA

In v0.2, we brought a number of new features to the inference scheduling space. With v0.3, the Inference Gateway (IGW) reaches v1.0 GA, making intelligent routing a stable primitive in llm-d.

Figure: A tri-panel of TTFT, TPoT and Throughput measured through progressively rising QPS rates, comparing scheduling that is load- vs prefix- vs prefix+load- aware scheduling. Check out our blog KV-Cache Wins You Can See for more details.

IGW integrates tightly with the llm-d scheduler, supporting load (kv-cache utilization and queue length) and prefix-cache aware scheduling. Benchmarks show that adding prefix aware scorers achieve near-100% KV cache hit rates and significantly reduce TTFT compared to earlier routing approaches based on load only. For cluster-level throughput, that means sustained >60k tokens/s with lower latency spread. It also demonstrates the importance of combining prefix and load-aware scorers to ensure that cache hits are exploited without overloading replicas.

Why it matters: This is the difference between inference systems that react to congestion and ones that proactively route requests to where cache reuse and concurrency balance are best. For operators it translates into fewer cache misses, reduced latency spikes, and predictable scaling.

Predicted Latency-Based Scheduling

We are introducing an innovative experimental scheduling scorer designed to optimize latency. This predictor incorporates input length, in-flight tokens, and concurrency into a unified cost model to forecast TTFT and TPOT for each request across available servers. The system proactively balances requests within the 25–75% saturation window, where most clusters operate and where reactive schedulers fall short.

Figure: TTFT and TPOT tail latency measured on increasing QPS rates (6 servers), comparing predicted-latency with prefix+load aware scheduling for a 4:1 ISL:OSL ratio.

The first graph demonstrates that the new scorer slightly enhances tail latency when handling a balanced ISL:OSL workload. Another major advantage it offers over manually tuned load and prefix-aware scorers is simplifying configuration by automatically learning thresholds from ingested features. This inherent adaptability makes it more resilient to shifts in traffic patterns and allows for easier integration of new scheduling signals.

The subsequent figure illustrates that as the workload shifts towards a more prefill-intensive distribution, the new scorer offers even better performance. This is because it more precisely predicts prefill duration, which is tied to a known request length, in contrast to decode time, which depends on an unpredictable output length.

Figure: TTFT and TPOT tail latency measured on increasing QPS rates (6 servers), comparing predicted-latency with prefix+load aware scheduling for a 400:1 ISL:OSL ratio.

With better prediction of latency the new approach also enhances SLO attainability for sheddable workloads by more effectively detecting saturation before it occurs rather than after.

Figure: SLO attainability (1 server), comparing predicted-latency with load-aware scheduling and shedding for a balanced ~1:1 ISL:OSL ratio.

Why it matters:

Users experience steadier, lower tail latencies: critical for interactive workloads.
Operators can push GPUs closer to saturation before scaling out, reducing cost.
SLOs evolve from reactive guardrails to proactive dials for tuning responsiveness vs. stability.
This lays the foundation for closed-loop autoscaling that responds directly to SLO objectives.

For MaaS providers, these gains are not just academic, they directly translate into lower variance for customers and better utilization of costly accelerators.

KV Disaggregation and Chat Completions

The KV path also matures in v0.3. Synchronous tokenization has stabilized efficient caching in high-availability deployments, while precise prefix-cache aware scorers achieve significantly reduced KV-cache duplication across instances and limits the impact of prefix cache expiration pressure, significantly reducing TTFT (x57 in our benchmark) and doubling throughput.

The benchmark simulates a B2B SaaS platform serving 150 enterprise customers, each with their own 6,000-token context shared among 5 concurrent users (750 total) under sustained load, where cache demand exceeds single-instance capacity and forces distributed scheduling - testing the system's ability to avoid cache thrashing.

Figure: A tri-panel of TTFT, TPoT and Throughput measured through progressively rising QPS rates in a demanding benchmark.

Figure: The total computational work - tokens per second - saved by effective KV-cache use across the cluster, over the course of the benchmarks.

Why it matters: The stronger cache affinity guarantees provided by precise prefix-cache aware routing keep latency low without requiring the use of coarse-grained sticky sessions or high operational overhead. Real-world chat completion workloads can keep the high cache hit rates of sticky sessions but gain utilization based balancing, reducing hot spotting and underutilized capacity.

Observability and Benchmarking

Behind these performance and scheduling gains is a stronger foundation for visibility and reproducibility.

Observability: Gateway metrics are now exposed via ServiceMonitor, with PromQL queries and Grafana dashboards published in the docs.
Benchmarking: v0.30 RC1 supports all well-lit path scenarios in CI/CD. Inference-perf has been hardened with saturation testing, trace replay, and more accurate scheduling.

Why it matters: operators can now validate configs before launch, monitor latency compliance in real time, and detect regressions early. This ensures the well-lit paths are not just concepts, but reproducible and production-ready.

What Do You Get Today?

With v0.3, here’s what you can run, experiment with, and rely on:

Cross-platform hardware support across NVIDIA, Intel XPU, and Google TPU.
Simplified installation with curated guides, overlays, static manifests, and a capacity planner.
Wide-EP throughput (reaching 2.2k tokens/s per GPU on multi-node H200 clusters).
Inference Gateway v1.0 GA, with cache-aware routing and stable APIs.
Adaptive SLO predictor (preview) showing up to 3x better P90 latency in long-prefill workloads.
KV disaggregation with synchronous tokenization, precise scorers, and chat completions support.
Disaggregation over TCP and RDMA/RoCE, extending beyond InfiniBand networks.
Observability tools with ServiceMonitor and Grafana dashboards.

What Is Next?

While v0.3 stabilizes today’s well-lit paths, the community is also laying the foundations for tomorrow’s paths

Native CPU offload in vLLM: Async KV transfer APIs and CPU buffer transport under review, enabling cache spillover to CPU memory with scheduler awareness through KV events.
Deferred decode and async protocols: design work underway for finer scheduling control and a lower level “tokens in / tokens out” serving engine in vLLM for high scale multi-modal serving
Public performance database: building on benchmarking harnesses to provide transparent, reproducible performance data across platforms.
Autoscaling incubation: WVA autoscaler integrated with Kubernetes HPA; demonstrated scale-to-zero. Next steps include tying autoscaler decisions directly to adaptive SLOs.

If there is an area you’d like for us to focus on, please suggest it in our 0.4 release tracking issue.

Community and Upcoming Events

The strength of llm-d comes from its community. v0.3 reflects contributions from a growing number of contributors and collaborators. Thank you to everyone who helped out!

Upcoming highlights include llm-d sessions, All Things Open (Oct 12–14, 2025), PyTorch Conference (Oct 22–23, 2025), AMD AI DevDay (Oct 20), and Kubecon 2025 (Nov 10-14). You can follow these and other events on our community events page.

Contribute on GitHub, join our community calls (Wed 12:30pm ET), join the SIGs and come build with us!

KV-Cache Wins You Can See: From Prefix Caching in vLLM to Distributed Scheduling with llm-d

2025-09-24T09:00:00.000Z

The llm-d project provides a series of “well-lit paths” - tested, benchmarked solutions for deploying large language models in production. Our first path, Intelligent Inference Scheduling, established a baseline for AI-aware routing by balancing both cluster load and prefix-cache affinities. The default configuration for that path uses an approximate method for the latter, predicting cache locality based on request traffic.

This blog illuminates a more advanced and powerful path: precise prefix-cache aware scheduling.

We take a deep dive into the next generation of this feature, which moves beyond prediction and gives the scheduler direct introspection into distributed vLLM caches. This precision is key to maximizing cache hit rates and achieving a new level of performance and maximizing cost-efficiency in your distributed deployments.

Blog key takeaways

KV-cache hit rates directly impact your bottom line: With 10x cost differences between cached and uncached tokens, cache efficiency isn't just a performance optimization — it's a fundamental cost and performance driver
This isn't theoretical: Real production workloads like conversational AI and agentic workflows naturally create the prefix-heavy patterns where this approach excels
vLLM's prefix caching breaks in distributed deployments: Standard load balancers scatter related requests across pods, destroying cache locality and forcing expensive re-computation
Precise prefix-cache aware scheduling delivers order-of-magnitude gains: Our benchmarks show 57x faster response times and double the throughput on identical hardware

The Most Important Metric in Production AI

In production LLM inference, we track dozens of metrics - latency, throughput, GPU utilization, and cost, to name a few. But one metric significantly stands out. As engineers building production-grade AI agents have noted:

Manus, Context Engineering for AI Agents

"The KV-cache hit rate is the single most important metric for a production-stage AI agent. It directly affects both latency and cost."

This isn't just an academic claim; it has a direct and dramatic impact on the bottom line. Consider the pricing model for a state-of-the-art model like Anthropic's Claude Sonnet. The cost for processing tokens that are already in the cache is 10 times lower than for uncached tokens ($0.30 vs. $3.00 per million). The same pattern can be seen in OpenAI’s API pricing page. A high cache hit rate doesn't just make your application faster; it makes it fundamentally cheaper to operate. This is the power of the KV-cache.

In a single-instance environment, engines like vLLM leverage Automatic Prefix Caching to cut redundant work, reusing prior computations to drive faster, more efficient performance. However, the moment you scale to a distributed, multi-replica environment, these finely tuned optimizations can fall apart.

This blog explores that challenge: how the wins of vLLM's prefix caching are lost in naive distributed systems, and how llm-d's precise prefix-cache aware scheduling restores and enhances them. To fully grasp this, we first need to understand what makes vLLM so performant in a single instance. Let's dive in.

Inside vLLM: Mastering the Cache in a Single Instance

For the Experts

Already know how vLLM uses the KV-cache and prefix caching to optimize inference? Feel free to skip right into the challenges of scale-out.

At the heart of every transformer model is the self-attention mechanism - how the model understands context by computing attention scores between every pair of tokens. This all-pairs comparison scales quadratically with input length, making the initial prefill computation the most expensive part of generation.

The result is Key (K) and Value (V) tensors stored in the KV-cache - the model's short-term memory. For subsequent token generation during decode, the model simply pulls these existing values from memory rather than recomputing them.

vLLM takes this further with Automatic Prefix Caching: it intelligently identifies when requests share the same token sequence prefix. Instead of recomputing, it reuses the exact same memory pages from the cache through hash-based block matching. This principle of reusing computed work drives vLLM's performance:

Time to First Token (TTFT) plummets because the expensive prefill step is mostly skipped
Overall throughput increases because the GPU is freed up to serve more requests

In a simple test sending a request with a ~10,000 token prompt to a Qwen/Qwen3-32B instance a second time, time-to-first-token drops from 4.3 seconds to just 0.6 seconds.

Prefix Reuse in Practical Use Cases

The power of vLLM's caching isn't theoretical; it directly maps to the structure of the most common and valuable LLM workloads. By understanding this pattern, we can see exactly what's at stake when serving in production.

Conversational AI

In any multi-turn dialogue, from a customer service bot to a long-form assistant, the entire chat history and system prompt form a massive prefix. Each new user message is a tiny suffix. Effective caching means only the latest turn is prefilled, keeping the conversation fluid and responsive, preventing latency from increasing as the dialogue gets longer.

FIGURE 1: A diagram showing the conversational history as a growing prefix that gets cached, with only the new user query requiring prefill.

Agentic Workflows

AI agents represent the most extreme case of prefix dominance. These systems operate in reasoning loops where the prefix contains the agent's goals, tool definitions, and a long history of actions and observations. Production data shows this can lead to input-to-output ratios exceeding 100:1 (from the Manus blog), making the prefix overwhelmingly large. Reusing context at every step makes agents computationally viable.

FIGURE 2: A visual of an agent loop, showing the massive, static context (tools, step-history) as the cached prefix and the new observation/action as the small suffix.

Reusing this massive context on each turn is essential for complex agents to be computationally viable and cost-effective.

What about RAG?

While Retrieval-Augmented Generation also relies on large prefixes (system prompts + documents), reusing the KVs is more challenging. The exact documents and their order often change between queries, breaking simple prefix patterns. This requires a more sophisticated approach, which we will discuss briefly at the end of this post.

The Challenge of Scale-Out

What happens when we move from single-instance environment to distributed production clusters? The once-unified KV-cache becomes disaggregated. Each vLLM pod manages its own cache in complete isolation. Standard load balancers naively spread traffic evenly using cache-blind metrics, scattering related requests across different pods and destroying cache locality.

Let's revisit our agentic workflow example to see the direct impact of being blind to this unmanaged, disaggregated cache:

FIGURE 3: A heartbreaking KV-cache miss scenario.

This single routing decision triggers a cascade of failures:

Cache Miss: The warm cache benefit on Pod A is completely lost
Duplicated Work: The most expensive computation is needlessly performed twice
Increased Latency: Users experience significantly higher Time to First Token (TTFT)
Wasted GPU Resources: Expensive hardware is tied up re-doing work instead of serving new requests, lowering overall system throughput

In a production environment with thousands of concurrent requests, this isn't a rare event; it's the default behavior. The result is a system that is significantly slower and more expensive than it should be. This is the core challenge that llm-d's precise prefix-cache aware scheduling is designed to solve.

llm-d: Precise Prefix-Cache Aware Scheduling

We've just seen how scaling out a vLLM cluster naturally disaggregates the KV-cache, creating a distributed pool of memory that leads to costly cache misses. The solution, then, is to bridge the disaggregation. To restore the benefits of prefix caching, the scheduler needs a new sense: vision into the real-time state of the distributed cache.

This is precisely what llm-d provides (pun intended). It creates a global view of the cluster's KV-cache, allowing it to treat the disaggregated memory as a single, manageable pool and route requests with precision.

How It Works: A Global Cache View via KVEvents

The global cache view is built upon a continuous stream of KVEvents from each vLLM pod, which are processed efficiently by the open-source llm-d-kv-cache library.

The KVEvents provide a live feed of all physical cache changes across the cluster, firing every time a cache block is created or evicted. This stream is then ingested and organized by the llm-d-kv-cache library's components:

kvevents.Pool: This component consumes the high-throughput stream of events. As it digests them, it continuously updates a low-level KV-Block Index, which maintains a simple, real-time map of block-hashes to the pod and memory-medium (GPU/CPU) it resides on.
kvcache.Index: This is the higher-level index used by the scheduler. It uses the underlying KV-Block Index to map logical sequences of tokens (i.e., prefixes) to the pods that hold them. This provides the direct answer to the question, "what percentage of this request's prefix is on the accessible Pods?"

This two-layered architecture provides a continuously updated, scalable view of the cluster's cache state, which is the key to enabling intelligent, cache-aware routing.

FIGURE 4: Simplified architecture diagram. (1) - (3) show the read path, while (A) - (B) show the write pipeline.

What about the overhead? The memory overhead for this global index is negligible - see Appendix A.3 for the scaling analysis showing a 1,000,000:1 data-to-metadata ratio.

High availability support

This design naturally supports active-active or active-passive deployments, with either full view duplication or sharding through configurations.

The Precise Prefix-Cache Scorer

With an accurate, real-time global view of the cache, the scheduler can now perform intelligent routing. The component responsible for this is the Precise Prefix-Cache Scorer. It sits within the scheduler and utilizes the kvcache.Index to perform a simple but critical task for every incoming request:

It queries the kvcache.Index to determine what percentage of that prefix is already available on each active vLLM pod.
It outputs a "cache affinity score" for each pod, directly representing the computational work that can be saved.

This scorer provides a strong stickiness signal, scheduling requests to maximize the probability of a cache hit. However, relying solely on stickiness can create new problems, like sending a stream of requests to an already overloaded pod while others sit idle.

Therefore, the final routing decision isn't based on this score alone. As detailed in our previous post on the Intelligent Inference Scheduling well-lit path, the KV-cache affinity score is combined with distributive, load-aware scores, creating a balanced decision.

Performance Results

To validate this approach, we benchmarked four scheduling strategies on a cluster of 8 vLLM pods (16 H100 GPUs total) using a realistic B2B workload simulating 150 enterprise customers each with 6,000-token contexts, with 5 concurrent users per customer submitting 1,200-token queries under sustained load from 3-60 QPS.

The total KV-cache demand this workload generates is 73% of cluster’s capacity, which is six times larger than any single pod could handle, forcing the system to distribute prefixes across the cluster - exactly where intelligent scheduling becomes essential.

Benchmark details

See Appendix A.1 and Appendix A.2 for complete benchmark methodology and workload details.

The four strategies compared:

random-scheduling: A naive scheduler, acting as the control group.
load-scheduling: A scheduler aware of only of load scorers: vLLM queueing + kv-cache-utilization
approximate-scheduling: The default configuration in the intelligent inference scheduling path, extending load-aware scheduling with the approximate prefix-cache scorer.
- This plugin builds an approximate-locality index based on routing history.
precise-scheduling: The advanced well-lit path described in this post.

This benchmark, therefore, tests the scheduler's ability to efficiently manage the disaggregated KV-cache. In a production environment, if the total cache demand were to exceed the cluster's capacity, an autoscaling system would be responsible for spinning up more replicas to maintain SLOs. Here, we focus on maximizing the performance of the existing hardware.

The Results: A Leap in Performance

The summary table below shows the difference across the key performance indicators.

Experiment	Output toks/s	TTFT p90 (s)	TTFT mean (s)	vLLM Wait Queue (mean)
precise-scheduling	8730.0	0.542	0.298	0.1
approximate-scheduling	6944.4	31.083	13.316	8.1
load-scheduling	4428.7	94.865	46.987	28.9
random-scheduling	4428.7	92.551	45.281	27.3

Time to First Token (TTFT)

The most dramatic impact was on user-facing latency. precise-scheduling delivered a P90 TTFT of just 0.542 seconds. In contrast, the approximate scheduler took over 31 seconds, and the cache-blind schedulers took over 90 seconds.

precise-scheduling is 57x faster than approximate-scheduling.
precise-scheduling is over 170x faster than random-scheduling.

This is the difference between an interactive experience and a system that is functionally unusable at scale.

Total System Throughput

This efficiency in latency directly translates to higher system capacity. precise-scheduling achieved a total throughput of 8,730 output tokens/second. This represents:

A 25% increase over the approximate-scheduling baseline.
Over double the throughput of the cache-blind configurations.

This allows you to handle significantly more traffic on the exact same hardware, simply by eliminating the waste of cache misses.

FIGURE 5: A tri-panel of TTFT, TPoT and Throughput measured through progressively rising QPS rates.

The charts above clearly illustrate these wins. The blue line (precise-scheduling) maintains the lowest Mean TTFT and achieves the highest Total Throughput as the request rate increases.

The "Why": From Saved Work to System Throughput

The dramatic performance gains seen in the benchmarks are a direct result of system efficiency, a difference that is immediately visible in the real-time Grafana metrics.

The following graphs were captured throughout the benchmark runs. Schedulers are shown in order: precise-scheduling (left), approximate-scheduling (center), and random-scheduling (right).

1. Effective Cache Throughput: Quantifying Saved Work

First, we measure the Effective Cache Throughput - the number of prompt tokens per second served directly from the cache. This metric quantifies the computational work the GPUs avoided. A high value means the system is consistently saving massive amounts of expensive prefill computation.

FIGURE 6: The total computational work saved by the KV-cache across the cluster, over the course of the benchmarks.

The chart clearly shows that precise-scheduling sustains a massive and stable throughput of saved work by hitting the prefixes effectively. In the middle, we see approximate-scheduling with good but lower efficiency, and on the right, random-scheduling saving almost no work.

2. System State: The Consequence of Efficiency

This saved work translates directly into system health. By avoiding prefill bottlenecks, the GPUs can focus on productive decoding. We can see this by comparing the number of "Waiting" requests (queued) and "Running" requests (in decode).

FIGURE 7: The number of waiting requests in vLLM over the course of the benchmark.

FIGURE 8: The number of running requests (decoding) in vLLM over the course of the benchmark.

The precise-scheduling plots on the left show a stable system. By effectively utilizing the disaggregated KV-cache, it maintains minimal waiting queues and maximizes the number of actively running requests. In contrast, the other schedulers are clearly overwhelmed; their growing waiting queues choke the system and prevent work from being done efficiently.

This instability is caused by "cache thrashing." Cache-blind schedulers constantly duplicate and evict the same prefixes across different pods, wasting GPU cycles on redundant prefill. precise-scheduling avoids this entirely. It is precisely aware of prefix locations and consistently routes requests for cache-hits - as long as the load allows - resulting in less work, virtually no queues, and a healthy system.

Session-Based Scheduling

Session-based scheduling provides affinity for individual users but misses cross-user scenarios. In our benchmark with 150 enterprise customers each having 6,000-token system prompts, session-scheduling would create 750 separate sessions but miss cross-user cache reuse within customer groups, leaving the majority of computational work uncaptured. Precise prefix-cache aware scheduling guarantees maximal reuse across the system.

Adoption

The dramatic performance improvements demonstrated in these benchmarks are what drive real-world adoption.

Alibaba Cloud, for instance, is integrating this precise routing strategy into its Alibaba Cloud Container Service for Kubernetes (ACK) Gateway with Inference Extension (GIE). To further enhance their production deployments for models like Qwen and DeepSeek, they are developing a disaggregated tokenization service to support complimentary features, with the goal of contributing this work back to the llm-d community. The end to end capabilities are already being validated in customer simulation environments.

This same potential has led DaoCloud to enhance its d.run Model-as-a-Service (MaaS) platform for accelerating inference of DeepSeek and other advanced models, adopting distributed inference with P/D disaggregation and advanced KV-cache architectures via Kubernetes, vLLM, and llm-d. Kay Yan emphasizes: "intelligent KV-cache management enables more adaptive and cost-effective inference architectures".

What's Next: Extending the Cache-Aware Paradigm

Precise prefix-aware scheduling is a massive step forward, but it's just one part of a broader, cache-centric vision for inference. The llm-d project is rapidly evolving, with several exciting directions on the horizon:

Enhanced CPU Offloading: For KV-cache pools at larger scales, we are deepening the native vLLM integration for CPU offloading. This will allow for massive cache pools that intelligently tier between GPU VRAM and cheaper CPU RAM, with the scheduler making latency-aware decisions.
KV-Cache-Fusion for RAG: As noted earlier, RAG workloads present a unique challenge because retrieved documents can appear in different orders, breaking simple prefix patterns. The next frontier is position-independent KV-fusion, a technique enabling flexible and powerful cache reuse across varied RAG queries. This will come hand-in-hand with large-scale storage offloading.

Conclusion

The journey of llm-d reflects a broader shift in how we think about LLM inference - not as a set of stateless function calls, but as a dynamic, stateful orchestration problem. The benchmark data is clear: prefix-cache aware scheduling is not just an optimization; it is essential for production performance and cost efficiency.

By moving from AI-blind routing to a precise, KV-cache aware strategy, we unlock order-of-magnitude improvements in latency and throughput on the same hardware. The well-lit path of precise prefix-cache awareness offers a tested, benchmarked solution to make your distributed deployments dramatically more efficient.

Choosing the Right Strategy

The optimal scheduler depends on the complexity of the workload. Below is a hierarchy of supported strategies, where each level addresses the limitations of the one before it.

1. Random/Round-Robin Scheduling: this simple approach works well for symmetric workloads where all requests have similar computational costs and minimal cache reuse.
2. Load-Aware Scheduling: the necessary next step for asymmetric workloads. By routing requests based on Pod serving capacity, it prevents overload and improves resource utilization.
3. Approximate Prefix-Cache Scheduling: this strategy introduces cache-awareness for workloads with context reuse patterns.
- Limitation: The estimations can become unreliable at high scale or with dynamic workloads, leading to suboptimal routing - as seen in our benchmarks.
4. Precise Prefix-Cache Aware Scheduling: in production environments with tight SLOs - this is the most effective strategy for dynamic, high-scale workloads where maximizing the cache-hit ratio is a primary performance driver.

Get Involved with llm-d

The llm-d project thrives on community contributions, and there are many ways to get involved:

Explore the llm-d Community Quickstart Guide → Start here to learn more about getting involved in the llm-d project.
Join our Slack → Get your invite and connect with maintainers and contributors
Explore the code → Browse our GitHub organization and find issues that interest you
Attend meetings → All meetings are open! Add our public calendar and join discussions`

Appendix

A.1: Benchmark Setup Details

Model: Qwen/Qwen-32B.
Hardware: A cluster of 8 vLLM pods, each running on 2 NVIDIA H100 GPUs (16 total).
- Each instance holds a 307,328-tokens KV-cache.
Schedulers Compared:
- random-scheduling: A naive scheduler, acting as the control group.
- load-scheduling: A scheduler aware of only of load scorers: vLLM queueing + kv-cache-utilization
- approximate-scheduling: The baseline intelligent scheduler extending load-scheduling with the approximate prefix-cache scorer.
- precise-scheduling: The advanced well-lit path described in this post.

A.2: Workload Details - Real-World B2B SaaS Scenario

The benchmark was designed to simulate a high-value, multi-tenant B2B application under a heavy, sustained load. Imagine a platform providing a specialized AI assistant to a large number of enterprise customers.

There are 150 distinct enterprise customers (groups) using the platform simultaneously.
Each customer has a unique and substantial context of 6,000 tokens. This can be thought of as their company's internal knowledge base or a detailed set of instructions, forming the valuable shared prefix.
For each customer, 5 employees are concurrently interacting with the assistant, submitting their own unique questions of 1,200 tokens each.
The system is stressed by a continuous Poisson distribution of requests, ramping up from 3 QPS to a demanding 60 QPS to simulate peak business hours.

For this workload, in an ideal state, caching the shared prefixes for all active customers requires ~73% of the cluster's total KV-cache capacity. This is nearly six times larger than any single pod's individual capacity (~12.5%). This makes it impossible for one replica to handle the load and forces the scheduler to intelligently distribute the prefixes across the entire cluster.

This benchmark, therefore, tests the scheduler's ability to efficiently manage the disaggregated KV-cache. In a real-world scenario, if the total cache demand were to exceed the cluster's capacity, an autoscaling system would be responsible for spinning up more replicas to maintain SLOs. Here, we focus on maximizing the performance of the existing hardware - a task where cache-blind configurations create massive queues and high latency.

The tools and specifics of the experiment are captured in this llm-d-kv-cache benchmarking report.

A.3: Indexing Scale Analysis

The overhead for this global bookkeeping is that of storing KV-block hashes, and not the massive KV tensors themselves. Let's consider this vLLM example: Running DeepSeek-R1 with FP8 on 8xH200.

A full KV-cache for the DeepSeek R1 model running with FP8 on 8x NVIDIA H200 GPUs with a total of 45.7 GB * 8 = 365 GB VRAM dedicated to the KV-cache pool would be composed of thousands of individual memory blocks. Each block, representing 128 tokens, consumes about 8.6 MB of VRAM. However, the metadata needed to track each of these blocks in the global index is just a single 64-bit hash, which is 8 bytes. The additional overhead of maintaining a data structure is trivial if efficient.

This means that managing the entire 365 GB cache pool requires only about 339 KB of memory for the scheduler's index - a data-to-metadata ratio of over 1,000,000-to-1. The index's memory footprint is several orders of magnitude smaller than the GPU VRAM it tracks, making this a highly efficient, low-overhead solution.

Intelligent Inference Scheduling with llm-d

2025-09-03T09:00:00.000Z

The llm-d project lays out clear, “well-lit” paths for anyone to adopt the leading inference optimizations within their existing deployment framework - Kubernetes. These are tested approaches designed to make complex deployments easier and more efficient. In this post, we explore the first of these paths: intelligent inference scheduling. Unlike basic round-robin load balancing, this method takes the unique demands of LLMs into account, leading to better performance across the board: higher throughput, lower latency, and efficient use of resources.

Why Intelligent Inference Is Needed for LLM Inference

Deploying large language models (LLMs) on Kubernetes has become the norm, but LLM inference workloads behave very differently from standard microservices. Traditional patterns like uniform replicas paired with round-robin load balancing assume each request uses the same amount of resources and finishes in roughly the same time. In contrast, LLM requests can vary wildly in token count and compute needs, making simple load-spread strategies prone to bottlenecks and imbalanced traffic.

LLM inference pipelines also consist of two distinct phases, compute-bound prefill stage and memory-bound decode stage, that have fundamentally different resource profiles. Without specialization, every replica must handle both phases, leading to wasted GPU cycles or memory bandwidth. At the same time, many LLM use cases involve multi-turn chats or agentic flows where cached prefix computations dramatically speeds up response times if the request is routed back to the same instance.

On top of these challenges, LLM endpoints often serve a spectrum of quality-of-service needs. Interactive tasks like code completion demand millisecond-level latency, chat agents can tolerate a few seconds, and batch jobs might take minutes or more. Satisfying tight latency SLOs for expensive inference calls can be prohibitively costly if every pod is treated identically.
To address these unique demands, an intelligent inference scheduler that understands both the shape of incoming requests and the real-time state of your cluster can boost throughput, slash tail latencies, and maximize GPU resource utilization.

Recap: Inference Serving in Kubernetes, the Gateway API and Inference Gateway Extension

Kubernetes Services paired with Deployments and standard load balancing distribute traffic evenly across identical replicas. That model works well for stateless microservices with uniform, short-lived requests. But as we saw earlier, LLM inference calls vary wildly in compute intensity, benefit from stateful routing (e.g., prefix caches), and demand tight tail-latency control - none of which a vanilla load balancing handles well.

The Gateway API modernizes Kubernetes networking by offering a CRD-based, L7 routing framework that replaces and extends traditional Ingress. It gives you fine-grained route definitions, pluggable data planes, and native compatibility with multi-cluster or cross-team routing policies. Yet on its own, the Gateway API lacks any notion of LLM inference serving based on inference-specific characteristics and metrics.

To bridge that gap, the Gateway API Inference Extension project introduces the Inference Gateway (IGW). IGW reuses Gateway API‘s core primitives but adds new CRDs - most notably InferencePool - to represent collections of model-serving pods. InferencePools can carry additional metadata such as base model, accelerator type, and runtime capabilities. Gateways then invoke a pluggable EndpointPicker (EPP) to perform “smart” load balancing, leveraging Envoy’s External Processing (ext-proc) to steer traffic to the right inference endpoint.

The default EPP in IGW follows a structured scheduling cycle for each incoming request:

Endpoint discovery: Enumerate all InferencePool pods and gather their metadata (waiting queue state, loaded models, cache contents, etc.).
Filtering: Exclude pods that can’t serve the request due to overload, incompatible resources, or memory pressure.
Scoring: Assign each remaining pod a score via extensible scorers - evaluating factors like queue depth, session affinity, prefix cache hits, and custom SLO indicators.
Selection: Pick appropriate endpoints, with built-in tie-breaking and fallback logic.

Building on IGW’s foundation, llm-d augments the EPP with more advanced scheduling capabilities. It introduces scorers that optimize for KV cache locality (boosting prefix-cache hit rates) and orchestrates multiple scheduling passes to disaggregate prefill and decode phases onto specialized pod variants. The result is a fully LLM-aware scheduler that drives higher throughput, lower tail latencies, and finer resource efficiency across the board.

Intelligent Inference Scheduling with llm-d

A key differentiator of llm-d is the ability to plug in configurable, AI-aware scorers into the inference gateway scheduling pipeline. These scorers go beyond generic load balancing by factoring in LLM-specific workload characteristics such as token count variability, compute/memory phase differences, and KV-cache locality - when deciding where each request should run.

LLM workloads are not uniform. Some use cases — like multi-turn conversations, RAG pipelines, or agentic flows naturally lead to high prefix reuse, where requests repeatedly share large portions of the prompt. Others like diverse batch inference jobs or single-shot completions exhibit low prefix sharing, where cache hits are rare and every request is essentially unique.

Because of this diversity, llm-d’s pluggable, AI-aware scorers allow operators to tailor scheduling strategies to workload profiles. We evaluated two configurations:

Prefix-only scorer – routes to maximize KV-cache hits.
Prefix + Load scorer – adds dynamic load-awareness while still exploiting cache opportunities.

Why AI-Aware Scorers Win

Following benchmarks show how performance evolves when cache opportunities are minimal, and they illustrate an important point: the optimal scheduling strategy depends on the workload profile.

When cache locality is abundant, the results are dramatic:

Success rate: The prefix-only scorer frequently overloaded replicas, succeeding in only ~55% of requests, while Prefix + Load maintained 100% success across all QPS levels.
Time to First Token (TTFT): Prefix + Load kept TTFT consistently near-zero, while Prefix-only degraded rapidly, exceeding 140s at high QPS.
Inter-Token Latency (ITL): Prefix + Load achieved ITL of ~30ms, versus ~160ms with Prefix-only — more than 5× improvement in responsiveness.
Throughput: Prefix + Load scaled linearly with QPS, reaching ~60k tokens/sec at 20 QPS. Prefix-only flatlined near 2k–3k tokens/sec.

Throughput vs Request Rate

Success Rate

TTFT and QPS

Intertoken Latency

In workloads with heavy prefix reuse, prefix-aware scheduling combined with load-awareness is essential to avoid bottlenecks and maximize GPU utilization. By combining prefix scoring with load awareness, llm-d achieves 100% request success, lower latencies, and linear throughput scaling — the essence of intelligent, AI-aware scheduling.

When cache hits are rare, prefix-awareness provides little benefit, and both scorers perform similarly:

Throughput: Both scorers perform nearly identically, scaling linearly with QPS. Output throughput reaches ~400 tokens/sec and total throughput ~60k tokens/sec at 20 QPS for both strategies.

Latency:

Time to First Token (TTFT): Both remain stable in the 300–380 ms range as load increases. Small variations exist, but neither scorer shows a clear advantage.
Normalized time per token: Flat around 0.65 ms/token, with both scorers tightly overlapping across QPS levels.
Inter-Token Latency (ITL): Increases linearly with load, from ~25 ms at 2 QPS to ~50 ms at 20 QPS — again, no significant gap between scorers.

Reliability:
Both scorers achieve 100% success rate across the full load range, confirming that load balancing alone is sufficient when prefix reuse is low.

Under low prefix sharing workloads, the benefits of prefix-aware routing naturally diminish. In this case, adding load-awareness or prefix-awareness makes little difference - both strategies scale smoothly and meet latency targets.

Takeaway

These benchmarks illustrate why configurable scorers matter in llm-d.

In prefix-heavy workloads, Prefix + Load scoring ensures cache hits are exploited without overloading replicas — yielding linear throughput scaling, low latencies, and high success rates.
In prefix-light workloads, simple load balancing suffices, and the system avoids unnecessary complexity.

This adaptability means operators can choose (or combine) scorers based on workload characteristics, achieving the best token-per-dollar efficiency while consistently meeting latency and throughput SLOs.

Looking Ahead: Roadmap and Future Plans

The IGW and llm-d projects are evolving rapidly, with several exciting directions on the horizon:

Dynamic Scheduling Goals: Support for runtime reconfiguration of scheduling strategies based on workload type, latency targets, or user-defined policies.
Multi-Model Awareness: Enhanced routing logic that accounts for model compatibility, adapter stacking, and ensemble inference. (next blog)
Plugin Ecosystem: A curated set of reusable plugins for common LLM use cases, contributed by the community. We’re considering supporting out of process plugins, written in any language, to allow researchers to experiment with new scheduling algorithms and ideas - let us know if you have an idea we can help enable!

Closing Thoughts

The journey of llm-d reflects a broader shift in how we think about LLM inference - not just as a stateless function call, but as a dynamic, resource-aware orchestration problem. By building on IGW and pushing its boundaries, llm-d offers a flexible, extensible foundation for intelligent scheduling at scale.
Whether you're running a single model or a fleet of fine-tuned variants, the goal is the same: maximize performance, minimize latency, and make smarter use of available compute.

Get Involved with llm-d

The llm-d project thrives on community contributions, and there are many ways to get involved:

Explore the llm-d Community Quickstart Guide → Start here to learn more about getting involved in the llm-d project.
Join our Slack → Get your invite and connect with maintainers and contributors
Explore the code → Browse our GitHub organization and find issues that interest you
Attend meetings → All meetings are open! Add our public calendar and join discussions`

llm-d 0.2: Our first well-lit paths (mind the tree roots!)

2025-07-29T00:00:00.000Z

Our 0.2 release delivers progress against our three well-lit paths to accelerate deploying large scale inference on Kubernetes - better load balancing, lower latency with disaggregation, and native vLLM support for very large Mixture of Expert models like DeepSeek-R1.

We’ve also enhanced our deployment and benchmarking tooling, incorporating lessons from real-world infrastructure deployments and addressing key antipatterns. This release gives llm-d users, contributors, researchers, and operators, clearer guides for efficient use in tested, reproducible scenarios.

New Paths to Production

In this release we've focused on delivering a clear and reproducible set of scenarios that teams can depend on and tested end-to-end on real hardware and models.

Our deployments have been tested and benchmarked on recent GPUs, such as H200 nodes, with models like Llama-3, Llama-4, and DeepSeek-R1. We provide deployment guidance and performance analysis to help teams understand when P/D separation is most beneficial and where tradeoffs emerge.

We’ve defined and improved three well-lit paths that form the foundation of this release:

Intelligent inference scheduling over any vLLM deployment: support for precise prefix-cache aware routing with no additional infrastructure, out-of-the-box load-aware scheduling for better tail latency that “just works”, and a new configurable scheduling profile system enable teams to see immediate latency wins and still customize scheduling behavior for their workloads and infrastructure.
P/D disaggregation: support for separating prefill and decode workloads to improve latency and GPU utilization for long-context scenarios.
Wide expert parallelism for DeepSeek R1 (EP/DP): support for large-scale multi-node deployments using expert and data parallelism patterns for MoE models. This includes optimized deployments leveraging NIXL+UCX for inter-node communication, with fixes and improvements to reduce latency, and demonstrates the use of LeaderWorkerSet for Kubernetes-native inference orchestration.

All of these scenarios are reproducible: we provide reference hardware specs, workloads, and benchmarking harness support, so others can evaluate, reproduce, and extend these benchmarks easily. This also reflects improvements to our deployment tooling and benchmarking framework, a new "machinery" that allows users to set up, test, and analyze these scenarios consistently.

While this is our first release and there are still some rough edges, our goal is to continue to refine and widen these paths to speed adoption. Please provide feedback about where to go next!

Key enabling changes and technical milestones

llm-d 0.2 builds on the progress of our Special Interest Groups (SIGs), delivering the following key features:

Modular deployer refactor

We've refactored the deployer into a Helm-first, modular structure, splitting charts for infrastructure, model services, and the inference gateway. These charts are now central to our documentation, complete with clear prerequisites for Kubernetes versions, networking, and GPU hardware. This refactor not only makes it easier for both first-time and advanced users to deploy llm-d but also positions us to work directly with end-user environments where modularity and flexibility adapt ML theory to user production practice.

P/D disaggregation and DP/EP for MoE deployments

The path for Prefill/Decode (P/D) disaggregation and multi-node DP/EP MoE deployments is now more clearly defined and tested. This work integrates and optimizes key vLLM 0.10.0 kernel improvements, including DeepGEMM and CUTLASS for expert parallel compute, as well as PPLX and DeepEP kernels and intra- and inter-node communication fixes and optimizations and multi-node scenarios. We now include:

Kubernetes-native deployment recipes now support API servers per DP rank for one-pod-per-rank placement, enhancing scalability and control
Helm charts are updated to support LeaderWorkerSet (LWS) for multi-node setups and direct one-pod-per-DP-rank deployments
Optimized intra-node communication by enabling DeepEP to use cuda_ipc efficiently
Enhanced NIXL+UCX performance, with fixes and optimizations that significantly reduce inter-node communication overhead, particularly for long context workloads

These validated scenarios are backed by benchmark baselines and example deployments via our quickstarts, offering clearer guidance on what works well today. As part of the "well-lit path" we have also identified limitations including known edge cases around response sizes and failure modes where more work is required.

Inference scheduler extensibility

The llm-d-inference-scheduler is now far more extensible and aligned with the latest upstream Inference Gateway code base. It’s fully configurable and supports flexible, label selector–based filtering to enable a variety of model server topologies, including LWS-based deployments. We’ve improved the user experience for prefix-aware scheduling by allowing a simple configuration switch between the current estimated prefix tracking at the gateway and a new precise prefix cache–based tracking that reads KV events directly from vLLM for better hit rates.

Our Helm charts now support deployment of scheduler configurations out of the box, making it easier for researchers and operators to iterate on custom scheduling and routing strategies without modifying core components. Internally, testing and development workflows have been updated for improved velocity and quality and this release includes numerous bug fixes.

Improved benchmarking suite

Our benchmarking suite has matured significantly. It now supports testing any pre-deployed llm-d workload, accommodates multiple load generators, and includes automated analysis and plot generation for easier performance data interpretation.

For this release, we ran sweeps to characterize throughput and scaling to demonstrate the benefits of P/D disaggregation for long context workloads. Scenarios cover representative workload shapes (input/output ratios of 10:1 and 100:1) and explore various parallelism schemes and P/D disaggregation ratios. For each setup, we're measuring throughput scaling (tokens per second per user and tokens per second per GPU) across increasing concurrency levels. These results provide direct comparison with and without P/D separation (just load-aware), highlighting where llm-d’s optimizations deliver significant benefits.

Fig. 1: Pareto curve for Llama-Scout on dual 8×H200 IB nodes, comparing monolithic (4tp4) and P/D-disaggregated (4ptp2–2dtp4) topologies.

The chart above shows the standard Pareto curve for a Llama‑Scout on 2 8xH200 nodes with Infiniband Networking, comparing standard 4tp4 topology to a disaggregated 4ptp2-2dtp4 configuration (maintaining the total GPU count). The X-axis measures the latency observed by each user and the Y-axis measures the total throughput per GPU. Each point on the chart represents a particular concurrency.

While both configurations perform similarly at very low or very high user output speeds, the disaggregated setup delivers significantly higher per-GPU throughput at medium concurrency levels (particularly around 64-128 concurrent requests), where contention between prefill and decode phases tend to dominate. This validates that disaggregation doesn’t just increase throughput, it exposes saturation points and unlocks headroom that is otherwise lost to phase interference. These insights are critical to autoscaling, role assignment, and future predictor-driven and SLO-aware scheduling.

These results follow a broader trend previously reported [1, 2]: disaggregated serving consistently provides the greatest benefit under medium‑concurrency conditions, particularly for prefill‑heavy traffic and larger models. Our results confirm this trend, showing improved throughput and clearer saturation dynamics at mid‑throughput, strongly validating our architectural direction. By decoupling prefill and decode phases, we not only boost raw performance but also surface scaling limits that static, monolithic systems obscure. This lays the foundation for dynamic topology adaptation, predictor‑informed routing, and autoscaling strategies driven by the real‑time behavior of workloads. These are key priorities for our upcoming releases.

Image improvements

Multi-arch support, smaller images, and hardened configurations ensure a reliable out-of-the-box experience.

Here are some key lessons we learned so far in our progress with llm-d:

Low-hanging fruit matters. Targeted optimizations, like reducing KV‑cache transfer overhead between prefill and decode workers and refining prefix‑aware scheduling, delivered significant gains in throughput and tail latency. These quick wins required minimal change but paved the way for the deeper architectural improvements planned in upcoming releases.
Using bleeding-edge libraries is hard. Many key libraries associated with distributed inference are immature. Through our applied experiments in our well-lit paths and in close collaboration with ecosystem partners, we have improved much of the key infrastructure the larger community relies on in real-world conditions.
Build on proven paths. This validates why llm-d exists: to help users avoid discovering these problems themselves, offering reproducible deployments, performance baselines, and extensibility. llm-d focuses on building these paths so our users don’t need to troubleshoot these complex challenges in isolation.
Community matters. Working closely with the NVIDIA Dynamo community, we've tackled NIXL/UCX performance overheads for long context workloads, leading to significant improvements and active upstream contributions.

Our survey

In our first community survey, we invited users to share their deployment requirements and challenges to help shape the future of the llm-d project and better understand how teams are serving LLMs today. We heard from a broad and diverse group, from platform engineers, business leaders, and data practitioners, reflecting a wide range of workloads, architectures, and operational maturity.

Conversational AI (82.9%) and real-time applications (56.1%) stood out as the most common workloads, with nearly half of respondents supporting 4-10 models concurrently. Hardware choices highlight a diverse landscape: 85% use NVIDIA GPUs, 29% run AMD GPUs, and 27% deploy on CPUs alone, a surprising signal. Model preferences show Llama (73%), Qwen (63%), and Mistral (56%) leading the way. Yet despite this activity, SLO maturity is still emerging: 46% report no formal SLOs, and 39% are still defining theirs, suggesting many teams are in exploratory or early production phases. What came through clearly is strong demand for operational features like routing, caching, observability, and flexibility, signaling that ease of use and manageability are top priorities as teams scale. See a full summary of the survey responses here.

What you get today

Today, llm-d 0.2 offers:

Modular Helm charts and clear deployment workflows.
Verified support for P/D, DP/EP, pod-per-rank, and heterogeneous GPUs (H200, B200).
Reproducible performance baselines, now with MoE support.
New foundations for routing and scheduler extensibility.
A developer, and researcher-friendly platform with tested examples, with detailed guides on the way.

A growing community

The best part of llm-d has been watching the community grow around it. We're thrilled that over 700 people have already joined our Slack channel, and the project has been starred over 1,400 times on GitHub. This isn't just about numbers; it’s about the active, collaborative spirit that drives the project forward.

Much of the work happens within our seven Special Interest Groups (SIGs), each focused on a key area:

Inference Scheduler – Developing smarter routing and load‑balancing strategies, including KV‑cache‑aware scheduling.
P/D Disaggregation – Advancing phase‑separation strategies to improve resource‑utilization efficiency.
KV Disaggregation – Advancing and optimizing distributed KV‑cache management.
Installation – Streamlining deployment on Kubernetes, from single‑node setups to large multi‑node clusters.
Benchmarking – Building tools to automate performance validation and make scenarios easier to reproduce and extend.
Autoscaling – Adapting resources dynamically based on workload demands.
Observability – Providing deep visibility into system performance and health.

We're also collaborating with other great open-source communities like vLLM, Dynamo, and LMCache. Every one of these groups is open, and we’d love for you to join in. Whether you want to contribute code, share ideas, or just listen in, you are welcome. You can find details for each SIG, including their leaders and meeting times, on our community page.

What's next:

Looking ahead, our community is focusing on these key areas:

Core optimizations
- TCP-based request dispatch upstream
- Disaggregation protocol refinements, including possible sidecar removal
- CPU cache offloading to expand memory capacity
- KV event awareness baked directly into routing decisions
- SLO-driven scheduling architecture for predictable performance
Benchmarking enhancements:
- Expanded reproducibility guides.
- Complete performance validation for core scenarios.
Developer experience improvements:
- Expanded examples for inference gateway and scheduler extensibility.
- Central Helm charts and expanded documentation.

See our roadmap issue to see what is coming next and make your voice heard!

Join the llm-d community!

We welcome researchers exploring scheduling, autoscaling, and routing optimization challenges. Your contributions are invaluable!

Community engagement is key to our success:

Join our community calls (Wed 12:30pm ET)

Contribute on GitHub, join our community calls, join the SIGs and build with us!

llm-d Community Update - June 2025

2025-06-25T00:00:00.000Z

Hey everyone! We've been making great progress with the llm-d project, and I wanted to share some important updates and opportunities to get involved.

Help Shape the Future of the llm-d Project

To guide the future development of the llm-d project, we need to understand the real-world challenges, configurations, and performance needs of our community. We've created a short survey to gather insight into how you serve Large Language Models, from the hardware you use to the features you need most.

This anonymous, vendor-agnostic survey will take approximately 5 minutes to complete. Your input will directly influence the project's roadmap and priorities. The aggregated results will be shared with the llm-d-contributors mailing list to benefit the entire community.

Your Input Will Define Our Roadmap

We've created an llm-d Community Roadmap Survey to gather information about your LLM workloads. We are looking to learn more about:

Your Serving Environment: This includes the hardware you use now and anticipate using in a year (like NVIDIA GPUs, AMD GPUs, or CPUs), and whether you run on-premise, in the cloud, or on edge devices.
Your Model Strategy: Do you serve a few large models or many smaller ones, which model families (like Llama or Mistral) are most common, and how you utilize techniques like LoRA adapters.
Your Performance Requirements: Your real-world SLOs for latency and throughput and the biggest LLM serving challenges you face—from cost optimization to operational ease of use.
Your Future Needs: What single new feature you would prioritize for an LLM Model-as-a-Service to help guide our innovation.

Take the 5-Minute Survey

Your participation is invaluable. Please take a few minutes to complete the survey. We encourage you to share it with other users or proxy their needs in your response to ensure our direction reflects the community's diverse requirements.

New YouTube Channel

We're excited to announce our new YouTube channel! We've been recording our SIG meetings and creating tutorial content to help you get started with llm-d.

Subscribe to our YouTube

Subscribe to get notified when new videos are posted, presentations, demos, and SIG meeting recordings: @llm-d-project

Reminder: Join Our Google Group to Access Project Docs

Join our Google Group

We use Google Groups to share architecture diagrams, SIG meeting notes, and other important project content. To get full access to all project resources, please join: llm-d-contributors Google Group

The Google Group is where all the magic happens! Here you'll find:

Detailed architecture diagrams and design documents
SIG meeting notes and recordings
Early access to new features and proposals
Direct communication with the core team and other contributors

Get Involved

There are many ways to contribute to llm-d:

Join the Google Group for full access to project resources
Review the project public calendar and join an upcoming community meeting
Subscribe to our YouTube channel for tutorials and meeting recordings
Fill out the Use Case Survey to help shape our benchmarks
Join our Slack workspace for real-time discussions
Check out our Contributor Guidelines to start contributing code

We're looking forward to hearing from you and working together to make llm-d even better!

llm-d Week 1 Project News Round-Up

2025-06-03T00:00:00.000Z

June 3, 2025

llm-d Week 1+2 Project News Round-Up

Hey, the llm-d project team has been really busy after the launch on May 20.

We've hit 1000 ⭐️'s on GitHub

Here are some of the active design conversations:

Join our Google Group

We use Google Groups to share architecture diagrams and other content. Please join: llm-d-contributors Google Group

Project links for those wanting to track us or get involved:

Project Launch
Contributor Guidelines
Project Calendar
- All SIG meetings
- Weekly Community Meeting, every Wednesday @ 12:30pm Eastern Time
Join the llm-d Google group - this is where all engineering drafts are shared. You must join the group to have access to project google docs.
Join Slack
llm-d Roadmap for Q2

Where to follow us on Social Media

Announcing the llm-d community!

2025-05-20T08:00:00.000Z

Announcing the llm-d community

llm-d is a Kubernetes-native high-performance distributed LLM inference framework
- a well-lit path for anyone to serve at scale, with the fastest time-to-value and competitive performance per dollar for most models across most hardware accelerators.

With llm-d, users can operationalize gen AI deployments with a modular, high-performance, end-to-end serving solution that leverages the latest distributed inference optimizations like KV-cache aware routing and disaggregated serving, co-designed and integrated with the Kubernetes operational tooling in Inference Gateway (IGW).

LLM Inference Goes Distributed

Why Standard Scale Out Falls Short

Kubernetes typically scales out application workloads with uniform replicas and round-robin load balancing.

This simple pattern is very effective for most request patterns, which have the following characteristics:

Requests are short-lived and generally uniform in resource utilization
Requests have generally uniform latency service level objectives (SLOs)
Each replica can process each request equally well
Specializing variants and coordinating replicas to process a single request is not useful

LLM Serving Is Unique

The LLM inference workload, however, is unique with slow, non-uniform, expensive requests. This means that typical scale-out and load-balancing patterns fall short of optimal performance.

Let's take a look at each one step-by-step:

A. Requests are expensive with significant variance in resource utilization.

Each LLM inference request has a different "shape" to it, as measured by the number of input tokens and output tokens. There is significant variance in these parameters across requests and workloads.
- RAG has long inputs - prompt and retrieved docs - and short generated outputs
- Reasoning has a short or medium inputs and long generated outputs

These differences in request times can lead to significant imbalances across instances, which are compounded as loaded instances get overwhelmed. Overloads lead to longer ITL (Inter-Token Latency), which leads to more load, which leads to more ITL.

B. Routing to specific replicas with cached prior computation can achieve orders of magnitude better latency.

Many common LLM workloads have "multi-turn" request patterns, where the same prompt is sent iteratively to the same instance.
- Agentic (tool calls are iterative request flow)
- Code completion task (requests reuse current codebase as context)

LLM inference servers like vLLM implement a method called "automatic prefix caching", which enables "skipping" a significant amount of prefill computation when there is a cache hit. If requests are routed to vLLM replicas that have the data in the cache, we skip computation. Increasing the likelihood of prefix cache hits with a larger cache size can dramatically improve tail latencies.

C. Specializing and coordinating replicas to process a single request can lead to more throughput per GPU.

Inference is split into two phases – prefill and decode. Prefill generates the first output token and runs in parallel over all the prompt tokens - this phase is compute bound. Decode generates tokens one at a time by doing a full pass over the model, making this phase memory bandwidth-bound.
Standard LLM deployments perform the prefill and decode phases of inference within a single replica.Given that prefill and decode phases of inference have different resource requirements, co-locating these phases on the same replica leads to inefficient resource use, especially for long sequences.
Disaggregation (e.g. Distserve) separates prefill and decode phases onto different variants, enabling independent optimization and scaling of each phase.
- Google leverages disaggregated serving on TPUs to provide better first-token latency and simplify operational scaling.
- DeepSeek released a discussion of the design of their inference system, which leverages aggressive disaggregation to achieve remarkable performance at scale.

D. Production deployments often have a range of quality of service (QoS) requirements.

Use cases for a single LLM endpoint can have a wide variety of quality of service requirements. Consider the following examples:
- Latency is the most important factor: Code completion requests and search responses need to minimize latency to provide an "in the loop" experience. O(ms) latency tolerance.
- Latency is important: Chat agent sessions and email drafting with interactive use cases. O(seconds) latency tolerance.
- Latency tolerant: Video call and email summarization and "deep research" agents with daily or hourly usage patterns. O(minutes) latency tolerance.
- Latency agnostic: Overnight batch processing workloads, meeting minute generation, and autonomous agents. O(hours) latency tolerance.
Given the compute intensity (and, therefore, high costs) of LLMs, tight latency SLOs are substantially more expensive to achieve. This spectrum of latency requirements presents an opportunity to further optimize infrastructure efficiency – the more latency tolerant a workload is, the more we can optimize infrastructure efficiency amongst other workloads.

Why llm-d?

To exploit these characteristics and achieve optimal performance for LLM workloads, the inference serving landscape is rapidly transitioning towards distributed cluster-scale architectures. For instance, in its "Open Source Week", the DeepSeek team published the design of its inference system, which aggressively leverages disaggregation and KV caching to achieve remarkable performance per $ of compute.

However, for most GenAI innovators, ML platform teams, and IT operations groups, these benefits remain out of reach. Building and operating a complex, monolithic system is time-consuming and challenging, especially in the context of the rapid pace of innovation and enterprise deployments with tens or hundreds of models for divergent use cases. This complexity risks time to market, higher operational costs and sprawl, and difficulty adopting and experimenting.

Our Objective

The objective of llm-d is to create a well-lit path for anyone to adopt the leading distributed inference optimizations within their existing deployment framework - Kubernetes.

To achieve this goal, we have the following design principles for the project:

Operationalizability: modular and resilient architecture with native integration into Kubernetes via Inference Gateway API
Flexibility: cross-platform (active work to support NVIDIA, Google TPU, AMD, and Intel), with extensible implementations of key composable layers of the stack
Performance: leverage distributed optimizations like disaggregation and prefix-aware routing to achieve the highest tok/$ while meeting SLOs

Architecture

To achieve this objective, we designed llm-d with a modular and layered architecture on top of industry-standard open-source technologies - vLLM, Kubernetes, and Inference Gateway.

vLLM. vLLM is the leading open-source LLM inference engine, supporting a wide range of models (including Llama and DeepSeek) and hardware accelerators (including NVIDIA GPU, Google TPU, AMD ) with high performance.
Kubernetes (K8s). K8s is an open source container orchestration engine for automating deployment, scaling, and management of containerized applications. It is the industry standard for deploying and updating LLM inference engines across various hardware accelerators.
Inference Gateway (IGW). IGW is an official Kubernetes project that extends the Gateway API (the next generation of Kubernetes Ingress and Load Balancing API) with inference-specific routing. IGW includes many important features like model routing, serving priority, and extensible scheduling logic for "smart" load balancing. IGW integrates with many different gateway implementations, such as Envoy, making it widely portable across Kubernetes clusters.

And our key new contributions:

vLLM Optimized Inference Scheduler - IGW defines a pattern for customizable "smart" load-balancing via the Endpoint Picker Protocol (EPP). Leveraging enhanced operational telemetry exposed by vLLM, the inference scheduler implements the filtering and scoring algorithms necessary to make "smart" scheduling decisions around disaggregated serving, prefix-cache-awareness, and load-awareness, validated to be used out-of-the-box by llm-d users. Advanced teams can also tweak or implement their own scorers and filterers to further customize for their use cases, while still benefiting from upcoming operational features in the inference gateway, like flow control and latency-aware balancing.
- For more details, see our Northstar: [PUBLIC] llm-d Scheduler Northstar
Disaggregated Serving with vLLM - llm-d leverages vLLM's recently enabled support for disaggregated serving via a pluggable KV Connector API to run prefill and decode on independent instances, using high-performance transport libraries like NVIDIA's NIXL.

In llm-d, we plan to support two "well-lit" paths for prefill/decode (P/D) disaggregation:
- Latency optimized implementation using fast interconnects (IB, RDMA, ICI)
- Throughput optimized implementation using data center networking
- For more details, see our Northstar:[PUBLIC] llm-d Disaggregated Serving Northstar
Disaggregated Prefix Caching with vLLM - llm-d uses the same vLLM KV connector API used in disaggregated serving to provide a pluggable cache for previous calculations, including offloading KVs to host, remote storage, and systems like LMCache.

In llm-d, we plan to support two "well-lit" paths for KV cache disaggregation:
- Independent caching with basic offloading to host memory and disk, providing a zero operational cost mechanism that utilizes all system resources
- Shared caching with KV transfer between instances and shared storage with global indexing, providing potential for higher performance at the cost of a more operationally complex system.
- For more details, see our Northstar: [PUBLIC] llm-d Prefix Caching Northstar
Variant Autoscaling over Hardware, Workload, and Traffic - Accelerator hardware varies dramatically in terms of compute, memory, and cost, workloads sharing the same models vary by their required quality of service, the distinct phases of LLM inference and large mixture-of-expert models vary on whether they are compute, memory, or network bound, and incoming traffic varies over time and by workload. Today, all of these decisions are made at deployment time, and almost all deployers struggle to enable autoscaling to reduce their costs safely.

Drawing on extensive experience from end users and OSS collaborators like AIBrix, we plan to implement a traffic- and hardware-aware autoscaler that:
- Measures the capacity of each model server instance
- Derive a load function that takes into account different request shapes and QoS
- Using the recent traffic mix - QPS (Queries Per Second), QoS, and shape distribution - calculate the optimal mix of instances to handle prefill, decode, and latency-tolerant requests, and label each instance with a grouping
- Report load metrics per grouping that allows Kubernetes horizontal pod autoscaling to match hardware in use to hardware needed without violating SLOs
- For more details, see our Northstar: [PUBLIC] llm-d Autoscaling Northstar

Example llm-d Features

llm-d integrates IGW and vLLM together, enabling a high performance distributed serving stack. Let's discuss some of the example features enabled by llm-d.

Prefix and KV cache-aware routing

The first key collaboration between IGW and vLLM in llm-d was developing prefix-cache aware routing to complement the existing KV cache utilization aware load balancing in IGW.

We conducted a series of experiments to evaluate the performance of the llm-d-inference-scheduler with prefix-aware routing on 2 NVIDIA 8xH100 nodes using the LMbenchmark in a long-input/short-output configuration designed to stress KV cache reuse and routing decision quality.

	Model	Configuration	ISL	OSL	Latency SLO
S1	LlaMA 4 Scout FP8	TP2, 2 replicas	20,000	100	None
S2	LlaMA 4 Scout FP8	TP2, 4 replicas	12,000	100	P95 TTFT <= 2s
S3	Llama 3.1 70B FP16	TP2, 4 replicas	8,000	100	P95 TTFT <= 2s

Key Observations:

S1: At 4 QPS, llm-d achieves a mean TTFT approximately 3X lower than the baseline (lower is better).
S2: llm-d delivers ~50% higher QPS than the baseline while meeting SLO requirements (higher is better).
S3: llm-d sustains 2X the baseline QPS under SLO constraints (higher is better).

These results show that llm-d's cache- and prefix-aware scheduling effectively reduces TTFT and increases QPS compared to the baseline, while consistently meeting SLA requirements.

Try it out with the `base.yaml` config in our quickstart. And as a customization example, see the template for adding your own scheduler filter.

P/D disaggregation

We've completed an initial implementation of P/D disaggregation with vLLM and llm-d-inference-scheduler, which delivers promising speedups for prefill-heavy workloads (20:1 ISL | OSL). Our next focus is finalizing the implementation with heterogeneous TP and completing comprehensive benchmarks for disaggregated serving. Short-term priorities include enabling heterogeneous TP, scaling with high-performance P/D + EP<>DP for large scale MoEs, and DP-aware load balancing. We will follow up with a detailed performance blog in the coming weeks.

Try it out with the pd-nixl.yaml config in our quickstart.

Get started with llm-d

llm-d builds brings together the performance of vLLM with the operationalizability of Kuberentes, creating a modular architecture for distributed LLM inference, targeting high performance on the latest models and agentic architectures.

We welcome AI engineers and researchers to join the llm-d community and contribute:

Check out our repository on Github: https://github.com/llm-d/llm-d
Join our developer slack: /slack
Try out our quick starts to deploy llm-d on your Kubernetes cluster: https://github.com/llm-d/llm-d-deployer/tree/main/quickstart

Please join us. The future of AI is open.

llm-d Press Release

2025-05-20T00:00:00.000Z

May 20, 2025

Red Hat Launches the llm-d Community, Powering Distributed Gen AI Inference at Scale

Forged in collaboration with founding contributors CoreWeave, Google Cloud, IBM Research and NVIDIA and joined by industry leaders AMD, Cisco, Hugging Face, Intel, Lambda and Mistral AI and university supporters at the University of California, Berkeley, and the University of Chicago, the project aims to make production generative AI as omnipresent as Linux

BOSTON – RED HAT SUMMIT – MAY 20, 2025 — Red Hat, the world's leading provider of open source solutions, today announced the launch of llm-d, a new open source project that answers the most crucial need of generative AI's (gen AI) future: Inference at scale. Tapping breakthrough inference technologies for gen AI at scale, llm-d is powered by a native Kubernetes architecture, vLLM-based distributed inference and intelligent AI-aware network routing, empowering robust, large language model (LLM) inference clouds to meet the most demanding production service-level objectives (SLOs).

While training remains vital, the true impact of gen AI hinges on more efficient and scalable inference - the engine that transforms AI models into actionable insights and user experiences. According to Gartner¹, "By 2028, as the market matures, more than 80% of data center workload accelerators will be specifically deployed for inference as opposed to training use." This underscores that the future of gen AI lies in the ability to execute. The escalating resource demands of increasingly sophisticated and larger reasoning models limits the viability of centralized inference and threatens to bottleneck AI innovation with prohibitive costs and crippling latency.

Answering the need for scalable gen AI inference with llm-d

Red Hat and its industry partners are directly confronting this challenge with llm-d, a visionary project that amplifies the power of vLLM to transcend single-server limitations and unlock production at scale for AI inference. Using the proven orchestration prowess of Kubernetes, llm-d integrates advanced inference capabilities into existing enterprise IT infrastructures. This unified platform empowers IT teams to meet the diverse serving demands of business-critical workloads, all while deploying innovative techniques to maximize efficiency and dramatically minimize the total cost of ownership (TCO) associated with high-performance AI accelerators.

llm-d delivers a powerful suite of innovations, highlighted by:

vLLM, which has quickly become the open source de facto standard inference server, providing day 0 model support for emerging frontier models, and support for a broad list of accelerators, now including Google Cloud Tensor Processor Units (TPUs).
Prefill and Decode Disaggregation to separate the input context and token generation phases of AI into discrete operations, where they can then be distributed across multiple servers.
KV (key-value) Cache Offloading, based on LMCache, shifts the memory burden of the KV cache from GPU memory to more cost-efficient and abundant standard storage, like CPU memory or network storage.
Kubernetes-powered clusters and controllers for more efficient scheduling of compute and storage resources as workload demands fluctuate, while maintaining performance and lower latency.
AI-Aware Network Routing for scheduling incoming requests to the servers and accelerators that are most likely to have hot caches of past inference calculations.
High-performance communication APIs for faster and more efficient data transfer between servers, with support for NVIDIA Inference Xfer Library (NIXL).

llm-d: Backed by industry leaders

This new open source project has already garnered the support of a formidable coalition of leading gen AI model providers, AI accelerator pioneers, and premier AI cloud platforms. CoreWeave, Google Cloud, IBM Research and NVIDIA are founding contributors, with AMD, Cisco, Hugging Face, Intel, Lambda and Mistral AI as partners, underscoring the industry's deep collaboration to architect the future of large-scale LLM serving. The llm-d community is further joined by founding supporters at the Sky Computing Lab at the University of California, originators of vLLM, and the LMCache Lab at the University of Chicago, originators of LMCache.

Rooted in its unwavering commitment to open collaboration, Red Hat recognizes the critical importance of vibrant and accessible communities in the rapidly evolving landscape of gen AI inference. Red Hat will actively champion the growth of the llm-d community, fostering an inclusive environment for new members and fueling its continued evolution.

Red Hat's vision: Any model, any accelerator, any cloud.

The future of AI must be defined by limitless opportunity, not constrained by infrastructure silos. Red Hat sees a horizon where organizations can deploy any model, on any accelerator, across any cloud, delivering an exceptional, more consistent user experience without exorbitant costs. To unlock the true potential of gen AI investments, enterprises require a universal inference platform - a standard for more seamless, high-performance AI innovation, both today and in the years to come.

Just as Red Hat pioneered the open enterprise by transforming Linux into the bedrock of modern IT, the company is now poised to architect the future of AI inference. vLLM's potential is that of a linchpin for standardized gen AI inference, and Red Hat is committed to building a thriving ecosystem around not just the vLLM community but also llm-d for distributed inference at scale. The vision is clear: regardless of the AI model or the underlying accelerator or the deployment environment, Red Hat intends to make vLLM the definitive open standard for inference across the new hybrid cloud.

Red Hat Summit
Join the Red Hat Summit keynotes to hear the latest from Red Hat executives, customers and partners:

Modernized infrastructure meets enterprise-ready AI — Tuesday, May 20, 8-10 a.m. EDT (YouTube)
Hybrid cloud evolves to deliver enterprise innovation — Wednesday, May 21, 8-9:30 a.m. EDT (YouTube)

Supporting Quotes
Brian Stevens, senior vice president and AI CTO, Red Hat
"The launch of the llm-d community, backed by a vanguard of AI leaders, marks a pivotal moment in addressing the need for scalable gen AI inference, a crucial obstacle that must be overcome to enable broader enterprise AI adoption. By tapping the innovation of vLLM and the proven capabilities of Kubernetes, llm-d paves the way for distributed, scalable and high-performing AI inference across the expanded hybrid cloud, supporting any model, any accelerator, on any cloud environment and helping realize a vision of limitless AI potential."

Ramine Roane, corporate vice president, AI Product Management, AMD
"AMD is proud to be a founding member of the llm-d community, contributing our expertise in high-performance GPUs to advance AI inference for evolving enterprise AI needs. As organizations navigate the increasing complexity of generative AI to achieve greater scale and efficiency, AMD looks forward to meeting this industry demand through the llm-d project."

Shannon McFarland, vice president, Cisco Open Source Program Office & Head of Cisco DevNet
"The llm-d project is an exciting step forward for practical generative AI. llm-d empowers developers to programmatically integrate and scale generative AI inference, unlocking new levels of innovation and efficiency in the modern AI landscape. Cisco is proud to be part of the llm-d community, where we're working together to explore real-world use cases that help organizations apply AI more effectively and efficiently."

Chen Goldberg, senior vice president, Engineering, CoreWeave
"CoreWeave is proud to be a founding contributor to the llm-d project and to deepen our long-
standing commitment to open source AI. From our early partnership with EleutherAI to our ongoing work advancing inference at scale, we've consistently invested in making powerful AI infrastructure more accessible. We're excited to collaborate with an incredible group of partners
and the broader developer community to build a flexible, high-performance inference engine
that accelerates innovation and lays the groundwork for open, interoperable AI."

Mark Lohmeyer, vice president and general manager, AI & Computing Infrastructure, Google Cloud
"Efficient AI inference is paramount as organizations move to deploying AI at scale and deliver value for their users. As we enter this new age of inference, Google Cloud is proud to build upon our legacy of open source contributions as a founding contributor to the llm-d project. This new community will serve as a critical catalyst for distributed AI inference at scale, helping users realize enhanced workload efficiency with increased optionality for their infrastructure resources."

Jeff Boudier, Head of Product, Hugging Face
"We believe every company should be able to build and run their own models. With vLLM leveraging the Hugging Face transformers library as the source of truth for model definitions; a wide diversity of models large and small is available to power text, audio, image and video AI applications. Eight million AI Builders use Hugging Face to collaborate on over two million AI models and datasets openly shared with the global community. We are excited to support the llm-d project to enable developers to take these applications to scale."

Priya Nagpurkar, vice president, Hybrid Cloud and AI Platform, IBM Research
"At IBM, we believe the next phase of AI is about efficiency and scale. We're focused on unlocking value for enterprises through AI solutions they can deploy effectively. As a founding contributor to llm-d, IBM is proud to be a key part of building a differentiated hardware agnostic distributed AI inference platform. We're looking forward to continued contributions towards the growth and success of this community to transform the future of AI inference."

Bill Pearson, vice president, Data Center & AI Software Solutions and Ecosystem, Intel
"The launch of llm-d will serve as a key inflection point for the industry in driving AI transformation at scale, and Intel is excited to participate as a founding supporter. Intel's involvement with llm-d is the latest milestone in our decades-long collaboration with Red Hat to empower enterprises with open source solutions that they can deploy anywhere, on their platform of choice. We look forward to further extending and building AI innovation through the llm-d community."

Eve Callicoat, senior staff engineer, ML Platform, Lambda
"Inference is where the real-world value of AI is delivered, and llm-d represents a major leap forward. Lambda is proud to support a project that makes state-of-the-art inference accessible, efficient, and open."

Ujval Kapasi, vice president, Engineering AI Frameworks, NVIDIA
"The llm-d project is an important addition to the open source AI ecosystem and reflects NVIDIA's support for collaboration to drive innovation in generative AI. Scalable, highly performant inference is key to the next wave of generative and agentic AI. We're working with Red Hat and other supporting partners to foster llm-d community engagement and industry adoption, helping accelerate llm-d with innovations from NVIDIA Dynamo such as NIXL."

Ion Stoica, Professor and Director of Sky Computing Lab, University of California, Berkeley
"We are pleased to see Red Hat build upon the established success of vLLM, which originated in our lab to help address the speed and memory challenges that come with running large AI models. Open source projects like vLLM, and now llm-d anchored in vLLM, are at the frontier of AI innovation tackling the most demanding AI inference requirements and moving the needle for the industry at large."

Junchen Jiang, CS Professor, LMCache Lab, University of Chicago
"Distributed KV cache optimizations, such as offloading, compression, and blending, have been a key focus of our lab, and we are excited to see llm-d leveraging LMCache as a core component to reduce time to first token as well as improve throughput, particularly in long-context inference."

Additional Resources

Learn more about llm-d
Read more about vLLM
Find out more about contributing to llm-d
Learn more about Red Hat Summit
See all of Red Hat's announcements this week in the Red Hat Summit newsroom
Follow @RedHatSummit or #RHSummit on X for event-specific updates

Connect with Red Hat

Learn more about Red Hat
Get more news in the Red Hat newsroom
Read the Red Hat blog
Follow Red Hat on X
Follow Red Hat on Instagram
Follow Red Hat on LinkedIn
Watch Red Hat videos on YouTube

About Red Hat
Red Hat is the open hybrid cloud technology leader, delivering a trusted, consistent and comprehensive foundation for transformative IT innovation and AI applications. Its portfolio of cloud, developer, AI, Linux, automation and application platform technologies enables any application, anywhere—from the datacenter to the edge. As the world's leading provider of enterprise open source software solutions, Red Hat invests in open ecosystems and communities to solve tomorrow's IT challenges. Collaborating with partners and customers, Red Hat helps them build, connect, automate, secure and manage their IT environments, supported by consulting services and award-winning training and certification offerings.

Forward-Looking Statements
Except for the historical information and discussions contained herein, statements contained in this press release may constitute forward-looking statements within the meaning of the Private Securities Litigation Reform Act of 1995. Forward-looking statements are based on the company's current assumptions regarding future business and financial performance. These statements involve a number of risks, uncertainties and other factors that could cause actual results to differ materially. Any forward-looking statement in this press release speaks only as of the date on which it is made. Except as required by law, the company assumes no obligation to update or revise any forward-looking statements.

Media Contact:
John Terrill
Red Hat
+1-571-421-8132
jterrill@redhat.com

###

Red Hat and the Red Hat logo are trademarks or registered trademarks of Red Hat, Inc. or its subsidiaries in the U.S. and other countries.

Forecast Analysis: AI Semiconductors, Worldwide, Alan Priestley, Gartner, 2 August 2024 - ID G00818912 GARTNER is a registered trademark and service mark of Gartner, Inc. and/or its affiliates in ↩

llm-d Blog

Predicted-Latency Based Scheduling for LLMs

The Load Balancing Problem in LLM Serving​

Predicted-Latency Aware Scheduling​

How It Works​

Design Goals​

Predicting TTFT and TPOT​

Features​

Training Data​

Request Scheduling​

Prediction Accuracy​

Endpoint Selection​

Benchmark Scenario Comparison​

Load Balancing Scorers​

Results​

Production Workload​

Benchmark Configuration​

Routing Strategies Compared​

Results​

Prediction Server Scalability​

Try It: A Well-Lit Path​

Tradeoffs & Gaps​

Takeaway​

Get Involved​

Appendix​

Multi-Turn Cache Capacity Analysis​

Theoretical Capacity Estimates​

KV Cache Behavior Across Workloads A-D​

Workload A -- High System Prompt Overlap, No System Cache Pressure​

Workload B -- Moderate System Prompt Overlap, No System Cache Pressure​

Workload C -- Low System Prompt Overlap, High System Cache Pressure​

Workload D -- Low System Prompt Overlap, Low System Cache Pressure​

Summary​

Native KV Cache Offloading to Any Filesystem with llm-d

The Importance of KV-cache Reuse​

Why Storage Offloading is Needed​

What We Built: llm-d FS Backend​

How to Use it​

Results and Benchmarks​

Single request speed-up​

Scalability test​

Scalability in realistic workloads​

Summary and Next Steps​

Acknowledgement​

llm-d 0.5: Sustaining Performance at Scale

Updated developer experience​

Performance update​

Throughput-oriented: Wide-EP on Nvidia B200​

Latency-oriented: Inference Scheduling​

Optimized offloading: Hierarchical KV-caching​

Advancements in scheduling​

Resilient networking: NIXL - UCCL backend​

Updates to autoscaling​

Community and ecosystem​

Broader hardware ecosystem integration​

What Is Next?​

Upcoming Events​

llm-d 0.4: Achieve SOTA Performance Across Accelerators

SOTA Low Latency Serving on Large MoE​

Expanding Hardware Choice​

Expanding the well-lit paths​

New well-lit path for prefix cache offloading​

Enhanced intelligent scheduling​

Benchmarking and validation​

Improving efficiency​

Introducing the Workload Variant Autoscaler (experimental)​

Saturation-based scaling​

Making it easier to operate​

Production readiness for chat​

Breaking changes​

What Is Next?​

Community and Upcoming Events​

llm-d 0.3: Wider Well-Lit Paths for Scalable Inference

Commit to the Mission​

Broader Hardware Support​

Simplified User Experience Across A Variety of Infrastructure​

Brighten the well-lit paths​

Wide-EP Performance​

Inference Gateway v1.0 GA​

Predicted Latency-Based Scheduling​

The Load Balancing Problem in LLM Serving

Predicted-Latency Aware Scheduling

How It Works

Design Goals

Predicting TTFT and TPOT

Features

Training Data

Request Scheduling

Prediction Accuracy

Endpoint Selection

Benchmark Scenario Comparison

Load Balancing Scorers

Results

Production Workload

Benchmark Configuration

Routing Strategies Compared

Results

Prediction Server Scalability

Try It: A Well-Lit Path

Tradeoffs & Gaps

Takeaway

Get Involved

Appendix

Multi-Turn Cache Capacity Analysis

Theoretical Capacity Estimates

KV Cache Behavior Across Workloads A-D

Workload A -- High System Prompt Overlap, No System Cache Pressure

Workload B -- Moderate System Prompt Overlap, No System Cache Pressure

Workload C -- Low System Prompt Overlap, High System Cache Pressure

Workload D -- Low System Prompt Overlap, Low System Cache Pressure

Summary

The Importance of KV-cache Reuse

Why Storage Offloading is Needed

What We Built: llm-d FS Backend

How to Use it

Results and Benchmarks

Single request speed-up

Scalability test

Scalability in realistic workloads

Summary and Next Steps

Acknowledgement

Updated developer experience

Performance update

Throughput-oriented: Wide-EP on Nvidia B200

Latency-oriented: Inference Scheduling

Optimized offloading: Hierarchical KV-caching

Advancements in scheduling

Resilient networking: NIXL - UCCL backend

Updates to autoscaling

Community and ecosystem

Broader hardware ecosystem integration

What Is Next?

Upcoming Events

SOTA Low Latency Serving on Large MoE

Expanding Hardware Choice

Expanding the well-lit paths

New well-lit path for prefix cache offloading

Enhanced intelligent scheduling

Benchmarking and validation

Improving efficiency

Introducing the Workload Variant Autoscaler (experimental)

Saturation-based scaling

Making it easier to operate

Production readiness for chat

Breaking changes

What Is Next?

Community and Upcoming Events

Commit to the Mission

Broader Hardware Support

Simplified User Experience Across A Variety of Infrastructure

Brighten the well-lit paths

Wide-EP Performance

Inference Gateway v1.0 GA

Predicted Latency-Based Scheduling

KV Disaggregation and Chat Completions

Observability and Benchmarking

What Do You Get Today?

What Is Next?

Community and Upcoming Events

The Most Important Metric in Production AI