<?xml version="1.0" encoding="utf-8"?><?xml-stylesheet type="text/xsl" href="atom.xsl"?>
<feed xmlns="http://www.w3.org/2005/Atom">
    <id>https://llm-d.ai/blog</id>
    <title>llm-d Blog</title>
    <updated>2026-03-13T09:00:00.000Z</updated>
    <generator>https://github.com/jpmonette/feed</generator>
    <link rel="alternate" href="https://llm-d.ai/blog"/>
    <subtitle>llm-d Blog</subtitle>
    <icon>https://llm-d.ai/img/llm-d-favicon.png</icon>
    <entry>
        <title type="html"><![CDATA[Predicted-Latency Based Scheduling for LLMs]]></title>
        <id>https://llm-d.ai/blog/predicted-latency-based-scheduling-for-llms</id>
        <link href="https://llm-d.ai/blog/predicted-latency-based-scheduling-for-llms"/>
        <updated>2026-03-13T09:00:00.000Z</updated>
        <summary type="html"><![CDATA[A lightweight ML model trained online from live traffic replaces manually tuned heuristic weights with direct latency predictions, achieving 43% improvement in P50 end-to-end latency and 70% improvement in TTFT on a production-realistic workload.]]></summary>
        <content type="html"><![CDATA[<p>Not all LLM requests cost the same. A short prompt might complete in milliseconds, while a long one can occupy a GPU for seconds. If we can predict how long a request will take on each candidate server before dispatching it, we can make substantially better routing decisions. This post describes a system that does exactly that: a lightweight ML model trained online from live traffic that replaces manually tuned heuristic weights with direct latency predictions.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-load-balancing-problem-in-llm-serving">The Load Balancing Problem in LLM Serving<a href="https://llm-d.ai/blog/predicted-latency-based-scheduling-for-llms#the-load-balancing-problem-in-llm-serving" class="hash-link" aria-label="Direct link to The Load Balancing Problem in LLM Serving" title="Direct link to The Load Balancing Problem in LLM Serving" translate="no">​</a></h2>
<p>The variation in request cost comes from how LLM inference works. It happens in two phases: first, the model processes the entire input prompt (the <strong>prefill</strong> phase), which is compute-heavy and scales with prompt length. Prefill can be accelerated when the server has already cached results from a similar prompt (prefix caching). Then it generates output tokens one at a time (the <strong>decode</strong> phase), which is memory-heavy and scales with the number of tokens generated.</p>
<p>Current load balancers try to account for this using signals like queue depth, memory pressure, cache locality, and batch size. But these signals often conflict: routing for cache reuse concentrates load, while routing for low utilization spreads it. Getting the balance right requires manual tuning of weights (see <a href="https://docs.nvidia.com/dynamo/latest/user-guides/kv-cache-aware-routing#cli-arguments" target="_blank" rel="noopener noreferrer" class="">NVIDIA Dynamo</a> or <a href="https://gateway-api-inference-extension.sigs.k8s.io/guides/epp-configuration/config-text/?h=kv#kvcachescorer" target="_blank" rel="noopener noreferrer" class="">Inference Gateway</a>), and the right balance shifts as the workload varies.</p>
<p>Production traffic makes this concrete. Figures below show metrics from an internal Google service serving an open model over 24 hours, patterns representative of what we see across production LLM deployments.</p>
<ul>
<li class=""><strong>Request and response sizes are bursty and have huge variance</strong>: Input and output token counts swing by orders of magnitude over the course of hours, driven by traffic that arrives in waves rather than at a steady rate. (Token counts are normalized by the model's maximum context length.)</li>
</ul>
<div style="text-align:center;margin:20px 0"><img src="https://llm-d.ai/img/blogs/predicted-latency/request_response_variance.webp" alt="Request and response size variance over 24 hours" style="width:100%;height:auto"></div>
<ul>
<li class=""><strong>Server load is uneven</strong>: These bursts hit servers differently. KV cache utilization (a measure of how much GPU memory is occupied by in-flight requests) spikes from 30% to over 70%. Queue depths spike just as unevenly.</li>
</ul>
<div style="text-align:center;margin:20px 0"><img src="https://llm-d.ai/img/blogs/predicted-latency/uneven_server_load.webp" alt="Uneven server load across pods" style="width:100%;height:auto"></div>
<ul>
<li class=""><strong>Cache reuse is unstable</strong>: LLM servers cache previously computed results so that repeated prefixes (like a shared system prompt) don't need to be recomputed. Most requests see little to no cache reuse, while a subset benefits from high hit rates. But even that subset is unstable: hit rates collapse frequently as traffic patterns shift and cached prefixes get evicted. A load balancer tuned for high cache reuse will routinely encounter requests where that assumption doesn't hold.</li>
</ul>
<div style="text-align:center;margin:20px 0"><img src="https://llm-d.ai/img/blogs/predicted-latency/unstable_cache_reuse.webp" alt="Unstable cache reuse patterns" style="width:85%;height:auto"></div>
<p>These patterns are consistent with observations across production LLM deployments. Azure inference traces show significant variation in request sizes and heavy-tailed token distributions over short time windows [<a href="https://jovans2.github.io/files/DynamoLLM_HPCA2025.pdf" target="_blank" rel="noopener noreferrer" class="">Stojkovic et al., "DynamoLLM," HPCA 2025</a>]. BurstGPT documents burstiness and diversified concurrency patterns across 10.31 million traces from Azure OpenAI services [<a href="https://arxiv.org/abs/2401.17644" target="_blank" rel="noopener noreferrer" class="">Wang et al., "BurstGPT," 2024</a>].</p>
<p>No fixed configuration can handle this. Weights tuned for high cache reuse cause latency violations during cache misses, and weights tuned for worst-case reuse waste capacity when caching is working well.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="predicted-latency-aware-scheduling">Predicted-Latency Aware Scheduling<a href="https://llm-d.ai/blog/predicted-latency-based-scheduling-for-llms#predicted-latency-aware-scheduling" class="hash-link" aria-label="Direct link to Predicted-Latency Aware Scheduling" title="Direct link to Predicted-Latency Aware Scheduling" translate="no">​</a></h2>
<p>The two phases of LLM inference each have a standard latency metric: <strong>Time to First Token (TTFT)</strong> measures how long prefill takes, and <strong>Time Per Output Token (TPOT)</strong> measures how long each subsequent token takes to generate.</p>
<p>We train a lightweight XGBoost regression model in real-time on the relationship between request and server characteristics -- prompt length, prefix cache hit rate, number of running requests, queue depth, KV cache utilization -- and the observed TTFT and TPOT for completed requests. The model learns to approximate the underlying performance behavior of the model server and accelerator hardware, continuously retraining on a sliding window of recent data to track shifting workload patterns.</p>
<p>At scheduling time, the model predicts the TTFT and TPOT a new request would experience on each candidate server, given that server's current state. The scheduler then routes the request to the server with the best predicted outcome. When SLOs are provided, the scheduler prefers servers with positive headroom (predicted latency below the SLO); otherwise, it simply picks the server with the lowest predicted latency.</p>
<p>This largely eliminates manual weight tuning. Rather than deciding how much to value cache locality versus queue depth versus memory pressure, the model learns those tradeoffs directly from observed latency data.</p>
<p>Across five benchmark scenarios ranging from cache-friendly to cache-intensive workloads, predicted-latency aware scheduling outperforms or matches load+prefix-aware routing in four out of five cases. Additionally we achieve 43% improvement in P50 end-to-end latency on a representative MaaS workload, with 70% improvements in TTFT.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="how-it-works">How It Works<a href="https://llm-d.ai/blog/predicted-latency-based-scheduling-for-llms#how-it-works" class="hash-link" aria-label="Direct link to How It Works" title="Direct link to How It Works" translate="no">​</a></h2>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="design-goals">Design Goals<a href="https://llm-d.ai/blog/predicted-latency-based-scheduling-for-llms#design-goals" class="hash-link" aria-label="Direct link to Design Goals" title="Direct link to Design Goals" translate="no">​</a></h3>
<p>The predicted-latency approach improves on utilization-based balancing in two key ways:</p>
<ul>
<li class="">
<p><strong>Balancing spread vs consolidation under changing traffic:</strong> Minimizing TPOT requires spreading load to reduce batch size, while minimizing TTFT benefits from consolidation to maximize prefix cache reuse. The optimal balance between these strategies shifts as traffic patterns change. Utilization-based balancers rely on manually tuned weights to make this tradeoff, and any fixed configuration will be wrong as the workload varies. The predicted-latency model learns the relationship between server state, request characteristics, and actual latency outcomes, allowing it to dynamically balance between spreading and consolidation as conditions change.</p>
</li>
<li class="">
<p><strong>Best-fit scheduling in the presence of SLOs:</strong> When requests have latency SLOs, the optimal strategy is best-fit: pack requests into servers that can still meet SLO targets, keeping other servers free for future requests with higher GPU requirements. Utilization-based balancers have no way to determine whether a server can meet a request's SLO, they only see proxy signals like queue depth and memory pressure. The predicted-latency model directly estimates TTFT and TPOT per server, allowing the scheduler to compute headroom against SLO targets and route accordingly. This best-fit strategy can be especially useful with a heterogeneous pool of model servers (say a mix of H100 and B200 GPUs).</p>
</li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="predicting-ttft-and-tpot">Predicting TTFT and TPOT<a href="https://llm-d.ai/blog/predicted-latency-based-scheduling-for-llms#predicting-ttft-and-tpot" class="hash-link" aria-label="Direct link to Predicting TTFT and TPOT" title="Direct link to Predicting TTFT and TPOT" translate="no">​</a></h3>
<p>We use an XGBoost regression model that takes request and server state as input and outputs predicted TTFT and TPOT. The model was designed to be fast, accurate, pluggable, and able to learn online as workload characteristics shift.</p>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="features">Features<a href="https://llm-d.ai/blog/predicted-latency-based-scheduling-for-llms#features" class="hash-link" aria-label="Direct link to Features" title="Direct link to Features" translate="no">​</a></h4>
<p>We assume all pods use the same accelerator class, a simplification that can be addressed in the future. Beyond that, load and request shape drive most of the variation in TTFT and TPOT.</p>
<table><thead><tr><th style="text-align:left">Feature</th><th style="text-align:left">What It Captures</th><th style="text-align:left">Why It Matters</th></tr></thead><tbody><tr><td style="text-align:left"><strong>KV Cache Usage %</strong></td><td style="text-align:left">How full the decode state is</td><td style="text-align:left">High KV cache -&gt; higher TPOT and slower TTFT when memory is saturated</td></tr><tr><td style="text-align:left"><strong>Input Length</strong></td><td style="text-align:left">Weight of the prefill step</td><td style="text-align:left">Longer prompts -&gt; higher prefill cost -&gt; higher TTFT</td></tr><tr><td style="text-align:left"><strong>Queue Depth</strong></td><td style="text-align:left">Backlog before scheduling</td><td style="text-align:left">More waiting requests -&gt; higher TTFT; correlates with prefill interruptions -&gt; affects TPOT</td></tr><tr><td style="text-align:left"><strong>Running Requests</strong></td><td style="text-align:left">Active GPU concurrency</td><td style="text-align:left">Higher concurrency -&gt; higher TTFT; larger decode batches -&gt; higher TPOT</td></tr><tr><td style="text-align:left"><strong>Prefix Cache Match %</strong></td><td style="text-align:left">How much KV reuse is possible</td><td style="text-align:left">High match -&gt; faster prefill -&gt; lower TTFT; low match -&gt; full attention -&gt; higher TTFT</td></tr><tr><td style="text-align:left"><strong>Input Tokens In Flight</strong></td><td style="text-align:left">Input tokens dispatched but not yet prefilled, plus input tokens already prefilled but still occupying KV cache (request not complete)</td><td style="text-align:left">Captures both incoming prefill pressure and lingering memory footprint -&gt; higher TTFT; helps the model anticipate load before it hits the server</td></tr></tbody></table>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="training-data">Training Data<a href="https://llm-d.ai/blog/predicted-latency-based-scheduling-for-llms#training-data" class="hash-link" aria-label="Direct link to Training Data" title="Direct link to Training Data" translate="no">​</a></h4>
<p>Performance characteristics in LLM serving are batch-dependent and shift too quickly for long-term historical averages to remain meaningful. To stay aligned with real traffic, the model:</p>
<ul>
<li class="">Collects the most recent samples using a <strong>sliding window</strong></li>
<li class="">Stratifies them into coarse <strong>buckets</strong> (KV cache % in steps of 10, prefix hit rate in steps of 0.25, etc.)</li>
<li class="">Continuously retrains on this stratified dataset</li>
</ul>
<p>Bucketing with a sliding window is important because it maintains samples from regimes that aren't showing up in the latest traffic. Without bucketing, a single global sliding window would let the newest data overwrite everything. For example, if current traffic sits around 60% KV cache utilization, older samples from 30% KV cache would eventually disappear -- and the model would forget how to predict in that regime.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="request-scheduling">Request Scheduling<a href="https://llm-d.ai/blog/predicted-latency-based-scheduling-for-llms#request-scheduling" class="hash-link" aria-label="Direct link to Request Scheduling" title="Direct link to Request Scheduling" translate="no">​</a></h3>
<p>The latency predictor runs as a sidecar to the <a href="https://gateway-api-inference-extension.sigs.k8s.io/" target="_blank" rel="noopener noreferrer" class="">Inference Gateway Endpoint Picker</a> (EPP), hosting both training and prediction servers:</p>
<ul>
<li class=""><strong>Training server:</strong> Continuously trains the model from live traffic, sampling data across KV cache, queue states, and prefix scores to maintain a stratified training dataset. As new requests complete, it refreshes the dataset and retrains the TTFT/TPOT models.</li>
<li class=""><strong>Prediction servers:</strong> Serve the trained models, returning predicted TTFT and TPOT given the current server load and request features.</li>
</ul>
<p>We added a <a href="https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/0845-scheduler-architecture-proposal#score" target="_blank" rel="noopener noreferrer" class="">predicted-latency scorer</a> to the EPP. The scorer compares predicted latencies to per-request SLOs and computes <strong>headroom</strong> (predicted latency minus SLO target). It then gives higher scores to servers with positive headroom, packing requests into servers that can still meet SLOs while keeping other pods free for future requests. If no SLOs are provided, it simply prefers servers with lower predicted latencies.</p>
<div style="text-align:center;margin:20px 0"><img src="https://llm-d.ai/img/blogs/predicted-latency/architecture_sidecars.webp" alt="Architecture: predicted-latency trainer and predictor sidecars" style="width:85%;height:auto"><p style="font-size:0.9em;margin-top:8px"><em>The predicted-latency trainer and predictor modules are deployed as sidecars to the EPP. The trainer is invoked at the post-response stage. The predictor is invoked during scheduling and optionally at the post-response stage. A new predicted-latency scorer utilizes predictions from the ML model.</em></p></div>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="prediction-accuracy">Prediction Accuracy<a href="https://llm-d.ai/blog/predicted-latency-based-scheduling-for-llms#prediction-accuracy" class="hash-link" aria-label="Direct link to Prediction Accuracy" title="Direct link to Prediction Accuracy" translate="no">​</a></h3>
<p>Below we show predicted vs actual TTFT and TPOT over a benchmark run (scenario C below) as QPS increases stepwise toward saturation.</p>
<div style="text-align:center;margin:20px 0"><img src="https://llm-d.ai/img/blogs/predicted-latency/predicted_vs_actual_tpot.webp" alt="Predicted vs actual TPOT over time" style="width:85%;height:auto"><p style="font-size:0.9em;margin-top:8px"><em>Predicted (red) vs actual (blue) TPOT over time. The model tracks steady-state TPOT well, even at transient spikes.</em></p></div>
<div style="text-align:center;margin:20px 0"><img src="https://llm-d.ai/img/blogs/predicted-latency/predicted_vs_actual_ttft.webp" alt="Predicted vs actual TTFT over time" style="width:85%;height:auto"><p style="font-size:0.9em;margin-top:8px"><em>Predicted (red) vs actual (blue) TTFT over time. The model tracks TTFT closely as it ramps from near zero to 5 minutes across increasing QPS levels.</em></p></div>
<p>Across multiple benchmark runs, the model achieves a Mean Absolute Percentage Error (MAPE) of approximately 5%. This is not surprising -- accelerator performance is fairly deterministic given the current server state and request characteristics. The same prompt length, at the same KV cache utilization, with the same number of running requests, will produce similar TTFT and TPOT. The model simply learns this mapping.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="endpoint-selection">Endpoint Selection<a href="https://llm-d.ai/blog/predicted-latency-based-scheduling-for-llms#endpoint-selection" class="hash-link" aria-label="Direct link to Endpoint Selection" title="Direct link to Endpoint Selection" translate="no">​</a></h3>
<p>Given accurate TTFT and TPOT predictions, there are multiple possible algorithms for choosing the optimal endpoint -- for example, optimization-based approaches or multi-armed bandit strategies. We chose a greedy approach for its simplicity and low overhead, combined with a cache-aware affinity gate to prevent fragmentation.</p>
<p><strong>Latency-based scoring.</strong> When SLOs are provided, the scorer computes <strong>headroom</strong> for each candidate server: how much room remains before the predicted latency exceeds the SLO target. To combine TTFT and TPOT into a single headroom score, we use a weighted combination -- by default 80% TTFT and 20% TPOT -- reflecting that TTFT is typically the more constraining metric in practice. The scheduler then does <strong>best-fit</strong>: it routes to the server with the least positive headroom, packing requests into servers that can still meet SLOs while keeping other servers free for future requests. When no SLOs are provided, the scorer simply routes to the server with the lowest predicted latency (most room).</p>
<p><strong>Cache-aware affinity gate.</strong> Pure greedy routing can be counterproductive: a pod with no load but no cached prefix may have a lower predicted latency right now, but routing there means paying the full prefill cost and abandoning the cache built up on another pod. Over many requests, this leads to cache fragmentation -- prefixes are scattered across pods, no pod builds deep cache reuse, and the cluster loses the latency benefit of caching altogether. The opposite extreme is equally harmful: always routing to the pod with the best cache match concentrates popular prefixes on a few pods, which collapse under memory pressure.</p>
<p>To balance cache exploitation with exploration, the scorer uses an epsilon-greedy affinity gate:</p>
<ul>
<li class="">
<p><strong>Exploit (99%):</strong> Filter candidates to pods whose prefix cache score exceeds a threshold (<code>affinityGateTau</code>, default 0.80). Among these, select the pod with the best predicted latency. Because the latency model already credits the cache benefit, this naturally picks the pod where cache reuse translates into actual lower latency, not just the highest raw cache score.</p>
</li>
<li class="">
<p><strong>Explore (1%):</strong> With probability <code>epsilonExploreSticky</code> (default 0.01), ignore the affinity gate entirely and consider all pods. These seeds cache entries on non-sticky pods, ensuring the cluster maintains cache diversity. Over time, these seeded entries grow into viable affinity targets, preventing the system from collapsing into a few overloaded cache-hot pods.</p>
</li>
<li class="">
<p><strong>Load gate:</strong> Even in the exploit path, if the best sticky pod's predicted TTFT exceeds the best overall pod's TTFT by more than <code>affinityMaxTTFTPenaltyMs</code> (default 5000ms), affinity is broken. This catches the case where queueing cost on a cache-hot pod has grown to outweigh the cache benefit, the predictor's latency estimate makes this comparison possible without manual thresholds on queue depth or memory.</p>
</li>
</ul>
<p>The default threshold of 0.80 comes from production observation: prefix cache scores follow a bimodal distribution, roughly half of request-pod pairs have very high cache match (&gt;0.80) and half have low match (&lt;0.80). This reflects how prefix caching works in practice: in multi-turn conversations, a pod either has the conversation history cached from prior turns or it doesn't. Partial matches from unrelated conversations contribute very little because caching is block-based. The 0.80 threshold cleanly separates these two populations, so the affinity gate routes to pods that genuinely have your conversation cached rather than pods with incidental partial matches.</p>
<div style="text-align:center;margin:20px 0"><img src="https://llm-d.ai/img/blogs/predicted-latency/prefix_hit_ratio_dist.webp" alt="Prefix cache hit ratio distribution showing bimodal pattern" style="width:85%;height:auto"><p style="font-size:0.9em;margin-top:8px"><em>A typical production prefix hit ratio distribution observed in internal workloads, showing the bimodal pattern that motivates the 0.80 affinity threshold.</em></p></div>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="benchmark-scenario-comparison">Benchmark Scenario Comparison<a href="https://llm-d.ai/blog/predicted-latency-based-scheduling-for-llms#benchmark-scenario-comparison" class="hash-link" aria-label="Direct link to Benchmark Scenario Comparison" title="Direct link to Benchmark Scenario Comparison" translate="no">​</a></h2>
<p>The table below contrasts five scenarios, ranging from cache-friendly (high prefix-sharing) to cache-intensive scenarios. We used the <a href="https://github.com/kubernetes-sigs/inference-perf/pull/301" target="_blank" rel="noopener noreferrer" class=""><strong>inference-perf</strong></a> library to enable shared prefix benchmarking configurations with multi-turn chat support. See <a href="https://llm-d.ai/blog/predicted-latency-based-scheduling-for-llms#appendix" class="">Appendix</a> for a complete analysis of the workloads.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="load-balancing-scorers">Load Balancing Scorers<a href="https://llm-d.ai/blog/predicted-latency-based-scheduling-for-llms#load-balancing-scorers" class="hash-link" aria-label="Direct link to Load Balancing Scorers" title="Direct link to Load Balancing Scorers" translate="no">​</a></h3>
<ul>
<li class="">
<p><strong>Predicted Latency Scorer</strong></p>
</li>
<li class="">
<p><strong>Load+Prefix Scorer:</strong> Combines pod load metrics (KV cache utilization and queued requests) with prefix cache awareness, balancing between resource utilization and cache locality.</p>
<ul>
<li class=""><strong>Load metrics:</strong> KV cache utilization and queued request count</li>
<li class=""><strong>Prefix cache awareness:</strong> Considers cached prefix availability</li>
</ul>
<p>The set of <a href="https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/0f0dff6866da606b6439254eda787b24be1b5110/config/charts/inferencepool/templates/epp-config.yaml#L55" target="_blank" rel="noopener noreferrer" class="">weights</a> used was:
(1, 1, 1) prefix scorer: 1, queue scorer: 1, kv cache scorer: 1</p>
</li>
<li class="">
<p><strong>K8s Default Load Balancer:</strong> Standard Kubernetes round-robin or least-connection load balancing without cache or latency awareness (baseline).</p>
</li>
</ul>
<p>Note that <strong>Predicted Latency Scorer</strong> eliminates the need to manually tune relative weights between different scoring components, as the ML model learns optimal trade-offs from historical data.</p>
<hr>
<p><strong>Hardware Configuration:</strong> 10 model servers, each with 2x H100 80GB GPUs (TP=2, DP=1, EP=1, no disaggregation) for scenario A - D. For ShareGPT workload, which has much shorter prompts, to achieve high KV Cache utilization, we have 8 model servers, each with 1x H100 80GB GPUs (TP=1).</p>
<p>Benchmark configuration: we tested multiple scenarios detailed in the following table. Think of <em>num_groups</em> as the number of unique system prompts and <em>num_prompts_per_group</em> as the number of users that share a system prompt.</p>
<table><thead><tr><th style="text-align:left">Scenario</th><th style="text-align:left">Description</th><th style="text-align:left">Benchmark Configuration</th><th style="text-align:left">Best Performing Scorer</th></tr></thead><tbody><tr><td style="text-align:left"><strong>A. Shared Prefix -- High System Prompt Overlap, No System Cache Pressure</strong></td><td style="text-align:left">This workload represents a regime where shared system prefixes amortize extremely well, and user context grows slowly enough that cache pressure remains low.</td><td style="text-align:left"><code>num_groups=6</code> <code>num_prompts_per_group=1000</code> <code>system_prompt_len=1000</code> <code>question_len=30 +/- 9</code> <code>output_len=1000 +/- 300</code> <code>enable_multi_turn_chat=true</code></td><td style="text-align:left"><strong>Predicted Latency Scorer</strong></td></tr><tr><td style="text-align:left"><strong>B. Shared Prefix -- Moderate System Prompt Overlap, No System Cache Pressure</strong></td><td style="text-align:left">This workload also represents a regime where shared system prefixes amortize extremely well. However user context grows faster than workload A leading to onset of prefix cache evictions at a lower QPS.</td><td style="text-align:left"><code>num_groups=6</code> <code>num_prompts_per_group=1000</code> <code>system_prompt_len=1000</code> <code>question_len=3000 +/- 900</code> <code>output_len=1000 +/- 300</code> <code>enable_multi_turn_chat=true</code></td><td style="text-align:left"><strong>Predicted Latency Scorer</strong></td></tr><tr><td style="text-align:left"><strong>C. Shared Prefix -- Low System Prompt Overlap, High System Cache Pressure</strong></td><td style="text-align:left">In this workload large system prompts fail to amortize due to low reuse, quickly consuming cache capacity. System-prefix eviction destroys shared reuse, leading to abrupt performance degradation across users.</td><td style="text-align:left"><code>num_groups=150</code> <code>num_prompts_per_group=5</code> <code>system_prompt_len=6000</code> <code>question_len=1200 +/- 360</code> <code>output_len=1000 +/- 300</code> <code>enable_multi_turn_chat=true</code></td><td style="text-align:left"><strong>Comparable</strong></td></tr><tr><td style="text-align:left"><strong>D. Shared Prefix -- Low System Prompt Overlap, Low System Cache Pressure</strong></td><td style="text-align:left">In this workload, like in workload C, system prompts fail to amortize due to low reuse. But smaller system prompts relative to user prompts ensure that performance is dominated by per-request computation rather than system cache.</td><td style="text-align:left"><code>num_groups=150</code> <code>num_prompts_per_group=5</code> <code>system_prompt_len=1000</code> <code>question_len=6200 +/- 1860</code> <code>output_len=1000 +/- 300</code> <code>enable_multi_turn_chat=true</code></td><td style="text-align:left"><strong>Predicted Latency Scorer</strong></td></tr><tr><td style="text-align:left"><strong>ShareGPT</strong></td><td style="text-align:left">A chatbot-style workload with minimal prefix overlap across prompts.</td><td style="text-align:left"></td><td style="text-align:left"><strong>Predicted Latency Scorer</strong></td></tr></tbody></table>
<hr>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="results">Results<a href="https://llm-d.ai/blog/predicted-latency-based-scheduling-for-llms#results" class="hash-link" aria-label="Direct link to Results" title="Direct link to Results" translate="no">​</a></h3>
<p>Below we compare different load balancing strategies across the four scenarios above and the ShareGPT dataset. In every scenario, the QPS is increased until throughput saturation occurs and queues begin to form. At each point, requests are sent for 100 seconds, and we wait for their completion before moving on to the next QPS. No SLOs were assumed; the predicted latency scorer simply selects pods with lower predicted latency.</p>
<p>The charts show two metrics: <strong>NTPOT</strong> (Normalized Time Per Output Token — E2E Latency divided by output length to make it comparable across requests of different output lengths), and <strong>output tokens per sec</strong>.</p>
<p>In Scenarios A and B, where system cache is amortized across pods, the predicted latency scorer performs best. In Scenario D, which has some system cache churn but user prompts much larger than system prompts, the predicted latency scorer performs as well as the load+prefix aware routing with weights (1, 1, 1). In Scenario C, which has very high system cache churn, the predicted latency scorer performs comparably to load+prefix aware scorers, while still outperforming standard Kubernetes load balancing. In this scenario, performance is governed by <strong>discrete cache-eviction events rather than gradual saturation</strong>, whereas the latency predictor's <em>greedy</em> routing strategy is inherently better suited to modeling <strong>continuous resource contention</strong>, such as queueing and KV-cache utilization. Alternate prefix distribution strategies could further improve performance in high-churn scenarios. For instance, <a href="https://github.com/vllm-project/vllm/issues/23083" target="_blank" rel="noopener noreferrer" class=""><strong>pinning critical KV cache</strong></a> prefixes (like system prompts in this case) ensures they remain non-evictable. Similarly, using a <a href="https://llm-d.ai/blog/llm-d-v0.4-achieve-sota-inference-across-accelerators" target="_blank" rel="noopener noreferrer" class=""><strong>no-hit-lru-scorer</strong></a> can improve performance by intelligently distributing "cold" requests to prevent hotspots during the formation of new prefix-caches.</p>
<p><strong>Overall, predicted-latency aware routing consistently performs as well as or better than standard Kubernetes routing and load+prefix-aware routing in all tested scenarios, while eliminating the need for manual parameter tuning.</strong></p>
<div style="text-align:center;margin:20px 0"><img src="https://llm-d.ai/img/blogs/predicted-latency/benchmark_scenario_a.webp" alt="Scenario A benchmark results" style="width:85%;height:auto"></div>
<div style="text-align:center;margin:20px 0"><img src="https://llm-d.ai/img/blogs/predicted-latency/benchmark_scenario_b.webp" alt="Scenario B benchmark results" style="width:85%;height:auto"></div>
<div style="text-align:center;margin:20px 0"><img src="https://llm-d.ai/img/blogs/predicted-latency/benchmark_scenario_c.webp" alt="Scenario C benchmark results" style="width:85%;height:auto"></div>
<div style="text-align:center;margin:20px 0"><img src="https://llm-d.ai/img/blogs/predicted-latency/benchmark_scenario_d.webp" alt="Scenario D benchmark results" style="width:85%;height:auto"></div>
<div style="text-align:center;margin:20px 0"><img src="https://llm-d.ai/img/blogs/predicted-latency/benchmark_sharegpt.webp" alt="ShareGPT benchmark results" style="width:85%;height:auto"></div>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="production-workload">Production Workload<a href="https://llm-d.ai/blog/predicted-latency-based-scheduling-for-llms#production-workload" class="hash-link" aria-label="Direct link to Production Workload" title="Direct link to Production Workload" translate="no">​</a></h2>
<p>In addition to the synthetic scenarios above, we evaluated against a workload derived from real production traffic at an internal Google service. The benchmarking profile was constructed from 7 days of traffic serving a large open model via vLLM, analyzing input token counts, output token counts, request rates, and prefix cache hit rates. The resulting profile represents the median (p50) production request.</p>
<p>This workload exhibits the characteristics discussed in the introduction: high variance in both input token counts (mean 729, std 13550, max 131072) and output token counts (mean 300, std 2213, max 8192), along with high but unstable prefix cache reuse (~94% at peak, with frequent collapses). This makes it a strong test of whether routing strategies can adapt to rapidly shifting traffic patterns.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="benchmark-configuration">Benchmark Configuration<a href="https://llm-d.ai/blog/predicted-latency-based-scheduling-for-llms#benchmark-configuration" class="hash-link" aria-label="Direct link to Benchmark Configuration" title="Direct link to Benchmark Configuration" translate="no">​</a></h3>
<ul>
<li class=""><strong>Model:</strong> Qwen3-480B on vLLM</li>
<li class=""><strong>Hardware:</strong> 13 servers, each with 8x NVIDIA H200 GPUs</li>
<li class=""><strong>Traffic shape:</strong> 8-stage load ladder alternating between 1.0 and 5.0 RPS to simulate realistic traffic spikes</li>
<li class=""><strong>Request distribution:</strong> system_prompt_len=1000, question_len=729 +/- 13550, output_len=300 +/- 2213, multi_turn=true</li>
<li class=""><strong>Cache hit rate:</strong> ~94% (matching production peak)</li>
<li class=""><strong>Load type:</strong> Poisson distribution with concurrency limit of 1000</li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="routing-strategies-compared">Routing Strategies Compared<a href="https://llm-d.ai/blog/predicted-latency-based-scheduling-for-llms#routing-strategies-compared" class="hash-link" aria-label="Direct link to Routing Strategies Compared" title="Direct link to Routing Strategies Compared" translate="no">​</a></h3>
<table><thead><tr><th style="text-align:left">Strategy</th><th style="text-align:left">Description</th></tr></thead><tbody><tr><td style="text-align:left"><strong>No Gateway</strong></td><td style="text-align:left">Direct connection to model servers (k8s round-robin baseline)</td></tr><tr><td style="text-align:left"><strong>Old Params (111)</strong></td><td style="text-align:left">Load+prefix scorer with weights (1, 1, 1)</td></tr><tr><td style="text-align:left"><strong>New Params (322)</strong></td><td style="text-align:left">Load+prefix scorer with tuned weights (3, 2, 2)</td></tr><tr><td style="text-align:left"><strong>Latency Prediction</strong></td><td style="text-align:left">Predicted-latency based routing, replacing all heuristic scorers</td></tr></tbody></table>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="results-1">Results<a href="https://llm-d.ai/blog/predicted-latency-based-scheduling-for-llms#results-1" class="hash-link" aria-label="Direct link to Results" title="Direct link to Results" translate="no">​</a></h3>
<table><thead><tr><th style="text-align:left">Scenario</th><th style="text-align:left">Success Rate</th><th style="text-align:left">E2E p50</th><th style="text-align:left">E2E p95</th><th style="text-align:left">TTFT p50</th><th style="text-align:left">TTFT p95</th><th style="text-align:left">TPOT p50</th><th style="text-align:left">TPOT p99</th></tr></thead><tbody><tr><td style="text-align:left">No Gateway</td><td style="text-align:left">~99.8%</td><td style="text-align:left">15.98s</td><td style="text-align:left">38.85s</td><td style="text-align:left">4.47s</td><td style="text-align:left">24.04s</td><td style="text-align:left">35ms</td><td style="text-align:left">93ms</td></tr><tr><td style="text-align:left">Load + Prefix (111)</td><td style="text-align:left">~99.9%</td><td style="text-align:left">16.42s</td><td style="text-align:left">35.06s</td><td style="text-align:left">2.86s</td><td style="text-align:left">18.06s</td><td style="text-align:left">39ms</td><td style="text-align:left">103ms</td></tr><tr><td style="text-align:left">Load + Prefix (322)</td><td style="text-align:left">100%</td><td style="text-align:left">13.42s</td><td style="text-align:left">26.55s</td><td style="text-align:left">3.38s</td><td style="text-align:left">16.78s</td><td style="text-align:left">28ms</td><td style="text-align:left">63ms</td></tr><tr><td style="text-align:left"><strong>Latency Prediction</strong></td><td style="text-align:left"><strong>~99.9%</strong></td><td style="text-align:left"><strong>9.06s</strong></td><td style="text-align:left"><strong>22.57s</strong></td><td style="text-align:left"><strong>0.97s</strong></td><td style="text-align:left"><strong>11.34s</strong></td><td style="text-align:left"><strong>22ms</strong></td><td style="text-align:left"><strong>53ms</strong></td></tr></tbody></table>
<p>Latency prediction delivers the lowest latency across all metrics:</p>
<ul>
<li class=""><strong>E2E latency:</strong> 9.06s p50 (43% lower than the best heuristic-based approach) and 22.57s p95 (15% lower)</li>
<li class=""><strong>TTFT:</strong> 0.97s p50 (70% lower than the best heuristic) and 11.34s p95 (32% lower)</li>
<li class=""><strong>TPOT:</strong> 22ms p50 and 53ms p99, the lowest across all strategies</li>
</ul>
<p>The improvement is most pronounced at the high-load stages (stages 3, 5, 7), where latency prediction consistently stays well below all other strategies.</p>
<div style="text-align:center;margin:20px 0"><img src="https://llm-d.ai/img/blogs/predicted-latency/benchmark_production_realistic.webp" alt="Production-realistic workload results" style="width:85%;height:auto"></div>
<p>These results are particularly notable because the Load+Prefix (3,2,2) weights were specifically tuned for this workload based on analysis of the production traffic profile. Latency prediction outperforms it -- and all other heuristic-based approaches -- without any workload-specific tuning.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="prediction-server-scalability">Prediction Server Scalability<a href="https://llm-d.ai/blog/predicted-latency-based-scheduling-for-llms#prediction-server-scalability" class="hash-link" aria-label="Direct link to Prediction Server Scalability" title="Direct link to Prediction Server Scalability" translate="no">​</a></h2>
<p>At high replica counts, the EPP issues one prediction call per candidate pod per incoming request, so prediction QPS scales with both request rate and cluster size. At each QPS level in the table below, we assume 100 model server endpoints, so the prediction server is generating predictions for all 100 pods per request. To handle this, the Go sidecar coalesces concurrent EPP prediction requests within a 1ms window into a single batched HTTP call, and load-balances across multiple prediction server instances -- each running 28 uvicorn workers. Latency scales roughly linearly with QPS, and adding a prediction server adds 28 cores of inference capacity.</p>
<table><thead><tr><th style="text-align:center">QPS</th><th style="text-align:center">Avg (ms)</th><th style="text-align:center">p50 (ms)</th><th style="text-align:center">p99 (ms)</th><th style="text-align:center">p99.9 (ms)</th><th style="text-align:center">Prediction Servers</th></tr></thead><tbody><tr><td style="text-align:center">1</td><td style="text-align:center">15.7</td><td style="text-align:center">15</td><td style="text-align:center">25</td><td style="text-align:center">25</td><td style="text-align:center">1</td></tr><tr><td style="text-align:center">10</td><td style="text-align:center">13.5</td><td style="text-align:center">13</td><td style="text-align:center">18</td><td style="text-align:center">46</td><td style="text-align:center">1</td></tr><tr><td style="text-align:center">100</td><td style="text-align:center">12.8</td><td style="text-align:center">12</td><td style="text-align:center">16</td><td style="text-align:center">46</td><td style="text-align:center">1</td></tr><tr><td style="text-align:center">1,000</td><td style="text-align:center">15.0</td><td style="text-align:center">15</td><td style="text-align:center">26</td><td style="text-align:center">49</td><td style="text-align:center">1</td></tr><tr><td style="text-align:center">2,500</td><td style="text-align:center">~19</td><td style="text-align:center">~17</td><td style="text-align:center">~36</td><td style="text-align:center">~65</td><td style="text-align:center">1</td></tr><tr><td style="text-align:center">5,000</td><td style="text-align:center">~27</td><td style="text-align:center">~23</td><td style="text-align:center">~74</td><td style="text-align:center">~99</td><td style="text-align:center">2</td></tr><tr><td style="text-align:center">7,500</td><td style="text-align:center">~35</td><td style="text-align:center">~30</td><td style="text-align:center">~96</td><td style="text-align:center">~137</td><td style="text-align:center">3</td></tr><tr><td style="text-align:center">10,000</td><td style="text-align:center">~48</td><td style="text-align:center">~40</td><td style="text-align:center">~137</td><td style="text-align:center">~189</td><td style="text-align:center">4</td></tr></tbody></table>
<p><em>5-minute stability test at each level. Each prediction server runs 28 uvicorn workers on a C4 machine (28 cores). All runs achieved 100% success rate.</em></p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="try-it-a-well-lit-path">Try It: A Well-Lit Path<a href="https://llm-d.ai/blog/predicted-latency-based-scheduling-for-llms#try-it-a-well-lit-path" class="hash-link" aria-label="Direct link to Try It: A Well-Lit Path" title="Direct link to Try It: A Well-Lit Path" translate="no">​</a></h2>
<p><strong>Prereqs</strong></p>
<ul>
<li class="">Install the Inference Gateway extension with the latency prediction sidecars:
<a href="https://gateway-api-inference-extension.sigs.k8s.io/guides/latency-based-predictor/" target="_blank" rel="noopener noreferrer" class="">https://gateway-api-inference-extension.sigs.k8s.io/guides/latency-based-predictor/</a></li>
</ul>
<p><strong>Smoke test</strong></p>
<ol>
<li class=""><strong>Health checks</strong></li>
</ol>
<div class="language-shell codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-shell codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">kubectl get pods</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">curl http://&lt;pod-ip&gt;:8000/readyz   # training</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">curl http://&lt;pod-ip&gt;:8001/readyz   # prediction (and 8002, 8003, ...)</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"># EPP health on 9003</span><br></span></code></pre></div></div>
<ol start="2">
<li class=""><strong>Send an SLO-aware request</strong></li>
</ol>
<div class="language-shell codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-shell codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">curl -v $GW_IP/v1/completions \</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  -H 'Content-Type: application/json' \</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  -H 'x-prediction-based-scheduling: true' \</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  -H 'x-slo-ttft-ms: 200' \</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  -H 'x-slo-tpot-ms: 50' \</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  -d '{</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    "model": "Qwen/Qwen3-32B",</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    "prompt": "what is the difference between Franz and Apache Kafka?",</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    "max_tokens": 200,</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    "temperature": 0,</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    "stream_options": {"include_usage": "true"},</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    "stream": "true"</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  }'</span><br></span></code></pre></div></div>
<ol start="3">
<li class=""><strong>Watch the picker think (EPP logs, <code>-v=4</code>)</strong></li>
</ol>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">msg:"Running profile handler, Pick profiles" plugin:"slo-aware-profile-handler/slo-aware-profile-handler"</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">msg:"Before running scorer plugins" pods:[{... "pod_name":"...-5k7qr"}, {... "pod_name":"...-9lp5g"}]</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">msg:"Pod score" scorer_type:"slo-scorer" pod_name:"vllm-...-9b4wt" score:0.82</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">msg:"Picked endpoint" scorer_type:"slo-scorer" selected_pod:"vllm-...-9b4wt"</span><br></span></code></pre></div></div>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="tradeoffs--gaps">Tradeoffs &amp; Gaps<a href="https://llm-d.ai/blog/predicted-latency-based-scheduling-for-llms#tradeoffs--gaps" class="hash-link" aria-label="Direct link to Tradeoffs &amp; Gaps" title="Direct link to Tradeoffs &amp; Gaps" translate="no">​</a></h2>
<p>The following areas highlight current limitations and ongoing work for the Predicted-Latency Aware Scheduling system:</p>
<ul>
<li class=""><strong>Addressing Homogeneous Pool Assumptions</strong>
Current models assume uniform serving pods regarding GPU types and runtimes. Future updates will incorporate richer features and per-variant training to better support heterogeneous pools.</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="takeaway">Takeaway<a href="https://llm-d.ai/blog/predicted-latency-based-scheduling-for-llms#takeaway" class="hash-link" aria-label="Direct link to Takeaway" title="Direct link to Takeaway" translate="no">​</a></h2>
<p>Accelerator performance is fairly predictable when we account for both the current model server GPU state and request characteristics. By applying online machine learning with a narrow horizon, we can train a model that avoids overfitting while staying accurate to changing workloads. With a good predictor in place, we can route requests based on expected latency, leading to smarter and more efficient load balancing.</p>
<hr>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="get-involved">Get Involved<a href="https://llm-d.ai/blog/predicted-latency-based-scheduling-for-llms#get-involved" class="hash-link" aria-label="Direct link to Get Involved" title="Direct link to Get Involved" translate="no">​</a></h3>
<ul>
<li class=""><strong>Docs &amp; guides</strong>: <a href="https://gateway-api-inference-extension.sigs.k8s.io/guides/latency-based-predictor/" target="_blank" rel="noopener noreferrer" class="">https://gateway-api-inference-extension.sigs.k8s.io/guides/latency-based-predictor/</a></li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="appendix">Appendix<a href="https://llm-d.ai/blog/predicted-latency-based-scheduling-for-llms#appendix" class="hash-link" aria-label="Direct link to Appendix" title="Direct link to Appendix" translate="no">​</a></h2>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="multi-turn-cache-capacity-analysis">Multi-Turn Cache Capacity Analysis<a href="https://llm-d.ai/blog/predicted-latency-based-scheduling-for-llms#multi-turn-cache-capacity-analysis" class="hash-link" aria-label="Direct link to Multi-Turn Cache Capacity Analysis" title="Direct link to Multi-Turn Cache Capacity Analysis" translate="no">​</a></h3>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="theoretical-capacity-estimates">Theoretical Capacity Estimates<a href="https://llm-d.ai/blog/predicted-latency-based-scheduling-for-llms#theoretical-capacity-estimates" class="hash-link" aria-label="Direct link to Theoretical Capacity Estimates" title="Direct link to Theoretical Capacity Estimates" translate="no">​</a></h4>
<p>This analysis evaluates what percentage of user and system prompts can be prefix cached assuming perfect load balancing. Assuming <strong>10 pods</strong> (H100 80 GB), the total cluster capacity is ~<strong>5,120,000 tokens</strong>.</p>
<p><strong>Multi-Round Context Assumptions:</strong></p>
<ul>
<li class="">We assume a <strong>multi-turn chat</strong> scenario where the conversation context grows cumulatively.</li>
<li class="">For every round, the previous question AND the previous model response are appended to the context of the next question.</li>
<li class="">To satisfy a 5-turn session, the cache must hold the "Working Set" for 4 rounds (the accumulated history of Turns 1-4 serves as the prefix for Turn 5).</li>
<li class="">The calculation assumes that System Prompts are prioritized and pinned in the cache. User history is only allocated space from the remaining capacity after all unique system prompts for the active groups are stored.</li>
</ul>
<p><strong>Key Variables:</strong></p>
<ul>
<li class=""><strong>Total Capacity:</strong> 32000 blocks * 16 tokens * 10 pods = 5,120,000 tokens</li>
<li class=""><strong>Unique Sys Tokens:</strong> #Groups * System Prompt Tokens</li>
<li class=""><strong>Unique User Tokens (4 Rounds):</strong> #Users * (User Prompt Tokens + Output Len) * 4</li>
</ul>
<table><thead><tr><th style="text-align:left">Workload</th><th style="text-align:center">Groups</th><th style="text-align:center">Users Per Group</th><th style="text-align:center">Sys Prompt Tokens</th><th style="text-align:center">User Prompt Tokens</th><th style="text-align:center">Output Len</th><th style="text-align:center">Total Unique Sys Tokens</th><th style="text-align:center">Total Rounds</th><th style="text-align:center">Total Unique User Tokens (1 round)</th><th style="text-align:center">Total Cache Capacity</th><th style="text-align:center">% sys prompts fit</th><th style="text-align:center">% user prompts fit</th></tr></thead><tbody><tr><td style="text-align:left">A</td><td style="text-align:center">6</td><td style="text-align:center">1,000</td><td style="text-align:center">1,000</td><td style="text-align:center">30</td><td style="text-align:center">1,000</td><td style="text-align:center">6,000</td><td style="text-align:center">4</td><td style="text-align:center">24,720,000</td><td style="text-align:center">5,120,000</td><td style="text-align:center">100.00%</td><td style="text-align:center">20.69%</td></tr><tr><td style="text-align:left">B</td><td style="text-align:center">6</td><td style="text-align:center">1,000</td><td style="text-align:center">1,000</td><td style="text-align:center">3,000</td><td style="text-align:center">1,000</td><td style="text-align:center">6,000</td><td style="text-align:center">4</td><td style="text-align:center">96,000,000</td><td style="text-align:center">5,120,000</td><td style="text-align:center">100.00%</td><td style="text-align:center">5.33%</td></tr><tr><td style="text-align:left">C</td><td style="text-align:center">150</td><td style="text-align:center">5</td><td style="text-align:center">6,000</td><td style="text-align:center">1,200</td><td style="text-align:center">1,000</td><td style="text-align:center">900,000</td><td style="text-align:center">4</td><td style="text-align:center">6,600,000</td><td style="text-align:center">5,120,000</td><td style="text-align:center">100.00%</td><td style="text-align:center">63.94%</td></tr><tr><td style="text-align:left">D</td><td style="text-align:center">150</td><td style="text-align:center">5</td><td style="text-align:center">1,000</td><td style="text-align:center">6,200</td><td style="text-align:center">1,000</td><td style="text-align:center">150,000</td><td style="text-align:center">4</td><td style="text-align:center">21,600,000</td><td style="text-align:center">5,120,000</td><td style="text-align:center">100.00%</td><td style="text-align:center">23.01%</td></tr></tbody></table>
<p>We notice that we have theoretically enough capacity to store the system prompts, however as we shall see below what also matters is the ratio between user and system prompts that determines the cache eviction dynamics.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="kv-cache-behavior-across-workloads-a-d">KV Cache Behavior Across Workloads A-D<a href="https://llm-d.ai/blog/predicted-latency-based-scheduling-for-llms#kv-cache-behavior-across-workloads-a-d" class="hash-link" aria-label="Direct link to KV Cache Behavior Across Workloads A-D" title="Direct link to KV Cache Behavior Across Workloads A-D" translate="no">​</a></h3>
<p>This section analyzes how <strong>workload inputs shape cache behavior</strong>, and manifests in prefix reuse, cache pressure, and eviction dynamics using a simple event based <a href="https://colab.research.google.com/drive/1ZsSIDHROFklDPoX8RnqKpw4D3RQs34d8#scrollTo=sAjdTFITo6DV" target="_blank" rel="noopener noreferrer" class="">simulation</a>.</p>
<p>Each workload varies six core inputs:</p>
<ul>
<li class=""><strong>System prompt size</strong> (system_prompt_len)</li>
<li class=""><strong>Prefix sharing structure</strong> (<code>num_groups</code> each with a unique system prompt and num_prompts_per_group signifying number of users per group)</li>
<li class=""><strong>User context growth</strong> (enable_multi_turn_chat) If turned on, each user appends its prompt to its previous prompts and responses.</li>
<li class=""><strong>Request shape</strong> (question_len, output_len)</li>
</ul>
<p>The simulation produces the figures below that should be read as answering four questions:</p>
<ul>
<li class="">How much of the system prompt and user prompt is typically reused?</li>
<li class="">How quickly does cache pressure build under load?</li>
<li class="">When evictions occur, <em>what</em> is being evicted?</li>
<li class="">At what QPS do evictions occur?</li>
</ul>
<p>Here we assume a single model server running on a <strong>2 x H100 80 GB pod (TP=2)</strong>. The requests are sent at different QPS for a fixed duration (100 secs).</p>
<p><strong>Note:</strong> We assume cache creation and eviction is instantaneous; in reality, it depends on the prefill and decode times which we are not simulating. So effectively, in this simulation, QPS functions only as a proxy for the <strong>total number of prompts processed</strong>, ignoring the concurrency overhead that actual QPS imposes. Thus, in practice, the actual onset of cache eviction can be much earlier than what the simulation suggests. For example, in Scenario A, we see the onset of cache eviction happen around <strong>QPS = 14 with 10 pods (which roughly translates to QPS = 1.4 with 1 pod)</strong> experimentally, whereas the theoretical results below indicate a higher threshold. For other scenarios, the theoretical cache eviction QPS matches closely with what we observed experimentally.</p>
<hr>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="workload-a----high-system-prompt-overlap-no-system-cache-pressure">Workload A -- High System Prompt Overlap, No System Cache Pressure<a href="https://llm-d.ai/blog/predicted-latency-based-scheduling-for-llms#workload-a----high-system-prompt-overlap-no-system-cache-pressure" class="hash-link" aria-label="Direct link to Workload A -- High System Prompt Overlap, No System Cache Pressure" title="Direct link to Workload A -- High System Prompt Overlap, No System Cache Pressure" translate="no">​</a></h4>
<p><strong>Inputs</strong></p>
<ul>
<li class=""><strong>Structure:</strong> Few groups, many users per group (6 groups, 1000 users per group).</li>
<li class=""><strong>Shape:</strong> Large system prompt relative to user prompt (1000:30 ratio).</li>
<li class=""><strong>Context:</strong> Multi-turn enabled.</li>
<li class=""><strong>Variance:</strong> 30% variance in user prompt and output length.</li>
</ul>
<p><strong>Interpretation</strong> Workload A represents a regime where shared system prefixes amortize extremely well, and user context grows slowly enough that cache pressure remains low.</p>
<div style="text-align:center;margin:20px 0"><img src="https://llm-d.ai/img/blogs/predicted-latency/workload_a_cache_behavior.webp" alt="Workload A cache behavior" style="width:85%;height:auto"></div>
<hr>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="workload-b----moderate-system-prompt-overlap-no-system-cache-pressure">Workload B -- Moderate System Prompt Overlap, No System Cache Pressure<a href="https://llm-d.ai/blog/predicted-latency-based-scheduling-for-llms#workload-b----moderate-system-prompt-overlap-no-system-cache-pressure" class="hash-link" aria-label="Direct link to Workload B -- Moderate System Prompt Overlap, No System Cache Pressure" title="Direct link to Workload B -- Moderate System Prompt Overlap, No System Cache Pressure" translate="no">​</a></h4>
<p><strong>Inputs</strong></p>
<ul>
<li class=""><strong>Structure:</strong> Few groups, many users per group (6 groups, 1000 users per group).</li>
<li class=""><strong>Shape:</strong> Smaller system prompt relative to user prompt (1000:3000 ratio).</li>
<li class=""><strong>Context:</strong> Multi-turn enabled.</li>
<li class=""><strong>Variance:</strong> 30% variance in user prompt and output length.</li>
</ul>
<p><strong>Interpretation</strong> Although this workload benefits from strong system reuse, longer user turns accelerate user context growth. Importantly, cache degradation is localized: user-block eviction preserves shared prefixes, preventing a global collapse in reuse.</p>
<div style="text-align:center;margin:20px 0"><img src="https://llm-d.ai/img/blogs/predicted-latency/workload_b_cache_behavior.webp" alt="Workload B cache behavior" style="width:85%;height:auto"></div>
<hr>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="workload-c----low-system-prompt-overlap-high-system-cache-pressure">Workload C -- Low System Prompt Overlap, High System Cache Pressure<a href="https://llm-d.ai/blog/predicted-latency-based-scheduling-for-llms#workload-c----low-system-prompt-overlap-high-system-cache-pressure" class="hash-link" aria-label="Direct link to Workload C -- Low System Prompt Overlap, High System Cache Pressure" title="Direct link to Workload C -- Low System Prompt Overlap, High System Cache Pressure" translate="no">​</a></h4>
<p><strong>Inputs</strong></p>
<ul>
<li class=""><strong>Structure:</strong> Many groups, few users per group (150 groups, 5 users per group).</li>
<li class=""><strong>Shape:</strong> Large system prompt relative to user prompt (6000:1200 ratio).</li>
<li class=""><strong>Context:</strong> Multi-turn enabled.</li>
<li class=""><strong>Variance:</strong> 30% variance in user prompt and output length.</li>
</ul>
<p><strong>Interpretation</strong> In Workload C, large system prompts fail to amortize due to low reuse, quickly consuming cache capacity. System-prefix eviction destroys shared reuse, leading to abrupt performance degradation across users.</p>
<div style="text-align:center;margin:20px 0"><img src="https://llm-d.ai/img/blogs/predicted-latency/workload_c_cache_behavior.webp" alt="Workload C cache behavior" style="width:85%;height:auto"></div>
<hr>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="workload-d----low-system-prompt-overlap-low-system-cache-pressure">Workload D -- Low System Prompt Overlap, Low System Cache Pressure<a href="https://llm-d.ai/blog/predicted-latency-based-scheduling-for-llms#workload-d----low-system-prompt-overlap-low-system-cache-pressure" class="hash-link" aria-label="Direct link to Workload D -- Low System Prompt Overlap, Low System Cache Pressure" title="Direct link to Workload D -- Low System Prompt Overlap, Low System Cache Pressure" translate="no">​</a></h4>
<p><strong>Inputs</strong></p>
<ul>
<li class=""><strong>Structure:</strong> Many groups, few users per group (150 groups, 5 users per group).</li>
<li class=""><strong>Shape:</strong> Moderate system prompt with very large user questions. (1000:6200 ratio)</li>
<li class=""><strong>Context:</strong> Multi-turn enabled.</li>
<li class=""><strong>Variance:</strong> 30% variance in user prompt and output length.</li>
</ul>
<p><strong>Interpretation</strong> In Workload D, cache overlap is low because of many groups. But smaller system prompts relative to user prompts ensure that performance is dominated by per-request computation rather than cache churn.</p>
<div style="text-align:center;margin:20px 0"><img src="https://llm-d.ai/img/blogs/predicted-latency/workload_d_cache_behavior.webp" alt="Workload D cache behavior" style="width:85%;height:auto"></div>
<hr>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="summary">Summary<a href="https://llm-d.ai/blog/predicted-latency-based-scheduling-for-llms#summary" class="hash-link" aria-label="Direct link to Summary" title="Direct link to Summary" translate="no">​</a></h4>
<table><thead><tr><th style="text-align:left">Metric</th><th style="text-align:left">Workload A</th><th style="text-align:left">Workload B</th><th style="text-align:left">Workload C</th><th style="text-align:left">Workload D</th></tr></thead><tbody><tr><td style="text-align:left"><strong>Prompt Reuse</strong></td><td style="text-align:left">High (~80%)</td><td style="text-align:left">Moderate (~20%)</td><td style="text-align:left">Low (~10%)</td><td style="text-align:left">Very Low (5%)</td></tr><tr><td style="text-align:left"><strong>Eviction Onset QPS</strong></td><td style="text-align:left">7</td><td style="text-align:left">0.6</td><td style="text-align:left">0.4</td><td style="text-align:left">0.4</td></tr><tr><td style="text-align:left"><strong>Eviction Target</strong></td><td style="text-align:left">User History</td><td style="text-align:left">User History</td><td style="text-align:left">System Prompts</td><td style="text-align:left">Mix of user history and system prompts though largely dominated by user prompt</td></tr><tr><td style="text-align:left"><strong>Best Scheduling Strategy</strong></td><td style="text-align:left">Latency Aware Scheduling</td><td style="text-align:left">Latency Aware Scheduling</td><td style="text-align:left">Comparable</td><td style="text-align:left">Latency Aware</td></tr></tbody></table>]]></content>
        <author>
            <name>Kaushik Mitra</name>
            <uri>https://github.com/kaushikmitr</uri>
        </author>
        <author>
            <name>Benjamin Braun</name>
            <uri>https://github.com/BenjaminBraunDev</uri>
        </author>
        <author>
            <name>Abdullah Gharaibeh</name>
            <uri>https://github.com/ahg-g</uri>
        </author>
        <author>
            <name>Clayton Coleman</name>
            <email>claytoncoleman@google.com</email>
            <uri>https://github.com/smarterclayton</uri>
        </author>
        <category label="blog posts" term="blog posts"/>
        <category label="Scheduling" term="Scheduling"/>
        <category label="Inference" term="Inference"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[Native KV Cache Offloading to Any Filesystem with llm-d]]></title>
        <id>https://llm-d.ai/blog/native-kv-cache-offloading-to-any-file-system-with-llm-d</id>
        <link href="https://llm-d.ai/blog/native-kv-cache-offloading-to-any-file-system-with-llm-d"/>
        <updated>2026-02-10T09:00:00.000Z</updated>
        <summary type="html"><![CDATA[llm-d's new filesystem backend offloads KV cache to shared storage, enabling cross-replica reuse and up to 16.8x faster TTFT — scaling inference throughput without GPU or CPU memory limits.]]></summary>
        <content type="html"><![CDATA[<p>llm-d is a distributed inference platform spanning multiple vLLM instances. KV cache hits are critical to achieving high inference throughput. Yet, in a distributed environment, cache hits do not occur across different nodes as the KV cache is local to each vLLM instance. In addition, this local cache is limited in size, further limiting KV data reuse. This blog presents a new way to offload KV cache to storage, tackling both aforementioned challenges – KV cache sharing and KV cache scale. llm-d's filesystem (FS) backend is a KV cache storage connector for vLLM that offloads KV blocks to shared storage based on vLLM's native Offloading Connector. While the llm-d FS backend can speed up serving of single requests (improve TTFT), its main goal is rather to preserve stable throughput and low latency at scale, as concurrency and context lengths grow. This is accomplished by significantly enlarging the cache space and enabling KV reuse across multiple replicas and nodes in llm-d.</p>
<p>While there are a number of existing solutions for KV cache offload to storage (e.g. LMCache or Dynamo KVBM), the new connector offers simplicity, can run with llm-d and vLLM as the only dependency, and exhibits improved performance over state-of-the-art shared storage connectors.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-importance-of-kv-cache-reuse">The Importance of KV-cache Reuse<a href="https://llm-d.ai/blog/native-kv-cache-offloading-to-any-file-system-with-llm-d#the-importance-of-kv-cache-reuse" class="hash-link" aria-label="Direct link to The Importance of KV-cache Reuse" title="Direct link to The Importance of KV-cache Reuse" translate="no">​</a></h2>
<p>In transformer-based inference, the prefill stage computes key and value (KV) tensors for the input tokens, which are then used when decoding output tokens. This stage is computationally intensive, especially for long input contexts. But once the KV tensors are available, they are kept in a KV cache and can be reused, avoiding the prefill computation entirely.</p>
<p>When the same prefix appears repeatedly - for example, shared system prompts, common documents, agentic loops, or multi-turn conversations - recomputing the KV tensors wastes significant compute. Reusing the KV cache allows the system to skip a large portion of the prefill work, reducing latency and improving overall throughput (a deeper dive on KV reuse use cases appears <a href="https://llm-d.ai/blog/kvcache-wins-you-can-see" target="_blank" rel="noopener noreferrer" class="">here</a>).</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="why-storage-offloading-is-needed">Why Storage Offloading is Needed<a href="https://llm-d.ai/blog/native-kv-cache-offloading-to-any-file-system-with-llm-d#why-storage-offloading-is-needed" class="hash-link" aria-label="Direct link to Why Storage Offloading is Needed" title="Direct link to Why Storage Offloading is Needed" translate="no">​</a></h2>
<p>vLLM already supports keeping KV-cache data in GPU (High Bandwidth Memory) HBM and, more recently, offloading KV to host memory. These approaches work well for a single server or small deployments, but they become limited at scale. GPU HBM is typically on the order of tens of gigabytes per GPU. CPU memory is usually larger but still on the same order of magnitude. For example, consider a high-end node hosting 8 GPUs with 2TB of DRAM. Divided by 8, the CPU DRAM per GPU is 250GB while the HBM is, say 80GB.</p>
<p>On the other hand, <strong>KV-cache takes up lots of space</strong>, especially with longer context lengths and higher concurrency from multiple users and requests. Even a medium-sized class model, such as Llama-3.1-70B, requires 305 GB of KV-cache for one million tokens. Storage scales nearly infinitely compared to memory solutions and offers a far superior $ per GB ratio.</p>
<p>In addition, shared storage is a simple method to share KV data across an entire cluster spanning multiple vLLM instances and physical nodes. New nodes added to a cluster can immediately benefit from existing KV-cache data without warming the cache from scratch. Shared persistent KV-cache also benefits post-peak scale-down since localized KV-cache data is not lost. Finally, KV-cache persistence matters, so cached data is not lost during restarts or rescheduling events.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-we-built-llm-d-fs-backend">What We Built: llm-d FS Backend<a href="https://llm-d.ai/blog/native-kv-cache-offloading-to-any-file-system-with-llm-d#what-we-built-llm-d-fs-backend" class="hash-link" aria-label="Direct link to What We Built: llm-d FS Backend" title="Direct link to What We Built: llm-d FS Backend" translate="no">​</a></h2>
<p>The llm-d FS backend is a storage backend that plugs into vLLM's Offloading Connector. It stores KV blocks as files on a shared filesystem and loads them back on demand. It uses the filesystem directory as the index of what KV values are in the storage, and as such is persistent and sharable across all nodes connected to the filesystem.</p>
<p>The following are some key properties of our solution:</p>
<ul>
<li class=""><strong>Filesystem agnostic</strong>: Relies on standard POSIX file operations, so it works with any standard filesystem.</li>
<li class=""><strong>KV sharing across instances and nodes:</strong> Multiple vLLM servers can reuse cached prefixes by accessing the same shared path.</li>
<li class=""><strong>Persistence across restarts or failures:</strong> KV data can survive pod restarts, rescheduling, and node failures (depending on storage durability).</li>
<li class=""><strong>Enterprise storage integration:</strong> Can leverage mature storage systems with existing durability, monitoring, and access control.</li>
</ul>
<p>In addition, the following performance-related design choices were made:</p>
<ul>
<li class="">
<p><strong>Fully asynchronous I/O:</strong> By using vLLM's Offloading Connector KV reads and writes can run without blocking the main path. Details about the vLLM offloading connector can be found <a href="https://blog.vllm.ai/2026/01/08/kv-offloading-connector.html" target="_blank" rel="noopener noreferrer" class="">here</a>.</p>
</li>
<li class="">
<p><strong>High throughput via parallelism:</strong> I/O operations are parallelized across worker threads to increase bandwidth and reduce tail latency.</p>
</li>
<li class="">
<p><strong>Minimal interference with GPU computations</strong>: Default transfers use GPU DMA, reducing interference with compute kernels.</p>
</li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="how-to-use-it">How to Use it<a href="https://llm-d.ai/blog/native-kv-cache-offloading-to-any-file-system-with-llm-d#how-to-use-it" class="hash-link" aria-label="Direct link to How to Use it" title="Direct link to How to Use it" translate="no">​</a></h3>
<p>Using the FS offloading connector is simple, requires a <code>pip install</code>, and a directory path to the storage being used. Other optional tunable parameters are the storage block size (in tokens) and the number of worker threads.</p>
<p>Detailed instructions can be found in the llm-d well-lit path <a href="https://github.com/llm-d/llm-d/tree/main/guides/tiered-prefix-cache/storage/README.md" target="_blank" rel="noopener noreferrer" class="">guide</a>.</p>
<p>Note that while the results presented in this blog were of tests run with IBM Storage Scale, the connector was also tested with other storage options including local storage (NVMe drive with a filesystem mounted on it) and CephFS. In general it will work seamlessly with any storage supporting a filesystem API or that has a filesystem mounted on it.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="results-and-benchmarks">Results and Benchmarks<a href="https://llm-d.ai/blog/native-kv-cache-offloading-to-any-file-system-with-llm-d#results-and-benchmarks" class="hash-link" aria-label="Direct link to Results and Benchmarks" title="Direct link to Results and Benchmarks" translate="no">​</a></h2>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="single-request-speed-up">Single request speed-up<a href="https://llm-d.ai/blog/native-kv-cache-offloading-to-any-file-system-with-llm-d#single-request-speed-up" class="hash-link" aria-label="Direct link to Single request speed-up" title="Direct link to Single request speed-up" translate="no">​</a></h3>
<p>Depending on storage speed, loading KV data from storage can dramatically reduce TTFT compared to prefilling. We start by examining the benefit that offloading of KV data to storage can have on the speed of a single request, while emphasizing that the main benefit of offloading KV to storage (supporting high throughput at scale) will be shown later on.</p>
<div style="text-align:center;margin:20px 0"><img src="https://llm-d.ai/img/blogs/fs-connector/StorageBlogimage1.webp" alt="Single Request Latency" style="width:75%;height:auto"><p style="font-size:0.9em;margin-top:8px"><em>Figure 1: Single-request KV-cache load across tiers</em></p></div>
<p>In Figure 1, we measure the TTFT for a single request as the number of tokens increases using various offloading mediums. This shows the evaluation for a Llama-3.1-70B model on a system with 4x NVIDIA H100 GPUs and an IBM Storage Scale. We compare KV loading from GPU memory, CPU memory, and shared storage and measure the speedup achieved vs. Prefill (a KV cache miss).<br>
<!-- -->As the number of tokens increases, KV loading becomes increasingly efficient compared to recomputing the same tokens during prefill, achieving a speedup of up to 16.8X on long prompts. However, for single-request workloads, GPU and CPU caching remain the faster options. This is because storage, for the most part, is slower than DRAM (with some exceptions).  Moreover, transferring data between storage and GPU HBM usually involves an additional hop via CPU DRAM, which adds to its overhead.  This is a fundamental reality that reinforces why <strong>CPU offloading and smart routing</strong> based on prefix cache hits remain valuable optimizations in the llm-d system, even with shared storage available. Storage is not intended to replace these mechanisms, but to complement them. Its main value lies in providing far greater scalability, significantly lower cost per gigabyte, and persistent KV storage.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="scalability-test">Scalability test<a href="https://llm-d.ai/blog/native-kv-cache-offloading-to-any-file-system-with-llm-d#scalability-test" class="hash-link" aria-label="Direct link to Scalability test" title="Direct link to Scalability test" translate="no">​</a></h3>
<p>In order to exemplify the benefit of storage for scalability, we start by examining a somewhat artificial workload that consists of multiple users, where each user has their own distinct system prompt (we will consider a more realistic workload in the following section). We ask how many concurrent users a single vLLM node can support without a significant drop in throughput due to cache misses.</p>
<div style="text-align:center;margin:20px 0"><img src="https://llm-d.ai/img/blogs/fs-connector/StorageBlogimage2.webp" alt="Full Hit Workload" style="width:75%;height:auto"><p style="font-size:0.9em;margin-top:8px"><em>Figure 2: Multiple-request KV-cache load across tiers</em></p></div>
<p>In Figure 2, we evaluate KV-cache loading under varying concurrency levels by issuing 16K token requests from a growing number of users. In this test, all prompts have previously appeared, and the decode is of a single token. We chose this extreme workload just to emphasize the point, and we will show a more realistic workload next.  Again, we used a single node running Llama-3.1-70B on a system with 4x NVIDIA H100 GPUs and an IBM Storage Scale filesystem.</p>
<p>We see that only a small number of user prompts can fit in GPU memory. With such a small number of users, performance is extremely fast, but once we grow beyond this, performance drops significantly as essentially almost all requests undergo prefill. With CPU offloading, this drop-off is postponed, and the system can handle a higher (yet limited) number of users with a very small drop-off for these. On the other hand, storage-backed KV caching shines once we scale up.  While the speed offered by storage does not match that of the GPU or CPU, it allows the system to sustain throughput as the working set increases nearly infinitely.</p>
<p>This experiment highlights the key benefit of storage offloading: <strong>it prevents performance collapse when workloads outgrow GPU or CPU cache capacity</strong>. The gains come from higher cache hit ratios at scale and from shared access across replicas. Its value lies in maintaining throughput and latency stability as workloads exceed the capacity of GPU or CPU memory, <strong>supporting efficient scaling of the system.</strong> This also highlights that storage performance is key to achieving high throughput on KV cache hits.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="scalability-in-realistic-workloads">Scalability in realistic workloads<a href="https://llm-d.ai/blog/native-kv-cache-offloading-to-any-file-system-with-llm-d#scalability-in-realistic-workloads" class="hash-link" aria-label="Direct link to Scalability in realistic workloads" title="Direct link to Scalability in realistic workloads" translate="no">​</a></h3>
<p>Finally, we evaluate a more realistic workload that mixes KV loading, prefill, and decode operations. We use the llm-d benchmarking framework to run inference-perf with a shared-prefix synthetic workload. Each query consists of a previously seen user-specific system prompt of 2000 tokens and a question made of 256 tokens. 256 tokens are decoded in response. The queries are issued at a rate of 40 QPS from a pool of users of variable size. The more users in the system, the greater the working set size is and hence the more we expect storage to shine. This setup helps us study how the different caching options behave with a growing number of users. This setup runs llm-d with two decode nodes executing a Llama-3.1-8B model on a system with 2x NVIDIA H100 GPUs and a cloud-based storage offering approximately 10GB/s I/O throughput.</p>
<div style="text-align:center;margin:20px 0"><img src="https://llm-d.ai/img/blogs/fs-connector/StorageBlogimage3.webp" alt="Real Life Workload" style="width:75%;height:auto"><p style="font-size:0.9em;margin-top:8px"><em>Figure 3: Multiple requests with mixed workload (load + prefill + decode)</em></p></div>
<p>The results, in Figure 3, show similar behavior to the previous test. By extending KV-cache capacity using shared storage, llm-d can reuse KV data more effectively across requests and replicas, maintaining an improved overall throughput and TTFT as the system scales. An additional important observation is that even though the storage in this test was not top-end, the asynchronous use of storage frees up precious GPU cycles for prefill and decode operations and hence achieves higher throughput.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="summary-and-next-steps">Summary and Next Steps<a href="https://llm-d.ai/blog/native-kv-cache-offloading-to-any-file-system-with-llm-d#summary-and-next-steps" class="hash-link" aria-label="Direct link to Summary and Next Steps" title="Direct link to Summary and Next Steps" translate="no">​</a></h2>
<p>Storage offloading is an important and essential capability for scalable AI inference platforms such as llm-d. It increases effective KV-cache capacity, enables cross-replica reuse, and makes llm-d clusters more elastic by allowing them to scale efficiently with growing request volume and user concurrency. The FS backend keeps the integration native to vLLM and llm-d, using an asynchronous design and high-throughput transfers built around parallelism.</p>
<p>The initial FS backend is the first llm-d native solution for storage KV offloading. As next steps, we are working on new features, including tiered storage offload (as a second tier to CPU DRAM), integration with NIXL backends, offloading to object storage, support for GPU Direct Storage (GDS), and more.</p>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="acknowledgement"><em>Acknowledgement</em><a href="https://llm-d.ai/blog/native-kv-cache-offloading-to-any-file-system-with-llm-d#acknowledgement" class="hash-link" aria-label="Direct link to acknowledgement" title="Direct link to acknowledgement" translate="no">​</a></h4>
<p>For performance testing, we used IBM Storage Scale, an enterprise-level storage system offering high performance, scalability, and reliability.</p>]]></content>
        <author>
            <name>Kfir Toledo</name>
            <uri>https://www.linkedin.com/in/kfir-toledo-394a8811a/</uri>
        </author>
        <author>
            <name>Danny Harnik</name>
            <uri>https://www.linkedin.com/in/danny-harnik-19a95436/</uri>
        </author>
        <author>
            <name>Effi Ofer</name>
            <uri>https://www.linkedin.com/in/effi-ofer-91a261b0/</uri>
        </author>
        <author>
            <name>Or Ozeri</name>
            <uri>https://www.linkedin.com/in/or-ozeri-a942859a/</uri>
        </author>
        <author>
            <name>Guy Margalit</name>
            <uri>https://www.linkedin.com/in/guymargalit/</uri>
        </author>
        <category label="blog posts" term="blog posts"/>
        <category label="KV Cache" term="KV Cache"/>
        <category label="Storage" term="Storage"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[llm-d 0.5: Sustaining Performance at Scale]]></title>
        <id>https://llm-d.ai/blog/llm-d-v0.5-sustaining-performance-at-scale</id>
        <link href="https://llm-d.ai/blog/llm-d-v0.5-sustaining-performance-at-scale"/>
        <updated>2026-02-04T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[llm-d v0.5 introduces hierarchical KV-cache offloading, LoRA-aware scheduling, UCCL networking, and scale-to-zero autoscaling for sustained inference performance at scale.]]></summary>
        <content type="html"><![CDATA[<p>In our previous release (<a class="" href="https://llm-d.ai/blog/llm-d-v0.4-achieve-sota-inference-across-accelerators">v0.4</a>), we focused on improving the end-to-end latency of production inference, introducing speculative decoding and extending prefill/decode disaggregation across a broader set of accelerator architectures. That work established llm-d’s ability to deliver state-of-the-art latency along the critical serving path. Sustaining low latency increasingly depended on how KV-cache pressure is handled once GPU memory is saturated, whether cached state can be reused across replicas instead of being repeatedly rebuilt, and how requests are routed when workloads mix adapters, models, and availability requirements.</p>
<p>With v0.5, llm-d expands its focus from peak performance to the operational rigor required to sustain performance at scale. This release prioritizes reproducibility, resilience, and cost efficiency, with concrete improvements across the following areas:</p>
<ol>
<li class=""><strong>Developer Experience and reproducibility:</strong> We have simplified the benchmarking workflow with dedicated, in-guide benchmark support, allowing users to validate each “well-lit path” with a single command.</li>
<li class=""><strong>Hierarchical KV Offloading:</strong> A new storage architecture decouples cache capacity from GPU memory through native CPU and filesystem tiers.</li>
<li class=""><strong>Advanced Scheduling:</strong> Cache-aware routing now supports LoRA adapters and active-active high availability.</li>
<li class=""><strong>Resilient Networking:</strong> A new transport backend (UCCL) improves stability in congested networks.</li>
<li class=""><strong>Autoscaling Updates:</strong> We have introduced scale-to-zero capabilities for cost-efficient intermittent workloads.</li>
</ol>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="updated-developer-experience"><strong>Updated developer experience</strong><a href="https://llm-d.ai/blog/llm-d-v0.5-sustaining-performance-at-scale#updated-developer-experience" class="hash-link" aria-label="Direct link to updated-developer-experience" title="Direct link to updated-developer-experience" translate="no">​</a></h2>
<p>A core challenge in distributed inference is the variance between reported benchmark numbers and realized production performance. In v0.5, we address this by adopting the "Research Paper Principle": the requirement that every performance claim be accompanied by a reproducible, version-controlled configuration.</p>
<ul>
<li class=""><strong>Simplified benchmarking (one guide, one benchmark):</strong> We have moved away from generic "one-size-fits-all" scripts. v0.5 introduces in-guide <a href="https://github.com/llm-d/llm-d/blob/main/guides/benchmark/README.md" target="_blank" rel="noopener noreferrer" class="">benchmark support</a> and a standalone benchmark script. Each "well-lit path" (e.g., <a href="https://github.com/llm-d/llm-d/blob/main/guides/inference-scheduling/README.md#benchmarking" target="_blank" rel="noopener noreferrer" class="">Inference-Scheduling</a>, <a href="https://github.com/llm-d/llm-d/blob/main/guides/wide-ep-lws/README.md#benchmarking" target="_blank" rel="noopener noreferrer" class="">Wide EP</a>, <a href="https://github.com/llm-d/llm-d/blob/main/guides/pd-disaggregation/README.md#benchmarking" target="_blank" rel="noopener noreferrer" class="">Disaggregated</a>) now has a defined benchmark configuration that reproduces the benchmark results shown in the guides.</li>
<li class=""><strong>Use cases and personalities:</strong> The new harness supports distinct "personalities" or use cases. While the “in-guide” benchmarking is targeted at “feature developers” who quickly and efficiently wanted to check if their code changes had a positive effect, taking into account a baseline, the llm-d-benchmark has a toolset for a variety of other users. For “config tuners”, aiming to carry out parameter exploration sweeps, there is support for <a href="https://github.com/llm-d/llm-d-benchmark/blob/main/docs/doe.md" target="_blank" rel="noopener noreferrer" class="">experiments</a> crafted with the purpose of finding optimal configurations and reference architecture. An llm-d service owner can track performance regressions on his production stack by selecting a mix of <a href="https://github.com/llm-d/llm-d-benchmark/blob/main/docs/run.md#workloads" target="_blank" rel="noopener noreferrer" class="">workloads</a> to be periodically run.</li>
<li class=""><strong>Parameter sweep exploration:</strong> Capacity planning often relies on trial-and-error. The new <a href="https://github.com/llm-d/llm-d-benchmark/tree/main/config_explorer" target="_blank" rel="noopener noreferrer" class="">Configuration Explorer</a> allows analysts, researchers, or customers to perform their own parameter sweep exploration, simulating activation memory usage against specific hardware constraints. This allows this category of users to validate hardware feasibility and generate memory-safe configurations prior to provisioning physical resources.</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="performance-update"><strong>Performance update</strong><a href="https://llm-d.ai/blog/llm-d-v0.5-sustaining-performance-at-scale#performance-update" class="hash-link" aria-label="Direct link to performance-update" title="Direct link to performance-update" translate="no">​</a></h2>
<p>With the overhauled benchmarking harness, we have re-validated our core architectural patterns. Below are the updated performance numbers for v0.5, representing the "well-lit paths" enabled by this release. These results are fully reproducible using the configurations provided in the corresponding guides.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="throughput-oriented-wide-ep-on-nvidia-b200"><strong>Throughput-oriented: Wide-EP on Nvidia B200</strong><a href="https://llm-d.ai/blog/llm-d-v0.5-sustaining-performance-at-scale#throughput-oriented-wide-ep-on-nvidia-b200" class="hash-link" aria-label="Direct link to throughput-oriented-wide-ep-on-nvidia-b200" title="Direct link to throughput-oriented-wide-ep-on-nvidia-b200" translate="no">​</a></h3>
<p>For batch-intensive workloads, throughput is the primary metric. We validated a Wide-EP topology on a cluster of Nvidia B200 GPUs, specifically targeting the "Batch Processing" use case where latency constraints are relaxed in favor of massive token generation.</p>
<ul>
<li class=""><strong>Topology:</strong> 16x Prefill GPUs / 16x Decode GPUs (EP=16, DP=16, TP=1)</li>
<li class=""><strong>Workload:</strong> Random dataset, 1k/1k input/output length</li>
<li class=""><strong>Result:</strong> ~50k output tokens/sec total throughput (~3.1k output tokens/sec per decode GPU).</li>
</ul>
<div style="text-align:center;margin:20px 0"><img src="https://llm-d.ai/assets/images/wideep-on-b200-cf15bcd5de616e8f9638c4be6d794f32.webp" alt="Wide-EP on Nvidia B200" style="width:90%;height:auto"></div>
<p><em>Figure 1: Wide-EP throughput on NVIDIA B200 for batch-oriented inference. Total output throughput scales with decode parallelism in a disaggregated 16×16 prefill/decode topology, demonstrating efficient utilization of B200 GPUs for batch-intensive workloads where latency constraints are relaxed.</em></p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="latency-oriented-inference-scheduling"><strong>Latency-oriented: Inference Scheduling</strong><a href="https://llm-d.ai/blog/llm-d-v0.5-sustaining-performance-at-scale#latency-oriented-inference-scheduling" class="hash-link" aria-label="Direct link to latency-oriented-inference-scheduling" title="Direct link to latency-oriented-inference-scheduling" translate="no">​</a></h3>
<p>For workloads with shared context and compute reuse opportunities, intelligent inference scheduling maximizes cache reuse and increases throughput. We validated our inference scheduling guides on a cluster deploying Qwen/Qwen3-32B, specifically targeting the "Multi-tenant SaaS" use case where shared customer contexts enable significant computational savings through prefix caching.</p>
<ul>
<li class=""><strong>Topology:</strong> 8x vLLM pods / 16x NVIDIA H100 GPUs (TP=2)</li>
<li class=""><strong>Workload:</strong> Shared prefix synthetic, 150 groups × 5 prompts, 6k/1.2k/1k system/question/output length</li>
<li class=""><strong>Result:</strong> 4.5-11k output tokens/sec throughput, P50 TTFT 136–157ms, up to 109% higher throughput and 99% lower TTFT vs baseline Kubernetes service</li>
</ul>
<div style="text-align:center;margin:20px 0"><img src="https://llm-d.ai/assets/images/tftp-vs-qps-cec4771e4366444802380277adc7bb09.webp" alt="TFTP vs QPS" style="width:90%;height:auto"></div>
<p><em>Figure 2: TTFT and throughput vs QPS on Qwen3-32B (8×vLLM pods, 16×NVIDIA H100). llm-d inference scheduling maintains near-zero TTFT and scales to ~120k tok/s, while baseline Kubernetes service degrades rapidly under load.</em></p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="optimized-offloading-hierarchical-kv-caching"><strong>Optimized offloading: Hierarchical KV-caching</strong><a href="https://llm-d.ai/blog/llm-d-v0.5-sustaining-performance-at-scale#optimized-offloading-hierarchical-kv-caching" class="hash-link" aria-label="Direct link to optimized-offloading-hierarchical-kv-caching" title="Direct link to optimized-offloading-hierarchical-kv-caching" translate="no">​</a></h2>
<p>In transformer-based inference, the KV-cache is a critical resource for reducing latency, yet it is strictly bounded by the finite capacity of GPU HBM and local CPU DRAM. Even in high-end nodes, local memory creates a hard ceiling on the number of concurrent contexts a system can serve before reverting to expensive prefill computations.</p>
<p>In v0.5, we introduce the `llm-d FS backend`, a storage connector that plugs into vLLM’s native offloading interface. This architecture effectively establishes a three-tier memory hierarchy (GPU, CPU, and Disk) to address the twin challenges of scale and sharing.</p>
<ul>
<li class=""><strong>Decoupling capacity from compute:</strong> By offloading KV blocks to a shared file system, the system decouples cache capacity from local node memory. This allows the inference engine to sustain high throughput even as the active working set grows significantly beyond the aggregate RAM of the cluster.</li>
<li class=""><strong>Cross-replica reuse:</strong> Unlike local CPU caches, which are isolated to a single instance, a shared file system acts as a persistent, global store of KV states. New nodes added to the cluster can "hydrate" their cache immediately from the shared tier, bypassing the warm-up phase typically required for new replicas.</li>
<li class=""><strong>Asynchronous design:</strong> To minimize interference with the decoding loop, the backend utilizes fully asynchronous I/O and parallel worker threads. This ensures that the latency cost of fetching blocks from disk does not block the main compute path.</li>
</ul>
<p>Our internal benchmarks illustrate that the primary value of this architecture is operational stability under load. As shown in Figure 3, standard GPU-only deployments experience a sharp performance collapse once HBM is saturated. In contrast, the storage-backed configuration creates a "performance floor," sustaining throughput as user concurrency increases well beyond local memory limits.</p>
<div style="text-align:center;margin:20px 0"><img src="https://llm-d.ai/assets/images/kvcache-throughput-8aba710cd220e3c399c572e1da3a833a.webp" alt="KV-cache throughput" style="width:75%;height:auto"></div>
<p><em>Figure 3: KV-cache throughput under growing user concurrency on Llama-3.1-70B (4×NVIDIA H100 with IBM Storage Scale). 16K token requests with previously-seen prompts. GPU-only collapses once HBM is saturated; storage offload sustains ~185k tok/s as concurrency scales (13.9x improvement at 250 users).</em></p>
<p>For high-performance computing environments, we have also validated Tiered Prefix Caching on Lustre, demonstrating how parallel file systems can be leveraged to provide persistence and fast throughput capabilities and minimize the latency overhead of this tertiary storage tier.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="advancements-in-scheduling"><strong>Advancements in scheduling</strong><a href="https://llm-d.ai/blog/llm-d-v0.5-sustaining-performance-at-scale#advancements-in-scheduling" class="hash-link" aria-label="Direct link to advancements-in-scheduling" title="Direct link to advancements-in-scheduling" translate="no">​</a></h2>
<p>In v0.5, the scheduler has evolved to handle more complex routing scenarios, specifically targeting multi-adapter workloads and high-availability topologies.</p>
<ul>
<li class=""><strong>Unified tokenization pipeline:</strong> We have standardized the preprocessing stack by reusing the native vLLM Python module for all tokenization and rendering paths. This lays the groundwork towards disaggregated tokenization by converging on a single, comprehensive rendering pipeline built on vLLM’s evolving Renderer API.</li>
<li class=""><strong>Active-active HA with dynamic discovery:</strong> We have introduced a new subscription manager for KV-Events that moves away from static connections. Instead of relying on deployment-time configuration, the scheduler now dynamically discovers vLLM pods and manages ZeroMQ (ZMQ) subscriptions based on the pod lifecycle. This enables robust Active-Active multi-replica scheduler High-Availability, ensuring that cache state is tracked accurately even as replicas scale up or down dynamically.</li>
<li class=""><strong>LoRA-precise prefix caching:</strong> The precise scheduling path now fully supports LoRA adapters. This allows the scheduler to route requests based on the specific cache locality of LoRA adapters, preventing the "thundering herd" problem where every replica attempts to load every adapter and maximizing cache efficiency for multi-tenant workloads.</li>
</ul>
<div style="text-align:center;margin:20px 0"><img src="https://llm-d.ai/assets/images/lora-aware-scheduling-516752378a62e673105e4a7ae15e5520.webp" alt="LoRA-precise prefix caching" style="width:75%;height:auto"></div>
<p><em>Figure 4: LoRA-aware prefix-cache scheduling improves throughput. Prefix-cache aware scheduling mitigates LoRA overhead by minimizing effective compute and avoiding redundant LoRA kernel execution, delivering higher throughput than standard K8s round-robin load-balancing.</em></p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="resilient-networking-nixl---uccl-backend"><strong>Resilient networking: NIXL - UCCL backend</strong><a href="https://llm-d.ai/blog/llm-d-v0.5-sustaining-performance-at-scale#resilient-networking-nixl---uccl-backend" class="hash-link" aria-label="Direct link to resilient-networking-nixl---uccl-backend" title="Direct link to resilient-networking-nixl---uccl-backend" translate="no">​</a></h2>
<p>In disaggregated architectures, tail latency is governed by the efficiency of KV-cache transport between prefill and decode nodes. In v0.5, we have integrated our contribution of the UCCL (Unified Collective Communication Library) backend into the NIXL networking layer.</p>
<p>UCCL provides a unified abstraction over vendor-specific collective primitives (NCCL, RCCL, MCCL). Crucially, for point-to-point (P2P) operations, the dominant traffic pattern in context migration, UCCL implements a host-resident software transport stack. By managing transport logic on the CPU rather than relying solely on hardware offload, the backend enables fine-grained flow splitting and adaptive congestion control strategies. This architecture currently supports both native RDMA and GPUDirect TCP-X transport.</p>
<p>To quantify the impact of this architecture, we evaluated a Llama-3.1-8B deployment on an OpenShift cluster with 200 Gb/s interconnects, measuring the transfer latency of a 4 GB KV-cache payload.</p>
<div style="text-align:center;margin:20px 0"><img src="https://llm-d.ai/assets/images/pdtransfer-under-congestion-497e61f8af883b67a1badb3562a029a0.webp" alt="PD transfer under congestion" style="width:75%;height:auto"></div>
<p><em>Figure 5: PD transfer latency growth under congestion. Comparing baseline vs. congested states for UCX and UCCL transport.</em></p>
<p>While baseline latency remained comparable between transports (~360ms), the architectures diverged significantly under network stress. When subjecting the cluster to heavy cross-traffic congestion:</p>
<ul>
<li class=""><strong>UCX backend:</strong> Latency degraded from 362ms to 424ms (+17.1%).</li>
<li class=""><strong>UCCL backend:</strong> Latency degraded from 359ms to 384ms (+7.1%).</li>
</ul>
<p>The UCCL backend demonstrated <strong>2.4x greater resilience</strong> to network contention (7.1% degradation vs. 17.1%). These results validate the use of host-driven congestion control for ensuring consistent tail latency in shared, multi-tenant production environments.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="updates-to-autoscaling"><strong>Updates to autoscaling</strong><a href="https://llm-d.ai/blog/llm-d-v0.5-sustaining-performance-at-scale#updates-to-autoscaling" class="hash-link" aria-label="Direct link to updates-to-autoscaling" title="Direct link to updates-to-autoscaling" translate="no">​</a></h2>
<p>In release v0.4, we introduced the Workload Variant Autoscaler (WVA) as an experimental feature. In v0.5, we have iterated on this foundation to support more aggressive cost-saving strategies for intermittent workloads.</p>
<ul>
<li class=""><strong>Scale-to-zero and from-zero:</strong> We have implemented support for scaling inference pools down to zero replicas during idle periods. Unlike simple timeout-based shutdowns, this implementation uses a specialized activator component to handle the "cold start" sequence, ensuring that incoming requests trigger provisioning without being dropped. This capability is critical for cost-sensitive environments, such as development clusters or internal RAG applications that do not require 24/7 GPU allocation.</li>
<li class=""><strong>Saturation-based scaling:</strong> The control loop has been refined to better detect saturation points based on queue depth and KV-cache pressure, allowing for more responsive scaling decisions before latency SLOs are violated.</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="community-and-ecosystem"><strong>Community and ecosystem</strong><a href="https://llm-d.ai/blog/llm-d-v0.5-sustaining-performance-at-scale#community-and-ecosystem" class="hash-link" aria-label="Direct link to community-and-ecosystem" title="Direct link to community-and-ecosystem" translate="no">​</a></h2>
<p>This progress wouldn't be possible without close collaboration with our broader community. llm-d exists within a rich ecosystem of open-source projects, and we are proud to contribute our findings and code back to the upstream. A few contribution highlights from this release.</p>
<ul>
<li class=""><strong>vLLM KV-Connector:</strong> We worked closely with the vLLM maintainers to define and implement the <a href="https://blog.vllm.ai/2026/01/08/kv-offloading-connector.html" target="_blank" rel="noopener noreferrer" class="">KV Offloading Connector</a>, ensuring that the storage hierarchy we built for llm-d rests on standard, upstream interfaces.</li>
<li class=""><strong>NIXL Integration:</strong> Our UCCL backend has been <a href="https://github.com/ai-dynamo/nixl/pull/895" target="_blank" rel="noopener noreferrer" class="">merged into the NIXL 0.9 release</a>, enabling the wider community to benefit from the host-driven congestion control strategies we developed for disaggregated serving.</li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="broader-hardware-ecosystem-integration">Broader hardware ecosystem integration<a href="https://llm-d.ai/blog/llm-d-v0.5-sustaining-performance-at-scale#broader-hardware-ecosystem-integration" class="hash-link" aria-label="Direct link to Broader hardware ecosystem integration" title="Direct link to Broader hardware ecosystem integration" translate="no">​</a></h3>
<p>We continue to validate llm-d across a broad and growing hardware ecosystem, reinforcing its role as a hardware-agnostic inference control plane. Additional results will be shared as validations complete.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-is-next"><strong>What Is Next?</strong><a href="https://llm-d.ai/blog/llm-d-v0.5-sustaining-performance-at-scale#what-is-next" class="hash-link" aria-label="Direct link to what-is-next" title="Direct link to what-is-next" translate="no">​</a></h2>
<p>The v0.5 release establishes a foundation for reproducible research and production stability. We invite the community to validate these findings using the new benchmarking tools available on our GitHub repository.</p>
<p>Looking ahead, our focus shifts to four key areas:</p>
<ul>
<li class=""><strong>Scheduling:</strong> Moving toward predictive, latency-aware routing (using TTFT/TPOT targets rather than queue depth) and native batch inference support.</li>
<li class=""><strong>Offloading:</strong> Developing proactive state management APIs to "pin" critical context blocks and exploring semantic caching policies.</li>
<li class=""><strong>Autoscaling:</strong> Introducing pluggable optimizers that leverage queuing theory and ML for predictive scaling, alongside direct SLO integration.</li>
<li class=""><strong>Observability:</strong> Implementing end-to-end distributed tracing, from the gateway through the scheduler to the engine, to expose granular latency bottlenecks in disaggregated architectures.</li>
</ul>
<p>We will publish a feature roadmap for v0.6, outlining the next set of priorities informed by ongoing community discussions and feedback.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="upcoming-events"><strong>Upcoming Events</strong><a href="https://llm-d.ai/blog/llm-d-v0.5-sustaining-performance-at-scale#upcoming-events" class="hash-link" aria-label="Direct link to upcoming-events" title="Direct link to upcoming-events" translate="no">​</a></h2>
<p>Please stay tuned for updates on upcoming community events. You can follow these and other events where you can find the llm-d contributors on our <a href="https://llm-d.ai/docs/community/events" target="_blank" rel="noopener noreferrer" class="">community events page</a>.</p>
<p>Follow the project updates on <a href="https://twitter.com/_llm_d_" target="_blank" rel="noopener noreferrer" class="">Twitter/X</a>, <a href="https://bsky.app/profile/llm-d.ai" target="_blank" rel="noopener noreferrer" class="">Bluesky</a>, and <a href="https://www.linkedin.com/company/llm-d" target="_blank" rel="noopener noreferrer" class="">LinkedIn</a>.  You can also find recordings of our community calls as well as demos of current and upcoming features by being a subscriber to the <a href="https://www.youtube.com/@llm-d-project" target="_blank" rel="noopener noreferrer" class="">llm-d YouTube</a>.</p>
<p>Contribute on <a href="https://github.com/llm-d" target="_blank" rel="noopener noreferrer" class="">GitHub</a>, <a href="https://red.ht/llm-d-public-calendar" target="_blank" rel="noopener noreferrer" class="">join our community calls</a> (Wed 12:30pm ET), join the SIGs and come build with us!</p>]]></content>
        <author>
            <name>Robert Shaw</name>
            <email>robshaw@redhat.com</email>
            <uri>https://github.com/robertgshaw2-redhat</uri>
        </author>
        <author>
            <name>Clayton Coleman</name>
            <email>claytoncoleman@google.com</email>
            <uri>https://github.com/smarterclayton</uri>
        </author>
        <author>
            <name>Carlos Costa</name>
            <email>chcost@us.ibm.com</email>
            <uri>https://github.com/chcost</uri>
        </author>
        <category label="Releases" term="Releases"/>
        <category label="Announcements" term="Announcements"/>
        <category label="llm-d release news" term="llm-d release news"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[llm-d 0.4: Achieve SOTA Performance Across Accelerators]]></title>
        <id>https://llm-d.ai/blog/llm-d-v0.4-achieve-sota-inference-across-accelerators</id>
        <link href="https://llm-d.ai/blog/llm-d-v0.4-achieve-sota-inference-across-accelerators"/>
        <updated>2025-12-02T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[llm-d v0.4 delivers 50% lower latency for MoE models via speculative decoding, expands TPU and XPU support, and adds prefix cache offloading for faster TTFT.]]></summary>
        <content type="html"><![CDATA[<p>llm-d’s mission is to provide the fastest time to SOTA inference performance across any accelerator and cloud. In our <a href="https://llm-d.ai/blog/llm-d-v0.3-expanded-hardware-faster-perf-and-igw-ga" target="_blank" rel="noopener noreferrer" class="">0.3 release</a> we enabled wide expert parallelism for large mixture-of-expert models to provide extremely high output token throughput - a key enabler for reinforcement learning - and we added preliminary support for multiple non-GPU accelerator families.</p>
<p>This release brings the complement to expert parallelism throughput: <strong>improving end-to-end request latency of production serving</strong>. We reduce DeepSeek per token latency up to 50% with speculative decoding and vLLM optimizations for latency critical workloads. We add dynamic disaggregated serving support to Google TPU and Intel XPU to further reduce time to first token latency when traffic is unpredictable, while our new well-lit path for prefix cache offloading helps you leverage CPU memory and high performance remote storage to increase hit rates and reduce tail latency. For users with multiple model deployments our workload autoscaler preview takes real-time server capacity and traffic into account to reduce the amount of time a model deployment is queuing requests - lessening the operational toil running multiple models over constrained accelerator capacity.</p>
<p>These OSS inference stack optimizations, surfaced through our <a href="https://github.com/llm-d/llm-d/blob/main/guides/README.md#well-lit-path-guides" target="_blank" rel="noopener noreferrer" class="">well-lit paths</a>, ensure you reach SOTA latency on frontier OSS models in real world scenarios.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="sota-low-latency-serving-on-large-moe">SOTA Low Latency Serving on Large MoE<a href="https://llm-d.ai/blog/llm-d-v0.4-achieve-sota-inference-across-accelerators#sota-low-latency-serving-on-large-moe" class="hash-link" aria-label="Direct link to SOTA Low Latency Serving on Large MoE" title="Direct link to SOTA Low Latency Serving on Large MoE" translate="no">​</a></h2>
<p>While our previous releases focused on throughput, v0.4 delivers the features necessary to guarantee ultra-low latency, particularly for demanding wide EP MoE models.</p>
<p>We integrated critical low-latency optimizations for MoE models in vLLM and observed an over 40% reduction of per token latency for DeepSeek V3.1 on H200 GPUs:</p>
<ul>
<li class=""><strong>Speculative decoding</strong> leverages under utilized compute at low concurrency to run a smaller “draft” model and guesses at the next valid token - correct guesses are cheaper to validate and thus reduce per output token latency.  We also tested and improved a number of MoE specific optimizations, including DeepSeek’s native MTP support.  We launched the <a href="https://github.com/vllm-project/speculators" target="_blank" rel="noopener noreferrer" class="">speculators OSS project</a> to broaden support for speculative decoding among popular models and are continuously adding new architectures to our <a href="https://huggingface.co/collections/RedHatAI/speculator-models" target="_blank" rel="noopener noreferrer" class="">Hugging Face Hub</a>.</li>
<li class=""><strong>Async Scheduling:</strong> integration of asynchronous scheduling in vLLM allows for more efficient overlapping of compute and CPU scheduling operations, ensuring that the system can sustain high request rates while maintaining low latencies.</li>
<li class=""><strong>Block-FP8 kernel optimizations</strong> to fuse key elementwise operations, overlap shared expert computation with grouped expert routing, and select more efficient kernels</li>
</ul>
<p>The recommendations and tradeoffs in tuning for latency will become a well-lit path in the next release of llm-d.</p>
<p><img decoding="async" loading="lazy" alt="Improved DeepSeek 3.1 per output token latency in llm-d 0.4" src="https://llm-d.ai/assets/images/deepseek-per-output-token-latency-cc4530593b1e33643bc15685eac59950.webp" width="1200" height="742" class="img_ev3q"></p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="expanding-hardware-choice">Expanding Hardware Choice<a href="https://llm-d.ai/blog/llm-d-v0.4-achieve-sota-inference-across-accelerators#expanding-hardware-choice" class="hash-link" aria-label="Direct link to Expanding Hardware Choice" title="Direct link to Expanding Hardware Choice" translate="no">​</a></h2>
<p>The community has <a href="https://github.com/llm-d/llm-d/tree/main/docs/accelerators#support" target="_blank" rel="noopener noreferrer" class="">continued to integrate support across a diverse range of accelerators</a> and <a href="https://github.com/llm-d/llm-d/tree/main/docs/infra-providers" target="_blank" rel="noopener noreferrer" class="">providers</a>, making llm-d the definitive hardware-agnostic control plane:</p>
<ul>
<li class=""><strong>Google TPU Disaggregation over DCN:</strong> We integrate the latest vLLM + TPU architecture with our <a href="https://github.com/llm-d/llm-d/blob/main/guides/pd-disaggregation/README.tpu.md" target="_blank" rel="noopener noreferrer" class="">dynamic disaggregation recipe for llama3-70b</a>, allowing high performance KV cache transfer over datacenter TCP and allowing on-demand scaling of prefill instances in response to prefill load.</li>
<li class=""><strong>Intel XPU Integration:</strong> <a href="https://github.com/llm-d/llm-d/tree/main/docs/accelerators#intel-xpu" target="_blank" rel="noopener noreferrer" class="">Initial support</a> and verification for the intelligent scheduling and disaggregation well-lit paths.</li>
<li class=""><strong>AWS EFA Support:</strong> Support for AWS Elastic Fabric Adapters (EFA) has been added to the llm-d CUDA image in the form of NIXL libfabric libraries to enable low-latency communication on AWS accelerator clusters with <a href="https://github.com/llm-d/llm-d/issues/522" target="_blank" rel="noopener noreferrer" class="">full EFA support in a future release</a>.</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="expanding-the-well-lit-paths">Expanding the well-lit paths<a href="https://llm-d.ai/blog/llm-d-v0.4-achieve-sota-inference-across-accelerators#expanding-the-well-lit-paths" class="hash-link" aria-label="Direct link to Expanding the well-lit paths" title="Direct link to Expanding the well-lit paths" translate="no">​</a></h2>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="new-well-lit-path-for-prefix-cache-offloading"><strong>New well-lit path for prefix cache offloading</strong><a href="https://llm-d.ai/blog/llm-d-v0.4-achieve-sota-inference-across-accelerators#new-well-lit-path-for-prefix-cache-offloading" class="hash-link" aria-label="Direct link to new-well-lit-path-for-prefix-cache-offloading" title="Direct link to new-well-lit-path-for-prefix-cache-offloading" translate="no">​</a></h3>
<p>For long-context or high-concurrency multi-turn workloads, GPU memory is the bottleneck. To address this, we are recommending <a href="https://github.com/llm-d/llm-d/blob/main/guides/tiered-prefix-cache/README.md" target="_blank" rel="noopener noreferrer" class=""><strong>tiered prefix-cache offloading</strong></a> as a standard practice in llm-d v0.4.</p>
<p>We have formalized <a href="https://github.com/vllm-project/vllm/issues/19854" target="_blank" rel="noopener noreferrer" class="">vLLM-native CPU offloading</a> and <a href="https://lmcache.ai/" target="_blank" rel="noopener noreferrer" class="">LMCache</a> connectors under a new well-lit path that allows the system to transparently utilize host CPU RAM - which is often underutilized - as a secondary storage tier for the KV cache. Instead of dropping context when VRAM fills up, llm-d allows you to swap blocks to the CPU and retrieve them when needed.</p>
<p>In our <a href="https://github.com/llm-d/llm-d/blob/main/guides/tiered-prefix-cache/cpu/README.md#benchmark" target="_blank" rel="noopener noreferrer" class="">user guide benchmarking</a>, enabling CPU offloading when KV cache working set exceeded the available HBM capacity yielded a <strong>25% reduction in mean Time To First Token (TTFT)</strong> and a <strong>21% increase in total throughput</strong>.</p>
<table><thead><tr><th style="text-align:left">HBM &lt; KVCache &lt; HBM + CPU RAM</th><th style="text-align:left">Mean TTFT (second)</th><th style="text-align:left">P90 TTFT (second)</th><th style="text-align:left">Mean E2E Latency (second)</th><th style="text-align:left">P90 E2E Latency (second)</th><th style="text-align:left">Overall Throughput (token per second)</th></tr></thead><tbody><tr><td style="text-align:left"><strong>Baseline vLLM</strong></td><td style="text-align:left">9.0</td><td style="text-align:left">20.9</td><td style="text-align:left">37.8</td><td style="text-align:left">49.7</td><td style="text-align:left">38534.8</td></tr><tr><td style="text-align:left"><strong>vLLM + CPU offloading 100GB</strong></td><td style="text-align:left">6.7 (-25.6%)</td><td style="text-align:left">20.2 (-3.3%)</td><td style="text-align:left">30.9 (-18.3%)</td><td style="text-align:left">44.2 (-11.1%)</td><td style="text-align:left">46751.0 (+21.3%)</td></tr><tr><td style="text-align:left"><strong>vLLM + LMCache CPU offloading 100GB</strong></td><td style="text-align:left">6.5 (-27.8%)</td><td style="text-align:left">18.8 (-10.0%)</td><td style="text-align:left">30.8 (-18.5%)</td><td style="text-align:left">43.0 (-13.5%)</td><td style="text-align:left">46910.6 (+21.7%)</td></tr></tbody></table>
<p><a href="https://github.com/llm-d/llm-d/blob/main/guides/tiered-prefix-cache/cpu/README.md#high-cache-performance" target="_blank" rel="noopener noreferrer" class=""><em>Table: High Performance</em></a> <em>compares the performance of the baseline vLLM with the vLLM using the CPU offloading connector when the KVCache size is larger than the available HBM.)</em></p>
<p>By treating CPU memory as an active extension of the GPU, operators can run larger models or higher concurrency on existing hardware, <strong>significantly improving token-per-dollar value</strong>. This sets the stage for deeper tiering strategies, including upcoming guides on distributed storage offloading.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="enhanced-intelligent-scheduling"><strong>Enhanced intelligent scheduling</strong><a href="https://llm-d.ai/blog/llm-d-v0.4-achieve-sota-inference-across-accelerators#enhanced-intelligent-scheduling" class="hash-link" aria-label="Direct link to enhanced-intelligent-scheduling" title="Direct link to enhanced-intelligent-scheduling" translate="no">​</a></h3>
<p>To complement the new offloading capabilities, we have updated our <strong>intelligent inference scheduling</strong> well-lit path to fully leverage the tiered memory architecture.</p>
<ul>
<li class=""><strong>Tier-aware precision:</strong> Prefix-cache affinity now accounts for the cost dynamics of the new cache tier. We implemented granular KV-cache tracking with <strong>configurable weighted scoring</strong> (e.g., weighting GPU hits higher than CPU hits). This allows the router to calculate the most efficient path for KV retrieval, balancing the high value of GPU-resident data against the available capacity of the CPU tier.</li>
</ul>
<p>Drawing on extensive benchmarking data, we also refined our <strong>load-oriented distribution</strong> scorers, highlighting:</p>
<ul>
<li class=""><strong>Saturation stability:</strong> we introduced the <code>no-hit-lru-scorer</code>. This scorer intelligently distributes "cold" requests - the seeds that form new prefix-cache trails - to prevent hotspots during cache formation. This eliminates oscillating wait-queues, allowing the cluster to <strong>maintain stability at higher concurrency on the same hardware</strong> for workloads with prefix reuse opportunities.</li>
</ul>
<p>For workloads with <strong>prefix-reuse opportunities</strong>, this improved placement logic <strong>cut P99 Time To First Token (TTFT) in half</strong> (18.3s -&gt; 8s) while maintaining stability at peak concurrency.</p>
<p>Benchmark comparison: Running 8x Qwen-32B on 16 NVIDIA H100 GPUs at 50 QPS with <strong>5500:1000 ISL/OSL requests</strong> sharing <strong>150 unique prefixes</strong>. The v0.4 scheduler (bottom) eliminates the wait-queue oscillations seen in v0.3, resulting in a 50% reduction in P99 TTFT.</p>
<p>High queueing in llm-d v0.3 for cold prefix cache requests:<br>
<img decoding="async" loading="lazy" alt="Cold cache request queueing in 0.3" src="https://llm-d.ai/assets/images/cold-request-0.3-3fdad4ac82ae7d8c6cee51f69415f1be.webp" width="1834" height="817" class="img_ev3q"></p>
<p>Significant reduction in queueing in llm-d v0.4 for cold prefix cache requests:<br>
<img decoding="async" loading="lazy" alt="Lower queueing of requests in 0.4 for cold cache requests" src="https://llm-d.ai/assets/images/cold-request-0.4-e8a298bb2eabc181efa8a6e2ceb5d003.webp" width="1999" height="616" class="img_ev3q"></p>
<p>Best practices and benchmarking for configuring the appropriate scoring for prefix-cache dependent workloads is planned for our next release.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="benchmarking-and-validation"><strong>Benchmarking and validation</strong><a href="https://llm-d.ai/blog/llm-d-v0.4-achieve-sota-inference-across-accelerators#benchmarking-and-validation" class="hash-link" aria-label="Direct link to benchmarking-and-validation" title="Direct link to benchmarking-and-validation" translate="no">​</a></h3>
<p>To help the community understand the trade-offs in distributed inference, we are making it easier to benchmark the well-lit paths. <a href="https://github.com/llm-d/llm-d-benchmark" target="_blank" rel="noopener noreferrer" class="">llm-d-benchmark</a> provides all the tooling needed for performance characterization using a "design of experiments" methodology, ensuring results are reproducible and standardized.</p>
<ul>
<li class=""><strong>Comprehensive automation:</strong> We now fully automate the setup and execution of the llm-d well lit paths in multiple scenarios using supported harnesses like inference-perf, guidellm, vllm-benchmark, and inferenceMAX.</li>
<li class=""><strong>Flexible execution:</strong> new guides covering automated stack setup, running against existing stacks, and interactive benchmarking.</li>
<li class=""><strong>Data-driven insights:</strong> Benchmark data is collected in a <a href="https://www.google.com/url?q=https://github.com/llm-d/llm-d-benchmark/blob/main/docs/benchmark_report.md&amp;sa=D&amp;source=docs&amp;ust=1764622305242195&amp;usg=AOvVaw2vguVuO1210avP52TrG-6x" target="_blank" rel="noopener noreferrer" class="">standardized report format</a>. Our "Configuration Explorer" allows you to parse this data to visualize Pareto curves and find optimal deployment parameters for your specific SLOs.</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="improving-efficiency">Improving efficiency<a href="https://llm-d.ai/blog/llm-d-v0.4-achieve-sota-inference-across-accelerators#improving-efficiency" class="hash-link" aria-label="Direct link to Improving efficiency" title="Direct link to Improving efficiency" translate="no">​</a></h2>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="introducing-the-workload-variant-autoscaler-experimental"><strong>Introducing the Workload Variant Autoscaler (experimental)</strong><a href="https://llm-d.ai/blog/llm-d-v0.4-achieve-sota-inference-across-accelerators#introducing-the-workload-variant-autoscaler-experimental" class="hash-link" aria-label="Direct link to introducing-the-workload-variant-autoscaler-experimental" title="Direct link to introducing-the-workload-variant-autoscaler-experimental" translate="no">​</a></h3>
<p>The <a href="https://github.com/llm-d/llm-d/blob/main/guides/workload-autoscaling/README.md" target="_blank" rel="noopener noreferrer" class="">Workload Variant Autoscaler</a> uses a reactive saturation-based optimizer that operates on per-replica metrics such as queue length and KV-cache utilization to identify saturated replicas and calculate the spare capacity on non-saturated replicas. It relies on similar signals as the inference scheduler with a wider safety margin to avoid oscillation. This approach is robust for workloads where performance parameters are uncertain and allows scaling on observed load to prevent queue overflow.</p>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="saturation-based-scaling"><strong>Saturation-based scaling</strong><a href="https://llm-d.ai/blog/llm-d-v0.4-achieve-sota-inference-across-accelerators#saturation-based-scaling" class="hash-link" aria-label="Direct link to saturation-based-scaling" title="Direct link to saturation-based-scaling" translate="no">​</a></h4>
<p>Saturation-based scaling is the default autoscaling approach for the v0.4 release and is recommended for most workloads, including hybrid state space models (HSSM), MoE, and diffusion architectures.</p>
<ul>
<li class=""><strong>How it works:</strong> Scales reactively based on observed metrics (arrival rate vs. capacity).</li>
<li class=""><strong>Benefit:</strong> Does not require complex performance parameters and tuning, making it robust and less sensitive to architecture-specific behavior differences.</li>
<li class=""><strong>Limitation</strong>: It is reactive only, scales when the average observed spare capacity across non-saturated replicas falls below statically configured thresholds, and may be less cost-efficient than accurate predictive scaling. Scaling happens in increments of one replica at a time.</li>
</ul>
<p>Future releases will enhance WVA with an SLO-driven, proactive, adaptive optimizer that responds not only to queries per second (QPS) but also to evolving traffic patterns striking a balance between ease of use and performance. Following this release, we are planning a detailed blog on autoscaling, including performance evaluations and roadmap.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="making-it-easier-to-operate">Making it easier to operate<a href="https://llm-d.ai/blog/llm-d-v0.4-achieve-sota-inference-across-accelerators#making-it-easier-to-operate" class="hash-link" aria-label="Direct link to Making it easier to operate" title="Direct link to Making it easier to operate" translate="no">​</a></h2>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="production-readiness-for-chat"><strong>Production readiness for chat</strong><a href="https://llm-d.ai/blog/llm-d-v0.4-achieve-sota-inference-across-accelerators#production-readiness-for-chat" class="hash-link" aria-label="Direct link to production-readiness-for-chat" title="Direct link to production-readiness-for-chat" translate="no">​</a></h3>
<ul>
<li class=""><strong>Chat completion API support:</strong> Full support for the OpenAI-compatible <strong><code>/v1/chat/completions</code></strong> API has been merged and tested, simplifying the operational deployment of all conversational and agentic workloads.</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="breaking-changes">Breaking changes<a href="https://llm-d.ai/blog/llm-d-v0.4-achieve-sota-inference-across-accelerators#breaking-changes" class="hash-link" aria-label="Direct link to Breaking changes" title="Direct link to Breaking changes" translate="no">​</a></h2>
<p>As part of our refactoring to align with best practices in the Kubernetes Gateway API Inference Extension (GAIE), we are announcing a critical breaking change for all deployments using our Helm charts: The method for setting configuration flags for the Inference Scheduler has changed from an array format to a map format. If you are upgrading to v0.4, please review the latest Helm chart <code>values.yaml</code> files and update your overrides immediately. This change is necessary to decouple configuration from code and enable future advanced scheduling features.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-is-next">What Is Next?<a href="https://llm-d.ai/blog/llm-d-v0.4-achieve-sota-inference-across-accelerators#what-is-next" class="hash-link" aria-label="Direct link to What Is Next?" title="Direct link to What Is Next?" translate="no">​</a></h2>
<p>0.5 will continue adding accelerator support on a wider range of scenarios and models and also increase our focus on tuning for specific use cases, like multi-turn chat or reinforcement learning. If there is a use case important to you please suggest it in our <a href="https://github.com/llm-d/llm-d/issues/517" target="_blank" rel="noopener noreferrer" class="">0.5 roadmap issue</a> or raise it in a community meeting!</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="community-and-upcoming-events">Community and Upcoming Events<a href="https://llm-d.ai/blog/llm-d-v0.4-achieve-sota-inference-across-accelerators#community-and-upcoming-events" class="hash-link" aria-label="Direct link to Community and Upcoming Events" title="Direct link to Community and Upcoming Events" translate="no">​</a></h2>
<p>Driven by our community, llm-d continues to evolve. Release v0.4 incorporates feedback and code from a diverse range of contributors, cementing our status as the definitive hardware-agnostic control plane. We are incredibly grateful to everyone involved.</p>
<p>Please stay tuned for updates on upcoming community events. You can follow these and other events on our <a href="https://llm-d.ai/docs/community/events" target="_blank" rel="noopener noreferrer" class="">community events page</a>.</p>
<p>Contribute on <a href="https://github.com/llm-d" target="_blank" rel="noopener noreferrer" class="">GitHub</a>, <a href="https://red.ht/llm-d-public-calendar" target="_blank" rel="noopener noreferrer" class="">join our community calls</a> (Wed 12:30pm ET), join the SIGs and come build with us!</p>]]></content>
        <author>
            <name>Robert Shaw</name>
            <email>robshaw@redhat.com</email>
            <uri>https://github.com/robertgshaw2-redhat</uri>
        </author>
        <author>
            <name>Clayton Coleman</name>
            <email>claytoncoleman@google.com</email>
            <uri>https://github.com/smarterclayton</uri>
        </author>
        <author>
            <name>Carlos Costa</name>
            <email>chcost@us.ibm.com</email>
            <uri>https://github.com/chcost</uri>
        </author>
        <category label="Releases" term="Releases"/>
        <category label="Announcements" term="Announcements"/>
        <category label="llm-d release news" term="llm-d release news"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[llm-d 0.3: Wider Well-Lit Paths for Scalable Inference]]></title>
        <id>https://llm-d.ai/blog/llm-d-v0.3-expanded-hardware-faster-perf-and-igw-ga</id>
        <link href="https://llm-d.ai/blog/llm-d-v0.3-expanded-hardware-faster-perf-and-igw-ga"/>
        <updated>2025-10-10T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[llm-d v0.3 adds Google TPU and Intel XPU support, wide expert parallelism at 2.2k tokens/sec per GPU, predicted latency scheduling, and Inference Gateway GA.]]></summary>
        <content type="html"><![CDATA[<p>In our <a href="https://llm-d.ai/blog/llm-d-v0.2-our-first-well-lit-paths" target="_blank" rel="noopener noreferrer" class="">0.2 release</a>, we introduced the first <em>well-lit paths</em>, tested blueprints for scaling inference on Kubernetes. With our <a href="https://github.com/llm-d/llm-d/releases/tag/v0.3.0" target="_blank" rel="noopener noreferrer" class="">0.3 release</a>, we double down on the mission: to provide a fast path to deploying high performance, hardware-agnostic, easy to operationalize, at scale inference.</p>
<p>This release delivers:</p>
<ul>
<li class="">Expanded hardware support, now including Google TPU and Intel support</li>
<li class="">TCP and RDMA over RoCE validated for disaggregation</li>
<li class="">A predicted latency based balancing preview that improves P90 latency by up to 3x in long-prefill workloads</li>
<li class="">Wide expert parallel (EP) scaling to 2.2k tokens per second per H200 GPU</li>
<li class="">The GA release of the Inference Gateway (IGW v1.0).</li>
</ul>
<p>Taken together, these results redefine the operating envelope for inference. llm-d enables clusters to run hotter before scaling out, extracting more value from each GPU, and still meet strict latency objectives. The result is a control plane built not just for speed, but for predictable, cost-efficient scale.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="commit-to-the-mission"><strong>Commit to the Mission</strong><a href="https://llm-d.ai/blog/llm-d-v0.3-expanded-hardware-faster-perf-and-igw-ga#commit-to-the-mission" class="hash-link" aria-label="Direct link to commit-to-the-mission" title="Direct link to commit-to-the-mission" translate="no">​</a></h2>
<p>llm-d’s mission is to be a hardware-agnostic, upstream-aligned control plane for inference. 0.3 takes a decisive step forward, not only by expanding platform coverage, but also by making it easier to adopt and experiment with.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="broader-hardware-support"><strong>Broader Hardware Support</strong><a href="https://llm-d.ai/blog/llm-d-v0.3-expanded-hardware-faster-perf-and-igw-ga#broader-hardware-support" class="hash-link" aria-label="Direct link to broader-hardware-support" title="Direct link to broader-hardware-support" translate="no">​</a></h3>
<p>llm-d support spans now across an expanded array of accelerator families: NVIDIA, Google TPU, and Intel XPU. TPU artifacts are tested in cloud environments and Intel XPU builds run through CI runners alongside NVIDIA.</p>
<p>Disaggregation support is now more portable. With 0.3, TCP and RDMA over RoCE have been validated as transports for prefill/decode separation. This extends disaggregation beyond specialized InfiniBand fabrics, opening the path to reproducible deployments in mainstream cloud networking environments.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="simplified-user-experience-across-a-variety-of-infrastructure"><strong>Simplified User Experience Across A Variety of Infrastructure</strong><a href="https://llm-d.ai/blog/llm-d-v0.3-expanded-hardware-faster-perf-and-igw-ga#simplified-user-experience-across-a-variety-of-infrastructure" class="hash-link" aria-label="Direct link to simplified-user-experience-across-a-variety-of-infrastructure" title="Direct link to simplified-user-experience-across-a-variety-of-infrastructure" translate="no">​</a></h3>
<p>Serving large language models is complex -  our documentation and configuration should be simple. Quickstarts have been streamlined and renamed to guides, with fewer options and more context around the key decisions you need to make. They are now located in the main repository and treated as living documents alongside our growing documentation for common scenarios.  Since llm-d is about exposing the key tradeoffs and exposing useful patterns, we’ve split out the key prerequisites for each guide - cluster configuration, client setup, and gateway choice - into their own sections, and replaced our all-in-one installer scripts with better step by step instructions.</p>
<p>As more cluster providers integrate into llm-d we’ve expanded the documentation for infrastructure with provider specific troubleshooting, configuration, and testing. This release adds documentation and steps for CoreWeave, Digital Ocean, Google Kubernetes Engine, and OpenShift.</p>
<p>Guides now include curated Inference Gateway installs and static manifests for clarity, with overlays available for benchmarking sweeps. RBAC patterns were refactored toward namespace scope for smoother multi-tenancy.</p>
<p><strong>Why it matters:</strong> With 0.3, experimenting with intelligent scheduling or disaggregation is as simple as running a documented guide. The control plane is more transparent, reproducible, and extensible, independently of the platform you are running.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="brighten-the-well-lit-paths"><strong>Brighten the well-lit paths</strong><a href="https://llm-d.ai/blog/llm-d-v0.3-expanded-hardware-faster-perf-and-igw-ga#brighten-the-well-lit-paths" class="hash-link" aria-label="Direct link to brighten-the-well-lit-paths" title="Direct link to brighten-the-well-lit-paths" translate="no">​</a></h2>
<p>The well-lit paths we introduced in 0.2 - wide-EP, intelligent scheduling, disaggregation - are now sharper and more reproducible.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="wide-ep-performance"><strong>Wide-EP Performance</strong><a href="https://llm-d.ai/blog/llm-d-v0.3-expanded-hardware-faster-perf-and-igw-ga#wide-ep-performance" class="hash-link" aria-label="Direct link to wide-ep-performance" title="Direct link to wide-ep-performance" translate="no">​</a></h3>
<p>The wide-EP path, which parallelizes across experts to maximize throughput, has reached <strong>2.2k tokens/s per H200 GPU</strong> in community benchmarks on H200 clusters.</p>
<p><img decoding="async" loading="lazy" alt="prefill throughput" src="https://llm-d.ai/assets/images/wide_ep_prefill-e200554d4ed651fdbf8d388d8f2bf509.webp" width="1200" height="742" class="img_ev3q"><img decoding="async" loading="lazy" alt="decode throughput" src="https://llm-d.ai/assets/images/wide_ep_decode-43e7ec6a9221e955939b046728200894.webp" width="1200" height="742" class="img_ev3q"></p>
<p>The result reflects sustained throughput in multi-node deployments with production-like settings. Earlier results hovered around 1.5k tokens/s per GPU; the jump to 2.2k confirms that kernel optimizations (silu-mul-quant fusion, Cutlass QKV kernels, TP attention bugfixes) and the addition of Dual Batch Overlap (DBO) for decode are delivering compound gains.</p>
<p><strong>Why it matters:</strong> At these throughput levels operators can consolidate workloads onto fewer nodes, reduce the number of replicas needed for a given QPS, and lower token-per-dollar cost.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="inference-gateway-v10-ga"><strong>Inference Gateway v1.0 GA</strong><a href="https://llm-d.ai/blog/llm-d-v0.3-expanded-hardware-faster-perf-and-igw-ga#inference-gateway-v10-ga" class="hash-link" aria-label="Direct link to inference-gateway-v10-ga" title="Direct link to inference-gateway-v10-ga" translate="no">​</a></h3>
<p>In v0.2, we brought a number of new features to the inference scheduling space. With v0.3, the Inference Gateway (IGW) reaches v1.0 GA, making intelligent routing a stable primitive in llm-d.</p>
<p><img decoding="async" loading="lazy" alt="Different strategies for balancing" src="https://llm-d.ai/assets/images/igw_ga-fc8828394a38eca5c8b2c7f9d2825a6d.webp" width="1979" height="455" class="img_ev3q"><br>
<em>Figure: A tri-panel of TTFT, TPoT and Throughput measured through progressively rising QPS rates, comparing scheduling that is load- vs prefix- vs prefix+load- aware scheduling</em>. <em>Check out our blog <a href="https://llm-d.ai/blog/kvcache-wins-you-can-see#performance-results" target="_blank" rel="noopener noreferrer" class="">KV-Cache Wins You Can See</a> for more details.</em></p>
<p>IGW integrates tightly with the llm-d scheduler, supporting load (kv-cache utilization and queue length) and prefix-cache aware scheduling. Benchmarks show that adding prefix aware scorers achieve near-100% KV cache hit rates and significantly reduce TTFT compared to earlier routing approaches based on load only. For cluster-level throughput, that means sustained &gt;60k tokens/s with lower latency spread. It also demonstrates the importance of combining prefix and load-aware scorers to ensure that cache hits are exploited without overloading replicas.</p>
<p><strong>Why it matters:</strong> This is the difference between inference systems that react to congestion and ones that proactively route requests to where cache reuse and concurrency balance are best. For operators it translates into fewer cache misses, reduced latency spikes, and predictable scaling.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="predicted-latency-based-scheduling"><strong>Predicted Latency-Based Scheduling</strong><a href="https://llm-d.ai/blog/llm-d-v0.3-expanded-hardware-faster-perf-and-igw-ga#predicted-latency-based-scheduling" class="hash-link" aria-label="Direct link to predicted-latency-based-scheduling" title="Direct link to predicted-latency-based-scheduling" translate="no">​</a></h3>
<p>We are introducing an innovative experimental scheduling scorer designed to optimize latency. This predictor incorporates input length, in-flight tokens, and concurrency into a unified cost model to forecast TTFT and TPOT for each request across available servers. The system proactively balances requests within the 25–75% saturation window, where most clusters operate and where reactive schedulers fall short.</p>
<p><img decoding="async" loading="lazy" alt="Tail latency improvements with predicted latency scheduling" src="https://llm-d.ai/assets/images/predicted_1-f117713a68745829b646a3356ab40374.webp" width="1999" height="515" class="img_ev3q">
<em>Figure:  TTFT and TPOT tail latency measured on increasing QPS rates (6 servers), comparing predicted-latency with prefix+load aware scheduling for a 4:1 ISL<!-- -->:OSL<!-- --> ratio.</em></p>
<p>The first graph demonstrates that the new scorer slightly enhances tail latency when handling a balanced ISL<!-- -->:OSL<!-- --> workload. Another major advantage it offers over manually tuned load and prefix-aware scorers is simplifying configuration by automatically learning thresholds from ingested features. This inherent adaptability makes it more resilient to shifts in traffic patterns and allows for easier integration of new scheduling signals.</p>
<p>The subsequent figure illustrates that as the workload shifts towards a more prefill-intensive distribution, the new scorer offers even better performance. This is because it more precisely predicts prefill duration, which is tied to a known request length, in contrast to decode time, which depends on an unpredictable output length.</p>
<p><img decoding="async" loading="lazy" alt="Strong latency improvement for extreme prefill / decode ratios" src="https://llm-d.ai/assets/images/predicted_2-2b9db2acc8fccdf86035fe83231708b1.webp" width="1999" height="515" class="img_ev3q"><em>Figure:  TTFT and TPOT tail latency measured on increasing QPS rates (6 servers), comparing predicted-latency with prefix+load aware scheduling for a 400:1 ISL<!-- -->:OSL<!-- --> ratio.</em></p>
<p>With better prediction of latency the new approach also enhances SLO attainability for sheddable workloads by more effectively detecting saturation before it occurs rather than after.</p>
<p><img decoding="async" loading="lazy" alt="Better saturation anticipation when accelerators are almost full" src="https://llm-d.ai/assets/images/predicted_3-18aef05ada5f1a90c19686754500130e.webp" width="1999" height="737" class="img_ev3q">
<em>Figure: SLO attainability (1 server), comparing predicted-latency with load-aware scheduling and shedding for a balanced ~1:1 ISL<!-- -->:OSL<!-- --> ratio.</em></p>
<p><strong>Why it matters:</strong></p>
<ul>
<li class="">Users experience steadier, lower tail latencies: critical for interactive workloads.</li>
<li class="">Operators can push GPUs closer to saturation before scaling out, reducing cost.</li>
<li class="">SLOs evolve from reactive guardrails to proactive dials for tuning responsiveness vs. stability.</li>
<li class="">This lays the foundation for closed-loop autoscaling that responds directly to SLO objectives.</li>
</ul>
<p>For MaaS providers, these gains are not just academic, they directly translate into lower variance for customers and better utilization of costly accelerators.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="kv-disaggregation-and-chat-completions"><strong>KV Disaggregation and Chat Completions</strong><a href="https://llm-d.ai/blog/llm-d-v0.3-expanded-hardware-faster-perf-and-igw-ga#kv-disaggregation-and-chat-completions" class="hash-link" aria-label="Direct link to kv-disaggregation-and-chat-completions" title="Direct link to kv-disaggregation-and-chat-completions" translate="no">​</a></h3>
<p>The KV path also matures in v0.3. Synchronous tokenization has stabilized efficient caching in high-availability deployments, while precise prefix-cache aware scorers achieve significantly reduced KV-cache duplication across instances and limits the impact of prefix cache expiration pressure, significantly reducing TTFT (x57 in <a href="https://llm-d.ai/blog/kvcache-wins-you-can-see" target="_blank" rel="noopener noreferrer" class="">our benchmark</a>) and doubling throughput.</p>
<p>The benchmark simulates a B2B SaaS platform serving 150 enterprise customers, each with their own 6,000-token context shared among 5 concurrent users (750 total) under sustained load, where cache demand exceeds single-instance capacity and forces distributed scheduling - testing the system's ability to avoid cache thrashing.<br>
<img decoding="async" loading="lazy" alt="Precise prefix caching" src="https://llm-d.ai/assets/images/kv_1-4565767a57f55b90aa1f200d1cbaf09a.webp" width="1979" height="455" class="img_ev3q"><br>
<em>Figure: A tri-panel of TTFT, TPoT and Throughput measured through progressively rising QPS rates in a <a href="https://llm-d.ai/blog/kvcache-wins-you-can-see#performance-results" target="_blank" rel="noopener noreferrer" class="">demanding benchmark</a>.</em></p>
<p><em><img decoding="async" loading="lazy" alt="Total computational work saved" src="https://llm-d.ai/assets/images/kv_2-004ee2b885ca03a955d63a0522d37b1f.webp" width="1580" height="724" class="img_ev3q"></em><br>
<em>Figure: The total computational work - tokens per second - <strong>saved</strong> by effective KV-cache use across the cluster, over the course of the <a href="https://llm-d.ai/blog/kvcache-wins-you-can-see#1-effective-cache-throughput-quantifying-saved-work" target="_blank" rel="noopener noreferrer" class="">benchmarks</a>.</em></p>
<p><strong>Why it matters:</strong> The stronger cache affinity guarantees provided by precise prefix-cache aware routing keep latency low without requiring the use of coarse-grained sticky sessions or high operational overhead. Real-world chat completion workloads can keep the high cache hit rates of sticky sessions but gain utilization based balancing, reducing hot spotting and underutilized capacity.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="observability-and-benchmarking"><strong>Observability and Benchmarking</strong><a href="https://llm-d.ai/blog/llm-d-v0.3-expanded-hardware-faster-perf-and-igw-ga#observability-and-benchmarking" class="hash-link" aria-label="Direct link to observability-and-benchmarking" title="Direct link to observability-and-benchmarking" translate="no">​</a></h3>
<p>Behind these performance and scheduling gains is a stronger foundation for visibility and reproducibility.</p>
<ul>
<li class=""><strong>Observability</strong>: Gateway metrics are now exposed via ServiceMonitor, with PromQL queries and Grafana dashboards published in the docs.</li>
<li class=""><strong>Benchmarking</strong>: v0.30 RC1 supports all well-lit path scenarios in CI/CD. Inference-perf has been hardened with saturation testing, trace replay, and more accurate scheduling.</li>
</ul>
<p><strong>Why it matters:</strong> operators can now validate configs before launch, monitor latency compliance in real time, and detect regressions early. This ensures the well-lit paths are not just concepts, but reproducible and production-ready.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-do-you-get-today"><strong>What Do You Get Today?</strong><a href="https://llm-d.ai/blog/llm-d-v0.3-expanded-hardware-faster-perf-and-igw-ga#what-do-you-get-today" class="hash-link" aria-label="Direct link to what-do-you-get-today" title="Direct link to what-do-you-get-today" translate="no">​</a></h2>
<p>With v0.3, here’s what you can run, experiment with, and rely on:</p>
<ul>
<li class="">Cross-platform hardware support across NVIDIA, Intel XPU, and Google TPU.</li>
<li class="">Simplified installation with curated guides, overlays, static manifests, and a capacity planner.</li>
<li class="">Wide-EP throughput (reaching 2.2k tokens/s per GPU on multi-node H200 clusters).</li>
<li class="">Inference Gateway v1.0 GA, with cache-aware routing and stable APIs.</li>
<li class="">Adaptive SLO predictor (preview) showing up to 3x better P90 latency in long-prefill workloads.</li>
<li class="">KV disaggregation with synchronous tokenization, precise scorers, and chat completions support.</li>
<li class="">Disaggregation over TCP and RDMA/RoCE, extending beyond InfiniBand networks.</li>
<li class="">Observability tools with ServiceMonitor and Grafana dashboards.</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-is-next"><strong>What Is Next?</strong><a href="https://llm-d.ai/blog/llm-d-v0.3-expanded-hardware-faster-perf-and-igw-ga#what-is-next" class="hash-link" aria-label="Direct link to what-is-next" title="Direct link to what-is-next" translate="no">​</a></h2>
<p>While v0.3 stabilizes today’s well-lit paths, the community is also laying the foundations for tomorrow’s paths</p>
<ul>
<li class=""><strong>Native CPU offload in vLLM</strong>: Async KV transfer APIs and CPU buffer transport under review, enabling cache spillover to CPU memory with scheduler awareness through KV events.</li>
<li class=""><strong>Deferred decode and async protocols</strong>: design work underway for finer scheduling control and a lower level “tokens in / tokens out” serving engine in vLLM for high scale multi-modal serving</li>
<li class=""><strong>Public performance database</strong>: building on benchmarking harnesses to provide transparent, reproducible performance data across platforms.</li>
<li class=""><strong>Autoscaling incubation</strong>: WVA autoscaler integrated with Kubernetes HPA; demonstrated scale-to-zero. Next steps include tying autoscaler decisions directly to adaptive SLOs.</li>
</ul>
<p>If there is an area you’d like for us to focus on, please suggest it in our <a href="https://github.com/llm-d/llm-d/issues/347" target="_blank" rel="noopener noreferrer" class="">0.4 release tracking issue.</a></p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="community-and-upcoming-events"><strong>Community and Upcoming Events</strong><a href="https://llm-d.ai/blog/llm-d-v0.3-expanded-hardware-faster-perf-and-igw-ga#community-and-upcoming-events" class="hash-link" aria-label="Direct link to community-and-upcoming-events" title="Direct link to community-and-upcoming-events" translate="no">​</a></h2>
<p>The strength of llm-d comes from its community. v0.3 reflects contributions from a growing number of contributors and collaborators.  Thank you to everyone who helped out!</p>
<p>Upcoming highlights include llm-d sessions, All Things Open (Oct 12–14, 2025), PyTorch Conference (Oct 22–23, 2025), AMD AI DevDay (Oct 20), and Kubecon 2025 (Nov 10-14). You can follow these and other events on our <a href="https://llm-d.ai/docs/community/events" target="_blank" rel="noopener noreferrer" class="">community events page</a>.</p>
<p>Contribute on <a href="https://github.com/llm-d" target="_blank" rel="noopener noreferrer" class="">GitHub</a>, <a href="https://red.ht/llm-d-public-calendar" target="_blank" rel="noopener noreferrer" class="">join our community calls</a> (Wed 12:30pm ET), join the SIGs and come build with us!</p>]]></content>
        <author>
            <name>Robert Shaw</name>
            <email>robshaw@redhat.com</email>
            <uri>https://github.com/robertgshaw2-redhat</uri>
        </author>
        <author>
            <name>Clayton Coleman</name>
            <email>claytoncoleman@google.com</email>
            <uri>https://github.com/smarterclayton</uri>
        </author>
        <author>
            <name>Carlos Costa</name>
            <email>chcost@us.ibm.com</email>
            <uri>https://github.com/chcost</uri>
        </author>
        <category label="Releases" term="Releases"/>
        <category label="Announcements" term="Announcements"/>
        <category label="llm-d release news" term="llm-d release news"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[KV-Cache Wins You Can See: From Prefix Caching in vLLM to Distributed Scheduling with llm-d]]></title>
        <id>https://llm-d.ai/blog/kvcache-wins-you-can-see</id>
        <link href="https://llm-d.ai/blog/kvcache-wins-you-can-see"/>
        <updated>2025-09-24T09:00:00.000Z</updated>
        <summary type="html"><![CDATA[See how llm-d's precise KV-cache aware scheduling delivers 57x faster responses and 2x throughput in production distributed LLM inference benchmarks.]]></summary>
        <content type="html"><![CDATA[<p>The llm-d project provides a series of “well-lit paths” - tested, benchmarked solutions for deploying large language models in production. Our first path, <a class="" href="https://llm-d.ai/blog/intelligent-inference-scheduling-with-llm-d"><strong>Intelligent Inference Scheduling</strong></a>, established a baseline for AI-aware routing by balancing both cluster load and prefix-cache affinities. The default configuration for that path uses an <em>approximate</em> method for the latter, predicting cache locality based on request traffic.</p>
<p>This blog illuminates a more advanced and powerful path: <a class="" href="https://llm-d.ai/docs/guide/Installation/precise-prefix-cache-aware"><strong>precise prefix-cache aware scheduling</strong></a>.</p>
<p>We take a deep dive into the next generation of this feature, which moves beyond prediction and gives the scheduler direct introspection into distributed vLLM caches. This precision is key to maximizing cache hit rates and achieving a new level of performance and maximizing cost-efficiency in your distributed deployments.</p>
<div class="theme-admonition theme-admonition-tip admonition_xJq3 alert alert--success"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>Blog key takeaways</div><div class="admonitionContent_BuS1"><ul>
<li class=""><strong>KV-cache hit rates directly impact your bottom line</strong>: With <strong>10x cost differences</strong> between cached and uncached tokens, cache efficiency isn't just a performance optimization — it's a fundamental cost and performance driver</li>
<li class=""><strong>This isn't theoretical</strong>: Real production workloads like conversational AI and agentic workflows naturally create the prefix-heavy patterns where this approach excels</li>
<li class=""><strong>vLLM's prefix caching breaks in distributed deployments</strong>: Standard load balancers scatter related requests across pods, destroying cache locality and forcing expensive re-computation</li>
<li class=""><strong>Precise prefix-cache aware scheduling delivers order-of-magnitude gains</strong>: Our benchmarks show <strong>57x faster response</strong> times and <strong>double the throughput</strong> on identical hardware</li>
</ul></div></div>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-most-important-metric-in-production-ai"><strong>The Most Important Metric in Production AI</strong><a href="https://llm-d.ai/blog/kvcache-wins-you-can-see#the-most-important-metric-in-production-ai" class="hash-link" aria-label="Direct link to the-most-important-metric-in-production-ai" title="Direct link to the-most-important-metric-in-production-ai" translate="no">​</a></h2>
<p>In production LLM inference, we track dozens of metrics - latency, throughput, GPU utilization, and cost, to name a few. But one metric significantly stands out. As engineers building production-grade AI agents have noted:</p>
<div class="theme-admonition theme-admonition-note admonition_xJq3 alert alert--secondary"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M6.3 5.69a.942.942 0 0 1-.28-.7c0-.28.09-.52.28-.7.19-.18.42-.28.7-.28.28 0 .52.09.7.28.18.19.28.42.28.7 0 .28-.09.52-.28.7a1 1 0 0 1-.7.3c-.28 0-.52-.11-.7-.3zM8 7.99c-.02-.25-.11-.48-.31-.69-.2-.19-.42-.3-.69-.31H6c-.27.02-.48.13-.69.31-.2.2-.3.44-.31.69h1v3c.02.27.11.5.31.69.2.2.42.31.69.31h1c.27 0 .48-.11.69-.31.2-.19.3-.42.31-.69H8V7.98v.01zM7 2.3c-3.14 0-5.7 2.54-5.7 5.68 0 3.14 2.56 5.7 5.7 5.7s5.7-2.55 5.7-5.7c0-3.15-2.56-5.69-5.7-5.69v.01zM7 .98c3.86 0 7 3.14 7 7s-3.14 7-7 7-7-3.12-7-7 3.14-7 7-7z"></path></svg></span>Manus, <a href="https://manus.im/blog/Context-Engineering-for-AI-Agents-Lessons-from-Building-Manus" target="_blank" rel="noopener noreferrer" class="">Context Engineering for AI Agents</a></div><div class="admonitionContent_BuS1"><p>"The <em>KV-cache hit rate is the single most important metric for a production-stage AI agent. It directly affects both latency and cost</em>."</p></div></div>
<p>This isn't just an academic claim; it has a direct and dramatic impact on the bottom line. Consider the pricing model for a state-of-the-art model like Anthropic's Claude Sonnet. The <a href="https://www.anthropic.com/pricing#api" target="_blank" rel="noopener noreferrer" class="">cost</a> for processing tokens that are already in the cache is <strong>10 times lower</strong> than for uncached tokens ($0.30 vs. $3.00 per million). The same pattern can be seen in OpenAI’s <a href="http://openai.com/api/pricing/" target="_blank" rel="noopener noreferrer" class="">API pricing</a> page. A high cache hit rate doesn't just make your application faster; it makes it <strong>fundamentally cheaper to operate</strong>. This is the power of the KV-cache.</p>
<p>In a single-instance environment, engines like vLLM leverage Automatic Prefix Caching to cut redundant work, reusing prior computations to drive faster, more efficient performance. However, the moment you scale to a distributed, multi-replica environment, these finely tuned optimizations can fall apart.</p>
<p>This blog explores that challenge: how the wins of vLLM's prefix caching are lost in naive distributed systems, and how llm-d's precise prefix-cache aware scheduling restores and enhances them. To fully grasp this, we first need to understand what makes vLLM so performant in a single instance. Let's dive in.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="inside-vllm-mastering-the-cache-in-a-single-instance"><strong>Inside vLLM: Mastering the Cache in a Single Instance</strong><a href="https://llm-d.ai/blog/kvcache-wins-you-can-see#inside-vllm-mastering-the-cache-in-a-single-instance" class="hash-link" aria-label="Direct link to inside-vllm-mastering-the-cache-in-a-single-instance" title="Direct link to inside-vllm-mastering-the-cache-in-a-single-instance" translate="no">​</a></h2>
<div class="theme-admonition theme-admonition-note admonition_xJq3 alert alert--secondary"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M6.3 5.69a.942.942 0 0 1-.28-.7c0-.28.09-.52.28-.7.19-.18.42-.28.7-.28.28 0 .52.09.7.28.18.19.28.42.28.7 0 .28-.09.52-.28.7a1 1 0 0 1-.7.3c-.28 0-.52-.11-.7-.3zM8 7.99c-.02-.25-.11-.48-.31-.69-.2-.19-.42-.3-.69-.31H6c-.27.02-.48.13-.69.31-.2.2-.3.44-.31.69h1v3c.02.27.11.5.31.69.2.2.42.31.69.31h1c.27 0 .48-.11.69-.31.2-.19.3-.42.31-.69H8V7.98v.01zM7 2.3c-3.14 0-5.7 2.54-5.7 5.68 0 3.14 2.56 5.7 5.7 5.7s5.7-2.55 5.7-5.7c0-3.15-2.56-5.69-5.7-5.69v.01zM7 .98c3.86 0 7 3.14 7 7s-3.14 7-7 7-7-3.12-7-7 3.14-7 7-7z"></path></svg></span>For the Experts</div><div class="admonitionContent_BuS1"><p>Already know how vLLM uses the KV-cache and prefix caching to optimize inference? Feel free to skip right into the <a href="https://llm-d.ai/blog/kvcache-wins-you-can-see#the-challenge-of-scale-out" class=""><strong>challenges of scale-out</strong></a>.</p></div></div>
<p>At the heart of every transformer model is the <strong>self-attention mechanism</strong> - how the model understands context by computing attention scores between every pair of tokens. This all-pairs comparison scales quadratically with input length, making the initial <strong>prefill</strong> computation the most expensive part of generation.</p>
<p>The result is <strong>Key (K)</strong> and <strong>Value (V)</strong> tensors stored in the <strong>KV-cache</strong> - the model's short-term memory. For subsequent token generation during <strong>decode</strong>, the model simply pulls these existing values from memory rather than recomputing them.</p>
<p>vLLM takes this further with <strong>Automatic Prefix Caching</strong>: it intelligently identifies when requests share the same token sequence prefix. Instead of recomputing, it reuses the exact same memory pages from the cache through hash-based block matching. This principle of reusing computed work drives vLLM's performance:</p>
<ul>
<li class=""><strong>Time to First Token (TTFT)</strong> plummets because the expensive prefill step is mostly skipped</li>
<li class="">Overall <strong>throughput</strong> increases because the GPU is freed up to serve more requests</li>
</ul>
<p>In a simple test sending a request with a ~10,000 token prompt to a Qwen/Qwen3-32B instance a second time, time-to-first-token drops from <strong>4.3 seconds</strong> to just <strong>0.6 seconds</strong>.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="prefix-reuse-in-practical-use-cases"><strong>Prefix Reuse in Practical Use Cases</strong><a href="https://llm-d.ai/blog/kvcache-wins-you-can-see#prefix-reuse-in-practical-use-cases" class="hash-link" aria-label="Direct link to prefix-reuse-in-practical-use-cases" title="Direct link to prefix-reuse-in-practical-use-cases" translate="no">​</a></h2>
<p>The power of vLLM's caching isn't theoretical; it directly maps to the structure of the most common and valuable LLM workloads. By understanding this pattern, we can see exactly what's at stake when serving in production.</p>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="conversational-ai"><strong>Conversational AI</strong><a href="https://llm-d.ai/blog/kvcache-wins-you-can-see#conversational-ai" class="hash-link" aria-label="Direct link to conversational-ai" title="Direct link to conversational-ai" translate="no">​</a></h4>
<p>In any multi-turn dialogue, from a customer service bot to a long-form assistant, the entire chat history and system prompt form a massive <strong>prefix</strong>. Each new user message is a tiny <strong>suffix</strong>. Effective caching means only the latest turn is prefilled, keeping the conversation fluid and responsive, preventing latency from increasing as the dialogue gets longer.</p>
<p><img decoding="async" loading="lazy" alt="Conversational AI prefix caching diagram" src="https://llm-d.ai/assets/images/image1-f56fc52a69a8c2874de96009de0ef719.webp" width="1672" height="666" class="img_ev3q"></p>
<small><em><strong>FIGURE 1</strong>: A diagram showing the conversational history as a growing prefix that gets cached, with only the new user query requiring prefill.</em></small>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="agentic-workflows"><strong>Agentic Workflows</strong><a href="https://llm-d.ai/blog/kvcache-wins-you-can-see#agentic-workflows" class="hash-link" aria-label="Direct link to agentic-workflows" title="Direct link to agentic-workflows" translate="no">​</a></h4>
<p>AI agents represent the most extreme case of prefix dominance. These systems operate in reasoning loops where the prefix contains the agent's goals, tool definitions, and a long history of actions and observations. Production data shows this can lead to input-to-output ratios exceeding <strong>100:1</strong> <em>(from the Manus <a href="https://manus.im/blog/Context-Engineering-for-AI-Agents-Lessons-from-Building-Manus" target="_blank" rel="noopener noreferrer" class="">blog</a>)</em>, making the prefix overwhelmingly large. Reusing context at every step makes agents computationally viable.</p>
<p><img decoding="async" loading="lazy" alt="Agentic workflow prefix caching diagram" src="https://llm-d.ai/assets/images/image2-4d74f6f25930749280d1e926bcbdbb49.webp" width="1158" height="670" class="img_ev3q"></p>
<small><em><strong>FIGURE 2</strong>: A visual of an agent loop, showing the massive, static context (tools, step-history) as the cached prefix and the new observation/action as the small suffix.</em></small>
<br>
<br>
<p>Reusing this massive context on each turn is essential for complex agents to be computationally viable and cost-effective.</p>
<div class="theme-admonition theme-admonition-tip admonition_xJq3 alert alert--success"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>What about RAG?</div><div class="admonitionContent_BuS1"><p>While Retrieval-Augmented Generation also relies on large prefixes (system prompts + documents), reusing the KVs is more challenging. The exact documents and their order often change between queries, breaking simple prefix patterns. This requires a more sophisticated approach, which we will discuss briefly at the end of this post.</p></div></div>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-challenge-of-scale-out"><strong>The Challenge of Scale-Out</strong><a href="https://llm-d.ai/blog/kvcache-wins-you-can-see#the-challenge-of-scale-out" class="hash-link" aria-label="Direct link to the-challenge-of-scale-out" title="Direct link to the-challenge-of-scale-out" translate="no">​</a></h2>
<p>What happens when we move from single-instance environment to distributed production clusters? The once-unified KV-cache becomes <strong>disaggregated</strong>. Each vLLM pod manages its own cache in complete isolation. Standard load balancers naively spread traffic evenly using cache-blind metrics, scattering related requests across different pods and destroying cache locality.</p>
<p>Let's revisit our agentic workflow example to see the direct impact of being blind to this unmanaged, disaggregated cache:</p>
<p><img decoding="async" loading="lazy" alt="KV-cache miss scenario diagram" src="https://llm-d.ai/assets/images/image3-fd643d27f8380909e5a26315118ce19e.webp" width="1368" height="472" class="img_ev3q"></p>
<small><em><strong>FIGURE 3</strong>: A heartbreaking KV-cache miss scenario.</em></small>
<br>
<br>
<p>This single routing decision triggers a cascade of failures:</p>
<ul>
<li class=""><strong>Cache Miss:</strong> The warm cache benefit on Pod A is completely lost</li>
<li class=""><strong>Duplicated Work:</strong> The most expensive computation is needlessly performed twice</li>
<li class=""><strong>Increased Latency:</strong> Users experience significantly higher Time to First Token (TTFT)</li>
<li class=""><strong>Wasted GPU Resources:</strong> Expensive hardware is tied up re-doing work instead of serving new requests, lowering overall system throughput</li>
</ul>
<p>In a production environment with thousands of concurrent requests, this isn't a rare event; it's the default behavior. The result is a system that is <strong>significantly</strong> <strong>slower</strong> and <strong>more expensive</strong> than it should be. This is the core challenge that <a href="https://github.com/llm-d/llm-d/blob/main/guides/precise-prefix-cache-aware/README.md" target="_blank" rel="noopener noreferrer" class="">llm-d's precise prefix-cache aware scheduling</a> is designed to solve.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="llm-d-precise-prefix-cache-aware-scheduling"><strong>llm-d: Precise Prefix-Cache Aware Scheduling</strong><a href="https://llm-d.ai/blog/kvcache-wins-you-can-see#llm-d-precise-prefix-cache-aware-scheduling" class="hash-link" aria-label="Direct link to llm-d-precise-prefix-cache-aware-scheduling" title="Direct link to llm-d-precise-prefix-cache-aware-scheduling" translate="no">​</a></h2>
<p>We've just seen how scaling out a vLLM cluster naturally disaggregates the KV-cache, creating a distributed pool of memory that leads to costly cache misses. The solution, then, is to bridge the disaggregation. To restore the benefits of prefix caching, the scheduler needs a new sense: vision into the real-time state of the distributed cache.</p>
<p>This is precisely what llm-d provides (pun intended). It creates a <strong>global view</strong> of the cluster's KV-cache, allowing it to treat the disaggregated memory as a single, manageable pool and route requests with precision.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="how-it-works-a-global-cache-view-via-kvevents"><strong>How It Works: A Global Cache View via KVEvents</strong><a href="https://llm-d.ai/blog/kvcache-wins-you-can-see#how-it-works-a-global-cache-view-via-kvevents" class="hash-link" aria-label="Direct link to how-it-works-a-global-cache-view-via-kvevents" title="Direct link to how-it-works-a-global-cache-view-via-kvevents" translate="no">​</a></h3>
<p>The global cache view is built upon a continuous stream of <a href="https://docs.vllm.ai/en/latest/api/vllm/config/kv_events.html" target="_blank" rel="noopener noreferrer" class=""><strong><code>KVEvents</code></strong></a> from each vLLM pod, which are processed efficiently by the open-source <a href="https://github.com/llm-d/llm-d-kv-cache" target="_blank" rel="noopener noreferrer" class=""><strong><code>llm-d-kv-cache</code></strong></a> library.</p>
<p>The <code>KVEvents</code> provide a live feed of all physical cache changes across the cluster, firing every time a cache block is created or evicted. This stream is then ingested and organized by the llm-d-kv-cache library's components:</p>
<ol>
<li class=""><strong><code>kvevents.Pool</code></strong>: This component consumes the high-throughput stream of events. As it digests them, it continuously updates a low-level <strong>KV-Block Index</strong>, which maintains a simple, real-time map of block-hashes to the pod and memory-medium (GPU/CPU) it resides on.</li>
<li class=""><strong><code>kvcache.Index</code></strong>: This is the higher-level index used by the scheduler. It uses the underlying KV-Block Index to map logical sequences of tokens (i.e., prefixes) to the pods that hold them. This provides the direct answer to the question, "what percentage of this request's prefix is on the accessible Pods?"</li>
</ol>
<p>This two-layered architecture provides a continuously updated, scalable view of the cluster's cache state, which is the key to enabling intelligent, cache-aware routing.</p>
<p><img decoding="async" loading="lazy" alt="llm-d architecture diagram" src="https://llm-d.ai/assets/images/image4-15b3f914b5bdb03e4fe74b393fcadb5e.webp" width="1674" height="898" class="img_ev3q"></p>
<small><em><strong>FIGURE 4</strong>: Simplified architecture diagram. (1) - (3) show the read path, while (A) - (B) show the write pipeline.</em></small>
<br>
<br>
<p><strong>What about the overhead?</strong> The memory overhead for this global index is negligible - see <strong>Appendix A.3</strong> for the scaling analysis showing a <strong>1,000,000:1</strong> data-to-metadata ratio.</p>
<div class="theme-admonition theme-admonition-info admonition_xJq3 alert alert--info"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M7 2.3c3.14 0 5.7 2.56 5.7 5.7s-2.56 5.7-5.7 5.7A5.71 5.71 0 0 1 1.3 8c0-3.14 2.56-5.7 5.7-5.7zM7 1C3.14 1 0 4.14 0 8s3.14 7 7 7 7-3.14 7-7-3.14-7-7-7zm1 3H6v5h2V4zm0 6H6v2h2v-2z"></path></svg></span>High availability support</div><div class="admonitionContent_BuS1"><p>This design naturally supports active-active or active-passive deployments, with either full view duplication or sharding through configurations.</p></div></div>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-precise-prefix-cache-scorer"><strong>The Precise Prefix-Cache Scorer</strong><a href="https://llm-d.ai/blog/kvcache-wins-you-can-see#the-precise-prefix-cache-scorer" class="hash-link" aria-label="Direct link to the-precise-prefix-cache-scorer" title="Direct link to the-precise-prefix-cache-scorer" translate="no">​</a></h3>
<p>With an accurate, real-time global view of the cache, the scheduler can now perform intelligent routing. The component responsible for this is the <strong>Precise Prefix-Cache Scorer</strong>. It sits within the scheduler and utilizes the <code>kvcache.Index</code> to perform a simple but critical task for every incoming request:</p>
<ol>
<li class="">It queries the <code>kvcache.Index</code> to determine what percentage of that prefix is already available on each active vLLM pod.</li>
<li class="">It outputs a "cache affinity score" for each pod, directly representing the computational work that can be saved.</li>
</ol>
<p>This scorer provides a strong <strong>stickiness</strong> signal, scheduling requests to maximize the probability of a cache hit. However, relying solely on stickiness can create new problems, like sending a stream of requests to an already overloaded pod while others sit idle.</p>
<p>Therefore, the final routing decision isn't based on this score alone. As detailed in our previous post on the <a class="" href="https://llm-d.ai/blog/intelligent-inference-scheduling-with-llm-d"><strong>Intelligent Inference Scheduling</strong></a> well-lit path, the KV-cache affinity score is combined with distributive, load-aware scores, creating a balanced decision.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="performance-results"><strong>Performance Results</strong><a href="https://llm-d.ai/blog/kvcache-wins-you-can-see#performance-results" class="hash-link" aria-label="Direct link to performance-results" title="Direct link to performance-results" translate="no">​</a></h2>
<p>To validate this approach, we benchmarked four scheduling strategies on a cluster of <strong>8 vLLM pods (16 H100 GPUs total)</strong> using a realistic B2B workload simulating <strong>150 enterprise customers</strong> each with <strong>6,000-token contexts</strong>, with <strong>5 concurrent users per customer</strong> submitting <strong>1,200-token queries</strong> under sustained load from <strong>3-60 QPS</strong>.</p>
<p>The total KV-cache demand this workload generates is <strong>73% of cluster’s capacity</strong>, which is <strong>six times</strong> larger than any single pod could handle, forcing the system to distribute prefixes across the cluster - exactly where intelligent scheduling becomes essential.</p>
<div class="theme-admonition theme-admonition-info admonition_xJq3 alert alert--info"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M7 2.3c3.14 0 5.7 2.56 5.7 5.7s-2.56 5.7-5.7 5.7A5.71 5.71 0 0 1 1.3 8c0-3.14 2.56-5.7 5.7-5.7zM7 1C3.14 1 0 4.14 0 8s3.14 7 7 7 7-3.14 7-7-3.14-7-7-7zm1 3H6v5h2V4zm0 6H6v2h2v-2z"></path></svg></span>Benchmark details</div><div class="admonitionContent_BuS1"><p>See <strong><a href="https://llm-d.ai/blog/kvcache-wins-you-can-see#a1-benchmark-setup-details" class="">Appendix A.1</a></strong> and <strong><a href="https://llm-d.ai/blog/kvcache-wins-you-can-see#a2-workload-details---real-world-b2b-saas-scenario" class="">Appendix A.2</a></strong> for complete benchmark methodology and workload details.</p></div></div>
<p>The four strategies compared:</p>
<ul>
<li class=""><strong><code>random-scheduling</code></strong>: A naive scheduler, acting as the control group.</li>
<li class=""><strong><code>load-scheduling</code></strong>: A scheduler aware of only of load scorers: vLLM queueing + kv-cache-utilization</li>
<li class=""><strong><code>approximate-scheduling</code></strong>: The default configuration in the intelligent inference scheduling path, extending load-aware scheduling with the <a href="https://gateway-api-inference-extension.sigs.k8s.io/guides/epp-configuration/prefix-aware/" target="_blank" rel="noopener noreferrer" class=""><strong>approximate</strong> prefix-cache scorer</a>.<!-- -->
<ul>
<li class="">This plugin builds an approximate-locality index based on routing history.</li>
</ul>
</li>
<li class=""><strong><code>precise-scheduling</code></strong>: The advanced well-lit path described in this post.</li>
</ul>
<p>This benchmark, therefore, tests the scheduler's ability to efficiently manage the disaggregated KV-cache. In a production environment, if the total cache demand were to exceed the cluster's capacity, an autoscaling system would be responsible for spinning up more replicas to maintain SLOs. Here, we focus on <strong>maximizing the performance of the existing hardware</strong>.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-results-a-leap-in-performance"><strong>The Results: A Leap in Performance</strong><a href="https://llm-d.ai/blog/kvcache-wins-you-can-see#the-results-a-leap-in-performance" class="hash-link" aria-label="Direct link to the-results-a-leap-in-performance" title="Direct link to the-results-a-leap-in-performance" translate="no">​</a></h3>
<p>The summary table below shows the difference across the key performance indicators.</p>
<table><thead><tr><th style="text-align:left">Experiment</th><th style="text-align:left">Output toks/s</th><th style="text-align:left">TTFT p90 (s)</th><th style="text-align:left">TTFT mean (s)</th><th style="text-align:left">vLLM Wait Queue (mean)</th></tr></thead><tbody><tr><td style="text-align:left"><strong>precise-scheduling</strong></td><td style="text-align:left"><strong>8730.0</strong></td><td style="text-align:left"><strong>0.542</strong></td><td style="text-align:left"><strong>0.298</strong></td><td style="text-align:left"><strong>0.1</strong></td></tr><tr><td style="text-align:left">approximate-scheduling</td><td style="text-align:left">6944.4</td><td style="text-align:left">31.083</td><td style="text-align:left">13.316</td><td style="text-align:left">8.1</td></tr><tr><td style="text-align:left">load-scheduling</td><td style="text-align:left">4428.7</td><td style="text-align:left">94.865</td><td style="text-align:left">46.987</td><td style="text-align:left">28.9</td></tr><tr><td style="text-align:left">random-scheduling</td><td style="text-align:left">4428.7</td><td style="text-align:left">92.551</td><td style="text-align:left">45.281</td><td style="text-align:left">27.3</td></tr></tbody></table>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="time-to-first-token-ttft"><strong>Time to First Token (TTFT)</strong><a href="https://llm-d.ai/blog/kvcache-wins-you-can-see#time-to-first-token-ttft" class="hash-link" aria-label="Direct link to time-to-first-token-ttft" title="Direct link to time-to-first-token-ttft" translate="no">​</a></h4>
<p>The most dramatic impact was on user-facing latency. <code>precise-scheduling</code> delivered a P90 TTFT of just <strong>0.542 seconds</strong>. In contrast, the approximate scheduler took over <strong>31 seconds</strong>, and the cache-blind schedulers took over <strong>90 seconds</strong>.</p>
<ul>
<li class=""><strong><code>precise-scheduling</code> is 57x faster than <code>approximate-scheduling</code>.</strong></li>
<li class=""><strong><code>precise-scheduling</code> is over 170x faster than <code>random-scheduling</code>.</strong></li>
</ul>
<p>This is the difference between an interactive experience and a system that is functionally unusable at scale.</p>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="total-system-throughput"><strong>Total System Throughput</strong><a href="https://llm-d.ai/blog/kvcache-wins-you-can-see#total-system-throughput" class="hash-link" aria-label="Direct link to total-system-throughput" title="Direct link to total-system-throughput" translate="no">​</a></h4>
<p>This efficiency in latency directly translates to higher system capacity. <code>precise-scheduling</code> achieved a total throughput of <strong>8,730 output tokens/second</strong>. This represents:</p>
<ul>
<li class="">A <strong>25% increase</strong> over the <strong><code>approximate-scheduling</code></strong> baseline.</li>
<li class="">Over <strong>double the throughput</strong> of the cache-blind configurations.</li>
</ul>
<p>This allows you to handle significantly more traffic on the exact same hardware, simply by eliminating the waste of cache misses.</p>
<p><img decoding="async" loading="lazy" alt="Performance benchmark charts" src="https://llm-d.ai/assets/images/image5-4565767a57f55b90aa1f200d1cbaf09a.webp" width="1979" height="455" class="img_ev3q"></p>
<small><em><strong>FIGURE 5</strong>: A tri-panel of TTFT, TPoT and Throughput measured through progressively rising QPS rates.</em></small>
<br>
<br>
<p>The charts above clearly illustrate these wins. The blue line (<code>precise-scheduling</code>) maintains the lowest Mean TTFT and achieves the highest Total Throughput as the request rate increases.</p>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-why-from-saved-work-to-system-throughput"><strong>The "Why": From Saved Work to System Throughput</strong><a href="https://llm-d.ai/blog/kvcache-wins-you-can-see#the-why-from-saved-work-to-system-throughput" class="hash-link" aria-label="Direct link to the-why-from-saved-work-to-system-throughput" title="Direct link to the-why-from-saved-work-to-system-throughput" translate="no">​</a></h4>
<p>The dramatic performance gains seen in the benchmarks are a direct result of <strong>system efficiency</strong>, a difference that is immediately visible in the <strong>real-time Grafana metrics</strong>.</p>
<p>The following graphs were captured throughout the benchmark runs. Schedulers are shown in order: <code>precise-scheduling</code> <em>(left)</em>, <code>approximate-scheduling</code> <em>(center)</em>, and <code>random-scheduling</code> <em>(right)</em>.</p>
<h5 class="anchor anchorTargetStickyNavbar_Vzrq" id="1-effective-cache-throughput-quantifying-saved-work"><strong>1. Effective Cache Throughput: Quantifying Saved Work</strong><a href="https://llm-d.ai/blog/kvcache-wins-you-can-see#1-effective-cache-throughput-quantifying-saved-work" class="hash-link" aria-label="Direct link to 1-effective-cache-throughput-quantifying-saved-work" title="Direct link to 1-effective-cache-throughput-quantifying-saved-work" translate="no">​</a></h5>
<p>First, we measure the <strong>Effective Cache Throughput</strong> - the number of prompt <strong>tokens</strong> per second served directly from the cache. This metric quantifies the computational work the GPUs <em><strong>avoided</strong></em>. A high value means the system is consistently saving massive amounts of expensive prefill computation.</p>
<p><img decoding="async" loading="lazy" alt="Effective cache throughput metrics" src="https://llm-d.ai/assets/images/image6-5f3e088e2b67914bf09e6048acc7b97c.webp" width="1322" height="602" class="img_ev3q"></p>
<small><em><strong>FIGURE 6</strong>: The total computational work <strong>saved</strong> by the KV-cache across the cluster, over the course of the benchmarks.</em></small>
<br>
<br>
<p>The chart clearly shows that <code>precise-scheduling</code> sustains a massive and stable throughput of saved work by hitting the prefixes effectively. In the middle, we see <code>approximate-scheduling</code> with good but lower efficiency, and on the right, <code>random-scheduling</code> saving almost no work.</p>
<h5 class="anchor anchorTargetStickyNavbar_Vzrq" id="2-system-state-the-consequence-of-efficiency"><strong>2. System State: The Consequence of Efficiency</strong><a href="https://llm-d.ai/blog/kvcache-wins-you-can-see#2-system-state-the-consequence-of-efficiency" class="hash-link" aria-label="Direct link to 2-system-state-the-consequence-of-efficiency" title="Direct link to 2-system-state-the-consequence-of-efficiency" translate="no">​</a></h5>
<p>This saved work translates directly into system health. By avoiding prefill bottlenecks, the GPUs can focus on productive decoding. We can see this by comparing the number of "<strong>Waiting</strong>" requests (<strong>queued</strong>) and "<strong>Running</strong>" requests (<strong>in decode</strong>).</p>
<p><img decoding="async" loading="lazy" alt="vLLM waiting requests metrics" src="https://llm-d.ai/assets/images/image7-11d29248e610592d25359670ee7c378b.webp" width="1999" height="886" class="img_ev3q"><br>
<small><em><strong>FIGURE 7</strong>: The number of <strong>waiting requests</strong> in vLLM over the course of the benchmark.</em></small></p>
<p><img decoding="async" loading="lazy" alt="vLLM running requests metrics" src="https://llm-d.ai/assets/images/image8-8648ded7551815d27aeb8d00ff10bc16.webp" width="1999" height="886" class="img_ev3q"><br>
<small><em><strong>FIGURE 8</strong>: The number of <strong>running requests</strong> <strong>(decoding)</strong> in vLLM over the course of the benchmark.</em></small></p>
<p>The <strong><code>precise-scheduling</code></strong> plots on the left show a stable system. By effectively utilizing the disaggregated KV-cache, it maintains minimal waiting queues and maximizes the number of actively running requests. In contrast, the other schedulers are clearly overwhelmed; their growing waiting queues choke the system and prevent work from being done efficiently.</p>
<p>This instability is caused by <strong>"cache thrashing."</strong> Cache-blind schedulers constantly <strong>duplicate and evict</strong> the same prefixes across different pods, wasting GPU cycles on <strong>redundant prefill</strong>. <code>precise-scheduling</code> avoids this entirely. It is precisely aware of prefix locations and consistently routes requests for cache-hits - as long as the load allows - resulting in less work, virtually no queues, and a healthy system.</p>
<div class="theme-admonition theme-admonition-info admonition_xJq3 alert alert--info"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M7 2.3c3.14 0 5.7 2.56 5.7 5.7s-2.56 5.7-5.7 5.7A5.71 5.71 0 0 1 1.3 8c0-3.14 2.56-5.7 5.7-5.7zM7 1C3.14 1 0 4.14 0 8s3.14 7 7 7 7-3.14 7-7-3.14-7-7-7zm1 3H6v5h2V4zm0 6H6v2h2v-2z"></path></svg></span>Session-Based Scheduling</div><div class="admonitionContent_BuS1"><p>Session-based scheduling provides affinity for individual users but misses cross-user scenarios. In our benchmark with <strong>150 enterprise customers</strong> each having <strong>6,000-token system prompts</strong>, session-scheduling would create 750 separate sessions but miss cross-user cache reuse within customer groups, leaving the majority of computational work uncaptured. Precise prefix-cache aware scheduling guarantees <strong>maximal reuse</strong> across the system.</p></div></div>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="adoption"><strong>Adoption</strong><a href="https://llm-d.ai/blog/kvcache-wins-you-can-see#adoption" class="hash-link" aria-label="Direct link to adoption" title="Direct link to adoption" translate="no">​</a></h3>
<p>The dramatic performance improvements demonstrated in these benchmarks are what drive real-world adoption.</p>
<p><strong>Alibaba Cloud</strong>, for instance, is integrating this precise routing strategy into its <strong>Alibaba Cloud Container Service for Kubernetes (ACK) Gateway with Inference Extension (GIE).</strong>  To further enhance their production deployments for models like <strong>Qwen</strong> and <strong>DeepSeek</strong>, they are developing a disaggregated tokenization service to support complimentary features, with the goal of contributing this work back to the llm-d community. The end to end capabilities are already being validated in customer simulation environments.</p>
<p>This same potential has led <strong>DaoCloud</strong> to enhance its <strong>d.run</strong> <strong>Model-as-a-Service (MaaS)</strong> platform for accelerating inference of <strong>DeepSeek</strong> and other advanced models, adopting distributed inference with P/D disaggregation and advanced  KV-cache architectures via <strong>Kubernetes</strong>, <strong>vLLM</strong>, and <strong>llm-d</strong>. Kay Yan emphasizes: "intelligent KV-cache management enables more adaptive and cost-effective inference architectures".</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="whats-next-extending-the-cache-aware-paradigm"><strong>What's Next: Extending the Cache-Aware Paradigm</strong><a href="https://llm-d.ai/blog/kvcache-wins-you-can-see#whats-next-extending-the-cache-aware-paradigm" class="hash-link" aria-label="Direct link to whats-next-extending-the-cache-aware-paradigm" title="Direct link to whats-next-extending-the-cache-aware-paradigm" translate="no">​</a></h2>
<p>Precise prefix-aware scheduling is a massive step forward, but it's just one part of a broader, cache-centric vision for inference. The llm-d project is rapidly evolving, with several exciting directions on the horizon:</p>
<ul>
<li class=""><strong>Enhanced CPU Offloading:</strong> For KV-cache pools at larger scales, we are deepening the native vLLM integration for CPU offloading. This will allow for massive cache pools that intelligently tier between GPU VRAM and cheaper CPU RAM, with the scheduler making latency-aware decisions.</li>
<li class=""><strong>KV-Cache-Fusion for RAG:</strong> As noted earlier, RAG workloads present a unique challenge because retrieved documents can appear in different orders, breaking simple prefix patterns. The next frontier is position-independent KV-fusion, a technique enabling flexible and powerful cache reuse across varied RAG queries. This will come hand-in-hand with <strong>large-scale storage offloading</strong>.</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="conclusion"><strong>Conclusion</strong><a href="https://llm-d.ai/blog/kvcache-wins-you-can-see#conclusion" class="hash-link" aria-label="Direct link to conclusion" title="Direct link to conclusion" translate="no">​</a></h2>
<p>The journey of llm-d reflects a broader shift in how we think about LLM inference - not as a set of stateless function calls, but as a dynamic, stateful orchestration problem. The benchmark data is clear: <strong>prefix-cache aware scheduling is not just an optimization; it is essential for production performance and cost efficiency.</strong></p>
<p>By moving from AI-blind routing to a precise, KV-cache aware strategy, <strong>we unlock order-of-magnitude improvements in latency and throughput on the same hardware</strong>. The well-lit path of precise prefix-cache awareness offers a tested, benchmarked solution to make your distributed deployments dramatically more efficient.</p>
<div class="theme-admonition theme-admonition-tip admonition_xJq3 alert alert--success"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>Choosing the Right Strategy</div><div class="admonitionContent_BuS1"><p>The optimal scheduler depends on the complexity of the workload. Below is a hierarchy of supported strategies, where each level addresses the limitations of the one before it.</p><ul>
<li class="">
<p><strong>1. Random/Round-Robin Scheduling</strong>: this simple approach works well for symmetric workloads where all requests have similar computational costs and minimal cache reuse.</p>
</li>
<li class="">
<p><strong>2. Load-Aware Scheduling</strong>: the necessary next step for asymmetric workloads. By routing requests based on Pod serving capacity, it prevents overload and improves resource utilization.</p>
</li>
<li class="">
<p><strong>3. Approximate Prefix-Cache Scheduling</strong>: this strategy introduces cache-awareness for workloads with context reuse patterns.</p>
<ul>
<li class=""><strong>Limitation:</strong> The estimations can become unreliable at high scale or with dynamic workloads, leading to suboptimal routing - as seen in our benchmarks.</li>
</ul>
</li>
<li class="">
<p><strong>4. Precise Prefix-Cache Aware Scheduling</strong>: in production environments with tight SLOs - this is the most effective strategy for dynamic, high-scale workloads where maximizing the cache-hit ratio is a primary performance driver.</p>
</li>
</ul></div></div>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="get-involved-with-llm-d"><strong>Get Involved with llm-d</strong><a href="https://llm-d.ai/blog/kvcache-wins-you-can-see#get-involved-with-llm-d" class="hash-link" aria-label="Direct link to get-involved-with-llm-d" title="Direct link to get-involved-with-llm-d" translate="no">​</a></h2>
<p>The llm-d project thrives on community contributions, and there are many ways to get involved:</p>
<ul>
<li class="">Explore the llm-d Community Quickstart Guide → <a class="" href="https://llm-d.ai/docs/community">Start here</a> to learn more about getting involved in the llm-d project.</li>
<li class="">Join our Slack → <a class="" href="https://llm-d.ai/slack">Get your invite</a> and connect with maintainers and contributors</li>
<li class="">Explore the code → Browse our <a href="https://github.com/llm-d" target="_blank" rel="noopener noreferrer" class="">GitHub organization</a> and find issues that interest you</li>
<li class="">Attend meetings → All meetings are open! Add our <a class="" href="https://llm-d.ai/docs/community#public-meeting-calendar">public calendar</a> and join discussions`</li>
</ul>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="appendix"><strong>Appendix</strong><a href="https://llm-d.ai/blog/kvcache-wins-you-can-see#appendix" class="hash-link" aria-label="Direct link to appendix" title="Direct link to appendix" translate="no">​</a></h2>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="a1-benchmark-setup-details"><strong>A.1: Benchmark Setup Details</strong><a href="https://llm-d.ai/blog/kvcache-wins-you-can-see#a1-benchmark-setup-details" class="hash-link" aria-label="Direct link to a1-benchmark-setup-details" title="Direct link to a1-benchmark-setup-details" translate="no">​</a></h3>
<ul>
<li class=""><strong>Model</strong>: <code>Qwen/Qwen-32B</code>.</li>
<li class=""><strong>Hardware</strong>: A cluster of <strong>8</strong> vLLM pods, each running on <strong>2 NVIDIA H100 GPUs</strong> (16 total).<!-- -->
<ul>
<li class="">Each instance holds a 307,328-tokens KV-cache.</li>
</ul>
</li>
<li class=""><strong>Schedulers Compared</strong>:<!-- -->
<ul>
<li class=""><strong><code>random-scheduling</code></strong>: A naive scheduler, acting as the control group.</li>
<li class=""><strong><code>load-scheduling</code></strong>: A scheduler aware of only of load scorers: vLLM queueing + kv-cache-utilization</li>
<li class=""><strong><code>approximate-scheduling</code></strong>: The baseline intelligent scheduler extending load-scheduling with the approximate prefix-cache scorer.</li>
<li class=""><strong><code>precise-scheduling</code></strong>: The advanced well-lit path described in this post.</li>
</ul>
</li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="a2-workload-details---real-world-b2b-saas-scenario"><strong>A.2: Workload Details - Real-World B2B SaaS Scenario</strong><a href="https://llm-d.ai/blog/kvcache-wins-you-can-see#a2-workload-details---real-world-b2b-saas-scenario" class="hash-link" aria-label="Direct link to a2-workload-details---real-world-b2b-saas-scenario" title="Direct link to a2-workload-details---real-world-b2b-saas-scenario" translate="no">​</a></h3>
<p>The benchmark was designed to simulate a high-value, multi-tenant B2B application under a heavy, sustained load. Imagine a platform providing a specialized AI assistant to a large number of enterprise customers.</p>
<ul>
<li class="">There are <strong>150 distinct enterprise customers</strong> (groups) using the platform simultaneously.</li>
<li class="">Each customer has a unique and substantial context of <strong>6,000 tokens</strong>. This can be thought of as their company's internal knowledge base or a detailed set of instructions, forming the valuable <strong>shared prefix</strong>.</li>
<li class="">For each customer, <strong>5 employees</strong> are concurrently interacting with the assistant, submitting their own unique questions of <strong>1,200 tokens</strong> each.</li>
<li class="">The system is stressed by a continuous Poisson distribution of requests, ramping up from 3 QPS to a demanding <strong>60 QPS</strong> to simulate peak business hours.</li>
</ul>
<p>For this workload, in an ideal state, caching the shared prefixes for all active customers requires <strong>~73%</strong> of the cluster's total KV-cache capacity. This is nearly <strong>six times larger</strong> than any single pod's individual capacity (~12.5%). This makes it impossible for one replica to handle the load and forces the scheduler to intelligently distribute the prefixes across the entire cluster.</p>
<p>This benchmark, therefore, tests the scheduler's ability to efficiently manage the disaggregated KV-cache. In a real-world scenario, if the total cache demand were to exceed the cluster's capacity, an autoscaling system would be responsible for spinning up more replicas to maintain SLOs. Here, we focus on <strong>maximizing the performance of the existing hardware</strong> - a task where cache-blind configurations create massive queues and high latency.</p>
<p>The tools and specifics of the experiment are captured in this <a href="https://github.com/llm-d/llm-d-kv-cache/blob/main/benchmarking/73-capacity/README.md" target="_blank" rel="noopener noreferrer" class="">llm-d-kv-cache benchmarking report</a>.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="a3-indexing-scale-analysis"><strong>A.3: Indexing Scale Analysis</strong><a href="https://llm-d.ai/blog/kvcache-wins-you-can-see#a3-indexing-scale-analysis" class="hash-link" aria-label="Direct link to a3-indexing-scale-analysis" title="Direct link to a3-indexing-scale-analysis" translate="no">​</a></h3>
<p>The overhead for this global bookkeeping is that of storing KV-block hashes, and not the massive KV tensors themselves. Let's consider this vLLM example: <a href="https://docs.vllm.ai/projects/recipes/en/latest/DeepSeek/DeepSeek-V3.html#installing-vllm" target="_blank" rel="noopener noreferrer" class="">Running DeepSeek-R1 with FP8 on 8xH200</a>.</p>
<p>A full KV-cache for the <strong>DeepSeek R1</strong> model running with FP8 on <strong>8x</strong> <strong>NVIDIA H200 GPUs</strong> with a total of <strong>45.7 GB * 8 = 365 GB VRAM</strong> dedicated to the KV-cache pool would be composed of thousands of individual memory blocks. Each block, representing <strong>128 tokens</strong>, consumes about <strong>8.6 MB</strong> of VRAM. However, the metadata needed to track each of these blocks in the global index is just a single 64-bit hash, which is <strong>8 bytes</strong>. The additional overhead of maintaining a data structure is trivial if efficient.</p>
<p>This means that managing the entire <strong>365 GB</strong> cache pool requires only about <strong>339 KB</strong> of memory for the scheduler's index - a data-to-metadata ratio of over <strong>1,000,000-to-1</strong>. The index's memory footprint is several orders of magnitude smaller than the GPU VRAM it tracks, making this a highly efficient, low-overhead solution.</p>]]></content>
        <author>
            <name>Maroon Ayoub</name>
            <uri>https://www.linkedin.com/in/v-maroon/</uri>
        </author>
        <author>
            <name>Danny Harnik</name>
            <uri>https://www.linkedin.com/in/danny-harnik-19a95436/</uri>
        </author>
        <author>
            <name>Tyler Smith</name>
            <uri>https://www.linkedin.com/in/tyler-michael-smith-017b28102/</uri>
        </author>
        <author>
            <name>Kellen Swain</name>
            <uri>https://www.linkedin.com/in/kellen-swain/</uri>
        </author>
        <author>
            <name>Xining Wang</name>
        </author>
        <author>
            <name>Hang Yin</name>
        </author>
        <author>
            <name>Kay Yan</name>
            <uri>https://www.linkedin.com/in/yankay/</uri>
        </author>
        <category label="blog posts" term="blog posts"/>
        <category label="Updates" term="Updates"/>
        <category label="llm-d release news" term="llm-d release news"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[Intelligent Inference Scheduling with llm-d]]></title>
        <id>https://llm-d.ai/blog/intelligent-inference-scheduling-with-llm-d</id>
        <link href="https://llm-d.ai/blog/intelligent-inference-scheduling-with-llm-d"/>
        <updated>2025-09-03T09:00:00.000Z</updated>
        <summary type="html"><![CDATA[Learn how llm-d's intelligent inference scheduling uses prefix-aware, load-balanced routing to maximize LLM throughput and minimize latency on Kubernetes.]]></summary>
        <content type="html"><![CDATA[<p>The llm-d project lays out clear, “well-lit” paths for anyone to adopt the leading inference optimizations within their existing deployment framework - Kubernetes. These are tested approaches designed to make complex deployments easier and more efficient. In this post, we explore the first of these paths: <strong>intelligent inference scheduling</strong>. Unlike basic round-robin load balancing, this method takes the unique demands of LLMs into account, leading to better performance across the board: higher throughput, lower latency, and efficient use of resources.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="why-intelligent-inference-is-needed-for-llm-inference">Why Intelligent Inference Is Needed for LLM Inference<a href="https://llm-d.ai/blog/intelligent-inference-scheduling-with-llm-d#why-intelligent-inference-is-needed-for-llm-inference" class="hash-link" aria-label="Direct link to Why Intelligent Inference Is Needed for LLM Inference" title="Direct link to Why Intelligent Inference Is Needed for LLM Inference" translate="no">​</a></h3>
<p>Deploying large language models (LLMs) on Kubernetes has become the norm, but LLM inference workloads behave very differently from standard microservices. Traditional patterns like uniform replicas paired with round-robin load balancing assume each request uses the same amount of resources and finishes in roughly the same time. In contrast, LLM requests can vary wildly in token count and compute needs, making simple load-spread strategies prone to bottlenecks and imbalanced traffic.</p>
<div style="text-align:center;margin:20px 0"><img src="https://llm-d.ai/img/blogs/inference-scheduling/image01.webp" alt="Intelligent inference scheduling diagram" style="width:75%;height:auto"></div>
<p>LLM inference pipelines also consist of two distinct phases, compute-bound prefill stage and memory-bound decode stage, that have fundamentally different resource profiles. Without specialization, every replica must handle both phases, leading to wasted GPU cycles or memory bandwidth. At the same time, many LLM use cases involve multi-turn chats or agentic flows where cached prefix computations dramatically speeds up response times if the request is routed back to the same instance.</p>
<p>On top of these challenges, LLM endpoints often serve a spectrum of quality-of-service needs. Interactive tasks like code completion demand millisecond-level latency, chat agents can tolerate a few seconds, and batch jobs might take minutes or more. Satisfying tight latency SLOs for expensive inference calls can be prohibitively costly if every pod is treated identically.<br>
<!-- -->To address these unique demands, an intelligent inference scheduler that understands both the shape of incoming requests and the real-time state of your cluster can boost throughput, slash tail latencies, and maximize GPU resource utilization.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="recap-inference-serving-in-kubernetes-the-gateway-api-and-inference-gateway-extension">Recap: Inference Serving in Kubernetes, the Gateway API and Inference Gateway Extension<a href="https://llm-d.ai/blog/intelligent-inference-scheduling-with-llm-d#recap-inference-serving-in-kubernetes-the-gateway-api-and-inference-gateway-extension" class="hash-link" aria-label="Direct link to Recap: Inference Serving in Kubernetes, the Gateway API and Inference Gateway Extension" title="Direct link to Recap: Inference Serving in Kubernetes, the Gateway API and Inference Gateway Extension" translate="no">​</a></h3>
<p>Kubernetes Services paired with Deployments and standard load balancing distribute traffic evenly across identical replicas. That model works well for stateless microservices with uniform, short-lived requests. But as we saw earlier, LLM inference calls vary wildly in compute intensity, benefit from stateful routing (e.g., prefix caches), and demand tight tail-latency control - none of which a vanilla load balancing handles well.</p>
<p>The Gateway API modernizes Kubernetes networking by offering a CRD-based, L7 routing framework that replaces and extends traditional Ingress. It gives you fine-grained route definitions, pluggable data planes, and native compatibility with multi-cluster or cross-team routing policies. Yet on its own, the Gateway API lacks any notion of LLM inference serving based on inference-specific characteristics and metrics.</p>
<p>To bridge that gap, the Gateway API Inference Extension project introduces the Inference Gateway (IGW). IGW reuses Gateway API‘s core primitives but adds new CRDs - most notably <strong>InferencePool</strong> - to represent collections of model-serving pods. InferencePools can carry additional metadata such as base model, accelerator type, and runtime capabilities. Gateways then invoke a pluggable <strong>EndpointPicker (EPP)</strong> to perform “smart” load balancing, leveraging Envoy’s External Processing (ext-proc) to steer traffic to the right inference endpoint.</p>
<p>The default EPP in IGW follows a structured scheduling cycle for each incoming request:</p>
<ul>
<li class=""><strong>Endpoint discovery:</strong> Enumerate all InferencePool pods and gather their metadata (waiting queue state, loaded models, cache contents, etc.).</li>
<li class=""><strong>Filtering:</strong> Exclude pods that can’t serve the request due to overload, incompatible resources, or memory pressure.</li>
<li class=""><strong>Scoring:</strong> Assign each remaining pod a score via extensible scorers - evaluating factors like queue depth, session affinity, prefix cache hits, and custom SLO indicators.</li>
<li class=""><strong>Selection:</strong> Pick appropriate endpoints, with built-in tie-breaking and fallback logic.</li>
</ul>
<p>Building on IGW’s foundation, <strong>llm-d</strong> <strong>augments the EPP with more advanced scheduling capabilities</strong>. It introduces scorers that optimize for KV cache locality (boosting prefix-cache hit rates) and orchestrates multiple scheduling passes to disaggregate prefill and decode phases onto specialized pod variants. The result is a fully LLM-aware scheduler that drives higher throughput, lower tail latencies, and finer resource efficiency across the board.</p>
<div style="text-align:center;margin:20px 0"><img src="https://llm-d.ai/img/blogs/inference-scheduling/image02.webp" alt="Diagram" style="width:75%;height:auto"></div>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="intelligent-inference-scheduling-with-llm-d">Intelligent Inference Scheduling with llm-d<a href="https://llm-d.ai/blog/intelligent-inference-scheduling-with-llm-d#intelligent-inference-scheduling-with-llm-d" class="hash-link" aria-label="Direct link to Intelligent Inference Scheduling with llm-d" title="Direct link to Intelligent Inference Scheduling with llm-d" translate="no">​</a></h3>
<p>A key differentiator of llm-d is the ability to plug in configurable, AI-aware scorers into the inference gateway scheduling pipeline. These scorers go beyond generic load balancing by factoring in LLM-specific workload characteristics such as token count variability, compute/memory phase differences, and KV-cache locality - when deciding where each request should run.</p>
<p>LLM workloads are not uniform. Some use cases — like multi-turn conversations, RAG pipelines, or agentic flows naturally lead to <strong>high prefix reuse</strong>, where requests repeatedly share large portions of the prompt. Others like diverse batch inference jobs or single-shot completions  exhibit <strong>low prefix sharing</strong>, where cache hits are rare and every request is essentially unique.</p>
<p>Because of this diversity, llm-d’s pluggable, AI-aware scorers allow operators to tailor scheduling strategies to workload profiles. We evaluated two configurations:</p>
<ul>
<li class=""><strong>Prefix-only scorer</strong> – routes to maximize KV-cache hits.</li>
<li class=""><strong>Prefix + Load scorer</strong> – adds dynamic load-awareness while still exploiting cache opportunities.</li>
</ul>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="why-ai-aware-scorers-win"><strong>Why AI-Aware Scorers Win</strong><a href="https://llm-d.ai/blog/intelligent-inference-scheduling-with-llm-d#why-ai-aware-scorers-win" class="hash-link" aria-label="Direct link to why-ai-aware-scorers-win" title="Direct link to why-ai-aware-scorers-win" translate="no">​</a></h4>
<p>Following benchmarks show how performance evolves when cache opportunities are minimal, and they illustrate an important point: <strong>the optimal scheduling strategy depends on the workload profile</strong>.</p>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="high-prefix-sharing-workload"><strong>High Prefix Sharing Workload</strong><a href="https://llm-d.ai/blog/intelligent-inference-scheduling-with-llm-d#high-prefix-sharing-workload" class="hash-link" aria-label="Direct link to high-prefix-sharing-workload" title="Direct link to high-prefix-sharing-workload" translate="no">​</a></h4>
<p>When cache locality is abundant, the results are dramatic:</p>
<ul>
<li class="">
<p><strong>Success rate:</strong> The prefix-only scorer frequently overloaded replicas, succeeding in only ~55% of requests, while Prefix + Load maintained 100% success across all QPS levels.</p>
</li>
<li class="">
<p><strong>Time to First Token (TTFT):</strong> Prefix + Load kept TTFT consistently near-zero, while Prefix-only degraded rapidly, exceeding 140s at high QPS.</p>
</li>
<li class="">
<p><strong>Inter-Token Latency (ITL):</strong> Prefix + Load achieved ITL of ~30ms, versus ~160ms with Prefix-only — more than 5× improvement in responsiveness.</p>
</li>
<li class="">
<p><strong>Throughput:</strong> Prefix + Load scaled linearly with QPS, reaching ~60k tokens/sec at 20 QPS. Prefix-only flatlined near 2k–3k tokens/sec.</p>
</li>
</ul>
<div style="margin:20px 0"><div style="margin-bottom:20px"><img src="https://llm-d.ai/img/blogs/inference-scheduling/image03.webp" alt="Throughput vs Request Rate" style="width:100%;height:auto"><p style="text-align:center;font-size:0.9em;margin-top:8px"><em>Throughput vs Request Rate</em></p></div><div style="display:grid;grid-template-columns:1fr 1fr 1fr;gap:15px;align-items:start"><div style="display:flex;flex-direction:column;justify-content:center;height:100%"><img src="https://llm-d.ai/img/blogs/inference-scheduling/image04.webp" alt="Success Rate" style="width:100%;height:auto"><p style="text-align:center;font-size:0.85em;margin-top:6px"><em>Success Rate</em></p></div><div style="display:flex;flex-direction:column;justify-content:center;height:100%"><img src="https://llm-d.ai/img/blogs/inference-scheduling/image05.webp" alt="TTFT and QPS" style="width:100%;height:auto"><p style="text-align:center;font-size:0.85em;margin-top:6px"><em>TTFT and QPS</em></p></div><div style="display:flex;flex-direction:column;justify-content:center;height:100%"><img src="https://llm-d.ai/img/blogs/inference-scheduling/image06.webp" alt="Intertoken Latency" style="width:100%;height:auto"><p style="text-align:center;font-size:0.85em;margin-top:6px"><em>Intertoken Latency</em></p></div></div></div>
<p>In workloads with heavy prefix reuse, prefix-aware scheduling combined with load-awareness is essential to avoid bottlenecks and maximize GPU utilization. By combining prefix scoring with load awareness, llm-d achieves <strong>100% request success, lower latencies, and linear throughput scaling</strong> — the essence of intelligent, AI-aware scheduling.</p>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="low-prefix-sharing-workload"><strong>Low Prefix Sharing Workload</strong><a href="https://llm-d.ai/blog/intelligent-inference-scheduling-with-llm-d#low-prefix-sharing-workload" class="hash-link" aria-label="Direct link to low-prefix-sharing-workload" title="Direct link to low-prefix-sharing-workload" translate="no">​</a></h4>
<p>When cache hits are rare, prefix-awareness provides little benefit, and both scorers perform similarly:</p>
<p><strong>Throughput:</strong>  Both scorers perform <strong>nearly identically</strong>, scaling linearly with QPS. Output throughput reaches ~400 tokens/sec and total throughput ~60k tokens/sec at 20 QPS for both strategies.</p>
<p><strong>Latency:</strong></p>
<ul>
<li class="">
<p><strong>Time to First Token (TTFT):</strong> Both remain stable in the <strong>300–380 ms range</strong> as load increases. Small variations exist, but neither scorer shows a clear advantage.</p>
</li>
<li class="">
<p><strong>Normalized time per token:</strong> Flat around <strong>0.65 ms/token</strong>, with both scorers tightly overlapping across QPS levels.</p>
</li>
<li class="">
<p><strong>Inter-Token Latency (ITL):</strong> Increases linearly with load, from ~25 ms at 2 QPS to ~50 ms at 20 QPS — again, no significant gap between scorers.</p>
<p><strong>Reliability:</strong><br>
<!-- -->Both scorers achieve <strong>100% success rate</strong> across the full load range, confirming that load balancing alone is sufficient when prefix reuse is low.</p>
</li>
</ul>
<p>Under low prefix sharing workloads, the benefits of prefix-aware routing naturally diminish. In this case, adding load-awareness or prefix-awareness makes little difference - both strategies scale smoothly and meet latency targets.</p>
<p><img decoding="async" loading="lazy" alt="Latency vs request rate" src="https://llm-d.ai/assets/images/image07-64242d3915269e30596065e232cf7d62.webp" width="1999" height="458" class="img_ev3q">
<img decoding="async" loading="lazy" alt="Throughput vs Request rate" src="https://llm-d.ai/assets/images/image08-5c33a7ebd58e8e1db2326fb77bf12aab.webp" width="1668" height="588" class="img_ev3q"></p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="takeaway"><strong>Takeaway</strong><a href="https://llm-d.ai/blog/intelligent-inference-scheduling-with-llm-d#takeaway" class="hash-link" aria-label="Direct link to takeaway" title="Direct link to takeaway" translate="no">​</a></h3>
<p>These benchmarks illustrate why <strong>configurable scorers matter in llm-d</strong>.</p>
<ul>
<li class="">
<p>In <strong>prefix-heavy workloads</strong>, Prefix + Load scoring ensures cache hits are exploited without overloading replicas — yielding linear throughput scaling, low latencies, and high success rates.</p>
</li>
<li class="">
<p>In <strong>prefix-light workloads</strong>, simple load balancing suffices, and the system avoids unnecessary complexity.</p>
</li>
</ul>
<p>This adaptability means operators can choose (or combine) scorers based on workload characteristics, achieving the best <strong>token-per-dollar efficiency</strong> while consistently meeting latency and throughput SLOs.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="looking-ahead-roadmap-and-future-plans">Looking Ahead: Roadmap and Future Plans<a href="https://llm-d.ai/blog/intelligent-inference-scheduling-with-llm-d#looking-ahead-roadmap-and-future-plans" class="hash-link" aria-label="Direct link to Looking Ahead: Roadmap and Future Plans" title="Direct link to Looking Ahead: Roadmap and Future Plans" translate="no">​</a></h3>
<p>The IGW and <code>llm-d</code> projects are evolving rapidly, with several exciting directions on the horizon:</p>
<ul>
<li class=""><strong>Dynamic Scheduling Goals</strong>: Support for runtime reconfiguration of scheduling strategies based on workload type, latency targets, or user-defined policies.</li>
<li class=""><strong>Multi-Model Awareness</strong>: Enhanced routing logic that accounts for model compatibility, adapter stacking, and ensemble inference. (next blog)</li>
<li class=""><strong>Plugin Ecosystem</strong>: A curated set of reusable plugins for common LLM use cases, contributed by the community. We’re considering supporting out of process plugins, written in any language, to allow researchers to experiment with new scheduling algorithms and ideas - let us know if you have an idea we can help enable!</li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="closing-thoughts">Closing Thoughts<a href="https://llm-d.ai/blog/intelligent-inference-scheduling-with-llm-d#closing-thoughts" class="hash-link" aria-label="Direct link to Closing Thoughts" title="Direct link to Closing Thoughts" translate="no">​</a></h3>
<p>The journey of llm-d reflects a broader shift in how we think about LLM inference - not just as a stateless function call, but as a dynamic, resource-aware orchestration problem. By building on IGW and pushing its boundaries, llm-d offers a flexible, extensible foundation for intelligent scheduling at scale.<br>
<!-- -->Whether you're running a single model or a fleet of fine-tuned variants, the goal is the same: <strong>maximize performance, minimize latency, and make smarter use of available compute</strong>.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="get-involved-with-llm-d">Get Involved with llm-d<a href="https://llm-d.ai/blog/intelligent-inference-scheduling-with-llm-d#get-involved-with-llm-d" class="hash-link" aria-label="Direct link to Get Involved with llm-d" title="Direct link to Get Involved with llm-d" translate="no">​</a></h3>
<p>The llm-d project thrives on community contributions, and there are many ways to get involved:</p>
<ul>
<li class=""><strong>Explore the llm-d Community Quickstart Guide</strong> → <a href="https://llm-d.ai/docs/community" target="_blank" rel="noopener noreferrer" class="">Start here</a> to learn more about getting involved in the llm-d project.</li>
<li class=""><strong>Join our Slack</strong> → <a href="https://llm-d.ai/slack" target="_blank" rel="noopener noreferrer" class="">Get your invite</a> and connect with maintainers and contributors</li>
<li class=""><strong>Explore the code</strong> → Browse our <a href="https://github.com/llm-d" target="_blank" rel="noopener noreferrer" class="">GitHub organization</a> and find issues that interest you</li>
<li class=""><strong>Attend meetings</strong> → All meetings are open! Add our <a href="https://llm-d.ai/docs/community#public-meeting-calendar" target="_blank" rel="noopener noreferrer" class="">public calendar</a> and join discussions`</li>
</ul>]]></content>
        <author>
            <name>Nili Guy</name>
            <uri>https://www.linkedin.com/in/nilig/</uri>
        </author>
        <author>
            <name>Vita Bortnikov</name>
            <uri>https://www.linkedin.com/in/vita-bortnikov/</uri>
        </author>
        <author>
            <name>Etai Lev Ran</name>
            <uri>https://www.linkedin.com/in/elevran/</uri>
        </author>
        <author>
            <name>Robert Shaw</name>
            <email>robshaw@redhat.com</email>
            <uri>https://github.com/robertgshaw2-redhat</uri>
        </author>
        <author>
            <name>Clayton Coleman</name>
            <email>claytoncoleman@google.com</email>
            <uri>https://github.com/smarterclayton</uri>
        </author>
        <category label="blog posts" term="blog posts"/>
        <category label="Updates" term="Updates"/>
        <category label="llm-d release news" term="llm-d release news"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[llm-d 0.2: Our first well-lit paths (mind the tree roots!)]]></title>
        <id>https://llm-d.ai/blog/llm-d-v0.2-our-first-well-lit-paths</id>
        <link href="https://llm-d.ai/blog/llm-d-v0.2-our-first-well-lit-paths"/>
        <updated>2025-07-29T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[llm-d v0.2 introduces well-lit paths for Kubernetes LLM deployment: intelligent scheduling, P/D disaggregation, and MoE support with vLLM optimizations.]]></summary>
        <content type="html"><![CDATA[<p>Our <a href="https://github.com/llm-d/llm-d/releases/tag/v0.2.0" target="_blank" rel="noopener noreferrer" class="">0.2 release</a> delivers progress against our three well-lit paths to accelerate deploying large scale inference on Kubernetes - better load balancing, lower latency with disaggregation, and native vLLM support for very large Mixture of Expert models like DeepSeek-R1.</p>
<p>We’ve also enhanced our deployment and benchmarking tooling, incorporating lessons from real-world infrastructure deployments and addressing key antipatterns. This release gives llm-d users, contributors, researchers, and operators, clearer guides for efficient use in tested, reproducible scenarios.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="new-paths-to-production">New Paths to Production<a href="https://llm-d.ai/blog/llm-d-v0.2-our-first-well-lit-paths#new-paths-to-production" class="hash-link" aria-label="Direct link to New Paths to Production" title="Direct link to New Paths to Production" translate="no">​</a></h2>
<p>In this release we've focused on delivering a clear and reproducible set of scenarios that teams can depend on and tested end-to-end on real hardware and models.</p>
<p>Our deployments have been tested and benchmarked on recent GPUs, such as H200 nodes, with models like Llama-3, Llama-4, and DeepSeek-R1. We provide deployment guidance and performance analysis to help teams understand when P/D separation is most beneficial and where tradeoffs emerge.</p>
<p>We’ve defined and improved three well-lit paths that form the foundation of this release:</p>
<ul>
<li class=""><a href="https://github.com/llm-d/llm-d/tree/main/guides/inference-scheduling" target="_blank" rel="noopener noreferrer" class=""><strong>Intelligent inference scheduling over any vLLM deployment</strong></a>: support for precise prefix-cache aware routing with no additional infrastructure, out-of-the-box load-aware scheduling for better tail latency that “just works”, and a new configurable scheduling profile system enable teams to see immediate latency wins and still customize scheduling behavior for their workloads and infrastructure.</li>
<li class=""><a href="https://github.com/llm-d/llm-d/tree/main/guides/pd-disaggregation" target="_blank" rel="noopener noreferrer" class=""><strong>P/D disaggregation</strong>:</a> support for separating prefill and decode workloads to improve latency and GPU utilization for long-context scenarios.</li>
<li class=""><a href="https://github.com/llm-d/llm-d/tree/main/guides/wide-ep-lws" target="_blank" rel="noopener noreferrer" class=""><strong>Wide expert parallelism for DeepSeek R1 (EP/DP)</strong></a>: support for large-scale multi-node deployments using expert and data parallelism patterns for MoE models. This includes optimized deployments leveraging NIXL+UCX for inter-node communication, with fixes and improvements to reduce latency, and demonstrates the use of LeaderWorkerSet for Kubernetes-native inference orchestration.</li>
</ul>
<p>All of these scenarios are reproducible: we provide reference hardware specs, workloads, and benchmarking harness support, so others can evaluate, reproduce, and extend these benchmarks easily. This also reflects improvements to our deployment tooling and benchmarking framework, a new "machinery" that allows users to set up, test, and analyze these scenarios consistently.</p>
<p>While this is our first release and there are still some rough edges, our goal is to continue to refine and widen these paths to speed adoption. Please <a href="https://github.com/llm-d/llm-d/issues/146" target="_blank" rel="noopener noreferrer" class="">provide feedback</a> about where to go next!</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="key-enabling-changes-and-technical-milestones">Key enabling changes and technical milestones<a href="https://llm-d.ai/blog/llm-d-v0.2-our-first-well-lit-paths#key-enabling-changes-and-technical-milestones" class="hash-link" aria-label="Direct link to Key enabling changes and technical milestones" title="Direct link to Key enabling changes and technical milestones" translate="no">​</a></h2>
<p>llm-d 0.2 builds on the progress of our Special Interest Groups (SIGs), delivering the following key features:</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="modular-deployer-refactor">Modular deployer refactor<a href="https://llm-d.ai/blog/llm-d-v0.2-our-first-well-lit-paths#modular-deployer-refactor" class="hash-link" aria-label="Direct link to Modular deployer refactor" title="Direct link to Modular deployer refactor" translate="no">​</a></h3>
<p>We've refactored the deployer into a Helm-first, modular structure, splitting charts for infrastructure, model services, and the inference gateway. These charts are now central to our documentation, complete with clear prerequisites for Kubernetes versions, networking, and GPU hardware. This refactor not only makes it easier for both first-time and advanced users to deploy llm-d but also positions us to work directly with end-user environments where modularity and flexibility adapt ML theory to user production practice.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="pd-disaggregation-and-dpep-for-moe-deployments">P/D disaggregation and DP/EP for MoE deployments<a href="https://llm-d.ai/blog/llm-d-v0.2-our-first-well-lit-paths#pd-disaggregation-and-dpep-for-moe-deployments" class="hash-link" aria-label="Direct link to P/D disaggregation and DP/EP for MoE deployments" title="Direct link to P/D disaggregation and DP/EP for MoE deployments" translate="no">​</a></h3>
<p>The path for Prefill/Decode (P/D) disaggregation and multi-node DP/EP MoE deployments is now more clearly defined and tested. This work integrates and optimizes key <a href="https://github.com/vllm-project/vllm/releases/tag/v0.10.0" target="_blank" rel="noopener noreferrer" class="">vLLM 0.10.0</a> kernel improvements, including DeepGEMM and CUTLASS for expert parallel compute, as well as PPLX and DeepEP kernels and intra- and inter-node communication fixes and optimizations and multi-node scenarios. We now include:</p>
<ul>
<li class="">Kubernetes-native deployment recipes now support API servers per DP rank for one-pod-per-rank placement, enhancing scalability and control</li>
<li class="">Helm charts are updated to support LeaderWorkerSet (LWS) for multi-node setups and direct one-pod-per-DP-rank deployments</li>
<li class="">Optimized intra-node communication by enabling DeepEP to use cuda_ipc efficiently</li>
<li class="">Enhanced NIXL+UCX performance, with fixes and optimizations that significantly reduce inter-node communication overhead, particularly for long context workloads</li>
</ul>
<p>These validated scenarios are backed by benchmark baselines and example deployments via our quickstarts, offering clearer guidance on what works well today. As part of the "well-lit path" we have also identified limitations including known edge cases around response sizes and failure modes where more work is required.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="inference-scheduler-extensibility">Inference scheduler extensibility<a href="https://llm-d.ai/blog/llm-d-v0.2-our-first-well-lit-paths#inference-scheduler-extensibility" class="hash-link" aria-label="Direct link to Inference scheduler extensibility" title="Direct link to Inference scheduler extensibility" translate="no">​</a></h3>
<p>The llm-d-inference-scheduler is now far more extensible and aligned with the latest upstream Inference Gateway code base. It’s fully configurable and supports flexible, label selector–based filtering to enable a variety of model server topologies, including LWS-based deployments. We’ve improved the user experience for prefix-aware scheduling by allowing a simple configuration switch between the current estimated prefix tracking at the gateway and a new precise prefix cache–based tracking that reads KV events directly from vLLM for better hit rates.</p>
<p>Our Helm charts now support deployment of scheduler configurations out of the box, making it easier for researchers and operators to iterate on custom scheduling and routing strategies without modifying core components. Internally, testing and development workflows have been updated for improved velocity and quality and this release includes numerous bug fixes.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="improved-benchmarking-suite">Improved benchmarking suite<a href="https://llm-d.ai/blog/llm-d-v0.2-our-first-well-lit-paths#improved-benchmarking-suite" class="hash-link" aria-label="Direct link to Improved benchmarking suite" title="Direct link to Improved benchmarking suite" translate="no">​</a></h3>
<p>Our benchmarking suite has matured significantly. It now supports testing any pre-deployed llm-d workload, accommodates multiple load generators, and includes automated analysis and plot generation for easier performance data interpretation.</p>
<p>For this release, we ran sweeps to characterize throughput and scaling to demonstrate the benefits of P/D disaggregation for long context workloads. Scenarios cover representative workload shapes (input/output ratios of 10:1 and 100:1) and explore various parallelism schemes and P/D disaggregation ratios. For each setup, we're measuring throughput scaling (tokens per second per user and tokens per second per GPU) across increasing concurrency levels. These results provide direct comparison with and without P/D separation (just load-aware), highlighting where llm-d’s optimizations deliver significant benefits.</p>
<p><img decoding="async" loading="lazy" alt="v0.2-llama-scout-1" src="https://llm-d.ai/assets/images/v0.2-llama-scout-1-7a33d8bad0b3b42dce3615602bf8a45a.webp" width="1566" height="1168" class="img_ev3q"></p>
<p>Fig. 1: Pareto curve for Llama-Scout on dual 8×H200 IB nodes, comparing monolithic (4tp4) and P/D-disaggregated (4ptp2–2dtp4) topologies.</p>
<p>The chart above shows the standard Pareto curve for a Llama‑Scout on 2 8xH200 nodes with Infiniband Networking, comparing standard 4tp4 topology to a disaggregated 4ptp2-2dtp4 configuration (maintaining the total GPU count). The X-axis measures the latency observed by each user and the Y-axis measures the total throughput per GPU. Each point on the chart represents a particular concurrency.</p>
<p>While both configurations perform similarly at very low or very high user output speeds, the disaggregated setup delivers significantly higher per-GPU throughput at medium concurrency levels (particularly around 64-128 concurrent requests), where contention between prefill and decode phases tend to dominate. This validates that disaggregation doesn’t just increase throughput, it exposes saturation points and unlocks headroom that is otherwise lost to phase interference. These insights are critical to autoscaling, role assignment, and future predictor-driven and SLO-aware scheduling.</p>
<p>These results follow a broader trend previously reported [<a href="https://developer.nvidia.com/blog/how-nvidia-gb200-nvl72-and-nvidia-dynamo-boost-inference-performance-for-moe-models/" target="_blank" rel="noopener noreferrer" class="">1</a>, <a href="https://arxiv.org/abs/2506.05508" target="_blank" rel="noopener noreferrer" class="">2</a>]: disaggregated serving consistently provides the greatest benefit under medium‑concurrency conditions, particularly for prefill‑heavy traffic and larger models. Our results confirm this trend, showing improved throughput and clearer saturation dynamics at mid‑throughput, strongly validating our architectural direction. By decoupling prefill and decode phases, we not only boost raw performance but also surface scaling limits that static, monolithic systems obscure. This lays the foundation for dynamic topology adaptation, predictor‑informed routing, and autoscaling strategies driven by the real‑time behavior of workloads. These are key priorities for our upcoming releases.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="image-improvements">Image improvements<a href="https://llm-d.ai/blog/llm-d-v0.2-our-first-well-lit-paths#image-improvements" class="hash-link" aria-label="Direct link to Image improvements" title="Direct link to Image improvements" translate="no">​</a></h3>
<p>Multi-arch support, smaller images, and hardened configurations ensure a reliable out-of-the-box experience.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-weve-learned-and-sharing-with-the-community">What we've learned and sharing with the community<a href="https://llm-d.ai/blog/llm-d-v0.2-our-first-well-lit-paths#what-weve-learned-and-sharing-with-the-community" class="hash-link" aria-label="Direct link to What we've learned and sharing with the community" title="Direct link to What we've learned and sharing with the community" translate="no">​</a></h2>
<p>Here are some key lessons we learned so far in our progress with llm-d:</p>
<ul>
<li class=""><strong>Low-hanging fruit matters.</strong> Targeted optimizations, like reducing KV‑cache transfer overhead between prefill and decode workers and refining prefix‑aware scheduling, delivered significant gains in throughput and tail latency. These quick wins required minimal change but paved the way for the deeper architectural improvements planned in upcoming releases.</li>
<li class=""><strong>Using bleeding-edge libraries is hard.</strong> Many key libraries associated with distributed inference are immature. Through our applied experiments in our well-lit paths and in close collaboration with ecosystem partners, we have improved much of the key infrastructure the larger community relies on in real-world conditions.</li>
<li class=""><strong>Build on proven paths.</strong> This validates why llm-d exists: to help users avoid discovering these problems themselves, offering reproducible deployments, performance baselines, and extensibility. llm-d focuses on building these paths so our users don’t need to troubleshoot these complex challenges in isolation.</li>
<li class=""><strong>Community matters.</strong> Working closely with the NVIDIA Dynamo community, we've tackled NIXL/UCX performance overheads for long context workloads, leading to significant improvements and active upstream contributions.</li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="our-survey">Our survey<a href="https://llm-d.ai/blog/llm-d-v0.2-our-first-well-lit-paths#our-survey" class="hash-link" aria-label="Direct link to Our survey" title="Direct link to Our survey" translate="no">​</a></h3>
<p>In our first <a href="https://llm-d.ai/blog/llm-d-community-update-june-2025" target="_blank" rel="noopener noreferrer" class="">community survey,</a> we invited users to share their deployment requirements and challenges to help shape the future of the llm-d project and better understand how teams are serving LLMs today. We heard from a broad and diverse group, from platform engineers, business leaders, and data practitioners, reflecting a wide range of workloads, architectures, and operational maturity.</p>
<p>Conversational AI (82.9%) and real-time applications (56.1%) stood out as the most common workloads, with nearly half of respondents supporting 4-10 models concurrently. Hardware choices highlight a diverse landscape: 85% use NVIDIA GPUs, 29% run AMD GPUs, and 27% deploy on CPUs alone, a surprising signal. Model preferences show Llama (73%), Qwen (63%), and Mistral (56%) leading the way. Yet despite this activity, SLO maturity is still emerging: 46% report no formal SLOs, and 39% are still defining theirs, suggesting many teams are in exploratory or early production phases. What came through clearly is strong demand for operational features like routing, caching, observability, and flexibility, signaling that ease of use and manageability are top priorities as teams scale. See a full summary of <a href="https://docs.google.com/document/d/1vwcIsCxrql0yObBEf-ziEThHIWAw8cY0FG68iGt3sj8" target="_blank" rel="noopener noreferrer" class="">the survey responses here</a>.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-you-get-today">What you get today<a href="https://llm-d.ai/blog/llm-d-v0.2-our-first-well-lit-paths#what-you-get-today" class="hash-link" aria-label="Direct link to What you get today" title="Direct link to What you get today" translate="no">​</a></h2>
<p>Today, <a href="https://github.com/llm-d/llm-d/releases/tag/v0.2.0" target="_blank" rel="noopener noreferrer" class="">llm-d 0.2</a> offers:</p>
<ul>
<li class="">Modular Helm charts and clear deployment workflows.</li>
<li class="">Verified support for P/D, DP/EP, pod-per-rank, and heterogeneous GPUs (H200, B200).</li>
<li class="">Reproducible performance baselines, now with MoE support.</li>
<li class="">New foundations for routing and scheduler extensibility.</li>
<li class="">A developer, and researcher-friendly platform with tested examples, with detailed guides on the way.</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="a-growing-community">A growing community<a href="https://llm-d.ai/blog/llm-d-v0.2-our-first-well-lit-paths#a-growing-community" class="hash-link" aria-label="Direct link to A growing community" title="Direct link to A growing community" translate="no">​</a></h2>
<p>The best part of llm-d has been watching the community grow around it. We're thrilled that over 700 people have already joined our Slack channel, and the project has been starred over 1,400 times on GitHub. This isn't just about numbers; it’s about the active, collaborative spirit that drives the project forward.</p>
<p>Much of the work happens within our seven Special Interest Groups (SIGs), each focused on a key area:</p>
<ul>
<li class=""><strong>Inference Scheduler</strong> – Developing smarter routing and load‑balancing strategies, including KV‑cache‑aware scheduling.</li>
<li class=""><strong>P/D Disaggregation</strong> – Advancing phase‑separation strategies to improve resource‑utilization efficiency.</li>
<li class=""><strong>KV Disaggregation</strong> – Advancing and optimizing distributed KV‑cache management.</li>
<li class=""><strong>Installation</strong> – Streamlining deployment on Kubernetes, from single‑node setups to large multi‑node clusters.</li>
<li class=""><strong>Benchmarking</strong> – Building tools to automate performance validation and make scenarios easier to reproduce and extend.</li>
<li class=""><strong>Autoscaling</strong> – Adapting resources dynamically based on workload demands.</li>
<li class=""><strong>Observability</strong> – Providing deep visibility into system performance and health.</li>
</ul>
<p>We're also collaborating with other great open-source communities like vLLM, Dynamo, and LMCache. Every one of these groups is open, and we’d love for you to join in. Whether you want to contribute code, share ideas, or just listen in, you are welcome. You can find details for each SIG, including their leaders and meeting times, on <a href="https://llm-d.ai/docs/community/sigs" target="_blank" rel="noopener noreferrer" class="">our community page</a>.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="whats-next">What's next:<a href="https://llm-d.ai/blog/llm-d-v0.2-our-first-well-lit-paths#whats-next" class="hash-link" aria-label="Direct link to What's next:" title="Direct link to What's next:" translate="no">​</a></h2>
<p>Looking ahead, our community is focusing on these key areas:</p>
<ul>
<li class=""><strong>Core optimizations</strong>
<ul>
<li class="">TCP-based request dispatch upstream</li>
<li class="">Disaggregation protocol refinements, including possible sidecar removal</li>
<li class="">CPU cache offloading to expand memory capacity</li>
<li class="">KV event awareness baked directly into routing decisions</li>
<li class="">SLO-driven scheduling architecture for predictable performance</li>
</ul>
</li>
<li class=""><strong>Benchmarking enhancements:</strong>
<ul>
<li class="">Expanded reproducibility guides.</li>
<li class="">Complete performance validation for core scenarios.</li>
</ul>
</li>
<li class=""><strong>Developer experience improvements:</strong>
<ul>
<li class="">Expanded examples for inference gateway and scheduler extensibility.</li>
<li class="">Central Helm charts and expanded documentation.</li>
</ul>
</li>
</ul>
<p>See our <a href="https://github.com/llm-d/llm-d/issues/146" target="_blank" rel="noopener noreferrer" class="">roadmap issue</a> to see what is coming next and make your voice heard!</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="join-the-llm-d-community"><strong>Join the llm-d community!</strong><a href="https://llm-d.ai/blog/llm-d-v0.2-our-first-well-lit-paths#join-the-llm-d-community" class="hash-link" aria-label="Direct link to join-the-llm-d-community" title="Direct link to join-the-llm-d-community" translate="no">​</a></h2>
<p>We welcome researchers exploring scheduling, autoscaling, and routing optimization challenges. Your contributions are invaluable!</p>
<p>Community engagement is key to our success:</p>
<ul>
<li class=""><a href="https://red.ht/llm-d-public-calendar" target="_blank" rel="noopener noreferrer" class=""><strong>Join our community calls</strong></a> (Wed 12:30pm ET)</li>
</ul>
<p>Contribute on <a href="https://github.com/llm-d" target="_blank" rel="noopener noreferrer" class="">GitHub</a>, join our community calls, join the SIGs and build with us!</p>]]></content>
        <author>
            <name>Robert Shaw</name>
            <email>robshaw@redhat.com</email>
            <uri>https://github.com/robertgshaw2-redhat</uri>
        </author>
        <author>
            <name>Clayton Coleman</name>
            <email>claytoncoleman@google.com</email>
            <uri>https://github.com/smarterclayton</uri>
        </author>
        <author>
            <name>Carlos Costa</name>
            <email>chcost@us.ibm.com</email>
            <uri>https://github.com/chcost</uri>
        </author>
        <category label="Releases" term="Releases"/>
        <category label="Announcements" term="Announcements"/>
        <category label="llm-d release news" term="llm-d release news"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[llm-d Community Update - June 2025]]></title>
        <id>https://llm-d.ai/blog/llm-d-community-update-june-2025</id>
        <link href="https://llm-d.ai/blog/llm-d-community-update-june-2025"/>
        <updated>2025-06-25T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[Help shape llm-d's future: Take our 5-minute community survey, subscribe to our YouTube channel, and access exclusive resources for LLM serving innovation.]]></summary>
        <content type="html"><![CDATA[<p>Hey everyone! We've been making great progress with the llm-d project, and I wanted to share some important updates and opportunities to get involved.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="help-shape-the-future-of-the-llm-d-project">Help Shape the Future of the llm-d Project<a href="https://llm-d.ai/blog/llm-d-community-update-june-2025#help-shape-the-future-of-the-llm-d-project" class="hash-link" aria-label="Direct link to Help Shape the Future of the llm-d Project" title="Direct link to Help Shape the Future of the llm-d Project" translate="no">​</a></h2>
<p>To guide the future development of the llm-d project, we need to understand the real-world challenges, configurations, and performance needs of our community. We've created a short survey to gather insight into how you serve Large Language Models, from the hardware you use to the features you need most.</p>
<p>This <strong>anonymous</strong>, vendor-agnostic survey will take approximately 5 minutes to complete. Your input will directly influence the project's roadmap and priorities. The aggregated results will be shared with the <a href="https://groups.google.com/g/llm-d-contributors" target="_blank" rel="noopener noreferrer" class="">llm-d-contributors</a> mailing list to benefit the entire community.</p>
<div class="theme-admonition theme-admonition-tip admonition_xJq3 alert alert--success"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>Your Input Will Define Our Roadmap</div><div class="admonitionContent_BuS1"><p>We've created an <a href="https://docs.google.com/forms/d/e/1FAIpQLScENfBNrN1q8XpugCmDUjy8sSn1cMzi_V0mCWTRwzixcPAQkw/viewform" target="_blank" rel="noopener noreferrer" class="">llm-d Community Roadmap Survey</a> to gather information about your LLM workloads. We are looking to learn more about:</p><ul>
<li class=""><strong>Your Serving Environment:</strong> This includes the hardware you use now and anticipate using in a year (like NVIDIA GPUs, AMD GPUs, or CPUs), and whether you run on-premise, in the cloud, or on edge devices.</li>
<li class=""><strong>Your Model Strategy:</strong> Do you serve a few large models or many smaller ones, which model families (like Llama or Mistral) are most common, and how you utilize techniques like LoRA adapters.</li>
<li class=""><strong>Your Performance Requirements:</strong> Your real-world SLOs for latency and throughput and the biggest LLM serving challenges you face—from cost optimization to operational ease of use.</li>
<li class=""><strong>Your Future Needs:</strong> What single new feature you would prioritize for an LLM Model-as-a-Service to help guide our innovation.</li>
</ul></div></div>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="take-the-5-minute-survey"><a href="https://docs.google.com/forms/d/e/1FAIpQLScENfBNrN1q8XpugCmDUjy8sSn1cMzi_V0mCWTRwzixcPAQkw/viewform" target="_blank" rel="noopener noreferrer" class="">Take the 5-Minute Survey</a><a href="https://llm-d.ai/blog/llm-d-community-update-june-2025#take-the-5-minute-survey" class="hash-link" aria-label="Direct link to take-the-5-minute-survey" title="Direct link to take-the-5-minute-survey" translate="no">​</a></h3>
<p><strong>Your participation is invaluable.</strong> Please take a few minutes to complete the survey. We encourage you to share it with other users or proxy their needs in your response to ensure our direction reflects the community's diverse requirements.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="new-youtube-channel">New YouTube Channel<a href="https://llm-d.ai/blog/llm-d-community-update-june-2025#new-youtube-channel" class="hash-link" aria-label="Direct link to New YouTube Channel" title="Direct link to New YouTube Channel" translate="no">​</a></h2>
<p>We're excited to announce our new YouTube channel! We've been recording our SIG meetings and creating tutorial content to help you get started with llm-d.</p>
<div class="theme-admonition theme-admonition-info admonition_xJq3 alert alert--info"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M7 2.3c3.14 0 5.7 2.56 5.7 5.7s-2.56 5.7-5.7 5.7A5.71 5.71 0 0 1 1.3 8c0-3.14 2.56-5.7 5.7-5.7zM7 1C3.14 1 0 4.14 0 8s3.14 7 7 7 7-3.14 7-7-3.14-7-7-7zm1 3H6v5h2V4zm0 6H6v2h2v-2z"></path></svg></span>Subscribe to our YouTube</div><div class="admonitionContent_BuS1"><p>Subscribe to get notified when new videos are posted, presentations, demos, and SIG meeting recordings: <a href="https://youtube.com/@llm-d-project" target="_blank" rel="noopener noreferrer" class="">@llm-d-project</a></p></div></div>
<div style="position:relative;padding-bottom:56.25%;height:0;overflow:hidden;max-width:100%"><iframe style="position:absolute;top:0;left:0;width:100%;height:100%" src="https://www.youtube.com/embed/playlist?list=PLU-3MgIuzv8oIRX7o2NLrxi5-Jzx9PoLr" title="llm-d introduction" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe></div>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="reminder-join-our-google-group-to-access-project-docs">Reminder: Join Our Google Group to Access Project Docs<a href="https://llm-d.ai/blog/llm-d-community-update-june-2025#reminder-join-our-google-group-to-access-project-docs" class="hash-link" aria-label="Direct link to Reminder: Join Our Google Group to Access Project Docs" title="Direct link to Reminder: Join Our Google Group to Access Project Docs" translate="no">​</a></h2>
<div class="theme-admonition theme-admonition-warning admonition_xJq3 alert alert--warning"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 16 16"><path fill-rule="evenodd" d="M8.893 1.5c-.183-.31-.52-.5-.887-.5s-.703.19-.886.5L.138 13.499a.98.98 0 0 0 0 1.001c.193.31.53.501.886.501h13.964c.367 0 .704-.19.877-.5a1.03 1.03 0 0 0 .01-1.002L8.893 1.5zm.133 11.497H6.987v-2.003h2.039v2.003zm0-3.004H6.987V5.987h2.039v4.006z"></path></svg></span>Join our Google Group</div><div class="admonitionContent_BuS1"><p>We use Google Groups to share architecture diagrams, SIG meeting notes, and other important project content. To get full access to all project resources, please join: <a href="https://groups.google.com/g/llm-d-contributors" target="_blank" rel="noopener noreferrer" class="">llm-d-contributors Google Group</a></p></div></div>
<p>The Google Group is where all the magic happens! Here you'll find:</p>
<ul>
<li class="">Detailed architecture diagrams and design documents</li>
<li class="">SIG meeting notes and recordings</li>
<li class="">Early access to new features and proposals</li>
<li class="">Direct communication with the core team and other contributors</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="get-involved">Get Involved<a href="https://llm-d.ai/blog/llm-d-community-update-june-2025#get-involved" class="hash-link" aria-label="Direct link to Get Involved" title="Direct link to Get Involved" translate="no">​</a></h2>
<p>There are many ways to contribute to llm-d:</p>
<ol>
<li class="">Join the <a href="https://groups.google.com/g/llm-d-contributors" target="_blank" rel="noopener noreferrer" class="">Google Group</a> for full access to project resources</li>
<li class="">Review the <a href="https://red.ht/llm-d-public-calendar" target="_blank" rel="noopener noreferrer" class="">project public calendar</a> and join an upcoming community meeting</li>
<li class="">Subscribe to our <a href="https://youtube.com/@llm-d-project" target="_blank" rel="noopener noreferrer" class="">YouTube channel</a> for tutorials and meeting recordings</li>
<li class="">Fill out the <a href="https://docs.google.com/forms/d/e/1FAIpQLScENfBNrN1q8XpugCmDUjy8sSn1cMzi_V0mCWTRwzixcPAQkw/viewform" target="_blank" rel="noopener noreferrer" class="">Use Case Survey</a> to help shape our benchmarks</li>
<li class="">Join our <a href="https://llm-d.ai/slack" target="_self">Slack workspace</a> for real-time discussions</li>
<li class="">Check out our <a href="https://llm-d.ai/docs/community/contribute" target="_blank" rel="noopener noreferrer" class="">Contributor Guidelines</a> to start contributing code</li>
</ol>
<p>We're looking forward to hearing from you and working together to make llm-d even better!</p>]]></content>
        <author>
            <name>Pete Cheslock</name>
            <email>pete.cheslock@redhat.com</email>
            <uri>https://github.com/petecheslock</uri>
        </author>
        <category label="Community" term="Community"/>
        <category label="Updates" term="Updates"/>
        <category label="SIG-Benchmarking" term="SIG-Benchmarking"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[llm-d Week 1 Project News Round-Up]]></title>
        <id>https://llm-d.ai/blog/llm-d-week-1-round-up</id>
        <link href="https://llm-d.ai/blog/llm-d-week-1-round-up"/>
        <updated>2025-06-03T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[llm-d hits 1000 GitHub stars! Week 1-2 round-up covers KVTransfer Protocol, InferenceModel API updates, and community resources for LLM inference developers.]]></summary>
        <content type="html"><![CDATA[<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="june-3-2025">June 3, 2025<a href="https://llm-d.ai/blog/llm-d-week-1-round-up#june-3-2025" class="hash-link" aria-label="Direct link to June 3, 2025" title="Direct link to June 3, 2025" translate="no">​</a></h2>
<h1><strong>llm-d Week 1+2 Project News Round-Up</strong></h1>
<p>Hey, the llm-d project team has been really busy after the launch on May 20.</p>
<p>We've hit 1000 ⭐️'s on <a href="https://github.com/llm-d/llm-d" target="_blank" rel="noopener noreferrer" class="">GitHub</a></p>
<p><img decoding="async" loading="lazy" alt="llm-d Star Chart" src="https://llm-d.ai/assets/images/star-history-202563-5bd50fe0c920335bc7e17657cb50bd1e.webp" width="3664" height="2616" class="img_ev3q"></p>
<p><strong>Here are some of the active design conversations:</strong></p>
<div class="theme-admonition theme-admonition-tip admonition_xJq3 alert alert--success"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>Join our Google Group</div><div class="admonitionContent_BuS1"><p>We use Google Groups to share architecture diagrams and other content. Please join: <a href="https://groups.google.com/g/llm-d-contributors" target="_blank" rel="noopener noreferrer" class="">llm-d-contributors Google Group</a></p></div></div>
<ul>
<li class=""><a href="https://docs.google.com/document/d/1zBkToR9XWjvBYLxu15JeoGpq16nH5sFFensZP_3lJQU/view" target="_blank" rel="noopener noreferrer" class="">2025-06-01 [PUBLIC] llm-d KVTransfer Protocol</a></li>
<li class=""><a href="https://docs.google.com/document/d/1x6aI9pbTF5oOsaEQYc9n4pBBY3_AuEY2X51VKxmBSnU/view" target="_blank" rel="noopener noreferrer" class="">Revisiting The InferenceModel API</a></li>
<li class=""><a href="https://docs.google.com/document/d/1HA-2yNZpc1F4KhyeYA30shjZpYEDqGIJXqVgDVv3SWU/view" target="_blank" rel="noopener noreferrer" class="">ModelService: Declarative Inference Serving on llm-d</a></li>
<li class=""><a href="https://docs.google.com/document/d/1SpSp1E6moa4HSrJnS4x3NpLuj88sMXr2tbofKlzTZpk/view" target="_blank" rel="noopener noreferrer" class="">[External] Standardizing Large Model Server Metrics in Kubernetes</a></li>
<li class=""><a href="https://docs.google.com/document/d/1Ia0oRGnkPS8anB4g-_XPGnxfmOTOeqjJNb32Hlo_Tp0" target="_blank" rel="noopener noreferrer" class="">[PUBLIC] LLM Workloads</a></li>
<li class="">[<a href="https://docs.google.com/document/d/1Giim_pIGKb4MDue9YFJBlsVkPsWomfVTIFuTmP3PZIo" target="_blank" rel="noopener noreferrer" class="">Public] LLM-D Inference Scheduler Roadmap</a></li>
</ul>
<p><strong>Project links for those wanting to track us or get involved:</strong></p>
<ul>
<li class=""><a href="https://www.youtube.com/live/Gr8jomztY2s?si=_FT4gGpWGU6MQLyl&amp;t=4994" target="_blank" rel="noopener noreferrer" class="">Project Launch</a></li>
<li class=""><a href="https://llm-d.ai/docs/community/contribute" target="_blank" rel="noopener noreferrer" class="">Contributor Guidelines</a></li>
<li class=""><a href="http://red.ht/llm-d-public-calendar" target="_blank" rel="noopener noreferrer" class="">Project Calendar</a>
<ul>
<li class="">All SIG meetings</li>
<li class="">Weekly Community Meeting, every Wednesday @ 12:30pm Eastern Time</li>
</ul>
</li>
<li class="">Join the <a href="http://red.ht/llm-d-google-group" target="_blank" rel="noopener noreferrer" class="">llm-d Google group</a> - this is where all engineering drafts are shared. You must join the group to have access to project google docs.</li>
<li class="">Join <a href="https://llm-d.ai/slack" target="_self">Slack</a></li>
<li class=""><a href="https://github.com/llm-d/llm-d/issues/26" target="_blank" rel="noopener noreferrer" class="">llm-d Roadmap for Q2</a></li>
</ul>
<p><strong>Where to follow us on Social Media</strong></p>
<ul>
<li class=""><a href="https://llm-d.ai/" target="_blank" rel="noopener noreferrer" class="">https://llm-d.ai</a></li>
<li class=""><a href="http://linkedin.com/company/llm-d" target="_blank" rel="noopener noreferrer" class="">LinkedIn</a></li>
<li class=""><a href="https://bsky.app/profile/llm-d.ai" target="_blank" rel="noopener noreferrer" class="">Bluesky - @llm-d.ai</a></li>
<li class=""><a href="https://twitter.com/_llm_d_" target="_blank" rel="noopener noreferrer" class="">@_llm_d_</a></li>
<li class=""><a href="https://www.reddit.com/r/llm_d/" target="_blank" rel="noopener noreferrer" class="">r/llm_d</a></li>
<li class=""><a href="https://www.youtube.com/@llm-d-project" target="_blank" rel="noopener noreferrer" class="">YouTube</a></li>
</ul>]]></content>
        <author>
            <name>Pete Cheslock</name>
            <email>pete.cheslock@redhat.com</email>
            <uri>https://github.com/petecheslock</uri>
        </author>
        <category label="News Releases" term="News Releases"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[Announcing the llm-d community!]]></title>
        <id>https://llm-d.ai/blog/llm-d-announce</id>
        <link href="https://llm-d.ai/blog/llm-d-announce"/>
        <updated>2025-05-20T08:00:00.000Z</updated>
        <summary type="html"><![CDATA[Introducing llm-d: Kubernetes-native distributed LLM inference with KV-cache routing, disaggregated serving, and SOTA performance per dollar. Built on vLLM.]]></summary>
        <content type="html"><![CDATA[<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="announcing-the-llm-d-community">Announcing the llm-d community<a href="https://llm-d.ai/blog/llm-d-announce#announcing-the-llm-d-community" class="hash-link" aria-label="Direct link to Announcing the llm-d community" title="Direct link to Announcing the llm-d community" translate="no">​</a></h2>
<p>llm-d is a Kubernetes-native high-performance distributed LLM inference framework<br>
<!-- -->- a well-lit path for anyone to serve at scale, with the fastest time-to-value and competitive performance per dollar for most models across most hardware accelerators.</p>
<p>With llm-d, users can operationalize gen AI deployments with a modular, high-performance, end-to-end serving solution that leverages the latest distributed inference optimizations like KV-cache aware routing and disaggregated serving, co-designed and integrated with the Kubernetes operational tooling in <a href="https://github.com/kubernetes-sigs/gateway-api-inference-extension?tab=readme-ov-file" target="_blank" rel="noopener noreferrer" class="">Inference Gateway (IGW)</a>.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="llm-inference-goes-distributed">LLM Inference Goes Distributed<a href="https://llm-d.ai/blog/llm-d-announce#llm-inference-goes-distributed" class="hash-link" aria-label="Direct link to LLM Inference Goes Distributed" title="Direct link to LLM Inference Goes Distributed" translate="no">​</a></h3>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="why-standard-scale-out-falls-short">Why Standard Scale Out Falls Short<a href="https://llm-d.ai/blog/llm-d-announce#why-standard-scale-out-falls-short" class="hash-link" aria-label="Direct link to Why Standard Scale Out Falls Short" title="Direct link to Why Standard Scale Out Falls Short" translate="no">​</a></h4>
<p>Kubernetes typically scales out application workloads with uniform replicas and round-robin load balancing.</p>
<p><img decoding="async" loading="lazy" alt="Figure 1: Deploying a service to multiple vLLM instances" src="https://llm-d.ai/assets/images/image5_46-dd662cfda15dbf7174d0ea6e97d745c0.webp" width="4605" height="3122" class="img_ev3q"></p>
<p>This simple pattern is very effective for most request patterns, which have the following characteristics:</p>
<ul>
<li class="">Requests are short-lived and generally uniform in resource utilization</li>
<li class="">Requests have generally uniform latency service level objectives (SLOs)</li>
<li class="">Each replica can process each request equally well</li>
<li class="">Specializing variants and coordinating replicas to process a single request is not useful</li>
</ul>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="llm-serving-is-unique">LLM Serving Is Unique<a href="https://llm-d.ai/blog/llm-d-announce#llm-serving-is-unique" class="hash-link" aria-label="Direct link to LLM Serving Is Unique" title="Direct link to LLM Serving Is Unique" translate="no">​</a></h4>
<p>The LLM inference workload, however, is unique with slow, non-uniform, expensive requests. This means that typical scale-out and load-balancing patterns fall short of optimal performance.</p>
<p><img decoding="async" loading="lazy" alt="Figure 2: Comparison of modern HTTP requests" src="https://llm-d.ai/assets/images/image7_33-73f3854277241ca80527a1994eda0c92.webp" width="1768" height="726" class="img_ev3q"></p>
<p>Let's take a look at each one step-by-step:</p>
<p><em>A. Requests are expensive with significant variance in resource utilization.</em></p>
<ul>
<li class="">Each LLM inference request has a different "shape" to it, as measured by the number of input tokens and output tokens. There is significant variance in these parameters across requests and workloads.<!-- -->
<ul>
<li class="">RAG has long inputs - prompt and retrieved docs - and short generated outputs</li>
<li class="">Reasoning has a short or medium inputs and long generated outputs</li>
</ul>
</li>
</ul>
<p><img decoding="async" loading="lazy" alt="Figure 3: Comparing the RAG pattern and Thinking/Reasoning pattern with prefill and decode stages" src="https://llm-d.ai/assets/images/image2_4-78ce4211a3ce8a2db443dfff5e98ba72.webp" width="1999" height="741" class="img_ev3q"></p>
<ul>
<li class="">These differences in request times can lead to significant imbalances across instances, which are compounded as loaded instances get overwhelmed. Overloads lead to longer ITL (Inter-Token Latency), which leads to more load, which leads to more ITL.</li>
</ul>
<p><em>B. Routing to specific replicas with cached prior computation can achieve orders of magnitude better latency.</em></p>
<ul>
<li class="">Many common LLM workloads have "multi-turn" request patterns, where the same prompt is sent iteratively to the same instance.<!-- -->
<ul>
<li class="">Agentic (tool calls are iterative request flow)</li>
<li class="">Code completion task (requests reuse current codebase as context)</li>
</ul>
</li>
</ul>
<p><img decoding="async" loading="lazy" alt="The agentic pattern sequence" src="https://llm-d.ai/assets/images/image8_0-65addf6114f9b264ec4146f199c094da.webp" width="1999" height="722" class="img_ev3q"></p>
<ul>
<li class="">LLM inference servers like vLLM implement a method called "automatic prefix caching", which enables "skipping" a significant amount of prefill computation when there is a cache hit. If requests are routed to vLLM replicas that have the data in the cache, we skip computation. Increasing the likelihood of prefix cache hits with a larger cache size can dramatically improve tail latencies.</li>
</ul>
<p><img decoding="async" loading="lazy" alt="The prefix aching method" src="https://llm-d.ai/assets/images/image3-ffd77c48ca9dba8df7c8461f7e445bd1.webp" width="1999" height="1744" class="img_ev3q"></p>
<p><em>C. Specializing and coordinating replicas to process a single request can lead to more throughput per GPU.</em></p>
<ul>
<li class="">
<p>Inference is split into two phases – prefill and decode.  Prefill generates the first output token and runs in parallel over all the prompt tokens - this phase is compute bound.  Decode generates tokens one at a time by doing a full pass over the model, making this phase memory bandwidth-bound.</p>
</li>
<li class="">
<p>Standard LLM deployments perform the prefill and decode phases of inference within a single replica.Given that prefill and decode phases of inference have different resource requirements, co-locating these phases on the same replica leads to inefficient resource use, especially for long sequences.</p>
</li>
<li class="">
<p><strong>Disaggregation</strong> (e.g. <a href="https://arxiv.org/abs/2401.09670" target="_blank" rel="noopener noreferrer" class="">Distserve</a>) separates prefill and decode phases onto different variants, enabling independent optimization and scaling of each phase.</p>
<ul>
<li class="">
<p>Google <a href="https://cloud.google.com/blog/products/compute/whats-new-with-ai-hypercomputer" target="_blank" rel="noopener noreferrer" class="">leverages disaggregated serving on TPUs</a> to provide better first-token latency and simplify operational scaling.</p>
</li>
<li class="">
<p>DeepSeek released a <a href="https://github.com/deepseek-ai/open-infra-index/blob/main/202502OpenSourceWeek/day_6_one_more_thing_deepseekV3R1_inference_system_overview.md" target="_blank" rel="noopener noreferrer" class="">discussion of the design of their inference system</a>, which leverages aggressive disaggregation to achieve remarkable performance at scale.</p>
</li>
</ul>
</li>
</ul>
<p><img decoding="async" loading="lazy" alt="Disaggregation separates the prefill and decode phases" src="https://llm-d.ai/assets/images/image4_57-9bb38b156a80392da3b3d1ab6852ac04.webp" width="1999" height="1144" class="img_ev3q"></p>
<p><em>D. Production deployments often have a range of quality of service (QoS) requirements.</em></p>
<ul>
<li class="">
<p>Use cases for a single LLM endpoint can have a wide variety of quality of service requirements. Consider the following examples:</p>
<ul>
<li class="">Latency is the most important factor: Code completion requests and search responses need to minimize latency to provide an "in the loop" experience. O(ms) latency tolerance.</li>
<li class="">Latency is important: Chat agent sessions and email drafting with interactive use cases. O(seconds) latency tolerance.</li>
<li class="">Latency tolerant: Video call and email summarization and "deep research" agents with daily or hourly usage patterns. O(minutes) latency tolerance.</li>
<li class="">Latency agnostic: Overnight batch processing workloads, meeting minute generation, and autonomous agents. O(hours) latency tolerance.</li>
</ul>
</li>
<li class="">
<p>Given the compute intensity (and, therefore, high costs) of LLMs, tight latency SLOs are substantially more expensive to achieve. This spectrum of latency requirements presents an opportunity to further optimize infrastructure efficiency – the more latency tolerant a workload is, the more we can optimize infrastructure efficiency amongst other workloads.</p>
</li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="why-llm-d">Why llm-d?<a href="https://llm-d.ai/blog/llm-d-announce#why-llm-d" class="hash-link" aria-label="Direct link to Why llm-d?" title="Direct link to Why llm-d?" translate="no">​</a></h3>
<p>To exploit these characteristics and achieve optimal performance for LLM workloads, the inference serving landscape is rapidly transitioning towards distributed cluster-scale architectures. For instance, in its "Open Source Week", the DeepSeek team published the design of its <a href="https://github.com/deepseek-ai/open-infra-index/blob/main/202502OpenSourceWeek/day_6_one_more_thing_deepseekV3R1_inference_system_overview.md" target="_blank" rel="noopener noreferrer" class="">inference system</a>, which aggressively leverages disaggregation and KV caching to achieve remarkable performance per $ of compute.</p>
<p>However, for most GenAI innovators, ML platform teams, and IT operations groups, these benefits remain out of reach. Building and operating a complex, monolithic system is time-consuming and challenging, especially in the context of the rapid pace of innovation and enterprise deployments with tens or hundreds of models for divergent use cases. This complexity risks time to market, higher operational costs and sprawl, and difficulty adopting and experimenting.</p>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="our-objective">Our Objective<a href="https://llm-d.ai/blog/llm-d-announce#our-objective" class="hash-link" aria-label="Direct link to Our Objective" title="Direct link to Our Objective" translate="no">​</a></h4>
<p>The objective of llm-d is to create a well-lit path for anyone to adopt the leading distributed inference optimizations <em>within their existing deployment framework</em> - Kubernetes.</p>
<p>To achieve this goal, we have the following design principles for the project:</p>
<ul>
<li class=""><strong>Operationalizability:</strong> modular and resilient architecture with native integration into Kubernetes via Inference Gateway API</li>
<li class=""><strong>Flexibility:</strong> cross-platform (active work to support NVIDIA, Google TPU, AMD, and Intel), with extensible implementations of key composable layers of the stack</li>
<li class=""><strong>Performance</strong>: leverage distributed optimizations like disaggregation and prefix-aware routing to achieve the highest tok/$ while meeting SLOs</li>
</ul>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="architecture">Architecture<a href="https://llm-d.ai/blog/llm-d-announce#architecture" class="hash-link" aria-label="Direct link to Architecture" title="Direct link to Architecture" translate="no">​</a></h4>
<p>To achieve this objective, we designed llm-d with a modular and layered architecture on top of industry-standard open-source technologies - vLLM, Kubernetes, and Inference Gateway.</p>
<ul>
<li class="">
<p><a href="https://docs.vllm.ai/en/latest/" target="_blank" rel="noopener noreferrer" class=""><strong>vLLM</strong>. vLLM</a> is the leading open-source LLM inference engine, supporting a wide range of models (including Llama and DeepSeek) and hardware accelerators (including NVIDIA GPU, Google TPU, AMD ) with high performance.</p>
</li>
<li class="">
<p><a href="https://kubernetes.io/docs/home/" target="_blank" rel="noopener noreferrer" class=""><strong>Kubernetes</strong></a> <strong>(K8s)</strong>. K8s is an open source container orchestration engine for automating deployment, scaling, and management of containerized applications. It is the industry standard for deploying and updating LLM inference engines across various hardware accelerators.</p>
</li>
<li class="">
<p><a href="https://gateway-api-inference-extension.sigs.k8s.io/" target="_blank" rel="noopener noreferrer" class=""><strong>Inference Gateway</strong></a> <strong>(IGW)</strong>. IGW is an official Kubernetes project that extends the <a href="https://gateway-api.sigs.k8s.io/" target="_blank" rel="noopener noreferrer" class="">Gateway API</a> (the next generation of Kubernetes Ingress and Load Balancing API) with inference-specific routing. IGW includes many important features like model routing, serving priority, and extensible scheduling logic for "smart" load balancing. IGW integrates with many different gateway implementations, such as Envoy, making it widely portable across Kubernetes clusters.</p>
</li>
</ul>
<p><img decoding="async" loading="lazy" src="https://llm-d.ai/assets/images/llm-d-arch-simplified-d41875ab8b1fcf94a1a42df44940ceae.svg" width="859" height="474" class="img_ev3q"></p>
<p>And our key new contributions:</p>
<ul>
<li class="">
<p><strong>vLLM Optimized Inference Scheduler</strong> - IGW defines a pattern for customizable "smart" load-balancing via the <a href="https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/004-endpoint-picker-protocol" target="_blank" rel="noopener noreferrer" class="">Endpoint Picker Protocol (EPP)</a>. Leveraging enhanced operational telemetry exposed by vLLM, the inference scheduler implements the filtering and scoring algorithms necessary to make "smart" scheduling decisions around disaggregated serving, prefix-cache-awareness, and load-awareness, validated to be used out-of-the-box by llm-d users. Advanced teams can also tweak or implement their own scorers and filterers to further customize for their use cases, while still benefiting from upcoming operational features in the inference gateway, like flow control and latency-aware balancing.</p>
<ul>
<li class="">For more details, see our Northstar: <a href="https://docs.google.com/document/d/1kE1LY8OVjiOgKVD9-9Po96HODbTIbgHp4qgvw06BCOc/edit?tab=t.0" target="_blank" rel="noopener noreferrer" class="">[PUBLIC] llm-d Scheduler Northstar</a></li>
</ul>
</li>
<li class="">
<p><strong>Disaggregated Serving with <a href="https://github.com/vllm-project/vllm" target="_blank" rel="noopener noreferrer" class="">vLLM</a> -</strong> llm-d leverages vLLM's recently enabled support for disaggregated serving via a pluggable KV Connector API to run prefill and decode on independent instances, using high-performance transport libraries like <a href="https://github.com/ai-dynamo/nixl" target="_blank" rel="noopener noreferrer" class="">NVIDIA's NIXL</a>.</p>
<p>In llm-d, we plan to support two "well-lit" paths for prefill/decode (P/D) disaggregation:</p>
<ul>
<li class="">Latency optimized implementation using fast interconnects (IB, RDMA, ICI)</li>
<li class="">Throughput optimized implementation using data center networking</li>
<li class="">For more details, see our Northstar:<a href="https://docs.google.com/document/d/1FNN5snmipaTxEA1FGEeSH7Z_kEqskouKD1XYhVyTHr8/edit?tab=t.0#heading=h.ycwld2oth1kj" target="_blank" rel="noopener noreferrer" class="">[PUBLIC] llm-d Disaggregated Serving Northstar</a></li>
</ul>
</li>
<li class="">
<p><strong>Disaggregated Prefix Caching with vLLM</strong> -  llm-d uses the same vLLM KV connector API used in disaggregated serving to provide a pluggable cache for previous calculations, including offloading KVs to host, remote storage, and systems like <a href="https://github.com/LMCache/LMCache" target="_blank" rel="noopener noreferrer" class="">LMCache</a>.</p>
<p>In llm-d, we plan to support two "well-lit" paths for KV cache disaggregation:</p>
<ul>
<li class="">Independent caching with basic offloading to host memory and disk, providing a zero operational cost mechanism that utilizes all system resources</li>
<li class="">Shared caching with KV transfer between instances and shared storage with global indexing, providing potential for higher performance at the cost of a more operationally complex system.</li>
<li class="">For more details, see our Northstar: <a href="https://docs.google.com/document/d/1d-jKVHpTJ_tkvy6Pfbl3q2FM59NpfnqPAh__Uz_bEZ8/edit?tab=t.0#heading=h.6qazyl873259" target="_blank" rel="noopener noreferrer" class="">[PUBLIC] llm-d Prefix Caching Northstar</a></li>
</ul>
</li>
<li class="">
<p><strong>Variant Autoscaling over Hardware, Workload, and Traffic</strong> - Accelerator hardware varies dramatically in terms of compute, memory, and cost, workloads sharing the same models vary by their required quality of service, the distinct phases of LLM inference and large mixture-of-expert models vary on whether they are compute, memory, or network bound, and incoming traffic varies over time and by workload. Today, all of these decisions are made at deployment time, and almost all deployers struggle to enable autoscaling to reduce their costs safely.</p>
<p>Drawing on extensive experience from end users and OSS collaborators like AIBrix, we plan to implement a traffic- and hardware-aware autoscaler that:</p>
<ul>
<li class="">Measures the capacity of each model server instance</li>
<li class="">Derive a load function that takes into account different request shapes and QoS</li>
<li class="">Using the recent traffic mix - QPS (Queries Per Second), QoS, and shape distribution - calculate the optimal mix of instances to handle prefill, decode, and latency-tolerant requests, and label each instance with a grouping</li>
<li class="">Report load metrics per grouping that allows Kubernetes horizontal pod autoscaling to match hardware in use to hardware needed without violating SLOs</li>
<li class="">For more details, see our Northstar: <a href="https://docs.google.com/document/d/1inTneLEZTv3rDEBB9KLOB9K6oMq8c3jkogARJqdt_58/edit?tab=t.0" target="_blank" rel="noopener noreferrer" class="">[PUBLIC] llm-d Autoscaling Northstar</a></li>
</ul>
</li>
</ul>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="example-llm-d-features">Example llm-d Features<a href="https://llm-d.ai/blog/llm-d-announce#example-llm-d-features" class="hash-link" aria-label="Direct link to Example llm-d Features" title="Direct link to Example llm-d Features" translate="no">​</a></h4>
<p>llm-d integrates IGW and vLLM together, enabling a high performance distributed serving stack. Let's discuss some of the example features enabled by llm-d.</p>
<p><strong>Prefix and KV cache-aware routing</strong></p>
<p>The first key collaboration between IGW and vLLM in llm-d was developing prefix-cache aware routing to complement the existing KV cache utilization aware load balancing in IGW.</p>
<p>We conducted a series of experiments to evaluate the performance of the <a href="https://github.com/llm-d/llm-d-inference-scheduler" target="_blank" rel="noopener noreferrer" class="">llm-d-inference-scheduler</a> with prefix-aware routing on 2 NVIDIA 8xH100 nodes using the <a href="https://github.com/LMCache/LMBenchmark/tree/main/synthetic-multi-round-qa" target="_blank" rel="noopener noreferrer" class="">LMbenchmark in a long-input/short-output configuration designed</a> to stress KV cache reuse and routing decision quality.</p>
<table><thead><tr><th style="text-align:left"></th><th style="text-align:left">Model</th><th style="text-align:left">Configuration</th><th style="text-align:left">ISL</th><th style="text-align:left">OSL</th><th style="text-align:left">Latency SLO</th></tr></thead><tbody><tr><td style="text-align:left"><strong>S1</strong></td><td style="text-align:left">LlaMA 4 Scout FP8</td><td style="text-align:left">TP2, 2 replicas</td><td style="text-align:left">20,000</td><td style="text-align:left">100</td><td style="text-align:left">None</td></tr><tr><td style="text-align:left"><strong>S2</strong></td><td style="text-align:left">LlaMA 4 Scout FP8</td><td style="text-align:left">TP2, 4 replicas</td><td style="text-align:left">12,000</td><td style="text-align:left">100</td><td style="text-align:left">P95 TTFT &lt;= 2s</td></tr><tr><td style="text-align:left"><strong>S3</strong></td><td style="text-align:left">Llama 3.1 70B FP16</td><td style="text-align:left">TP2, 4 replicas</td><td style="text-align:left">8,000</td><td style="text-align:left">100</td><td style="text-align:left">P95 TTFT &lt;= 2s</td></tr></tbody></table>
<h1><img decoding="async" loading="lazy" src="https://llm-d.ai/assets/images/image1_116-5419e52e9870560d530d03e729516a25.webp" width="1999" height="1415" class="img_ev3q"></h1>
<p><strong>Key Observations:</strong></p>
<ul>
<li class=""><strong>S1:</strong> At 4 QPS, llm-d achieves a mean TTFT approximately 3X lower than the baseline (lower is better).</li>
<li class=""><strong>S2:</strong> llm-d delivers ~50% higher QPS than the baseline while meeting SLO requirements (higher is better).</li>
<li class=""><strong>S3:</strong> llm-d sustains 2X the baseline QPS under SLO constraints (higher is better).</li>
</ul>
<p>These results show that llm-d's cache- and prefix-aware scheduling effectively reduces TTFT and increases QPS compared to the baseline, while consistently meeting SLA requirements.</p>
<p>Try it out with the `base.yaml` config in our <a href="https://github.com/llm-d/llm-d-deployer/tree/main/quickstart" target="_blank" rel="noopener noreferrer" class="">quickstart</a>. And as a customization example, see the  <a href="https://github.com/llm-d/llm-d-inference-scheduler/blob/main/docs/create_new_filter.md" target="_blank" rel="noopener noreferrer" class="">template</a> for adding your own scheduler filter.</p>
<p><strong>P/D disaggregation</strong></p>
<p>We've completed an initial implementation of P/D disaggregation with vLLM and llm-d-inference-scheduler, which delivers promising speedups for prefill-heavy workloads (20:1 ISL | OSL). Our next focus is finalizing the implementation with heterogeneous TP and completing comprehensive benchmarks for disaggregated serving. Short-term priorities include enabling heterogeneous TP, scaling with high-performance P/D + EP&lt;&gt;DP for large scale MoEs, and DP-aware load balancing. We will follow up with a detailed performance blog in the coming weeks.</p>
<p>Try it out with the pd-nixl.yaml config in our <a href="https://github.com/llm-d/llm-d-deployer/tree/main/quickstart" target="_blank" rel="noopener noreferrer" class="">quickstart</a>.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="get-started-with-llm-d">Get started with llm-d<a href="https://llm-d.ai/blog/llm-d-announce#get-started-with-llm-d" class="hash-link" aria-label="Direct link to Get started with llm-d" title="Direct link to Get started with llm-d" translate="no">​</a></h3>
<p>llm-d builds brings together the performance of vLLM with the operationalizability of Kuberentes, creating a modular architecture for distributed LLM inference, targeting high performance on the latest models and agentic architectures.</p>
<p>We welcome AI engineers and researchers to join the llm-d community and contribute:</p>
<ul>
<li class="">Check out our repository on Github: <a href="https://github.com/llm-d/llm-d" target="_blank" rel="noopener noreferrer" class="">https://github.com/llm-d/llm-d</a></li>
<li class="">Join our developer slack: <a href="https://llm-d.ai/slack" target="_self">/slack</a></li>
<li class="">Try out our quick starts to deploy llm-d on your Kubernetes cluster: <a href="https://github.com/llm-d/llm-d-deployer/tree/main/quickstart" target="_blank" rel="noopener noreferrer" class="">https://github.com/llm-d/llm-d-deployer/tree/main/quickstart</a></li>
</ul>
<p>Please join us. The future of AI is open.</p>]]></content>
        <author>
            <name>Robert Shaw</name>
            <email>robshaw@redhat.com</email>
            <uri>https://github.com/robertgshaw2-redhat</uri>
        </author>
        <author>
            <name>Clayton Coleman</name>
            <email>claytoncoleman@google.com</email>
            <uri>https://github.com/smarterclayton</uri>
        </author>
        <author>
            <name>Carlos Costa</name>
            <email>chcost@us.ibm.com</email>
            <uri>https://github.com/chcost</uri>
        </author>
        <category label="Hello" term="Hello"/>
        <category label="Welcome!" term="Welcome!"/>
        <category label="llm-d release news" term="llm-d release news"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[llm-d Press Release]]></title>
        <id>https://llm-d.ai/blog/llm-d-press-release</id>
        <link href="https://llm-d.ai/blog/llm-d-press-release"/>
        <updated>2025-05-20T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[Red Hat launches llm-d: Open source distributed AI inference platform backed by NVIDIA, Google Cloud, IBM. Scale generative AI with intelligent routing on Kubernetes.]]></summary>
        <content type="html"><![CDATA[<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="may-20-2025">May 20, 2025<a href="https://llm-d.ai/blog/llm-d-press-release#may-20-2025" class="hash-link" aria-label="Direct link to May 20, 2025" title="Direct link to May 20, 2025" translate="no">​</a></h2>
<h1><strong>Red Hat Launches the llm-d Community, Powering Distributed Gen AI Inference at Scale</strong></h1>
<p>Forged in collaboration with founding contributors CoreWeave, Google Cloud, IBM Research and NVIDIA and joined by industry leaders AMD, Cisco, Hugging Face, Intel, Lambda and Mistral AI and university supporters at the University of California, Berkeley, and the University of Chicago, the project aims to make production generative AI as omnipresent as Linux</p>
<p><strong>BOSTON – RED HAT SUMMIT – MAY 20, 2025 —</strong> Red Hat, the world's leading provider of open source solutions, today announced the launch of llm-d, a new open source project that answers the most crucial need of generative AI's (gen AI) future: Inference at scale. Tapping breakthrough inference technologies for gen AI at scale, llm-d is powered by a native Kubernetes architecture, vLLM-based distributed inference and intelligent AI-aware network routing, empowering robust, large language model (LLM) inference clouds to meet the most demanding production service-level objectives (SLOs).</p>
<p>While training remains vital, the true impact of gen AI hinges on more efficient and scalable inference - the engine that transforms AI models into actionable insights and user experiences. According to Gartner<sup><a href="https://llm-d.ai/blog/llm-d-press-release#user-content-fn-1-984a1d" id="user-content-fnref-1-984a1d" data-footnote-ref="true" aria-describedby="footnote-label" class="anchorTargetStickyNavbar_Vzrq">1</a></sup>, "By 2028, as the market matures, more than 80% of data center workload accelerators will be specifically deployed for inference as opposed to training use." This underscores that the future of gen AI lies in the ability to execute. The escalating resource demands of increasingly sophisticated and larger reasoning models limits the viability of centralized inference and threatens to bottleneck AI innovation with prohibitive costs and crippling latency.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="answering-the-need-for-scalable-gen-ai-inference-with-llm-d">Answering the need for scalable gen AI inference with llm-d<a href="https://llm-d.ai/blog/llm-d-press-release#answering-the-need-for-scalable-gen-ai-inference-with-llm-d" class="hash-link" aria-label="Direct link to Answering the need for scalable gen AI inference with llm-d" title="Direct link to Answering the need for scalable gen AI inference with llm-d" translate="no">​</a></h3>
<p>Red Hat and its industry partners are directly confronting this challenge with llm-d, a visionary project that amplifies the power of vLLM to transcend single-server limitations and unlock production at scale for AI inference. Using the proven orchestration prowess of Kubernetes, llm-d integrates advanced inference capabilities into existing enterprise IT infrastructures. This unified platform empowers IT teams to meet the diverse serving demands of business-critical workloads, all while deploying innovative techniques to maximize efficiency and dramatically minimize the total cost of ownership (TCO) associated with high-performance AI accelerators.</p>
<p>llm-d delivers a powerful suite of innovations, highlighted by:</p>
<ul>
<li class=""><strong>vLLM</strong>, which has quickly become the open source de facto standard inference server, providing day 0 model support for emerging frontier models, and support for a broad list of accelerators, now including Google Cloud Tensor Processor Units (TPUs).</li>
<li class=""><strong>Prefill and Decode Disaggregation</strong> to separate the input context and token generation phases of AI into discrete operations, where they can then be distributed across multiple servers.</li>
<li class=""><strong>KV (key-value) Cache Offloading</strong>, based on LMCache, shifts the memory burden of the KV cache from GPU memory to more cost-efficient and abundant standard storage, like CPU memory or network storage.</li>
<li class=""><strong>Kubernetes-powered clusters and controllers</strong> for more efficient scheduling of compute and storage resources as workload demands fluctuate, while maintaining performance and lower latency.</li>
<li class=""><strong>AI-Aware Network Routing</strong> for scheduling incoming requests to the servers and accelerators that are most likely to have hot caches of past inference calculations.</li>
<li class=""><strong>High-performance communication APIs</strong> for faster and more efficient data transfer between servers, with support for NVIDIA Inference Xfer Library (NIXL).</li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="llm-d-backed-by-industry-leaders">llm-d: Backed by industry leaders<a href="https://llm-d.ai/blog/llm-d-press-release#llm-d-backed-by-industry-leaders" class="hash-link" aria-label="Direct link to llm-d: Backed by industry leaders" title="Direct link to llm-d: Backed by industry leaders" translate="no">​</a></h3>
<p>This new open source project has already garnered the support of a formidable coalition of leading gen AI model providers, AI accelerator pioneers, and premier AI cloud platforms. CoreWeave, Google Cloud, IBM Research and NVIDIA are founding contributors, with AMD, Cisco, Hugging Face, Intel, Lambda and Mistral AI as partners, underscoring the industry's deep collaboration to architect the future of large-scale LLM serving. The llm-d community is further joined by founding supporters at the Sky Computing Lab at the University of California, originators of vLLM, and the LMCache Lab at the University of Chicago, originators of <a href="https://github.com/LMCache/LMCache" target="_blank" rel="noopener noreferrer" class="">LMCache</a><em>.</em></p>
<p>Rooted in its unwavering commitment to open collaboration, Red Hat recognizes the critical importance of vibrant and accessible communities in the rapidly evolving landscape of gen AI inference. Red Hat will actively champion the growth of the llm-d community, fostering an inclusive environment for new members and fueling its continued evolution.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="red-hats-vision-any-model-any-accelerator-any-cloud">Red Hat's vision: Any model, any accelerator, any cloud.<a href="https://llm-d.ai/blog/llm-d-press-release#red-hats-vision-any-model-any-accelerator-any-cloud" class="hash-link" aria-label="Direct link to Red Hat's vision: Any model, any accelerator, any cloud." title="Direct link to Red Hat's vision: Any model, any accelerator, any cloud." translate="no">​</a></h3>
<p>The future of AI must be defined by limitless opportunity, not constrained by infrastructure silos. Red Hat sees a horizon where organizations can deploy any model, on any accelerator, across any cloud, delivering an exceptional, more consistent user experience without exorbitant costs. To unlock the true potential of gen AI investments, enterprises require a universal inference platform - a standard for more seamless, high-performance AI innovation, both today and in the years to come.</p>
<p>Just as Red Hat pioneered the open enterprise by transforming Linux into the bedrock of modern IT, the company is now poised to architect the future of AI inference. vLLM's potential is that of a linchpin for standardized gen AI inference, and Red Hat is committed to building a thriving ecosystem around not just the vLLM community but also llm-d for distributed inference at scale. The vision is clear: regardless of the AI model or the underlying accelerator or the deployment environment, Red Hat intends to make vLLM the definitive open standard for inference across the new hybrid cloud.</p>
<p><strong>Red Hat Summit</strong><br>
<!-- -->Join the Red Hat Summit keynotes to hear the latest from Red Hat executives, customers and partners:</p>
<ul>
<li class=""><a href="https://events.experiences.redhat.com/widget/redhat/sum25/SessionCatalog2025/session/1737554802676001HJ8q" target="_blank" rel="noopener noreferrer" class=""><strong>Modernized infrastructure meets enterprise-ready AI</strong></a> — Tuesday, May 20, 8-10 a.m. EDT (<a href="https://youtube.com/live/Gr8jomztY2s?feature=share" target="_blank" rel="noopener noreferrer" class="">YouTube</a>)</li>
<li class=""><a href="https://events.experiences.redhat.com/widget/redhat/sum25/SessionCatalog2025/session/1737554802763001Hr0T" target="_blank" rel="noopener noreferrer" class=""><strong>Hybrid cloud evolves to deliver enterprise innovation</strong></a> — Wednesday, May 21, 8-9:30 a.m. EDT (<a href="https://youtube.com/live/g0K0pJIKHBU?feature=share" target="_blank" rel="noopener noreferrer" class="">YouTube</a>)</li>
</ul>
<p><strong>Supporting Quotes</strong><br>
<em>Brian Stevens, senior vice president and AI CTO, Red Hat</em><br>
<!-- -->"The launch of the llm-d community, backed by a vanguard of AI leaders, marks a pivotal moment in addressing the need for scalable gen AI inference, a crucial obstacle that must be overcome to enable broader enterprise AI adoption. By tapping the innovation of vLLM and the proven capabilities of Kubernetes, llm-d paves the way for distributed, scalable and high-performing AI inference across the expanded hybrid cloud, supporting any model, any accelerator, on any cloud environment and helping realize a vision of limitless AI potential."</p>
<p><em>Ramine Roane, corporate vice president, AI Product Management, AMD</em><br>
<!-- -->"AMD is proud to be a founding member of the llm-d community, contributing our expertise in high-performance GPUs to advance AI inference for evolving enterprise AI needs. As organizations navigate the increasing complexity of generative AI to achieve greater scale and efficiency, AMD looks forward to meeting this industry demand through the llm-d project."</p>
<p><em>Shannon McFarland, vice president, Cisco Open Source Program Office &amp; Head of Cisco DevNet</em><br>
<!-- -->"The llm-d project is an exciting step forward for practical generative AI. llm-d empowers developers to programmatically integrate and scale generative AI inference, unlocking new levels of innovation and efficiency in the modern AI landscape. Cisco is proud to be part of the llm-d community, where we're working together to explore real-world use cases that help organizations apply AI more effectively and efficiently."</p>
<p><em>Chen Goldberg, senior vice president, Engineering, CoreWeave</em><br>
<!-- -->"CoreWeave is proud to be a founding contributor to the llm-d project and to deepen our long-<br>
<!-- -->standing commitment to open source AI. From our early partnership with EleutherAI to our ongoing work advancing inference at scale, we've consistently invested in making powerful AI infrastructure more accessible. We're excited to collaborate with an incredible group of partners<br>
<!-- -->and the broader developer community to build a flexible, high-performance inference engine<br>
<!-- -->that accelerates innovation and lays the groundwork for open, interoperable AI."</p>
<p><em>Mark Lohmeyer, vice president and general manager, AI &amp; Computing Infrastructure, Google Cloud</em><br>
<!-- -->"Efficient AI inference is paramount as organizations move to deploying AI at scale and deliver value for their users. As we enter this new age of inference, Google Cloud is proud to build upon our legacy of open source contributions as a founding contributor to the llm-d project. This new community will serve as a critical catalyst for distributed AI inference at scale, helping users realize enhanced workload efficiency with increased optionality for their infrastructure resources."</p>
<p><em>Jeff Boudier, Head of Product, Hugging Face</em><br>
<!-- -->"We believe every company should be able to build and run their own models. With vLLM leveraging the Hugging Face transformers library as the source of truth for model definitions; a wide diversity of models large and small is available to power text, audio, image and video AI applications. Eight million AI Builders use Hugging Face to collaborate on over two million AI models and datasets openly shared with the global community. We are excited to support the llm-d project to enable developers to take these applications to scale."</p>
<p><em>Priya Nagpurkar, vice president, Hybrid Cloud and AI Platform, IBM Research</em><br>
<!-- -->"At IBM, we believe the next phase of AI is about efficiency and scale. We're focused on unlocking value for enterprises through AI solutions they can deploy effectively. As a founding contributor to llm-d, IBM is proud to be a key part of building a differentiated hardware agnostic distributed AI inference platform. We're looking forward to continued contributions towards the growth and success of this community to transform the future of AI inference."</p>
<p><em>Bill Pearson, vice president, Data Center &amp; AI Software Solutions and Ecosystem, Intel</em><br>
<!-- -->"The launch of llm-d will serve as a key inflection point for the industry in driving AI transformation at scale, and Intel is excited to participate as a founding supporter.  Intel's involvement with llm-d is the latest milestone in our decades-long collaboration with Red Hat to empower enterprises with open source solutions that they can deploy anywhere, on their platform of choice. We look forward to further extending and building AI innovation through the llm-d community."</p>
<p><em>Eve Callicoat, senior staff engineer, ML Platform, Lambda</em><br>
<!-- -->"Inference is where the real-world value of AI is delivered, and llm-d represents a major leap forward. Lambda is proud to support a project that makes state-of-the-art inference accessible, efficient, and open."</p>
<p><em>Ujval Kapasi, vice president, Engineering AI Frameworks, NVIDIA</em><br>
<!-- -->"The llm-d project is an important addition to the open source AI ecosystem and reflects NVIDIA's support for collaboration to drive innovation in generative AI. Scalable, highly performant inference is key to the next wave of generative and agentic AI. We're working with Red Hat and other supporting partners to foster llm-d community engagement and industry adoption, helping accelerate llm-d with innovations from NVIDIA Dynamo such as NIXL."</p>
<p><em>Ion Stoica, Professor and Director of Sky Computing Lab, University of California, Berkeley</em><br>
<!-- -->"We are pleased to see Red Hat build upon the established success of vLLM, which originated in our lab to help address the speed and memory challenges that come with running large AI models. Open source projects like vLLM, and now llm-d anchored in vLLM, are at the frontier of AI innovation tackling the most demanding AI inference requirements and moving the needle for the industry at large."</p>
<p><em>Junchen Jiang, CS Professor, LMCache Lab, University of Chicago</em><br>
<!-- -->"Distributed KV cache optimizations, such as offloading, compression, and blending, have been a key focus of our lab, and we are excited to see llm-d leveraging LMCache as a core component to reduce time to first token as well as improve throughput, particularly in long-context inference."</p>
<p><strong>Additional Resources</strong></p>
<ul>
<li class="">Learn more about <a href="https://www.llm-d.ai/" target="_blank" rel="noopener noreferrer" class="">llm-d</a></li>
<li class="">Read more about <a href="https://www.redhat.com/en/topics/ai/what-is-vllm" target="_blank" rel="noopener noreferrer" class="">vLLM</a></li>
<li class="">Find out more about <a href="https://github.com/llm-d" target="_blank" rel="noopener noreferrer" class="">contributing to llm-d</a></li>
<li class="">Learn more about <a href="http://red.ht/I2Zk1e" target="_blank" rel="noopener noreferrer" class="">Red Hat Summit</a></li>
<li class="">See all of Red Hat's announcements this week in the <a href="https://red.ht/3QrRUAh" target="_blank" rel="noopener noreferrer" class="">Red Hat Summit newsroom</a></li>
<li class="">Follow <a href="https://twitter.com/redhatsummit" target="_blank" rel="noopener noreferrer" class="">@RedHatSummit</a> or <a href="https://twitter.com/hashtag/rhsummit" target="_blank" rel="noopener noreferrer" class="">#RHSummit</a> on X for event-specific updates</li>
</ul>
<p><strong>Connect with Red Hat</strong></p>
<ul>
<li class="">Learn more about <a href="http://red.ht/IOS5vm" target="_blank" rel="noopener noreferrer" class="">Red Hat</a></li>
<li class="">Get more news in the <a href="http://red.ht/1qeXuma" target="_blank" rel="noopener noreferrer" class="">Red Hat newsroom</a></li>
<li class="">Read the <a href="http://red.ht/1zzgkXp" target="_blank" rel="noopener noreferrer" class="">Red Hat blog</a></li>
<li class="">Follow <a href="https://red.ht/3Ghe0TT." target="_blank" rel="noopener noreferrer" class="">Red Hat on X</a></li>
<li class="">Follow <a href="https://red.ht/4iBsqwB" target="_blank" rel="noopener noreferrer" class="">Red Hat on Instagram</a></li>
<li class="">Follow <a href="https://red.ht/4hHewrv" target="_blank" rel="noopener noreferrer" class="">Red Hat on LinkedIn</a></li>
<li class="">Watch <a href="https://red.ht/44B8oxL" target="_blank" rel="noopener noreferrer" class="">Red Hat videos on YouTube</a></li>
</ul>
<p><strong>About Red Hat</strong><br>
<a href="https://www.redhat.com/en" target="_blank" rel="noopener noreferrer" class="">Red Hat</a> is the open hybrid cloud technology leader, delivering a trusted, consistent and comprehensive foundation for transformative IT innovation and AI applications. Its portfolio of cloud, developer, AI, Linux, automation and application platform technologies enables any application, anywhere—from the datacenter to the edge. As the world's leading provider of enterprise open source software solutions, Red Hat invests in open ecosystems and communities to solve tomorrow's IT challenges. Collaborating with partners and customers, Red Hat helps them build, connect, automate, secure and manage their IT environments, supported by consulting services and <a href="https://access.redhat.com/recognition" target="_blank" rel="noopener noreferrer" class="">award-winning</a> training and certification offerings.</p>
<p><strong>Forward-Looking Statements</strong><br>
<!-- -->Except for the historical information and discussions contained herein, statements contained in this press release may constitute forward-looking statements within the meaning of the Private Securities Litigation Reform Act of 1995. Forward-looking statements are based on the company's current assumptions regarding future business and financial performance. These statements involve a number of risks, uncertainties and other factors that could cause actual results to differ materially. Any forward-looking statement in this press release speaks only as of the date on which it is made. Except as required by law, the company assumes no obligation to update or revise any forward-looking statements.</p>
<p><strong>Media Contact:</strong><br>
<!-- -->John Terrill<br>
<!-- -->Red Hat<br>
<!-- -->+1-571-421-8132<br>
<a href="mailto:jterrill@redhat.com" target="_blank" rel="noopener noreferrer" class="">jterrill@redhat.com</a></p>
<p><em>###</em></p>
<p><em>Red Hat and the Red Hat logo are trademarks or registered trademarks of Red Hat, Inc. or its subsidiaries in the U.S. and other countries.</em></p>
<!-- -->
<section data-footnotes="true" class="footnotes"><h2 class="anchor anchorTargetStickyNavbar_Vzrq sr-only" id="footnote-label">Footnotes<a href="https://llm-d.ai/blog/llm-d-press-release#footnote-label" class="hash-link" aria-label="Direct link to Footnotes" title="Direct link to Footnotes" translate="no">​</a></h2>
<ol>
<li class="anchorTargetStickyNavbar_Vzrq" id="user-content-fn-1-984a1d">
<p>Forecast Analysis: AI Semiconductors, Worldwide, Alan Priestley, Gartner, 2 August 2024 - ID G00818912 GARTNER is a registered trademark and service mark of Gartner, Inc. and/or its affiliates in <a href="https://llm-d.ai/blog/llm-d-press-release#user-content-fnref-1-984a1d" data-footnote-backref="" aria-label="Back to reference 1" class="data-footnote-backref">↩</a></p>
</li>
</ol>
</section>]]></content>
        <category label="News Releases" term="News Releases"/>
    </entry>
</feed>