Prefix-aware routing#
Optimize LLM inference with cache locality using prefix-aware request routing.
Warning
This API is in alpha and may change before becoming stable.
LLM inference can benefit significantly from cache locality optimization. When one replica processes multiple prompts that share a prefix, the engine can reuse previously computed KV-cache entries, reducing computation overhead and improving response times. This technique is known as Automatic Prefix Caching (APC) in vLLM.
The PrefixCacheAffinityRouter
routes requests with similar prefixes to the same replicas, maximizing KV cache hit rates.
When to use prefix-aware routing#
Use prefix-aware routing when:
Your workload has many requests with shared prefixes (for example, same system prompts or few-shot examples)
You’re using vLLM with Automatic Prefix Caching enabled
Cache hit rate is more important than perfect load balance in balanced scenarios
How it works#
The PrefixCacheAffinityRouter
implements a multi-tier routing strategy that balances cache locality with load distribution:
1. Load balance check#
First, it evaluates whether the current load is balanced across replicas by comparing queue lengths. If the difference between the highest and lowest queue lengths is below the imbalanced_threshold
, it proceeds with prefix cache-aware routing.
2. Prefix matching strategy#
When load is balanced, the router uses a prefix tree to find replicas that have previously processed similar input text:
High match rate (≥10%): Routes to replicas with the highest prefix match rate for better cache hit rates
Low match rate (<10%): Falls back to replicas with the lowest prefix cache utilization to increase utilization
No prefix data: Uses the default Power of Two Choices selection
3. Imbalanced load fallback#
When load is imbalanced (queue length difference exceeds threshold), the router prioritizes load balancing over cache locality and falls back to the standard Power of Two Choices algorithm.
Prefix tree management#
The router maintains a distributed prefix tree actor that:
Tracks input text prefixes processed by each replica
Supports automatic eviction of old entries to manage memory usage
Persists across router instances using Ray’s detached actor pattern
Deploy with prefix-aware routing#
The following example shows how to deploy an LLM with prefix-aware routing:
from ray import serve
from ray.serve.llm import LLMConfig, build_openai_app
from ray.serve.llm.request_router import (
PrefixCacheAffinityRouter
)
llm_config = LLMConfig(
model_loading_config=dict(
model_id="qwen-7b",
model_source="Qwen/Qwen2.5-7B-Instruct",
),
# Enable APC in vLLM
engine_kwargs=dict(
enable_prefix_caching=True,
),
deployment_config=dict(
autoscaling_config=dict(
min_replicas=2,
max_replicas=4,
)
),
accelerator_type="A10G",
)
# Configure prefix-aware router
app = build_openai_app({
"llm_configs": [llm_config],
"router_cls": PrefixCacheAffinityRouter,
"router_config": {
"imbalanced_threshold": 10,
"match_rate_threshold": 0.1,
}
})
serve.run(app)
Configuration parameters#
The PrefixCacheAffinityRouter
provides several configuration parameters to tune its behavior:
Core routing parameters#
imbalanced_threshold
(default: 10): Queue length difference threshold for considering load balanced. Lower values prioritize load balancing over cache locality.match_rate_threshold
(default: 0.1): Minimum prefix match rate (0.0-1.0) required to use prefix cache-aware routing. Higher values require stronger prefix matches before routing for cache locality.
Memory management parameters#
do_eviction
(default: False): Enable automatic eviction of old prefix tree entries to approximate the LLM engine’s eviction policy.eviction_threshold_chars
(default: 400,000): Maximum number of characters in the prefix tree before the LLM engine triggers an eviction.eviction_target_chars
(default: 360,000): Target number of characters to reduce the prefix tree to during eviction.eviction_interval_secs
(default: 10): Interval in seconds between eviction checks when eviction is enabled.
Best practices#
Enable vLLM APC: Make sure to set
enable_prefix_caching=True
in yourengine_kwargs
for the router to have any effectTune thresholds: Adjust
imbalanced_threshold
andmatch_rate_threshold
based on your workload characteristicsMonitor cache hit rates: Track vLLM’s cache hit metrics to verify the router is improving performance
Start conservative: Begin with default settings and tune incrementally based on observed behavior