PrefixCacheAffinityRouter
for LLM inference optimization#
Warning
This API is in alpha and may change before becoming stable.
LLM inference can benefit significantly from cache locality optimization. When one replica processes multiple prompts that share a prefix, the engine can reuse previously computed KV-cache entries, reducing computation overhead and improving response times. This technique is known as Automatic Prefix Caching (APC) in vLLM. The PrefixCacheAffinityRouter
is designed specifically for this use case.
This guide covers:
Understanding the prefix cache-aware routing algorithm
Building the components of a prefix-aware router
Configuration parameters and their impact
How Ray Serve LLM prefix cache-aware routing works#
The PrefixCacheAffinityRouter
implements a multi-tier routing strategy that balances cache locality with load distribution:
1. Load balance check#
First, it evaluates whether the current load is balanced across replicas by comparing queue lengths. If the difference between the highest and lowest queue lengths is below the imbalanced_threshold
, it proceeds with prefix cache-aware routing.
2. Prefix matching strategy#
When load is balanced, the router uses a prefix tree to find replicas that have previously processed similar input text:
High Match Rate (≥10%): Routes to replicas with the highest prefix match rate for better cache hit rates
Low Match Rate (<10%): Falls back to replicas with the lowest prefix cache utilization to increase utilization
No Prefix Data: Uses the default Power of Two Choices selection
3. Imbalanced load fallback#
When load is imbalanced (queue length difference exceeds threshold), the router prioritizes load balancing over cache locality and falls back to the standard Power of Two Choices algorithm.
Prefix tree management#
The router maintains a distributed prefix tree actor that:
Tracks input text prefixes processed by each replica
Supports automatic eviction of old entries to manage memory usage
Persists across router instances using Ray’s detached actor pattern
Building prefix-aware router components#
This section breaks down the key components of PrefixCacheAffinityRouter
and shows how they work together. For a more basic example, see Use Custom Algorithm for Request Routing.
Base RequestRouter foundation#
Like all custom routers in Ray Serve, the PrefixCacheAffinityRouter
extends the base RequestRouter
class. The two core methods that define router behavior are:
choose_replicas()
: The main routing logic that selects which replicas should handle a requeston_request_routed()
: A callback that updates router state after a request is successfully routed
For a detailed explanation of these methods and their parameters, see the Define simple uniform request router example in the custom request router guide.
1. Load balance detection component#
The first component evaluates whether the current load is balanced across replicas:
# Check for imbalanced load.
highest_queue_len = 0
lowest_queue_len = float("inf")
not_in_cache: List[ReplicaID] = []
if self._use_replica_queue_len_cache:
# Populate available queue lens from the cache.
for r in candidate_replicas:
queue_len = self._replica_queue_len_cache.get(r.replica_id)
if queue_len is None or queue_len >= r.max_ongoing_requests:
not_in_cache.append(r)
else:
highest_queue_len = max(highest_queue_len, queue_len)
lowest_queue_len = min(lowest_queue_len, queue_len)
else:
not_in_cache = candidate_replicas
if len(not_in_cache) > 0:
for r, queue_len in await self._probe_queue_lens(
not_in_cache,
0,
):
if queue_len is None:
continue
highest_queue_len = max(highest_queue_len, queue_len)
lowest_queue_len = min(lowest_queue_len, queue_len)
is_imbalanced = (
highest_queue_len - lowest_queue_len > self._imbalanced_threshold
)
This component prioritizes load balancing over cache locality when replicas become too imbalanced.
2. Prefix tree management component#
The prefix tree component is implemented as a detached Ray actor that manages prefix tracking across the Serve application. The actual tree structure uses a multi-tenant prefix tree (approximate radix tree).
This distributed architecture allows the prefix information to persist across router restarts and be shared among multiple router instances.
3. Prefix matching logic component#
The core prefix matching component implements the routing decision logic in the _prefix_match_best_replicas
method. When load is balanced, it performs prefix matching to find the best replica:
if not is_imbalanced:
# Convert candidate replica IDs to strings for prefix matching.
candidate_replica_ids_strings = [
r.replica_id.to_full_id_str() for r in candidate_replicas
]
(matched_text, matched_tenant_id_strings,) = ray.get(
self._tree_actor.prefix_match.remote(
input_text, candidate_replica_ids_strings
)
)
match_rate = len(matched_text) / len(input_text)
if match_rate < self._match_rate_threshold:
smallest_tenants_id_strings = ray.get(
self._tree_actor.get_smallest_tenants.remote()
)
if (
smallest_tenants_id_strings is not None
and len(smallest_tenants_id_strings) > 0
):
chosen_replica_id_strings = smallest_tenants_id_strings
else:
if (
matched_tenant_id_strings is not None
and len(matched_tenant_id_strings) > 0
):
chosen_replica_id_strings = matched_tenant_id_strings
This logic implements the three-tier strategy:
High match rate: Routes to replicas with the highest prefix match when
match_rate >= match_rate_threshold
Low match rate: Falls back to replicas with smallest KV-cache usage when match rate is below threshold
No match: Fall back to default Power of Two Choices selection when
_prefix_match_best_replicas
returns tochoose_replicas
.
4. Integration with Power of Two choices#
The prefix-aware router extends the proven Power of Two Choices algorithm, falling back to it when prefix-based routing would degenerate. PrefixCacheAffinityRouter
integrates this component in the choose_replicas
method:
# Get fallback replicas from PowerOfTwoChoicesRequestRouter
fallback_replicas = await PowerOfTwoChoicesRequestRouter.choose_replicas(
self,
candidate_replicas=candidate_replicas,
pending_request=pending_request,
)
if pending_request is None or not fallback_replicas:
return fallback_replicas
5. State management and callbacks#
The router uses the on_request_routed()
callback to update the prefix tree with routing decisions:
def on_request_routed(
self,
pending_request: PendingRequest,
replica_id: ReplicaID,
result: ReplicaResult,
):
"""Called when a request is routed to a replica.
This is used as a callback to update the state of the request router
after a response is generated.
"""
# Right now this only inserts the prompt into the prefix tree, not the response (streaming response makes things complicated)
if (
pending_request is not None
and pending_request.args is not None
and len(pending_request.args) > 0
):
input_text = self._extract_text_from_request(pending_request)
if input_text is not None:
# Insert into prefix tree
ray.get(
self._tree_actor.insert.remote(
input_text, replica_id.to_full_id_str(), time.time()
)
)
When a replica dies, the router uses the on_replica_actor_died
callback to remove the replica’s entries from the shared prefix tree:
def on_replica_actor_died(self, replica_id: ReplicaID):
"""Drop replica from replica set so it's not considered for future requests."""
super().on_replica_actor_died(replica_id)
ray.get(self._tree_actor.remove_tenants.remote([replica_id.to_full_id_str()]))
Mixin components#
The PrefixCacheAffinityRouter
inherits from two mixins. For more details about these and other available mixins, see Utility mixins for request routing. The router uses these mixins to optimize the list of candidate replicas against which it calculates prefix cache hit rate.
The LocalityMixin
provides locality-aware routing to optimize network latency by preferring replicas on the same node. The MultiplexMixin
enables model multiplexing support by tracking which models are loaded on each replica and routing requests to replicas that already have the requested model in memory.
Configuration parameters#
The PrefixCacheAffinityRouter
provides several configuration parameters to tune its behavior:
Core routing parameters#
imbalanced_threshold
(default: 10): Queue length difference threshold for considering load balanced. Lower values prioritize load balancing over cache locality.match_rate_threshold
(default: 0.1): Minimum prefix match rate (0.0-1.0) required to use prefix cache-aware routing. Higher values require stronger prefix matches before routing for cache locality.
Memory management parameters#
do_eviction
(default: False): Enable automatic eviction of old prefix tree entries to approximate the LLM engine’s eviction policy.eviction_threshold_chars
(default: 400,000): Maximum number of characters in the prefix tree before the LLM engine triggers an eviction.eviction_target_chars
(default: 360,000): Target number of characters to reduce the prefix tree to during eviction.eviction_interval_secs
(default: 10): Interval in seconds between eviction checks for when eviction is enabled.
Deploying LLM applications with Prefix Cache-Aware Routing#
Deploy an LLM application using the prefix cache-aware request router as follows:
from ray import serve
from ray.serve.llm import LLMConfig, build_openai_app
from ray.serve.llm.request_router import PrefixCacheAffinityRouter
llm_config = LLMConfig(
model_loading_config={
"model_id": "qwen-0.5b",
"model_source": "Qwen/Qwen2.5-0.5B-Instruct",
},
deployment_config={
"autoscaling_config": {
"min_replicas": 4,
"max_replicas": 4,
},
"request_router_config": {
"request_router_class": PrefixCacheAffinityRouter,
"request_router_kwargs": {
"imbalanced_threshold": 5, # More aggressive load balancing
"match_rate_threshold": 0.15, # Require 15% match rate
"do_eviction": True, # Enable memory management
"eviction_threshold_chars": 500_000,
"eviction_target_chars": 400_000,
"eviction_interval_secs": 30,
},
},
},
runtime_env={"env_vars": {"VLLM_USE_V1": "1"}},
)
app = build_openai_app({"llm_configs": [llm_config]})
serve.run(app, blocking=True)