PrefixCacheAffinityRouter for LLM inference optimization#

Warning

This API is in alpha and may change before becoming stable.

LLM inference can benefit significantly from cache locality optimization. When one replica processes multiple prompts that share a prefix, the engine can reuse previously computed KV-cache entries, reducing computation overhead and improving response times. This technique is known as Automatic Prefix Caching (APC) in vLLM. The PrefixCacheAffinityRouter is designed specifically for this use case.

This guide covers:

  • Understanding the prefix cache-aware routing algorithm

  • Building the components of a prefix-aware router

  • Configuration parameters and their impact

How Ray Serve LLM prefix cache-aware routing works#

The PrefixCacheAffinityRouter implements a multi-tier routing strategy that balances cache locality with load distribution:

1. Load balance check#

First, it evaluates whether the current load is balanced across replicas by comparing queue lengths. If the difference between the highest and lowest queue lengths is below the imbalanced_threshold, it proceeds with prefix cache-aware routing.

2. Prefix matching strategy#

When load is balanced, the router uses a prefix tree to find replicas that have previously processed similar input text:

  • High Match Rate (≥10%): Routes to replicas with the highest prefix match rate for better cache hit rates

  • Low Match Rate (<10%): Falls back to replicas with the lowest prefix cache utilization to increase utilization

  • No Prefix Data: Uses the default Power of Two Choices selection

3. Imbalanced load fallback#

When load is imbalanced (queue length difference exceeds threshold), the router prioritizes load balancing over cache locality and falls back to the standard Power of Two Choices algorithm.

Prefix tree management#

The router maintains a distributed prefix tree actor that:

  • Tracks input text prefixes processed by each replica

  • Supports automatic eviction of old entries to manage memory usage

  • Persists across router instances using Ray’s detached actor pattern

Building prefix-aware router components#

This section breaks down the key components of PrefixCacheAffinityRouter and shows how they work together. For a more basic example, see Use Custom Algorithm for Request Routing.

Base RequestRouter foundation#

Like all custom routers in Ray Serve, the PrefixCacheAffinityRouter extends the base RequestRouter class. The two core methods that define router behavior are:

  • choose_replicas(): The main routing logic that selects which replicas should handle a request

  • on_request_routed(): A callback that updates router state after a request is successfully routed

For a detailed explanation of these methods and their parameters, see the Define simple uniform request router example in the custom request router guide.

1. Load balance detection component#

The first component evaluates whether the current load is balanced across replicas:

prefix_aware_router.py#
                # Check for imbalanced load.
                highest_queue_len = 0
                lowest_queue_len = float("inf")
                not_in_cache: List[ReplicaID] = []
                if self._use_replica_queue_len_cache:
                    # Populate available queue lens from the cache.
                    for r in candidate_replicas:
                        queue_len = self._replica_queue_len_cache.get(r.replica_id)
                        if queue_len is None or queue_len >= r.max_ongoing_requests:
                            not_in_cache.append(r)
                        else:
                            highest_queue_len = max(highest_queue_len, queue_len)
                            lowest_queue_len = min(lowest_queue_len, queue_len)
                else:
                    not_in_cache = candidate_replicas
                if len(not_in_cache) > 0:
                    for r, queue_len in await self._probe_queue_lens(
                        not_in_cache,
                        0,
                    ):
                        if queue_len is None:
                            continue
                        highest_queue_len = max(highest_queue_len, queue_len)
                        lowest_queue_len = min(lowest_queue_len, queue_len)

                is_imbalanced = (
                    highest_queue_len - lowest_queue_len > self._imbalanced_threshold
                )

This component prioritizes load balancing over cache locality when replicas become too imbalanced.

2. Prefix tree management component#

The prefix tree component is implemented as a detached Ray actor that manages prefix tracking across the Serve application. The actual tree structure uses a multi-tenant prefix tree (approximate radix tree).

This distributed architecture allows the prefix information to persist across router restarts and be shared among multiple router instances.

3. Prefix matching logic component#

The core prefix matching component implements the routing decision logic in the _prefix_match_best_replicas method. When load is balanced, it performs prefix matching to find the best replica:

prefix_aware_router.py#
                if not is_imbalanced:
                    # Convert candidate replica IDs to strings for prefix matching.
                    candidate_replica_ids_strings = [
                        r.replica_id.to_full_id_str() for r in candidate_replicas
                    ]
                    (matched_text, matched_tenant_id_strings,) = ray.get(
                        self._tree_actor.prefix_match.remote(
                            input_text, candidate_replica_ids_strings
                        )
                    )
                    match_rate = len(matched_text) / len(input_text)
                    if match_rate < self._match_rate_threshold:
                        smallest_tenants_id_strings = ray.get(
                            self._tree_actor.get_smallest_tenants.remote()
                        )
                        if (
                            smallest_tenants_id_strings is not None
                            and len(smallest_tenants_id_strings) > 0
                        ):
                            chosen_replica_id_strings = smallest_tenants_id_strings
                    else:
                        if (
                            matched_tenant_id_strings is not None
                            and len(matched_tenant_id_strings) > 0
                        ):
                            chosen_replica_id_strings = matched_tenant_id_strings

This logic implements the three-tier strategy:

  1. High match rate: Routes to replicas with the highest prefix match when match_rate >= match_rate_threshold

  2. Low match rate: Falls back to replicas with smallest KV-cache usage when match rate is below threshold

  3. No match: Fall back to default Power of Two Choices selection when _prefix_match_best_replicas returns to choose_replicas.

4. Integration with Power of Two choices#

The prefix-aware router extends the proven Power of Two Choices algorithm, falling back to it when prefix-based routing would degenerate. PrefixCacheAffinityRouter integrates this component in the choose_replicas method:

prefix_aware_router.py#
        # Get fallback replicas from PowerOfTwoChoicesRequestRouter
        fallback_replicas = await PowerOfTwoChoicesRequestRouter.choose_replicas(
            self,
            candidate_replicas=candidate_replicas,
            pending_request=pending_request,
        )
        if pending_request is None or not fallback_replicas:
            return fallback_replicas

5. State management and callbacks#

The router uses the on_request_routed() callback to update the prefix tree with routing decisions:

prefix_aware_router.py#
    def on_request_routed(
        self,
        pending_request: PendingRequest,
        replica_id: ReplicaID,
        result: ReplicaResult,
    ):
        """Called when a request is routed to a replica.

        This is used as a callback to update the state of the request router
        after a response is generated.
        """
        # Right now this only inserts the prompt into the prefix tree, not the response (streaming response makes things complicated)
        if (
            pending_request is not None
            and pending_request.args is not None
            and len(pending_request.args) > 0
        ):
            input_text = self._extract_text_from_request(pending_request)
            if input_text is not None:
                # Insert into prefix tree
                ray.get(
                    self._tree_actor.insert.remote(
                        input_text, replica_id.to_full_id_str(), time.time()
                    )
                )

When a replica dies, the router uses the on_replica_actor_died callback to remove the replica’s entries from the shared prefix tree:

prefix_aware_router.py#
    def on_replica_actor_died(self, replica_id: ReplicaID):
        """Drop replica from replica set so it's not considered for future requests."""
        super().on_replica_actor_died(replica_id)
        ray.get(self._tree_actor.remove_tenants.remote([replica_id.to_full_id_str()]))

Mixin components#

The PrefixCacheAffinityRouter inherits from two mixins. For more details about these and other available mixins, see Utility mixins for request routing. The router uses these mixins to optimize the list of candidate replicas against which it calculates prefix cache hit rate.

The LocalityMixin provides locality-aware routing to optimize network latency by preferring replicas on the same node. The MultiplexMixin enables model multiplexing support by tracking which models are loaded on each replica and routing requests to replicas that already have the requested model in memory.

Configuration parameters#

The PrefixCacheAffinityRouter provides several configuration parameters to tune its behavior:

Core routing parameters#

  • imbalanced_threshold (default: 10): Queue length difference threshold for considering load balanced. Lower values prioritize load balancing over cache locality.

  • match_rate_threshold (default: 0.1): Minimum prefix match rate (0.0-1.0) required to use prefix cache-aware routing. Higher values require stronger prefix matches before routing for cache locality.

Memory management parameters#

  • do_eviction (default: False): Enable automatic eviction of old prefix tree entries to approximate the LLM engine’s eviction policy.

  • eviction_threshold_chars (default: 400,000): Maximum number of characters in the prefix tree before the LLM engine triggers an eviction.

  • eviction_target_chars (default: 360,000): Target number of characters to reduce the prefix tree to during eviction.

  • eviction_interval_secs (default: 10): Interval in seconds between eviction checks for when eviction is enabled.

Deploying LLM applications with Prefix Cache-Aware Routing#

Deploy an LLM application using the prefix cache-aware request router as follows:

prefix_aware_example.py#
from ray import serve
from ray.serve.llm import LLMConfig, build_openai_app
from ray.serve.llm.request_router import PrefixCacheAffinityRouter

llm_config = LLMConfig(
    model_loading_config={
        "model_id": "qwen-0.5b",
        "model_source": "Qwen/Qwen2.5-0.5B-Instruct",
    },
    deployment_config={
        "autoscaling_config": {
            "min_replicas": 4,
            "max_replicas": 4,
        },
        "request_router_config": {
            "request_router_class": PrefixCacheAffinityRouter,
            "request_router_kwargs": {
                "imbalanced_threshold": 5,  # More aggressive load balancing
                "match_rate_threshold": 0.15,  # Require 15% match rate
                "do_eviction": True,  # Enable memory management
                "eviction_threshold_chars": 500_000,
                "eviction_target_chars": 400_000,
                "eviction_interval_secs": 30,
            },
        },
    },
    runtime_env={"env_vars": {"VLLM_USE_V1": "1"}},
)

app = build_openai_app({"llm_configs": [llm_config]})
serve.run(app, blocking=True)