Prefill/decode disaggregation#

Deploy LLMs with separated prefill and decode phases for better resource utilization and cost optimization.

Warning

This feature requires vLLM v1, which is the default engine. For legacy deployments using vLLM v0, upgrade to v1 first.

Prefill/decode disaggregation separates the prefill phase (processing input prompts) from the decode phase (generating tokens). This separation provides:

  • Independent optimization: You can optimize prefill separately from decode with different configurations.

  • Reduced interference: Prefill operations can interfere with decode operations and vice versa, degrading performance during unpredictable traffic spikes. Disaggregation removes this contention.

  • Independent scaling: You can scale each phase independently based on demand.

  • Cost optimization: You can use different node types for different workloads, taking advantage of heterogeneous clusters.

vLLM provides several KV transfer backends for disaggregated serving:

  1. NIXLConnector: Network-based KV cache transfer using NVIDIA Inference Xfer Library (NIXL) with support for various backends such as UCX, libfabric, and EFA. Simple setup with minimal configuration.

  2. LMCacheConnectorV1: Advanced caching solution with support for various storage backends, including integration with NIXL.

When to use prefill/decode disaggregation#

Consider this pattern when:

  • You have variable workload patterns with different resource needs for prefill vs decode.

  • You want to optimize costs by using different hardware for different phases.

  • Your application has high throughput requirements that benefit from decoupling prefill and decode.

Deploy with NIXLConnector#

NIXLConnector provides network-based KV cache transfer between prefill and decode servers with minimal configuration.

Prerequisites#

If you use ray-project/ray-llm Docker images, NIXL is already installed. Otherwise, install it:

uv pip install nixl

The NIXL wheel comes bundled with its supported backends (UCX, libfabric, EFA, etc.). These shared binaries may not be the latest version for your hardware and network stack. If you need the latest versions, install NIXL from source against the target backend library. See the NIXL installation guide for details.

Basic deployment#

The following example shows how to deploy with NIXLConnector:

from ray.serve.llm import LLMConfig, build_pd_openai_app
import ray.serve as serve

# Configure prefill instance
prefill_config = LLMConfig(
    model_loading_config={
        "model_id": "meta-llama/Llama-3.1-8B-Instruct"
    },
    engine_kwargs={
        "kv_transfer_config": {
            "kv_connector": "NixlConnector",
            "kv_role": "kv_both",
        }
    }
)

# Configure decode instance
decode_config = LLMConfig(
    model_loading_config={
        "model_id": "meta-llama/Llama-3.1-8B-Instruct"
    },
    engine_kwargs={
        "kv_transfer_config": {
            "kv_connector": "NixlConnector",
            "kv_role": "kv_both",
        }
    }
)

pd_config = dict(
    prefill_config=prefill_config,
    decode_config=decode_config,
)

app = build_pd_openai_app(pd_config)
serve.run(app)

Production YAML configuration#

For production deployments, use a YAML configuration file:

# Example: Basic NIXLConnector configuration for prefill/decode disaggregation
# nixl_config.yaml

applications:
  - args:
      prefill_config:
        model_loading_config:
          model_id: meta-llama/Llama-3.1-8B-Instruct
        engine_kwargs:
          kv_transfer_config:
            kv_connector: NixlConnector
            kv_role: kv_producer
            engine_id: engine1
        deployment_config:
          autoscaling_config:
            min_replicas: 2
            max_replicas: 4
      
      decode_config:
        model_loading_config:
          model_id: meta-llama/Llama-3.1-8B-Instruct
        engine_kwargs:
          kv_transfer_config:
            kv_connector: NixlConnector
            kv_role: kv_consumer
            engine_id: engine2
        deployment_config:
          autoscaling_config:
            min_replicas: 6
            max_replicas: 10

    import_path: ray.serve.llm:build_pd_openai_app
    name: pd-disaggregation-nixl
    route_prefix: "/"

Deploy with:

serve deploy nixl_config.yaml

Configuration parameters#

  • kv_connector: Set to "NixlConnector" to use NIXL.

  • kv_role: Set to "kv_both" for both prefill and decode instances.

Deploy with LMCacheConnectorV1#

LMCacheConnectorV1 provides advanced caching with support for multiple storage backends.

Prerequisites#

Install LMCache:

uv pip install lmcache

Scenario 1: LMCache with NIXL backend#

This configuration uses LMCache with a NIXL-based storage backend for network communication.

The following is an example Ray Serve configuration for LMCache with NIXL:

# Example: LMCacheConnectorV1 with NIXL backend configuration

applications:
  - args:
      prefill_config:
        model_loading_config:
          model_id: meta-llama/Llama-3.1-8B-Instruct
        engine_kwargs:
          kv_transfer_config:
            kv_connector: LMCacheConnectorV1
            kv_role: kv_producer
            kv_connector_extra_config:
              discard_partial_chunks: false
              lmcache_rpc_port: producer1
        deployment_config:
          autoscaling_config:
            min_replicas: 2
            max_replicas: 2
        runtime_env:
          env_vars:
            LMCACHE_CONFIG_FILE: lmcache_prefiller.yaml
            LMCACHE_USE_EXPERIMENTAL: "True"

      decode_config:
        model_loading_config:
          model_id: meta-llama/Llama-3.1-8B-Instruct
        engine_kwargs:
          kv_transfer_config:
            kv_connector: LMCacheConnectorV1
            kv_role: kv_consumer
            kv_connector_extra_config:
              discard_partial_chunks: false
              lmcache_rpc_port: consumer1
        deployment_config:
          autoscaling_config:
            min_replicas: 6
            max_replicas: 6
        runtime_env:
          env_vars:
            LMCACHE_CONFIG_FILE: lmcache_decoder.yaml
            LMCACHE_USE_EXPERIMENTAL: "True"

    import_path: ray.serve.llm:build_pd_openai_app
    name: pd-disaggregation-lmcache-nixl
    route_prefix: "/"

Create the LMCache configuration for the prefill instance (lmcache_prefiller.yaml):

local_cpu: False
max_local_cpu_size: 0
max_local_disk_size: 0
remote_serde: NULL

enable_nixl: True
nixl_role: "sender"
nixl_receiver_host: "localhost"
nixl_receiver_port: 55555
nixl_buffer_size: 1073741824 # 1GB
nixl_buffer_device: "cuda"
nixl_enable_gc: True

Create the LMCache configuration for the decode instance (lmcache_decoder.yaml):

local_cpu: False
max_local_cpu_size: 0
max_local_disk_size: 0
remote_serde: NULL

enable_nixl: True
nixl_role: "receiver"
nixl_receiver_host: "localhost"
nixl_receiver_port: 55555
nixl_buffer_size: 1073741824 # 1GB
nixl_buffer_device: "cuda"
nixl_enable_gc: True

Note

The LMCACHE_CONFIG_FILE environment variable must point to an existing configuration file that’s accessible within the Ray Serve container or worker environment. Ensure these configuration files are properly mounted or available in your deployment environment.

Scenario 2: LMCache with Mooncake store backend#

This configuration uses LMCache with Mooncake store, a high-performance distributed storage system.

The following is an example Ray Serve configuration for LMCache with Mooncake:

# Example: LMCacheConnectorV1 with Mooncake store configuration

applications:
  - args:
      prefill_config:
        model_loading_config:
          model_id: meta-llama/Llama-3.1-8B-Instruct
        engine_kwargs:
          kv_transfer_config: &kv_transfer_config
            kv_connector: LMCacheConnectorV1
            kv_role: kv_both
        deployment_config:
          autoscaling_config:
            min_replicas: 2
            max_replicas: 2
        runtime_env: &runtime_env
          env_vars:
            LMCACHE_CONFIG_FILE: lmcache_mooncake.yaml
            LMCACHE_USE_EXPERIMENTAL: "True"

      decode_config:
        model_loading_config:
          model_id: meta-llama/Llama-3.1-8B-Instruct
        engine_kwargs:
          kv_transfer_config: *kv_transfer_config
        deployment_config:
          autoscaling_config:
            min_replicas: 1
            max_replicas: 1
        runtime_env: *runtime_env

    import_path: ray.serve.llm:build_pd_openai_app
    name: pd-disaggregation-lmcache-mooncake
    route_prefix: "/"

Create the LMCache configuration for Mooncake (lmcache_mooncake.yaml):

# LMCache configuration for Mooncake store backend
chunk_size: 256
local_device: "cpu"
remote_url: "mooncakestore://storage-server:49999/"
remote_serde: "naive"
pipelined_backend: false
local_cpu: false
max_local_cpu_size: 5
extra_config:
  local_hostname: "compute-node-001"
  metadata_server: "etcd://metadata-server:2379"
  protocol: "rdma"
  device_name: "rdma0"
  master_server_address: "storage-server:49999"
  global_segment_size: 3355443200  # 3.125 GB
  local_buffer_size: 1073741824    # 1 GB
  transfer_timeout: 1

Warning

For Mooncake deployments:

  • Ensure the etcd metadata server is running and accessible at the specified address.

  • Verify that you properly configured RDMA devices and storage servers and that they are accessible.

  • In containerized deployments, mount configuration files with appropriate read permissions (for example, chmod 644).

  • Ensure all referenced hostnames and IP addresses in configuration files are resolvable from the deployment environment.

Configuration parameters#

  • kv_connector: Set to "LMCacheConnectorV1".

  • kv_role: Set to "kv_producer" for prefill, "kv_consumer" for decode.

  • kv_buffer_size: Size of the KV cache buffer.

  • LMCACHE_CONFIG_FILE: Environment variable that specifies the configuration file path.

Test your deployment#

Before deploying with LMCacheConnectorV1, start the required services:

# Start etcd server if not already running
docker run -d --name etcd-server \
  -p 2379:2379 -p 2380:2380 \
  quay.io/coreos/etcd:latest \
  etcd --listen-client-urls http://0.0.0.0:2379 \
       --advertise-client-urls http://localhost:2379

# For Mooncake backend, start the Mooncake master
# See https://docs.lmcache.ai/kv_cache/mooncake.html for details
mooncake_master --port 49999

Test with a chat completion request:

curl -X POST "http://localhost:8000/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "user", "content": "Explain the benefits of prefill/decode disaggregation"}
    ],
    "max_tokens": 100,
    "temperature": 0.7
  }'

Best practices#

  • Choose the right backend: Use NIXLConnector for simpler deployments. Use LMCacheConnectorV1 when you need advanced caching or multiple storage backends.

  • Monitor KV transfer overhead: Ensure that the benefits of disaggregation outweigh the network transfer costs. Monitor latency and throughput.

  • Scale independently: Take advantage of independent scaling by monitoring resource utilization for each phase separately.

  • Test with realistic workloads: Validate performance improvements with your actual traffic patterns before production deployment.

  • Ensure network connectivity: For NIXLConnector, verify that prefill and decode instances can communicate over the network.

  • Secure etcd access: For LMCacheConnectorV1, ensure your etcd server is properly secured and accessible only to authorized services.

Troubleshooting#

Prefill and decode instances can’t communicate#

  • Verify network connectivity between instances with sufficient bandwidth for KV transfer.

  • Check that your network supports the backend you’re using (such as RDMA for high-performance deployments).

  • For NIXLConnector, ensure NIXL is properly installed on all nodes.

  • Verify firewall rules and security groups allow communication between prefill and decode instances.

LMCache configuration not found#

  • Verify the LMCACHE_CONFIG_FILE environment variable points to an existing file.

  • Ensure the configuration file is accessible from the Ray Serve worker environment.

  • Check that the file has appropriate read permissions.

See also#