Prefill/Decode Disaggregation with KV Transfer Backends#

Overview#

Prefill/decode disaggregation is a technique that separates the prefill phase (processing input prompts) from the decode phase (generating tokens). This separation allows for:

Better resource utilization: Prefill operations can use high-memory, high-compute nodes while decode operations can use optimized inference nodes
Improved scalability: Each phase can be scaled independently based on demand
Cost optimization: Different node types can be used for different workloads

vLLM v1 supports two main KV transfer backends:

NIXLConnector: Network-based KV cache transfer using NIXL (Network Interface for eXtended LLM). Simple setup with automatic network configuration.
LMCacheConnectorV1: Advanced caching solution with support for various storage backends. Requires etcd server for metadata coordination between prefill and decode instances.

Prerequisites#

Make sure that you are using vLLM v1 by setting VLLM_USE_V1=1 environment variable.

For NixlConnector make sure nixl is installed. If you use ray-project/ray-llm images you automatically get the dependency installed.

For LMCacheConnectorV1, also install LMCache:

pip install lmcache

NIXLConnector Backend#

The NIXLConnector provides network-based KV cache transfer between prefill and decode servers using a side channel communication mechanism.

Basic Configuration#

from ray.serve.llm import LLMConfig, build_pd_openai_app

# Prefill configuration
prefill_config = LLMConfig(
    model_loading_config={
        "model_id": "meta-llama/Llama-3.1-8B-Instruct"
    },
    engine_kwargs={
        "kv_transfer_config": {
            "kv_connector": "NixlConnector",
            "kv_role": "kv_both",
            "engine_id": "engine1"
        }
    }
)

# Decode configuration
decode_config = LLMConfig(
    model_loading_config={
        "model_id": "meta-llama/Llama-3.1-8B-Instruct"
    },
    engine_kwargs={
        "kv_transfer_config": {
            "kv_connector": "NixlConnector",
            "kv_role": "kv_both",
            "engine_id": "engine2"
        }
    }
)

pd_config = dict(
    prefill_config=prefill_config,
    decode_config=decode_config,
)

app = build_pd_openai_app(pd_config)
serve.run(app)

Complete YAML Configuration Example#

Here’s a complete configuration file for NIXLConnector:

# Example: Basic NIXLConnector configuration for prefill/decode disaggregation

applications:
  - args:
      prefill_config:
        model_loading_config:
          model_id: meta-llama/Llama-3.1-8B-Instruct
        engine_kwargs:
          kv_transfer_config:
            kv_connector: NixlConnector
            kv_role: kv_producer
            engine_id: engine1
        deployment_config:
          autoscaling_config:
            min_replicas: 2
            max_replicas: 4
      
      decode_config:
        model_loading_config:
          model_id: meta-llama/Llama-3.1-8B-Instruct
        engine_kwargs:
          kv_transfer_config:
            kv_connector: NixlConnector
            kv_role: kv_consumer
            engine_id: engine2
        deployment_config:
          autoscaling_config:
            min_replicas: 6
            max_replicas: 10

    import_path: ray.serve.llm:build_pd_openai_app
    name: pd-disaggregation-nixl
    route_prefix: "/"

LMCacheConnectorV1 Backend#

LMCacheConnectorV1 provides a more advanced caching solution with support for multiple storage backends and enhanced performance features.

Scenario 1: LMCache with NIXL Backend#

This configuration uses LMCache with a NIXL-based storage backend for network communication.

# Example: LMCacheConnectorV1 with NIXL backend configuration

applications:
  - args:
      prefill_config:
        model_loading_config:
          model_id: meta-llama/Llama-3.1-8B-Instruct
        engine_kwargs:
          kv_transfer_config:
            kv_connector: LMCacheConnectorV1
            kv_role: kv_producer
            kv_connector_extra_config:
              discard_partial_chunks: false
              lmcache_rpc_port: producer1
        deployment_config:
          autoscaling_config:
            min_replicas: 2
            max_replicas: 2
        runtime_env:
          env_vars:
            LMCACHE_CONFIG_FILE: lmcache_prefiller.yaml
            LMCACHE_USE_EXPERIMENTAL: "True"

      decode_config:
        model_loading_config:
          model_id: meta-llama/Llama-3.1-8B-Instruct
        engine_kwargs:
          kv_transfer_config:
            kv_connector: LMCacheConnectorV1
            kv_role: kv_consumer
            kv_connector_extra_config:
              discard_partial_chunks: false
              lmcache_rpc_port: consumer1
        deployment_config:
          autoscaling_config:
            min_replicas: 6
            max_replicas: 6
        runtime_env:
          env_vars:
            LMCACHE_CONFIG_FILE: lmcache_decoder.yaml
            LMCACHE_USE_EXPERIMENTAL: "True"

    import_path: ray.serve.llm:build_pd_openai_app
    name: pd-disaggregation-lmcache-nixl
    route_prefix: "/"

LMCache Configuration for NIXL Backend#

Create lmcache_prefiller.yaml:

local_cpu: False
max_local_cpu_size: 0
max_local_disk_size: 0
remote_serde: NULL

enable_nixl: True
nixl_role: "sender"
nixl_receiver_host: "localhost"
nixl_receiver_port: 55555
nixl_buffer_size: 1073741824 # 1GB
nixl_buffer_device: "cuda"
nixl_enable_gc: True

Create lmcache_decoder.yaml:

local_cpu: False
max_local_cpu_size: 0
max_local_disk_size: 0
remote_serde: NULL

enable_nixl: True
nixl_role: "receiver"
nixl_receiver_host: "localhost"
nixl_receiver_port: 55555
nixl_buffer_size: 1073741824 # 1GB
nixl_buffer_device: "cuda"
nixl_enable_gc: True

Important: The LMCACHE_CONFIG_FILE environment variable must point to an existing configuration file that is accessible within the Ray Serve container or worker environment. Ensure these configuration files are properly mounted or available in your deployment environment.

Scenario 2: LMCache with Mooncake Store Backend#

This configuration uses LMCache with Mooncake store, a high-performance distributed storage system.

# Example: LMCacheConnectorV1 with Mooncake store configuration

applications:
  - args:
      prefill_config:
        model_loading_config:
          model_id: meta-llama/Llama-3.1-8B-Instruct
        engine_kwargs:
          kv_transfer_config: &kv_transfer_config
            kv_connector: LMCacheConnectorV1
            kv_role: kv_both
        deployment_config:
          autoscaling_config:
            min_replicas: 2
            max_replicas: 2
        runtime_env: &runtime_env
          env_vars:
            LMCACHE_CONFIG_FILE: lmcache_mooncake.yaml
            LMCACHE_USE_EXPERIMENTAL: "True"

      decode_config:
        model_loading_config:
          model_id: meta-llama/Llama-3.1-8B-Instruct
        engine_kwargs:
          kv_transfer_config: *kv_transfer_config
        deployment_config:
          autoscaling_config:
            min_replicas: 1
            max_replicas: 1
        runtime_env: *runtime_env

    import_path: ray.serve.llm:build_pd_openai_app
    name: pd-disaggregation-lmcache-mooncake
    route_prefix: "/"

LMCache Configuration for Mooncake Store#

Create lmcache_mooncake.yaml:

# LMCache configuration for Mooncake store backend
chunk_size: 256
local_device: "cpu"
remote_url: "mooncakestore://storage-server:49999/"
remote_serde: "naive"
pipelined_backend: false
local_cpu: false
max_local_cpu_size: 5
extra_config:
  local_hostname: "compute-node-001"
  metadata_server: "etcd://metadata-server:2379"
  protocol: "rdma"
  device_name: "rdma0"
  master_server_address: "storage-server:49999"
  global_segment_size: 3355443200  # 3.125 GB
  local_buffer_size: 1073741824    # 1 GB
  transfer_timeout: 1

Important Notes:

The LMCACHE_CONFIG_FILE environment variable must point to an existing configuration file that is accessible within the Ray Serve container or worker environment.
For Mooncake store backend, ensure the etcd metadata server is running and accessible at the specified address.
Verify that RDMA devices and storage servers are properly configured and accessible.
In containerized deployments, mount configuration files with appropriate read permissions (e.g., chmod 644).
Ensure all referenced hostnames and IP addresses in configuration files are resolvable from the deployment environment.

Deployment and Testing#

Deploy the Application#

Start required services (for LMCacheConnectorV1):

# Start etcd server if not already running
docker run -d --name etcd-server \
  -p 2379:2379 -p 2380:2380 \
  quay.io/coreos/etcd:latest \
  etcd --listen-client-urls http://0.0.0.0:2379 \
       --advertise-client-urls http://localhost:2379

# See https://docs.lmcache.ai/kv_cache/mooncake.html for more details.
mooncake_master --port 49999

Save your configuration to a YAML file (e.g., mooncake.yaml)
Deploy using Ray Serve CLI:
```
serve deploy pd_config.yaml
```

Test the Deployment#

Test with a simple request:

curl -X POST "http://localhost:8000/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "user", "content": "Explain the benefits of prefill/decode disaggregation"}
    ],
    "max_tokens": 100,
    "temperature": 0.7
  }'