Prefill/Decode Disaggregation with KV Transfer Backends#
Overview#
Prefill/decode disaggregation is a technique that separates the prefill phase (processing input prompts) from the decode phase (generating tokens). This separation allows for:
Better resource utilization: Prefill operations can use high-memory, high-compute nodes while decode operations can use optimized inference nodes
Improved scalability: Each phase can be scaled independently based on demand
Cost optimization: Different node types can be used for different workloads
vLLM v1 supports two main KV transfer backends:
NIXLConnector: Network-based KV cache transfer using NIXL (Network Interface for eXtended LLM). Simple setup with automatic network configuration.
LMCacheConnectorV1: Advanced caching solution with support for various storage backends. Requires etcd server for metadata coordination between prefill and decode instances.
Prerequisites#
Make sure that you are using vLLM v1 by setting VLLM_USE_V1=1
environment variable.
For NixlConnector make sure nixl is installed. If you use ray-project/ray-llm images you automatically get the dependency installed.
For LMCacheConnectorV1, also install LMCache:
pip install lmcache
NIXLConnector Backend#
The NIXLConnector provides network-based KV cache transfer between prefill and decode servers using a side channel communication mechanism.
Basic Configuration#
from ray.serve.llm import LLMConfig, build_pd_openai_app
# Prefill configuration
prefill_config = LLMConfig(
model_loading_config={
"model_id": "meta-llama/Llama-3.1-8B-Instruct"
},
engine_kwargs={
"kv_transfer_config": {
"kv_connector": "NixlConnector",
"kv_role": "kv_both",
"engine_id": "engine1"
}
}
)
# Decode configuration
decode_config = LLMConfig(
model_loading_config={
"model_id": "meta-llama/Llama-3.1-8B-Instruct"
},
engine_kwargs={
"kv_transfer_config": {
"kv_connector": "NixlConnector",
"kv_role": "kv_both",
"engine_id": "engine2"
}
}
)
pd_config = dict(
prefill_config=prefill_config,
decode_config=decode_config,
)
app = build_pd_openai_app(pd_config)
serve.run(app)
Complete YAML Configuration Example#
Here’s a complete configuration file for NIXLConnector:
# Example: Basic NIXLConnector configuration for prefill/decode disaggregation
applications:
- args:
prefill_config:
model_loading_config:
model_id: meta-llama/Llama-3.1-8B-Instruct
engine_kwargs:
kv_transfer_config:
kv_connector: NixlConnector
kv_role: kv_producer
engine_id: engine1
deployment_config:
autoscaling_config:
min_replicas: 2
max_replicas: 4
decode_config:
model_loading_config:
model_id: meta-llama/Llama-3.1-8B-Instruct
engine_kwargs:
kv_transfer_config:
kv_connector: NixlConnector
kv_role: kv_consumer
engine_id: engine2
deployment_config:
autoscaling_config:
min_replicas: 6
max_replicas: 10
import_path: ray.serve.llm:build_pd_openai_app
name: pd-disaggregation-nixl
route_prefix: "/"
LMCacheConnectorV1 Backend#
LMCacheConnectorV1 provides a more advanced caching solution with support for multiple storage backends and enhanced performance features.
Scenario 1: LMCache with NIXL Backend#
This configuration uses LMCache with a NIXL-based storage backend for network communication.
# Example: LMCacheConnectorV1 with NIXL backend configuration
applications:
- args:
prefill_config:
model_loading_config:
model_id: meta-llama/Llama-3.1-8B-Instruct
engine_kwargs:
kv_transfer_config:
kv_connector: LMCacheConnectorV1
kv_role: kv_producer
kv_connector_extra_config:
discard_partial_chunks: false
lmcache_rpc_port: producer1
deployment_config:
autoscaling_config:
min_replicas: 2
max_replicas: 2
runtime_env:
env_vars:
LMCACHE_CONFIG_FILE: lmcache_prefiller.yaml
LMCACHE_USE_EXPERIMENTAL: "True"
decode_config:
model_loading_config:
model_id: meta-llama/Llama-3.1-8B-Instruct
engine_kwargs:
kv_transfer_config:
kv_connector: LMCacheConnectorV1
kv_role: kv_consumer
kv_connector_extra_config:
discard_partial_chunks: false
lmcache_rpc_port: consumer1
deployment_config:
autoscaling_config:
min_replicas: 6
max_replicas: 6
runtime_env:
env_vars:
LMCACHE_CONFIG_FILE: lmcache_decoder.yaml
LMCACHE_USE_EXPERIMENTAL: "True"
import_path: ray.serve.llm:build_pd_openai_app
name: pd-disaggregation-lmcache-nixl
route_prefix: "/"
LMCache Configuration for NIXL Backend#
Create lmcache_prefiller.yaml
:
local_cpu: False
max_local_cpu_size: 0
max_local_disk_size: 0
remote_serde: NULL
enable_nixl: True
nixl_role: "sender"
nixl_receiver_host: "localhost"
nixl_receiver_port: 55555
nixl_buffer_size: 1073741824 # 1GB
nixl_buffer_device: "cuda"
nixl_enable_gc: True
Create lmcache_decoder.yaml
:
local_cpu: False
max_local_cpu_size: 0
max_local_disk_size: 0
remote_serde: NULL
enable_nixl: True
nixl_role: "receiver"
nixl_receiver_host: "localhost"
nixl_receiver_port: 55555
nixl_buffer_size: 1073741824 # 1GB
nixl_buffer_device: "cuda"
nixl_enable_gc: True
Important: The LMCACHE_CONFIG_FILE
environment variable must point to an existing configuration file that is accessible within the Ray Serve container or worker environment. Ensure these configuration files are properly mounted or available in your deployment environment.
Scenario 2: LMCache with Mooncake Store Backend#
This configuration uses LMCache with Mooncake store, a high-performance distributed storage system.
# Example: LMCacheConnectorV1 with Mooncake store configuration
applications:
- args:
prefill_config:
model_loading_config:
model_id: meta-llama/Llama-3.1-8B-Instruct
engine_kwargs:
kv_transfer_config: &kv_transfer_config
kv_connector: LMCacheConnectorV1
kv_role: kv_both
deployment_config:
autoscaling_config:
min_replicas: 2
max_replicas: 2
runtime_env: &runtime_env
env_vars:
LMCACHE_CONFIG_FILE: lmcache_mooncake.yaml
LMCACHE_USE_EXPERIMENTAL: "True"
decode_config:
model_loading_config:
model_id: meta-llama/Llama-3.1-8B-Instruct
engine_kwargs:
kv_transfer_config: *kv_transfer_config
deployment_config:
autoscaling_config:
min_replicas: 1
max_replicas: 1
runtime_env: *runtime_env
import_path: ray.serve.llm:build_pd_openai_app
name: pd-disaggregation-lmcache-mooncake
route_prefix: "/"
LMCache Configuration for Mooncake Store#
Create lmcache_mooncake.yaml
:
# LMCache configuration for Mooncake store backend
chunk_size: 256
local_device: "cpu"
remote_url: "mooncakestore://storage-server:49999/"
remote_serde: "naive"
pipelined_backend: false
local_cpu: false
max_local_cpu_size: 5
extra_config:
local_hostname: "compute-node-001"
metadata_server: "etcd://metadata-server:2379"
protocol: "rdma"
device_name: "rdma0"
master_server_address: "storage-server:49999"
global_segment_size: 3355443200 # 3.125 GB
local_buffer_size: 1073741824 # 1 GB
transfer_timeout: 1
Important Notes:
The
LMCACHE_CONFIG_FILE
environment variable must point to an existing configuration file that is accessible within the Ray Serve container or worker environment.For Mooncake store backend, ensure the etcd metadata server is running and accessible at the specified address.
Verify that RDMA devices and storage servers are properly configured and accessible.
In containerized deployments, mount configuration files with appropriate read permissions (e.g.,
chmod 644
).Ensure all referenced hostnames and IP addresses in configuration files are resolvable from the deployment environment.
Deployment and Testing#
Deploy the Application#
Start required services (for LMCacheConnectorV1):
# Start etcd server if not already running docker run -d --name etcd-server \ -p 2379:2379 -p 2380:2380 \ quay.io/coreos/etcd:latest \ etcd --listen-client-urls http://0.0.0.0:2379 \ --advertise-client-urls http://localhost:2379 # See https://docs.lmcache.ai/kv_cache/mooncake.html for more details. mooncake_master --port 49999
Save your configuration to a YAML file (e.g.,
mooncake.yaml
)Deploy using Ray Serve CLI:
serve deploy pd_config.yaml
Test the Deployment#
Test with a simple request:
curl -X POST "http://localhost:8000/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [
{"role": "user", "content": "Explain the benefits of prefill/decode disaggregation"}
],
"max_tokens": 100,
"temperature": 0.7
}'