Data parallel attention#
Deploy LLMs with data parallel attention for increased throughput and better resource utilization, especially for sparse MoE (Mixture of Experts) models.
Data parallel attention creates multiple coordinated inference engine replicas that process requests in parallel. This pattern is most effective when combined with expert parallelism for sparse MoE models, where attention (QKV) layers are replicated across replicas while MoE experts are sharded. This separation provides:
Increased throughput: Process more concurrent requests by distributing them across multiple replicas.
Better resource utilization: Especially beneficial for sparse MoE models where not all experts are active for each request.
KV cache scalability: Add more KV cache capacity across replicas to handle larger batch sizes.
Expert saturation: Achieve higher effective batch sizes during decoding to better saturate MoE layers.
When to use data parallel attention#
Consider this pattern when:
Sparse MoE models with MLA: You’re serving models with Multi-head Latent Attention (MLA) where KV cache can’t be sharded along the head dimension. MLA reduces KV cache memory requirements, making data parallel replication more efficient.
High throughput requirements: You need to serve many concurrent requests and want to maximize throughput.
KV-cache limited: Adding more KV cache capacity increases throughput, and data parallel attention effectively increases KV cache capacity across replicas.
When not to use data parallel attention:
Low to medium throughput: If you can’t saturate the MoE layers, data parallel attention adds unnecessary complexity.
Non-MoE models: The main benefit is lifting effective batch size for saturating experts, which doesn’t apply to dense models.
Sufficient tensor parallelism: For models with GQA (Grouped Query Attention), use tensor parallelism (TP) first to shard KV cache up to
TP_size <= num_kv_heads. Beyond that, TP requires KV cache replication—at that point, data parallel attention becomes a better choice.
Basic deployment#
The following example shows how to deploy with data parallel attention. Each data parallel deployment requires num_replicas * data_parallel_size * tensor_parallel_size GPUs.
from ray import serve
from ray.serve.llm import LLMConfig, build_dp_openai_app
# Configure the model with data parallel settings
config = LLMConfig(
model_loading_config={
"model_id": "microsoft/Phi-tiny-MoE-instruct"
},
deployment_config={
"num_replicas": 2
},
engine_kwargs={
"data_parallel_size": 2, # Number of DP replicas
"tensor_parallel_size": 1, # TP size per replica
# Reduced for CI compatibility
"max_model_len": 1024,
"max_num_seqs": 32,
},
)
app = build_dp_openai_app({
"llm_config": config
})
serve.run(app, blocking=True)
Production YAML configuration#
For production deployments, use a declarative YAML configuration file:
applications:
- name: dp_llm_app
route_prefix: /
import_path: ray.serve.llm:build_dp_openai_app
args:
llm_config:
model_loading_config:
model_id: Qwen/Qwen2.5-0.5B-Instruct
deployment_config:
num_replicas: 2
engine_kwargs:
data_parallel_size: 4
tensor_parallel_size: 2
Deploy with CLI:
serve deploy dp_config.yaml
Configuration parameters#
Required parameters#
data_parallel_size: Number of data parallel replicas within a data parallel group. Must be a positive integer and passed in viaengine_kwargs.
Deployment configuration#
num_replicas: Can be set to any positive integer, unset (defaults to 1), or"auto"to enable autoscaling based on request queue length.
Note
Within a data parallel deployment, num_replicas under the deployment_config refers to the number of data parallel groups, which translates to num_replicas * data_parallel_size data parallel replicas (equivalent to the number of Ray serve replicas). Each data parallel replica inherently runs a vLLM data parallel server.
Understanding data parallel replica coordination#
In data parallel attention, all data parallel replicas within a data parallel group work together as a cohesive unit by leveraging Ray Serve’s gang scheduling capability:
Rank assignment: Each replica receives a unique rank (0 to
data_parallel_size-1) from Ray Serve’s controller to start a vLLM data parallel server.Request distribution: Ray Serve’s request router distributes requests across replicas using load balancing.
Collective operations: Replicas coordinate for collective operations (e.g., all-reduce, dispatch and combine) required by the model.
Synchronization: All data parallel replicas in a data parallel group must be present and healthy. MoE layers use all-to-all collectives to route tokens to experts across DP ranks. If any data parallel replica is unavailable, these collectives hang and tokens can’t reach experts assigned to that rank.
Fault tolerance: If any data parallel replica in a data parallel group fails, the entire group becomes unavailable because the remaining replicas can’t complete collective operations. While Ray Serve controller detects the failure and restarts the entire group, other data parallel groups keep serving requests without any downtime if
num_replicas > 1.
There’s no coordination overhead introduced by Ray Serve LLM:
Startup: Data parallel ranks are assigned when Ray Serve’s controller creates the data parallel replica.
Runtime: No coordination overhead during request processing.
For more details, see Data parallel attention.
Test your deployment#
Test with a chat completion request:
curl -X POST "http://localhost:8000/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer fake-key" \
-d '{
"model": "Qwen/Qwen2.5-0.5B-Instruct",
"messages": [
{"role": "user", "content": "Explain data parallel attention"}
],
"max_tokens": 100,
"temperature": 0.7
}'
You can also test programmatically:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="fake-key"
)
response = client.chat.completions.create(
model="Qwen/Qwen2.5-0.5B-Instruct",
messages=[
{"role": "user", "content": "Explain data parallel attention"}
],
max_tokens=100
)
print(response.choices[0].message.content)
Combining with other patterns#
Data parallel + Prefill-decode disaggregation#
You can combine data parallel attention with prefill-decode disaggregation to scale both phases independently while using DP within each phase. This pattern is useful when you need high throughput for both prefill and decode phases.
The following example shows a complete, functional deployment:
from ray import serve
from ray.serve.llm import LLMConfig, build_dp_deployment
from ray.serve.llm.deployment import PDProxyServer
from ray.serve.llm.ingress import OpenAiIngress, make_fastapi_ingress
# Configure prefill with data parallel attention
prefill_config = LLMConfig(
model_loading_config={
"model_id": "microsoft/Phi-tiny-MoE-instruct"
},
engine_kwargs={
"data_parallel_size": 2, # 2 DP replicas for prefill
"tensor_parallel_size": 1,
"kv_transfer_config": {
"kv_connector": "NixlConnector",
"kv_role": "kv_both",
},
# Reduced for CI compatibility
"max_model_len": 1024,
"max_num_seqs": 32,
},
)
# Configure decode with data parallel attention
decode_config = LLMConfig(
model_loading_config={
"model_id": "microsoft/Phi-tiny-MoE-instruct"
},
engine_kwargs={
"data_parallel_size": 2, # 2 DP replicas for decode (adjusted for 4 GPU limit)
"tensor_parallel_size": 1,
"kv_transfer_config": {
"kv_connector": "NixlConnector",
"kv_role": "kv_both",
},
# Reduced for CI compatibility
"max_model_len": 1024,
"max_num_seqs": 32,
},
)
# Build prefill and decode deployments with DP
prefill_deployment = build_dp_deployment(prefill_config, name_prefix="Prefill:")
decode_deployment = build_dp_deployment(decode_config, name_prefix="Decode:")
# Create PDProxyServer to coordinate between prefill and decode
proxy_options = PDProxyServer.get_deployment_options(prefill_config, decode_config)
proxy_deployment = serve.deployment(PDProxyServer).options(**proxy_options).bind(
prefill_server=prefill_deployment,
decode_server=decode_deployment,
)
# Create OpenAI-compatible ingress
ingress_options = OpenAiIngress.get_deployment_options([prefill_config, decode_config])
ingress_cls = make_fastapi_ingress(OpenAiIngress)
ingress_deployment = serve.deployment(ingress_cls).options(**ingress_options).bind(
llm_deployments=[proxy_deployment]
)
# Deploy the application
serve.run(ingress_deployment, blocking=True)
This configuration creates:
Prefill phase: 2 data parallel replicas for processing input prompts
Decode phase: 2 data parallel replicas for generating tokens
PDProxyServer: Coordinates requests between prefill and decode phases
OpenAI ingress: Provides OpenAI-compatible API endpoints
This allows you to:
Optimize prefill and decode phases independently based on workload characteristics
Use data parallel attention within each phase for increased throughput
Note
This example uses 4 GPUs total (2 for prefill, 2 for decode). Adjust the data_parallel_size values based on your available GPU resources.
Note
For this example to work, you need to have NIXL installed. See the Prefill/decode disaggregation guide for prerequisites and installation instructions.
See also#
Data parallel attention - Data parallel attention architecture details
Prefill/decode disaggregation - Prefill-decode disaggregation guide
Serving patterns - Overview of serving patterns
Quickstart examples - Basic LLM deployment examples