Serving patterns#
Architecture documentation for distributed LLM serving patterns.
Overview#
Ray Serve LLM supports several serving patterns that can be combined for complex deployment scenarios:
Data parallel attention: scale throughput by running multiple coordinated engine replicas that process requests in parallel, replicating attention while sharding requests across the replicas.
Prefill-decode disaggregation: optimize resource utilization by separating prompt processing from token generation.
These patterns are composable and can be mixed to meet specific requirements for throughput, latency, and cost optimization.
These pages describe how each pattern works. For step-by-step configuration, see the matching how-to guides: Data parallel attention and Prefill/decode disaggregation.