Serving patterns#
Architecture documentation for distributed LLM serving patterns.
Overview#
Ray Serve LLM supports several serving patterns that can be combined for complex deployment scenarios:
Data parallel attention: Scale throughput by running multiple coordinated engine instances that shard requests across attention layers.
Prefill-decode disaggregation: Optimize resource utilization by separating prompt processing from token generation.
These patterns are composable and can be mixed to meet specific requirements for throughput, latency, and cost optimization.