Serving patterns#

Architecture documentation for distributed LLM serving patterns.

Overview#

Ray Serve LLM supports several serving patterns that can be combined for complex deployment scenarios:

  • Data parallel attention: Scale throughput by running multiple coordinated engine instances that shard requests across attention layers.

  • Prefill-decode disaggregation: Optimize resource utilization by separating prompt processing from token generation.

These patterns are composable and can be mixed to meet specific requirements for throughput, latency, and cost optimization.