Replica scheduling#
This guide explains how Ray Serve schedules deployment replicas across your cluster and the APIs and environment variables you can use to control placement behavior.
Quick reference: Choosing the right approach#
Goal |
Solution |
Example |
|---|---|---|
Multi-GPU inference with tensor parallelism |
|
vLLM with |
Target specific GPU types or zones |
Custom resources in |
Schedule on A100 nodes only |
Limit replicas per node for high availability |
|
Max 2 replicas of each deployment per node |
Reduce cloud costs by packing nodes |
|
Many small models sharing nodes |
Reserve resources for worker actors |
|
Replica spawns Ray Data workers |
Shard large embeddings across nodes |
|
Recommendation model with distributed embedding table |
Simple deployment, no special needs |
Default (just |
Single-GPU model |
How replica scheduling works#
When you deploy an application, Ray Serve’s deployment scheduler determines where to place each replica actor across the available nodes in your Ray cluster. The scheduler runs on the Serve Controller and makes batch scheduling decisions during each update cycle. For information on configuring CPU, GPU, and other resource requirements for your replicas, see Resource allocation.
┌──────────────────────────────────┐
│ serve.run(app) │
└────────────────┬─────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│ Serve Controller │
│ ┌───────────────────────────────────────────────────────────────────────────┐ │
│ │ Deployment Scheduler │ │
│ │ │ │
│ │ 1. Check placement_group_bundles ──▶ PlacementGroupSchedulingStrategy │ │
│ │ 2. Check target node affinity ──▶ NodeAffinitySchedulingStrategy │ │
│ │ 3. Use default strategy ──▶ SPREAD (default) or PACK │ │
│ └───────────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────────┘
│
┌─────────────────────────────────┴─────────────────────────────────┐
│ │
▼ ▼
┌─────────────────────────────────────┐ ┌─────────────────────────────────────┐
│ SPREAD Strategy (default) │ │ PACK Strategy │
│ │ │ │
│ Distributes replicas across nodes │ │ Packs replicas onto fewer nodes │
│ for fault tolerance │ │ to minimize resource waste │
│ │ │ │
│ ┌─────────┐ ┌─────────┐ ┌───────┐ │ │ ┌─────────┐ ┌─────────┐ ┌───────┐ │
│ │ Node 1 │ │ Node 2 │ │Node 3 │ │ │ │ Node 1 │ │ Node 2 │ │Node 3 │ │
│ │ ┌─────┐ │ │ ┌─────┐ │ │┌─────┐│ │ │ │ ┌─────┐ │ │ │ │ │ │
│ │ │ R1 │ │ │ │ R2 │ │ ││ R3 ││ │ │ │ │ R1 │ │ │ idle │ │ idle │ │
│ │ └─────┘ │ │ └─────┘ │ │└─────┘│ │ │ │ │ R2 │ │ │ │ │ │ │
│ │ │ │ │ │ │ │ │ │ │ R3 │ │ │ │ │ │ │
│ └─────────┘ └─────────┘ └───────┘ │ │ └─────────┘ └─────────┘ └───────┘ │
│ │ │ ▲ ▲ │
│ ✓ High availability │ │ └───────────┘ │
│ ✓ Load balanced │ │ Can be released │
│ ✓ Reduced contention │ │ ✓ Fewer nodes = lower cloud costs │
└─────────────────────────────────────┘ └────────────────────────────────────┘
By default, Ray Serve uses a spread scheduling strategy that distributes replicas across nodes with best effort. This approach:
Maximizes fault tolerance by avoiding concentration of replicas on a single node
Balances load across the cluster
Helps prevent resource contention between replicas
Scheduling priority#
When scheduling a replica, the scheduler evaluates strategies in the following priority order:
Placement groups: If you specify
placement_group_bundles, the scheduler uses aPlacementGroupSchedulingStrategyto co-locate the replica with its required resources.Pack scheduling with node affinity: If pack scheduling is enabled, the scheduler identifies the best available node by preferring non-idle nodes (nodes already running replicas) and using a best-fit algorithm to minimize resource fragmentation. It then uses a
NodeAffinitySchedulingStrategywith soft constraints to schedule the replica on that node.Default strategy: Falls back to
SPREADwhen pack scheduling isn’t enabled.
Downscaling behavior#
When Ray Serve scales down a deployment, it intelligently selects which replicas to stop:
Non-running replicas first: Pending, launching, or recovering replicas are stopped before running replicas.
Minimize node count: Running replicas are stopped from nodes with the fewest total replicas across all deployments, helping to free up nodes faster. Among replicas on the same node, newer replicas are stopped before older ones.
Head node protection: Replicas on the head node have the lowest priority for removal since the head node can’t be released. Among replicas on the head node, newer replicas are stopped before older ones.
Note
Running replicas on the head node isn’t recommended for production deployments. The head node runs critical cluster processes such as the GCS and Serve controller, and replica workloads can compete for resources.
APIs for controlling replica placement#
Ray Serve provides several options to control where replicas are scheduled. These parameters are configured through the @serve.deployment decorator. For the full API reference, see the deployment decorator documentation.
Limit replicas per node with max_replicas_per_node#
Use max_replicas_per_node to cap the number of replicas of a deployment that can run on a single node. This is useful when:
You want to ensure high availability by spreading replicas across nodes
You want to avoid resource contention between replicas of the same deployment
from ray import serve
@serve.deployment(num_replicas=6, max_replicas_per_node=2, ray_actor_options={"num_cpus": 0.1})
class MyDeployment:
def __call__(self, request):
return "Hello!"
app = MyDeployment.bind()
In this example, if you have 6 replicas and max_replicas_per_node=2, Ray Serve requires at least 3 nodes to schedule all replicas.
Note
Valid values for max_replicas_per_node are None (default, no limit) or an integer. You can’t set max_replicas_per_node together with placement_group_bundles.
You can also specify this in a config file:
applications:
- name: my_app
import_path: my_module:app
deployments:
- name: MyDeployment
num_replicas: 6
max_replicas_per_node: 2
Reserve resources with placement groups#
For more details on placement group strategies, see the Ray Core placement groups documentation.
A placement group is a Ray primitive that reserves a group of resources (called bundles) across one or more nodes in your cluster. When you configure placement_group_bundles for a Ray Serve deployment, Ray creates a dedicated placement group for each replica, ensuring those resources are reserved and available for that replica’s use.
A bundle is a dictionary specifying resource requirements, such as {"CPU": 2, "GPU": 1}. When you define multiple bundles, you’re telling Ray to reserve multiple sets of resources that can be placed according to your chosen strategy.
What placement groups and bundles mean#
The following diagram illustrates how a deployment with placement_group_bundles=[{"GPU": 1}, {"GPU": 1}, {"CPU": 4}] and placement_group_strategy set to "STRICT_PACK" is scheduled:
┌─────────────────────────────────────────────────────────────────────────────┐
│ Node (8 CPUs, 4 GPUs) │
│ ┌───────────────────────────────────────────────────────────────────────┐ │
│ │ Placement Group (per replica) │ │
│ │ │ │
│ │ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────────┐ │ │
│ │ │ Bundle 0 │ │ Bundle 1 │ │ Bundle 2 │ │ │
│ │ │ {"GPU": 1} │ │ {"GPU": 1} │ │ {"CPU": 4} │ │ │
│ │ │ │ │ │ │ │ │ │
│ │ │ ┌─────────────┐ │ │ ┌─────────────┐ │ │ ┌─────────────────┐ │ │ │
│ │ │ │ Replica │ │ │ │ Worker │ │ │ │ Worker Tasks │ │ │ │
│ │ │ │ Actor │ │ │ │ Actor │ │ │ │ (preprocessing)│ │ │ │
│ │ │ │ (main GPU) │ │ │ │ (2nd GPU) │ │ │ │ │ │ │ │
│ │ │ └─────────────┘ │ │ └─────────────┘ │ │ └─────────────────┘ │ │ │
│ │ └─────────────────┘ └─────────────────┘ └─────────────────────┘ │ │
│ │ ▲ │ │
│ │ │ │ │
│ │ Replica runs in │ │
│ │ first bundle │ │
│ └───────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
With STRICT_PACK: All bundles guaranteed on same node
Consider a deployment with placement_group_bundles=[{"GPU": 1}, {"GPU": 1}, {"CPU": 4}]:
Ray reserves 3 bundles of resources for each replica
The replica actor runs in the first bundle (so
ray_actor_optionsmust fit within it)The remaining bundles are available for worker actors/tasks spawned by the replica
All child actors and tasks are automatically scheduled within the placement group
This is different from simply requesting resources in ray_actor_options. With ray_actor_options={"num_gpus": 2}, your replica actor gets 2 GPUs but you have no control over where additional worker processes run. With placement groups, you explicitly reserve resources for both the replica and its workers.
When to use placement groups#
Scenario |
Why placement groups help |
|---|---|
Model parallelism |
Tensor parallelism or pipeline parallelism requires multiple GPUs that must communicate efficiently. Use |
Replica spawns workers |
Your deployment creates Ray actors or tasks for parallel processing. Placement groups reserve resources for these workers. For example, a video processing service that spawns Ray tasks to decode frames in parallel, or a batch inference service using Ray Data to preprocess inputs before model inference. |
Cross-node distribution |
You need bundles spread across different nodes. Use |
Don’t use placement groups when:
Your replica is self-contained and doesn’t spawn additional actors/tasks
You only need simple resource requirements (use
ray_actor_optionsinstead)You want to use
max_replicas_per_node. The combination of these two options is not supported today.
Note
How max_replicas_per_node works: Ray Serve creates a synthetic custom resource for each deployment. Every node implicitly has 1.0 of this resource, and each replica requests 1.0 / max_replicas_per_node of it. For example, with max_replicas_per_node=3, each replica requests ~0.33 of the resource, so only 3 replicas can fit on a node before the resource is exhausted. This mechanism relies on Ray’s standard resource scheduling, which conflicts with placement group scheduling.
Configuring placement groups#
The following example reserves 2 GPUs for each replica using a strict pack strategy:
from ray import serve
@serve.deployment(
ray_actor_options={"num_cpus": 0.1},
placement_group_bundles=[{"CPU": 0.1}, {"CPU": 0.1}],
placement_group_strategy="STRICT_PACK",
)
class MultiCPUModel:
def __call__(self, request):
return "Processed with 2 CPUs"
multi_cpu_app = MultiCPUModel.bind()
The replica actor is scheduled in the first bundle, so the resources specified in ray_actor_options must be a subset of the first bundle’s resources. All actors and tasks created by the replica are scheduled in the placement group by default (placement_group_capture_child_tasks=True).
Target nodes with custom resources#
You can use custom resources in ray_actor_options to target replicas to specific nodes. This is the recommended approach for controlling which nodes run your replicas.
Then configure your deployment to require the specific resource:
from ray import serve
# Schedule only on nodes with A100 GPUs
@serve.deployment(ray_actor_options={"resources": {"A100": 1}})
class A100Model:
def __call__(self, request):
return "Running on A100"
# Schedule only on nodes with T4 GPUs
@serve.deployment(ray_actor_options={"resources": {"T4": 1}})
class T4Model:
def __call__(self, request):
return "Running on T4"
a100_app = A100Model.bind()
t4_app = T4Model.bind()
First, start your Ray nodes with custom resources that identify their capabilities:
if __name__ == "__main__":
ray.init(
resources={
"A100": 1,
"T4": 1,
}
)
serve.run(a100_app, name="a100", route_prefix="/a100")
serve.run(t4_app, name="t4", route_prefix="/t4")
serve.shutdown()
ray.shutdown()
Custom resources offer several advantages for Ray Serve deployments:
Quantifiable: You can request specific amounts (such as
{"A100": 2}for 2 GPUs or{"A100": 0.5}to share a GPU between 2 replicas), while labels are binary (present or absent).Autoscaler-aware: The Ray autoscaler understands custom resources and can provision nodes with the required resources automatically.
Scheduling guarantees: Replicas won’t be scheduled until nodes with the required custom resources are available, preventing placement on incompatible nodes.
Tip
Use descriptive resource names that reflect the node’s capabilities, such as GPU types, availability zones, or hardware generations.
Environment variables#
These environment variables modify Ray Serve’s scheduling behavior. Set them before starting Ray.
RAY_SERVE_USE_PACK_SCHEDULING_STRATEGY#
Default: 0 (disabled)
When enabled, switches from spread scheduling to pack scheduling. Pack scheduling:
Packs replicas onto fewer nodes to minimize resource fragmentation
Sorts pending replicas by resource requirements (largest first)
Prefers scheduling on nodes that already have replicas (non-idle nodes)
Uses best-fit bin packing to find the optimal node for each replica
export RAY_SERVE_USE_PACK_SCHEDULING_STRATEGY=1
ray start --head
When to use pack scheduling: When you run many small deployments (such as 10 models each needing 0.5 CPUs), spread scheduling scatters them across nodes, wasting capacity. Pack scheduling fills nodes efficiently before using new ones. Cloud providers bill per node-hour. Packing replicas onto fewer nodes allows idle nodes to be released by the autoscaler, directly reducing your bill.
When to avoid pack scheduling: High availability is critical and you want replicas spread across nodes
Note
Pack scheduling automatically falls back to spread scheduling when any deployment uses placement groups with PACK, SPREAD, or STRICT_SPREAD strategies. This happens because pack scheduling needs to predict where resources will be consumed to bin-pack effectively. With STRICT_PACK, all bundles are guaranteed to land on one node, making resource consumption predictable. With other strategies, bundles may spread across multiple nodes unpredictably, so the scheduler can’t accurately track available resources per node.
RAY_SERVE_HIGH_PRIORITY_CUSTOM_RESOURCES#
Default: empty
A comma-separated list of custom resource names that should be prioritized when sorting replicas for pack scheduling. Resources listed earlier have higher priority.
export RAY_SERVE_HIGH_PRIORITY_CUSTOM_RESOURCES="TPU,custom_accelerator"
ray start --head
When pack scheduling sorts replicas by resource requirements, the priority order is:
Custom resources in
RAY_SERVE_HIGH_PRIORITY_CUSTOM_RESOURCES(in order)GPU
CPU
Memory
Other custom resources
This ensures that replicas requiring high-priority resources are scheduled first, reducing the chance of resource fragmentation.
See also#
Resource allocation for configuring CPU, GPU, and other resources
Autoscaling for automatically adjusting replica count
Ray placement groups for advanced resource co-location