Replica scheduling#

This guide explains how Ray Serve schedules deployment replicas across your cluster and the APIs and environment variables you can use to control placement behavior.

Quick reference: Choosing the right approach#

Goal

Solution

Example

Multi-GPU inference with tensor parallelism

placement_group_bundles + STRICT_PACK

vLLM with tensor_parallel_size=4

Target specific GPU types or zones

label_selector in ray_actor_options

Schedule on A100 nodes only

Limit replicas per node for high availability

max_replicas_per_node

Max 2 replicas of each deployment per node

Reduce cloud costs by packing nodes

RAY_SERVE_USE_PACK_SCHEDULING_STRATEGY=1

Many small models sharing nodes

Reserve resources for worker actors

placement_group_bundles

Replica spawns Ray Data workers

Shard large embeddings across nodes

placement_group_bundles + STRICT_SPREAD

Recommendation model with distributed embedding table

Simple deployment, no special needs

Default (just ray_actor_options)

Single-GPU model

How replica scheduling works#

When you deploy an application, Ray Serve’s deployment scheduler determines where to place each replica actor across the available nodes in your Ray cluster. The scheduler runs on the Serve Controller and makes batch scheduling decisions during each update cycle. For information on configuring CPU, GPU, and other resource requirements for your replicas, see Resource allocation.

                              ┌──────────────────────────────────┐
                              │        serve.run(app)            │
                              └────────────────┬─────────────────┘
                                               │
                                               ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│                              Serve Controller                                   │
│  ┌───────────────────────────────────────────────────────────────────────────┐  │
│  │                        Deployment Scheduler                               │  │
│  │                                                                           │  │
│  │   1. Check placement_group_bundles  ──▶  PlacementGroupSchedulingStrategy │  │
│  │   2. Check target node affinity     ──▶  NodeAffinitySchedulingStrategy   │  │
│  │   3. Use default strategy           ──▶  SPREAD (default) or PACK         │  │
│  └───────────────────────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────────────────────┘
                                               │
             ┌─────────────────────────────────┴─────────────────────────────────┐
             │                                                                   │
             ▼                                                                   ▼
┌─────────────────────────────────────┐               ┌─────────────────────────────────────┐
│    SPREAD Strategy (default)        │               │           PACK Strategy             │
│                                     │               │                                     │
│  Distributes replicas across nodes  │               │   Packs replicas onto fewer nodes   │
│  for fault tolerance                │               │   to minimize resource waste        │
│                                     │               │                                     │
│  ┌─────────┐ ┌─────────┐ ┌───────┐  │               │  ┌─────────┐ ┌─────────┐ ┌───────┐  │
│  │ Node 1  │ │ Node 2  │ │Node 3 │  │               │  │ Node 1  │ │ Node 2  │ │Node 3 │  │
│  │ ┌─────┐ │ │ ┌─────┐ │ │┌─────┐│  │               │  │ ┌─────┐ │ │         │ │       │  │
│  │ │ R1  │ │ │ │ R2  │ │ ││ R3  ││  │               │  │ │ R1  │ │ │  idle   │ │ idle  │  │
│  │ └─────┘ │ │ └─────┘ │ │└─────┘│  │               │  │ │ R2  │ │ │         │ │       │  │
│  │         │ │         │ │       │  │               │  │ │ R3  │ │ │         │ │       │  │
│  └─────────┘ └─────────┘ └───────┘  │               │  └─────────┘ └─────────┘ └───────┘  │
│                                     │               │               ▲           ▲        │
│  ✓ High availability                │               │               └───────────┘        │
│  ✓ Load balanced                    │               │           Can be released          │
│  ✓ Reduced contention               │               │  ✓ Fewer nodes = lower cloud costs │
└─────────────────────────────────────┘               └────────────────────────────────────┘

By default, Ray Serve uses a spread scheduling strategy that distributes replicas across nodes with best effort. This approach:

  • Maximizes fault tolerance by avoiding concentration of replicas on a single node

  • Balances load across the cluster

  • Helps prevent resource contention between replicas

Scheduling priority#

When scheduling a replica, the scheduler evaluates strategies in the following priority order:

  1. Placement groups: If you specify placement_group_bundles, the scheduler uses a PlacementGroupSchedulingStrategy to co-locate the replica with its required resources. If you specify placement_group_bundle_label_selector, the scheduler will only select nodes with the required labels for each bundle.

  2. Pack scheduling with node affinity: If pack scheduling is enabled, the scheduler identifies the best available node by preferring non-idle nodes (nodes already running replicas) and using a best-fit algorithm to minimize resource fragmentation. It then uses a NodeAffinitySchedulingStrategy with soft constraints to schedule the replica on that node.

  • With labels: If a label_selector is provided, the scheduler strictly filters candidate nodes to match the labels before selecting the best fit.

  • With fallback: If a fallback_strategy is provided, the scheduler first attempts to pack on nodes matching the labels. If no matching nodes are available, it retries using the next fallback option.

  1. Default strategy: Falls back to SPREAD when pack scheduling isn’t enabled.

Downscaling behavior#

When Ray Serve scales down a deployment, it intelligently selects which replicas to stop:

  1. Non-running replicas first: Pending, launching, or recovering replicas are stopped before running replicas.

  2. Minimize node count: Running replicas are stopped from nodes with the fewest total replicas across all deployments, helping to free up nodes faster. Among replicas on the same node, newer replicas are stopped before older ones.

  3. Head node protection: Replicas on the head node have the lowest priority for removal since the head node can’t be released. Among replicas on the head node, newer replicas are stopped before older ones.

Note

Running replicas on the head node isn’t recommended for production deployments. The head node runs critical cluster processes such as the GCS and Serve controller, and replica workloads can compete for resources.

APIs for controlling replica placement#

Ray Serve provides several options to control where replicas are scheduled. These parameters are configured through the @serve.deployment decorator. For the full API reference, see the deployment decorator documentation.

Limit replicas per node with max_replicas_per_node#

Use max_replicas_per_node to cap the number of replicas of a deployment that can run on a single node. This is useful when:

  • You want to ensure high availability by spreading replicas across nodes

  • You want to avoid resource contention between replicas of the same deployment

from ray import serve


@serve.deployment(num_replicas=6, max_replicas_per_node=2, ray_actor_options={"num_cpus": 0.1})
class MyDeployment:
    def __call__(self, request):
        return "Hello!"


app = MyDeployment.bind()

In this example, if you have 6 replicas and max_replicas_per_node=2, Ray Serve requires at least 3 nodes to schedule all replicas.

Note

Valid values for max_replicas_per_node are None (default, no limit) or an integer. You can’t set max_replicas_per_node together with placement_group_bundles.

You can also specify this in a config file:

applications:
  - name: my_app
    import_path: my_module:app
    deployments:
      - name: MyDeployment
        num_replicas: 6
        max_replicas_per_node: 2

Reserve resources with placement groups#

For more details on placement group strategies, see the Ray Core placement groups documentation.

A placement group is a Ray primitive that reserves a group of resources (called bundles) across one or more nodes in your cluster. When you configure placement_group_bundles for a Ray Serve deployment, Ray creates a dedicated placement group for each replica, ensuring those resources are reserved and available for that replica’s use.

A bundle is a dictionary specifying resource requirements, such as {"CPU": 2, "GPU": 1}. When you define multiple bundles, you’re telling Ray to reserve multiple sets of resources that can be placed according to your chosen strategy.

Controlling placement group location with label selectors#

You can further refine where placement groups are scheduled using a placement_group_bundle_label_selector. This field defines a list of label selectors to apply per-bundle when scheduling the Serve deployment. This allows you to restrict the nodes where your bundles (and therefore your replicas) are placed based on Ray node labels. For more information on Ray label selectors, see Use labels to control scheduling.

@serve.deployment(
    ray_actor_options={"num_cpus": 0.1},
    placement_group_bundles=[{"CPU": 0.1, "GPU": 1}],
    placement_group_bundle_label_selector=[
        {"ray.io/accelerator-type": "A100"}
    ]
)
def PlacementGroupBundleLabelSelector(request):
    return "Running in PG on A100"

pg_label_app = PlacementGroupBundleLabelSelector.bind()

The placement_group_bundle_label_selector accepts a list of dictionaries.

  • Single selector: If you provide a list containing a single dictionary, that selector is applied to all bundles in placement_group_bundles.

  • Per-bundle selector: If you provide a list of multiple dictionaries, the length must match placement_group_bundles. The i-th selector applies to the i-th bundle.

What placement groups and bundles mean#

The following diagram illustrates how a deployment with placement_group_bundles=[{"GPU": 1}, {"GPU": 1}, {"CPU": 4}] and placement_group_strategy set to  "STRICT_PACK" is scheduled:

┌─────────────────────────────────────────────────────────────────────────────┐
│                              Node (8 CPUs, 4 GPUs)                          │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │                     Placement Group (per replica)                     │  │
│  │                                                                       │  │
│  │   ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────────┐   │  │
│  │   │   Bundle 0      │  │   Bundle 1      │  │     Bundle 2        │   │  │
│  │   │   {"GPU": 1}    │  │   {"GPU": 1}    │  │    {"CPU": 4}       │   │  │
│  │   │                 │  │                 │  │                     │   │  │
│  │   │ ┌─────────────┐ │  │ ┌─────────────┐ │  │ ┌─────────────────┐ │   │  │
│  │   │ │   Replica   │ │  │ │   Worker    │ │  │ │  Worker Tasks   │ │   │  │
│  │   │ │   Actor     │ │  │ │   Actor     │ │  │ │  (preprocessing)│ │   │  │
│  │   │ │  (main GPU) │ │  │ │ (2nd GPU)   │ │  │ │                 │ │   │  │
│  │   │ └─────────────┘ │  │ └─────────────┘ │  │ └─────────────────┘ │   │  │
│  │   └─────────────────┘  └─────────────────┘  └─────────────────────┘   │  │
│  │           ▲                                                           │  │
│  │           │                                                           │  │
│  │    Replica runs in                                                    │  │
│  │    first bundle                                                       │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────────────────┘

With STRICT_PACK: All bundles guaranteed on same node

Consider a deployment with placement_group_bundles=[{"GPU": 1}, {"GPU": 1}, {"CPU": 4}]:

  • Ray reserves 3 bundles of resources for each replica

  • The replica actor runs in the first bundle (so ray_actor_options must fit within it)

  • The remaining bundles are available for worker actors/tasks spawned by the replica

  • All child actors and tasks are automatically scheduled within the placement group

This is different from simply requesting resources in ray_actor_options. With ray_actor_options={"num_gpus": 2}, your replica actor gets 2 GPUs but you have no control over where additional worker processes run. With placement groups, you explicitly reserve resources for both the replica and its workers.

When to use placement groups#

Scenario

Why placement groups help

Model parallelism

Tensor parallelism or pipeline parallelism requires multiple GPUs that must communicate efficiently. Use STRICT_PACK to guarantee all GPUs are on the same node. For example, vLLM with tensor_parallel_size=4 and the Ray distributed executor backend spawns 4 Ray worker actors (one per GPU shard), all of which must be on the same node for efficient inter-GPU communication via NVLink/NVSwitch.

Replica spawns workers

Your deployment creates Ray actors or tasks for parallel processing. Placement groups reserve resources for these workers. For example, a video processing service that spawns Ray tasks to decode frames in parallel, or a batch inference service using Ray Data to preprocess inputs before model inference.

Cross-node distribution

You need bundles spread across different nodes. Use SPREAD or STRICT_SPREAD. For example, serving a model with a massive embedding table (such as a recommendation model with billions of item embeddings) that must be sharded across multiple nodes because it exceeds single-node memory. Each bundle holds one shard, and STRICT_SPREAD ensures each shard is on a separate node.

Don’t use placement groups when:

  • Your replica is self-contained and doesn’t spawn additional actors/tasks

  • You only need simple resource requirements (use ray_actor_options instead)

  • You want to use max_replicas_per_node. The combination of these two options is not supported today.

Note

How max_replicas_per_node works: Ray Serve creates a synthetic custom resource for each deployment. Every node implicitly has 1.0 of this resource, and each replica requests 1.0 / max_replicas_per_node of it. For example, with max_replicas_per_node=3, each replica requests ~0.33 of the resource, so only 3 replicas can fit on a node before the resource is exhausted. This mechanism relies on Ray’s standard resource scheduling, which conflicts with placement group scheduling.

Configuring placement groups#

The following example reserves 2 GPUs for each replica using a strict pack strategy:

from ray import serve


@serve.deployment(
    ray_actor_options={"num_cpus": 0.1},
    placement_group_bundles=[{"CPU": 0.1}, {"CPU": 0.1}],
    placement_group_strategy="STRICT_PACK",
)
class MultiCPUModel:
    def __call__(self, request):
        return "Processed with 2 CPUs"


multi_cpu_app = MultiCPUModel.bind()

The replica actor is scheduled in the first bundle, so the resources specified in ray_actor_options must be a subset of the first bundle’s resources. All actors and tasks created by the replica are scheduled in the placement group by default (placement_group_capture_child_tasks=True).

Target nodes with labels#

You can use label selectors in ray_actor_options to target replicas to specific nodes. This is the recommended approach for controlling which nodes run your replicas.

Then configure your deployment to require the specific labels:

from ray import serve


# Schedule only on nodes with A100 GPUs
@serve.deployment(ray_actor_options={"label_selector": {"ray.io/accelerator-type": "A100"}})
class A100Model:
    def __call__(self, request):
        return "Running on A100"


# Schedule only on nodes with T4 GPUs
@serve.deployment(ray_actor_options={"label_selector": {"ray.io/accelerator-type": "T4"}})
class T4Model:
    def __call__(self, request):
        return "Running on T4"


a100_app = A100Model.bind()
t4_app = T4Model.bind()

First, start your Ray nodes with labels that identify their capabilities:

if __name__ == "__main__":
    # RayCluster with resources to run example tests.
    ray.init(
        labels={
            "ray.io/accelerator-type": "A100",
            "zone": "us-west-2b",
        },
        num_cpus=16,
        num_gpus=1,
        resources={"my_custom_resource": 10},
    )

    serve.run(a100_app, name="a100", route_prefix="/a100")

Soft constraints with fallback_strategy#

By default, a label_selector acts as a hard constraint. If no node matches the selector, the replica remains pending indefinitely. You can relax this requirement by providing a fallback_strategy in ray_actor_options.

@serve.deployment(
    ray_actor_options={
        "label_selector": {"zone": "us-west-2a"},
        "fallback_strategy": [{"label_selector": {"zone": "us-west-2b"}}]
    }
)
class SoftAffinityDeployment:
    def __call__(self, request):
        return "Scheduling to a zone with soft constraints!"

soft_affinity_app = SoftAffinityDeployment.bind()

This allows you to express preferences. For example, when using PACK scheduling, the scheduler will attempt to find a node that matches the label_selector first. If no available node is found, the scheduler will retry scheduling using the rules defined in your fallback strategy.

Label selectors and fallback strategies offer several advantages for Ray Serve deployments:

  • Expressive placement constraints: Ray automatically detects and populates labels for node attributes like ray.io/accelerator-type, or you can add custom labels at startup using the --labels flag. You can target these labels utilizing familiar Kubernetes-like syntax with complex operators (equality, negation (!), inclusion (in), and exclusion (!in)) to precisely filter which nodes run your replicas.

  • Autoscaler-aware: The Ray autoscaler understands label selectors and can provision nodes with the required labels automatically.

  • Soft constraints: Unlike custom resources which are strict requirements, label selectors can also be specified in the fallback_strategy field. This allows you to define preferred scheduling options while permitting the scheduler to utilize alternative nodes if the primary targets are unavailable, preventing deployments from stalling.

Environment variables#

These environment variables modify Ray Serve’s scheduling behavior. Set them before starting Ray.

RAY_SERVE_USE_PACK_SCHEDULING_STRATEGY#

Default: 0 (disabled)

When enabled, switches from spread scheduling to pack scheduling. Pack scheduling:

  • Packs replicas onto fewer nodes to minimize resource fragmentation

  • Sorts pending replicas by resource requirements (largest first)

  • Prefers scheduling on nodes that already have replicas (non-idle nodes)

  • Uses best-fit bin packing to find the optimal node for each replica

export RAY_SERVE_USE_PACK_SCHEDULING_STRATEGY=1
ray start --head

When to use pack scheduling: When you run many small deployments (such as 10 models each needing 0.5 CPUs), spread scheduling scatters them across nodes, wasting capacity. Pack scheduling fills nodes efficiently before using new ones. Cloud providers bill per node-hour. Packing replicas onto fewer nodes allows idle nodes to be released by the autoscaler, directly reducing your bill.

When to avoid pack scheduling: High availability is critical and you want replicas spread across nodes

Note

Pack scheduling automatically falls back to spread scheduling when any deployment uses placement groups with PACK, SPREAD, or STRICT_SPREAD strategies. This happens because pack scheduling needs to predict where resources will be consumed to bin-pack effectively. With STRICT_PACK, all bundles are guaranteed to land on one node, making resource consumption predictable. With other strategies, bundles may spread across multiple nodes unpredictably, so the scheduler can’t accurately track available resources per node.

RAY_SERVE_HIGH_PRIORITY_CUSTOM_RESOURCES#

Default: empty

A comma-separated list of custom resource names that should be prioritized when sorting replicas for pack scheduling. Resources listed earlier have higher priority.

export RAY_SERVE_HIGH_PRIORITY_CUSTOM_RESOURCES="TPU,custom_accelerator"
ray start --head

When pack scheduling is enabled, the scheduler first filters the cluster to find nodes that match the label_selector (if specified). It then sorts the pending replicas by resource requirements to pack them efficiently. The priority order for sorting replicas is:

  1. Custom resources in RAY_SERVE_HIGH_PRIORITY_CUSTOM_RESOURCES (in order)

  2. GPU

  3. CPU

  4. Memory

  5. Other custom resources

This ensures that replicas requiring high-priority resources are scheduled first, reducing the chance of resource fragmentation.

See also#