Ray Serve Autoscaling#

Each Ray Serve deployment has one replica by default. This means there is one worker process running the model and serving requests. When traffic to your deployment increases, the single replica can become overloaded. To maintain high performance of your service, you need to scale out your deployment.

Manual Scaling#

Before jumping into autoscaling, which is more complex, the other option to consider is manual scaling. You can increase the number of replicas by setting a higher value for num_replicas in the deployment options through in place updates. By default, num_replicas is 1. Increasing the number of replicas will horizontally scale out your deployment and improve latency and throughput for increased levels of traffic.

# Deploy with a single replica
deployments:
- name: Model
  num_replicas: 1

# Scale up to 10 replicas
deployments:
- name: Model
  num_replicas: 10

Autoscaling Basic Configuration#

Instead of setting a fixed number of replicas for a deployment and manually updating it, you can configure a deployment to autoscale based on incoming traffic. The Serve autoscaler reacts to traffic spikes by monitoring queue sizes and making scaling decisions to add or remove replicas. Turn on autoscaling for a deployment by setting num_replicas="auto". You can further configure it by tuning the autoscaling_config in deployment options.

The following config is what we will use in the example in the following section.

- name: Model
  num_replicas: auto

Setting num_replicas="auto" is equivalent to the following deployment configuration.

- name: Model
  max_ongoing_requests: 5
  autoscaling_config:
    target_ongoing_requests: 2
    min_replicas: 1
    max_replicas: 100

Note

You can set num_replicas="auto" and override its default values (shown above) by specifying autoscaling_config, or you can omit num_replicas="auto" and fully configure autoscaling yourself.

Let’s dive into what each of these parameters do.

  • target_ongoing_requests (replaces the deprecated target_num_ongoing_requests_per_replica) is the average number of ongoing requests per replica that the Serve autoscaler tries to ensure. You can adjust it based on your request processing length (the longer the requests, the smaller this number should be) as well as your latency objective (the shorter you want your latency to be, the smaller this number should be).

  • max_ongoing_requests (replaces the deprecated max_concurrent_queries) is the maximum number of ongoing requests allowed for a replica. Note this parameter is not part of the autoscaling config because it’s relevant to all deployments, but it’s important to set it relative to the target value if you turn on autoscaling for your deployment.

  • min_replicas is the minimum number of replicas for the deployment. Set this to 0 if there are long periods of no traffic and some extra tail latency during upscale is acceptable. Otherwise, set this to what you think you need for low traffic.

  • max_replicas is the maximum number of replicas for the deployment. Set this to ~20% higher than what you think you need for peak traffic.

These guidelines are a great starting point. If you decide to further tune your autoscaling config for your application, see Advanced Ray Serve Autoscaling.

Basic example#

This example is a synchronous workload that runs ResNet50. The application code and its autoscaling configuration are below. Alternatively, see the second tab for specifying the autoscaling config through a YAML file.

import requests
from io import BytesIO

from PIL import Image
import starlette.requests
import torch
from torchvision import transforms
import torchvision.models as models
from torchvision.models import ResNet50_Weights

from ray import serve


@serve.deployment(
    ray_actor_options={"num_cpus": 1},
    num_replicas="auto",
)
class Model:
    def __init__(self):
        self.resnet50 = (
            models.resnet50(weights=ResNet50_Weights.DEFAULT).eval().to("cpu")
        )
        self.preprocess = transforms.Compose(
            [
                transforms.Resize(256),
                transforms.CenterCrop(224),
                transforms.ToTensor(),
                transforms.Normalize(
                    mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]
                ),
            ]
        )
        resp = requests.get(
            "https://raw.githubusercontent.com/pytorch/hub/master/imagenet_classes.txt"
        )
        self.categories = resp.content.decode("utf-8").split("\n")

    async def __call__(self, request: starlette.requests.Request) -> str:
        uri = (await request.json())["uri"]
        image_bytes = requests.get(uri).content
        image = Image.open(BytesIO(image_bytes)).convert("RGB")

        # Batch size is 1
        input_tensor = torch.cat([self.preprocess(image).unsqueeze(0)]).to("cpu")
        with torch.no_grad():
            output = self.resnet50(input_tensor)
            sm_output = torch.nn.functional.softmax(output[0], dim=0)
        ind = torch.argmax(sm_output)
        return self.categories[ind]


app = Model.bind()
applications:
  - name: default
    import_path: resnet:app
    deployments:
    - name: Model
      num_replicas: auto

This example uses Locust to run a load test against this application. The Locust load test runs a certain number of “users” that ping the ResNet50 service, where each user has a constant wait time of 0. Each user (repeatedly) sends a request, waits for a response, then immediately sends the next request. The number of users running over time is shown in the following graph:

users

The results of the load test are as follows:

Replicas

replicas

QPS

qps

P50 Latency

latency

Notice the following:

  • Each Locust user constantly sends a single request and waits for a response. As a result, the number of autoscaled replicas is roughly half the number of Locust users over time as Serve attempts to satisfy the target_ongoing_requests=2 setting.

  • The throughput of the system increases with the number of users and replicas.

  • The latency briefly spikes when traffic increases, but otherwise stays relatively steady.

Ray Serve Autoscaler vs Ray Autoscaler#

The Ray Serve Autoscaler is an application-level autoscaler that sits on top of the Ray Autoscaler. Concretely, this means that the Ray Serve autoscaler asks Ray to start a number of replica actors based on the request demand. If the Ray Autoscaler determines there aren’t enough available resources (e.g. CPUs, GPUs, etc.) to place these actors, it responds by requesting more Ray nodes. The underlying cloud provider then responds by adding more nodes. Similarly, when Ray Serve scales down and terminates replica Actors, it attempts to make as many nodes idle as possible so the Ray Autoscaler can remove them. To learn more about the architecture underlying Ray Serve Autoscaling, see Ray Serve Autoscaling Architecture.