Advanced Topics, Configurations, & FAQ

Ray Serve has a number of knobs and tools for you to tune for your particular workload. All Ray Serve advanced options and topics are covered on this page aside from the fundamentals of Deploying Ray Serve. For a more hands on take, please check out the Serve Tutorials.

There are a number of things you’ll likely want to do with your serving application including scaling out, splitting traffic, or batching input for better performance. To do all of this, you will create a BackendConfig, a configuration object that you’ll use to set the properties of a particular backend.

Scaling Out

To scale out a backend to multiple workers, simplify configure the number of replicas.

config = {"num_replicas": 10}
serve.create_backend("my_scaled_endpoint_backend", handle_request, config=config)

# scale it back down...
config = {"num_replicas": 2}
serve.update_backend_config("my_scaled_endpoint_backend", config)

This will scale up or down the number of workers that can accept requests.

Using Resources (CPUs, GPUs)

To assign hardware resource per worker, you can pass resource requirements to ray_actor_options. To learn about options to pass in, take a look at Resources with Actor guide.

For example, to create a backend where each replica uses a single GPU, you can do the following:

config = {"num_gpus": 1}
serve.create_backend("my_gpu_backend", handle_request, ray_actor_options=config)

Configuring Parallelism with OMP_NUM_THREADS

Deep learning models like PyTorch and Tensorflow often use multithreading when performing inference. The number of CPUs they use is controlled by the OMP_NUM_THREADS environment variable. To avoid contention, Ray sets OMP_NUM_THREADS=1 by default because Ray workers and actors use a single CPU by default. If you do want to enable this parallelism in your Serve backend, just set OMP_NUM_THREADS to the desired value either when starting Ray or in your function/class definition:

OMP_NUM_THREADS=12 ray start --head
OMP_NUM_THREADS=12 ray start --address=$HEAD_NODE_ADDRESS
class MyBackend:
    def __init__(self, parallelism):
        os.environ["OMP_NUM_THREADS"] = parallelism
        # Download model weights, initialize model, etc.

serve.create_backend("parallel_backend", MyBackend, 12)

Batching to improve performance

You can also have Ray Serve batch requests for performance. In order to do use this feature, you need to: 1. Set the max_batch_size in the config dictionary. 2. Modify your backend implementation to accept a list of requests and return a list of responses instead of handling a single request.

class BatchingExample:
    def __init__(self):
        self.count = 0

    def __call__(self, requests):
        responses = []
            for request in requests:
        return responses

config = {"max_batch_size": 5}
serve.create_backend("counter1", BatchingExample, config=config)
serve.create_endpoint("counter1", backend="counter1", route="/increment")

Please take a look at Batching Tutorial for a deep dive.

Splitting Traffic Between Backends

At times it may be useful to expose a single endpoint that is served by multiple backends. You can do this by splitting the traffic for an endpoint between backends using set_traffic. When calling set_traffic, you provide a dictionary of backend name to a float value that will be used to randomly route that portion of traffic (out of a total of 1.0) to the given backend. For example, here we split traffic 50/50 between two backends:

serve.create_backend("backend1", MyClass1)
serve.create_backend("backend2", MyClass2)

serve.create_endpoint("fifty-fifty", backend="backend1", route="/fifty")
serve.set_traffic("fifty-fifty", {"backend1": 0.5, "backend2": 0.5})

Each request is routed randomly between the backends in the traffic dictionary according to the provided weights. Please see Session Affinity for details on how to ensure that clients or users are consistently mapped to the same backend.

Canary Deployments

set_traffic can be used to implement canary deployments, where one backend serves the majority of traffic, while a small fraction is routed to a second backend. This is especially useful for “canary testing” a new model on a small percentage of users, while the tried and true old model serves the majority. Once you are satisfied with the new model, you can reroute all traffic to it and remove the old model:

serve.create_backend("default_backend", MyClass)

# Initially, set all traffic to be served by the "default" backend.
serve.create_endpoint("canary_endpoint", backend="default_backend", route="/canary-test")

# Add a second backend and route 1% of the traffic to it.
serve.create_backend("new_backend", MyNewClass)
serve.set_traffic("canary_endpoint", {"default_backend": 0.99, "new_backend": 0.01})

# Add a third backend that serves another 1% of the traffic.
serve.create_backend("new_backend2", MyNewClass2)
serve.set_traffic("canary_endpoint", {"default_backend": 0.98, "new_backend": 0.01, "new_backend2": 0.01})

# Route all traffic to the new, better backend.
serve.set_traffic("canary_endpoint", {"new_backend": 1.0})

# Or, if not so succesful, revert to the "default" backend for all traffic.
serve.set_traffic("canary_endpoint", {"default_backend": 1.0})

Incremental Rollout

set_traffic can also be used to implement incremental rollout. Here, we want to replace an existing backend with a new implementation by gradually increasing the proportion of traffic that it serves. In the example below, we do this repeatedly in one script, but in practice this would likely happen over time across multiple scripts.

serve.create_backend("existing_backend", MyClass)

# Initially, all traffic is served by the existing backend.
serve.create_endpoint("incremental_endpoint", backend="existing_backend", route="/incremental")

# Then we can slowly increase the proportion of traffic served by the new backend.
serve.create_backend("new_backend", MyNewClass)
serve.set_traffic("incremental_endpoint", {"existing_backend": 0.9, "new_backend": 0.1})
serve.set_traffic("incremental_endpoint", {"existing_backend": 0.8, "new_backend": 0.2})
serve.set_traffic("incremental_endpoint", {"existing_backend": 0.5, "new_backend": 0.5})
serve.set_traffic("incremental_endpoint", {"new_backend": 1.0})

# At any time, we can roll back to the existing backend.
serve.set_traffic("incremental_endpoint", {"existing_backend": 1.0})

Session Affinity

Splitting traffic randomly among backends for each request is is general and simple, but it can be an issue when you want to ensure that a given user or client is served by the same backend repeatedly. To address this, Serve offers a “shard key” can be specified for each request that will deterministically map to a backend. In practice, this should be something that uniquely identifies the entity that you want to consistently map, like a client ID or session ID. The shard key can either be specified via the X-SERVE-SHARD-KEY HTTP header or handle.options(shard_key="key").


The mapping from shard key to backend may change when you update the traffic policy for an endpoint.

# Specifying the shard key via an HTTP header.
requests.get("", headers={"X-SERVE-SHARD-KEY": session_id})

# Specifying the shard key in a call made via serve handle.
handle = serve.get_handle("api_endpoint")

Shadow Testing

Sometimes when deploying a new backend, you may want to test it out without affecting the results seen by users. You can do this with shadow_traffic, which allows you to duplicate requests to multiple backends for testing while still having them served by the set of backends specified via set_traffic. Metrics about these requests are recorded as usual so you can use them to validate model performance. This is demonstrated in the example below, where we create an endpoint serviced by a single backend but shadow traffic to two other backends for testing.

serve.create_backend("existing_backend", MyClass)

# All traffic is served by the existing backend.
serve.create_endpoint("shadowed_endpoint", backend="existing_backend", route="/shadow")

# Create two new backends that we want to test.
serve.create_backend("new_backend_1", MyNewClass)
serve.create_backend("new_backend_2", MyNewClass)

# Shadow traffic to the two new backends. This does not influence the result
# of requests to the endpoint, but a proportion of requests are
# *additionally* sent to these backends.

# Send 50% of all queries to the endpoint new_backend_1.
serve.shadow_traffic("shadowed_endpoint", "new_backend_1", 0.5)
# Send 10% of all queries to the endpoint new_backend_2.
serve.shadow_traffic("shadowed_endpoint", "new_backend_2", 0.1)

# Stop shadowing traffic to the backends.
serve.shadow_traffic("shadowed_endpoint", "new_backend_1", 0)
serve.shadow_traffic("shadowed_endpoint", "new_backend_2", 0)

Composing Multiple Models

Ray Serve supports composing individually scalable models into a single model out of the box. For instance, you can combine multiple models to perform stacking or ensembles.

To define a higher-level composed model you need to do three things:

  1. Define your underlying models (the ones that you will compose together) as Ray Serve backends

  2. Define your composed model, using the handles of the underlying models (see the example below).

  3. Define an endpoint representing this composed model and query it!

In order to avoid synchronous execution in the composed model (e.g., it’s very slow to make calls to the composed model), you’ll need to make the function asynchronous by using an async def. You’ll see this in the example below.

That’s it. Let’s take a look at an example:

from random import random
import requests
import ray
from ray import serve


# Our pipeline will be structured as follows:
# - Input comes in, the composed model sends it to model_one
# - model_one outputs a random number between 0 and 1, if the value is
#   greater than 0.5, then the data is sent to model_two
# - otherwise, the data is returned to the user.

# Let's define two models that just print out the data they received.

def model_one(_unused_flask_request, data=None):
    print("Model 1 called with data ", data)
    return random()

def model_two(_unused_flask_request, data=None):
    print("Model 2 called with data ", data)
    return data

class ComposedModel:
    def __init__(self):
        self.model_one = serve.get_handle("model_one")
        self.model_two = serve.get_handle("model_two")

    # This method can be called concurrently!
    async def __call__(self, flask_request):
        data =

        score = await self.model_one.remote(data=data)
        if score > 0.5:
            result = await self.model_two.remote(data=data)
            result = {"model_used": 2, "score": score}
            result = {"model_used": 1, "score": score}

        return result

serve.create_backend("model_one", model_one)
serve.create_endpoint("model_one", backend="model_one")

serve.create_backend("model_two", model_two)
serve.create_endpoint("model_two", backend="model_two")

# max_concurrent_queries is optional. By default, if you pass in an async
# function, Ray Serve sets the limit to a high number.
    "composed_backend", ComposedModel, config={"max_concurrent_queries": 10})
    "composed", backend="composed_backend", route="/composed")

for _ in range(5):
    resp = requests.get("", data="hey!")
# Output
# {'model_used': 2, 'score': 0.6250189863595503}
# {'model_used': 1, 'score': 0.03146855349621436}
# {'model_used': 2, 'score': 0.6916977560006987}
# {'model_used': 2, 'score': 0.8169693450866928}
# {'model_used': 2, 'score': 0.9540681979573862}


Ray Serve exposes system metrics like number of requests through Python API serve.stat and HTTP /-/metrics API. By default, it uses a custom structured format for easy parsing and debugging.

Via python:

  [..., {
        "info": {
            "name": "num_http_requests",
            "route": "/-/routes",
            "type": "MetricType.COUNTER"
        "value": 1
        "info": {
            "name": "num_http_requests",
            "route": "/echo",
            "type": "MetricType.COUNTER"
        "value": 10
    }, ...]


curl http://localhost:8000/-/metrics
# Returns the same output as above in JSON format.

You can also access the result in Prometheus format, by setting the metric_exporter option in serve.init.

from ray.serve.metric import PrometheusExporter
curl http://localhost:8000/-/metrics

# HELP backend_request_counter_total Number of queries that have been processed in this replica
# TYPE backend_request_counter_total counter
backend_request_counter_total{backend="echo:v1"} 5.0
backend_request_counter_total{backend="echo:v2"} 5.0

The metric exporter is extensible and you can customize it for your own metric infrastructure. We are gathering feedback and welcome contribution! Feel free to submit a github issue to chat with us in #serve channel in community slack.

Here’s an simple example of a dummy exporter that writes metrics to file:

import json
import time

import requests

from ray import serve
from ray.serve.metric.exporter import ExporterInterface

class FileExporter(ExporterInterface):
    def __init__(self):
        self.file = open("/tmp/serve_metrics.log", "w")

    def export(self, metric_metadata, metric_batch):
        for metric_item in metric_batch:
            data = metric_metadata[metric_item.key].__dict__
            data["labels"] = metric_item.labels
            data["values"] = metric_item.value

    def inspect_metrics(self):
        return "Metric is located at /tmp/serve_metrics.log"


def echo(flask_request):
    return "hello " + flask_request.args.get("name", "serve!")

serve.create_backend("hello", echo)
serve.create_endpoint("hello", backend="hello", route="/hello")

for _ in range(5):

print("Retrieving metrics from file...")
with open("/tmp/serve_metrics.log") as metric_log:
    for line in metric_log:

# Retrieving metrics from file...
# {"name": "backend_worker_starts",
#  "type": 1,
#  "description": "The number of time this replica workers ...",
#  "label_names": ["replica_tag"],
#  "default_labels": {"backend": "hello"}, "
#  labels": {"replica_tag": "hello#XwzPQn"},
#  "values": 1
# }
# ...

Ray Serve FAQ

How do I deploy serve?

See Deploying Ray Serve for information about how to deploy serve.

How do I delete backends and endpoints?

To delete a backend, you can use serve.delete_backend. Note that the backend must not be use by any endpoints in order to be delete. Once a backend is deleted, its tag can be reused.


To delete a endpoint, you can use serve.delete_endpoint. Note that the endpoint will no longer work and return a 404 when queried. Once a endpoint is deleted, its tag can be reused.


How do I call an endpoint from Python code?

use the following to get a “handle” to that endpoint.

handle = serve.get_handle("api_endpoint")

How do I call a method on my backend class besides __call__?

To call a method via HTTP use the header field X-SERVE-CALL-METHOD.

To call a method via Python, do the following:

class StatefulProcessor:
    def __init__(self):
        self.count = 1

    def __call__(self, request):
        return {"current": self.count}

    def other_method(self, inc):
        self.count += inc
        return True

handle = serve.get_handle("backend_name")