Advanced Topics and Configurations

Ray Serve has a number of knobs and tools for you to tune for your particular workload. All Ray Serve advanced options and topics are covered on this page aside from the fundamentals of Deploying Ray Serve. For a more hands on take, please check out the Serve Tutorials.

There are a number of things you’ll likely want to do with your serving application including scaling out, splitting traffic, or batching input for better performance. To do all of this, you will create a BackendConfig, a configuration object that you’ll use to set the properties of a particular backend.

Scaling Out

To scale out a backend to many instances, simply configure the number of replicas.

config = {"num_replicas": 10}
client.create_backend("my_scaled_endpoint_backend", handle_request, config=config)

# scale it back down...
config = {"num_replicas": 2}
client.update_backend_config("my_scaled_endpoint_backend", config)

This will scale up or down the number of replicas that can accept requests.

Resource Management (CPUs, GPUs)

To assign hardware resources per replica, you can pass resource requirements to ray_actor_options. By default, each replica requires one CPU. To learn about options to pass in, take a look at Resources with Actor guide.

For example, to create a backend where each replica uses a single GPU, you can do the following:

config = {"num_gpus": 1}
client.create_backend("my_gpu_backend", handle_request, ray_actor_options=config)

Fractional Resources

The resources specified in ray_actor_options can also be fractional. This allows you to flexibly share resources between replicas. For example, if you have two models and each doesn’t fully saturate a GPU, you might want to have them share a GPU by allocating 0.5 GPUs each. The same could be done to multiplex over CPUs.

half_gpu_config = {"num_gpus": 0.5}
client.create_backend("my_gpu_backend_1", handle_request, ray_actor_options=half_gpu_config)
client.create_backend("my_gpu_backend_2", handle_request, ray_actor_options=half_gpu_config)

Configuring Parallelism with OMP_NUM_THREADS

Deep learning models like PyTorch and Tensorflow often use multithreading when performing inference. The number of CPUs they use is controlled by the OMP_NUM_THREADS environment variable. To avoid contention, Ray sets OMP_NUM_THREADS=1 by default because Ray workers and actors use a single CPU by default. If you do want to enable this parallelism in your Serve backend, just set OMP_NUM_THREADS to the desired value either when starting Ray or in your function/class definition:

OMP_NUM_THREADS=12 ray start --head
OMP_NUM_THREADS=12 ray start --address=$HEAD_NODE_ADDRESS
class MyBackend:
    def __init__(self, parallelism):
        os.environ["OMP_NUM_THREADS"] = parallelism
        # Download model weights, initialize model, etc.

client.create_backend("parallel_backend", MyBackend, 12)


Some other libraries may not respect OMP_NUM_THREADS and have their own way to configure parallelism. For example, if you’re using OpenCV, you’ll need to manually set the number of threads using cv2.setNumThreads(num_threads) (set to 0 to disable multi-threading). You can check the configuration using cv2.getNumThreads() and cv2.getNumberOfCPUs().

Batching to Improve Performance

You can also have Ray Serve batch requests for performance. In order to do use this feature, you need to: 1. Set the max_batch_size in the config dictionary. 2. Modify your backend implementation to accept a list of requests and return a list of responses instead of handling a single request.

class BatchingExample:
    def __init__(self):
        self.count = 0

    def __call__(self, requests):
        responses = []
            for request in requests:
        return responses

config = {"max_batch_size": 5}
client.create_backend("counter1", BatchingExample, config=config)
client.create_endpoint("counter1", backend="counter1", route="/increment")

Please take a look at Batching Tutorial for a deep dive.

Splitting Traffic Between Backends

At times it may be useful to expose a single endpoint that is served by multiple backends. You can do this by splitting the traffic for an endpoint between backends using client.set_traffic. When calling client.set_traffic, you provide a dictionary of backend name to a float value that will be used to randomly route that portion of traffic (out of a total of 1.0) to the given backend. For example, here we split traffic 50/50 between two backends:

client.create_backend("backend1", MyClass1)
client.create_backend("backend2", MyClass2)

client.create_endpoint("fifty-fifty", backend="backend1", route="/fifty")
client.set_traffic("fifty-fifty", {"backend1": 0.5, "backend2": 0.5})

Each request is routed randomly between the backends in the traffic dictionary according to the provided weights. Please see Session Affinity for details on how to ensure that clients or users are consistently mapped to the same backend.

Canary Deployments

client.set_traffic can be used to implement canary deployments, where one backend serves the majority of traffic, while a small fraction is routed to a second backend. This is especially useful for “canary testing” a new model on a small percentage of users, while the tried and true old model serves the majority. Once you are satisfied with the new model, you can reroute all traffic to it and remove the old model:

client.create_backend("default_backend", MyClass)

# Initially, set all traffic to be served by the "default" backend.
client.create_endpoint("canary_endpoint", backend="default_backend", route="/canary-test")

# Add a second backend and route 1% of the traffic to it.
client.create_backend("new_backend", MyNewClass)
client.set_traffic("canary_endpoint", {"default_backend": 0.99, "new_backend": 0.01})

# Add a third backend that serves another 1% of the traffic.
client.create_backend("new_backend2", MyNewClass2)
client.set_traffic("canary_endpoint", {"default_backend": 0.98, "new_backend": 0.01, "new_backend2": 0.01})

# Route all traffic to the new, better backend.
client.set_traffic("canary_endpoint", {"new_backend": 1.0})

# Or, if not so succesful, revert to the "default" backend for all traffic.
client.set_traffic("canary_endpoint", {"default_backend": 1.0})

Incremental Rollout

client.set_traffic can also be used to implement incremental rollout. Here, we want to replace an existing backend with a new implementation by gradually increasing the proportion of traffic that it serves. In the example below, we do this repeatedly in one script, but in practice this would likely happen over time across multiple scripts.

client.create_backend("existing_backend", MyClass)

# Initially, all traffic is served by the existing backend.
client.create_endpoint("incremental_endpoint", backend="existing_backend", route="/incremental")

# Then we can slowly increase the proportion of traffic served by the new backend.
client.create_backend("new_backend", MyNewClass)
client.set_traffic("incremental_endpoint", {"existing_backend": 0.9, "new_backend": 0.1})
client.set_traffic("incremental_endpoint", {"existing_backend": 0.8, "new_backend": 0.2})
client.set_traffic("incremental_endpoint", {"existing_backend": 0.5, "new_backend": 0.5})
client.set_traffic("incremental_endpoint", {"new_backend": 1.0})

# At any time, we can roll back to the existing backend.
client.set_traffic("incremental_endpoint", {"existing_backend": 1.0})

Session Affinity

Splitting traffic randomly among backends for each request is is general and simple, but it can be an issue when you want to ensure that a given user or client is served by the same backend repeatedly. To address this, a “shard key” can be specified for each request that will deterministically map to a backend. In practice, this should be something that uniquely identifies the entity that you want to consistently map, like a client ID or session ID. The shard key can either be specified via the X-SERVE-SHARD-KEY HTTP header or handle.options(shard_key="key").


The mapping from shard key to backend may change when you update the traffic policy for an endpoint.

# Specifying the shard key via an HTTP header.
requests.get("", headers={"X-SERVE-SHARD-KEY": session_id})

# Specifying the shard key in a call made via serve handle.
handle = client.get_handle("api_endpoint")

Shadow Testing

Sometimes when deploying a new backend, you may want to test it out without affecting the results seen by users. You can do this with client.shadow_traffic, which allows you to duplicate requests to multiple backends for testing while still having them served by the set of backends specified via client.set_traffic. Metrics about these requests are recorded as usual so you can use them to validate model performance. This is demonstrated in the example below, where we create an endpoint serviced by a single backend but shadow traffic to two other backends for testing.

client.create_backend("existing_backend", MyClass)

# All traffic is served by the existing backend.
client.create_endpoint("shadowed_endpoint", backend="existing_backend", route="/shadow")

# Create two new backends that we want to test.
client.create_backend("new_backend_1", MyNewClass)
client.create_backend("new_backend_2", MyNewClass)

# Shadow traffic to the two new backends. This does not influence the result
# of requests to the endpoint, but a proportion of requests are
# *additionally* sent to these backends.

# Send 50% of all queries to the endpoint new_backend_1.
client.shadow_traffic("shadowed_endpoint", "new_backend_1", 0.5)
# Send 10% of all queries to the endpoint new_backend_2.
client.shadow_traffic("shadowed_endpoint", "new_backend_2", 0.1)

# Stop shadowing traffic to the backends.
client.shadow_traffic("shadowed_endpoint", "new_backend_1", 0)
client.shadow_traffic("shadowed_endpoint", "new_backend_2", 0)

Composing Multiple Models

Ray Serve supports composing individually scalable models into a single model out of the box. For instance, you can combine multiple models to perform stacking or ensembles.

To define a higher-level composed model you need to do three things:

  1. Define your underlying models (the ones that you will compose together) as Ray Serve backends

  2. Define your composed model, using the handles of the underlying models (see the example below).

  3. Define an endpoint representing this composed model and query it!

In order to avoid synchronous execution in the composed model (e.g., it’s very slow to make calls to the composed model), you’ll need to make the function asynchronous by using an async def. You’ll see this in the example below.

That’s it. Let’s take a look at an example:

from random import random
import requests
import ray
from ray import serve

client = serve.start()

# Our pipeline will be structured as follows:
# - Input comes in, the composed model sends it to model_one
# - model_one outputs a random number between 0 and 1, if the value is
#   greater than 0.5, then the data is sent to model_two
# - otherwise, the data is returned to the user.

# Let's define two models that just print out the data they received.

def model_one(request):
    print("Model 1 called with data ", request.query_params.get("data"))
    return random()

def model_two(request):
    print("Model 2 called with data ", request.query_params.get("data"))
    return request.query_params.get("data")

class ComposedModel:
    def __init__(self):
        client = serve.connect()
        self.model_one = client.get_handle("model_one")
        self.model_two = client.get_handle("model_two")

    # This method can be called concurrently!
    async def __call__(self, starlette_request):
        data = await starlette_request.body()

        score = await self.model_one.remote(data=data)
        if score > 0.5:
            result = await self.model_two.remote(data=data)
            result = {"model_used": 2, "score": score}
            result = {"model_used": 1, "score": score}

        return result

client.create_backend("model_one", model_one)
client.create_endpoint("model_one", backend="model_one")

client.create_backend("model_two", model_two)
client.create_endpoint("model_two", backend="model_two")

# max_concurrent_queries is optional. By default, if you pass in an async
# function, Ray Serve sets the limit to a high number.
    "composed_backend", ComposedModel, config={"max_concurrent_queries": 10})
    "composed", backend="composed_backend", route="/composed")

for _ in range(5):
    resp = requests.get("", data="hey!")
# Output
# {'model_used': 2, 'score': 0.6250189863595503}
# {'model_used': 1, 'score': 0.03146855349621436}
# {'model_used': 2, 'score': 0.6916977560006987}
# {'model_used': 2, 'score': 0.8169693450866928}
# {'model_used': 2, 'score': 0.9540681979573862}

Sync and Async Handles

Ray Serve offers two types of ServeHandle. You can use the client.get_handle(..., sync=True|False) flag to toggle between them.

  • When you set sync=True (the default), a synchronous handle is returned. Calling handle.remote() should return a Ray ObjectRef.

  • When you set sync=False, an asyncio based handle is returned. You need to Call it with await handle.remote() to return a Ray ObjectRef. To use await, you have to run client.get_handle and handle.remote in Python asyncio event loop.

The async handle has performance advantage because it uses asyncio directly; as compared to the sync handle, which talks to an asyncio event loop in a thread. To learn more about the reasoning behind these, checkout our architecture documentation.


Ray Serve exposes important system metrics like the number of successful and errored requests through the Ray metrics monitoring infrastructure. By default, the metrics are exposed in Prometheus format on each node.

The following metrics are exposed by Ray Serve:




The number of queries that have been processed in this replica.


The number of exceptions that have occurred in the backend.


The number of times this replica has been restarted due to failure.


The latency for queries in the replica’s queue waiting to be processed or batched.


The latency for queries to be processed.


The current number of queries queued in the backend replicas.


The current number of queries being processed.


The number of HTTP requests processed.


The number of requests processed by the router.

To see this in action, run ray start --head --metrics-export-port=8080 in your terminal, and then run the following script:

import ray
from ray import serve

import time

client = serve.start()

def f(request):

client.create_backend("f", f)
client.create_endpoint("f", backend="f")

handle = client.get_handle("f")
while (True):

In your web browser, navigate to localhost:8080. In the output there, you can search for serve_ to locate the metrics above. The metrics are updated once every ten seconds, and you will need to refresh the page to see the new values.

For example, after running the script for some time and refreshing localhost:8080 you might see something that looks like:

ray_serve_backend_processing_latency_ms_count{...,backend="f",...} 99.0
ray_serve_backend_processing_latency_ms_sum{...,backend="f",...} 99279.30498123169

which indicates that the average processing latency is just over one second, as expected.

You can even define a custom metric to use in your backend, and tag it with the current backend or replica. Here’s an example:

class MyBackendClass:
    def __init__(self):
        self.my_counter = metrics.Count(
            description=("The number of excellent requests to this backend."),
            tag_keys=("backend", ))
            "backend": serve.get_current_backend_tag()

    def __call__(self, request):
        if "excellent" in request.query_params:

See the Ray Monitoring documentation for more details, including instructions for scraping these metrics using Prometheus.

Reconfiguring Backends (Experimental)

Suppose you want to update a parameter in your model without creating a whole new backend. You can do this by writing a reconfigure method for the class underlying your backend. At runtime, you can then pass in your new parameters by setting the user_config field of BackendConfig.

The following simple example will make the usage clear:

import requests
import random

import ray
from ray import serve
from ray.serve import BackendConfig

client = serve.start()

class Threshold:
    def __init__(self):
        # self.model won't be changed by reconfigure.
        self.model = random.Random()  # Imagine this is some heavyweight model.

    def reconfigure(self, config):
        # This will be called when the class is created and when
        # the user_config field of BackendConfig is updated.
        self.threshold = config["threshold"]

    def __call__(self, request):
        return self.model.random() > self.threshold

backend_config = BackendConfig(user_config={"threshold": 0.01})
client.create_backend("threshold", Threshold, config=backend_config)
client.create_endpoint("threshold", backend="threshold", route="/threshold")
print(requests.get("").text)  # true, probably

backend_config = BackendConfig(user_config={"threshold": 0.99})
client.update_backend_config("threshold", backend_config)
print(requests.get("").text)  # false, probably

The reconfigure method is called when the class is created if user_config is set. In particular, it’s also called when new replicas are created in the future, in case you decide to scale up your backend later. The reconfigure method is also called each time user_config is updated via client.update_backend_config.

Dependency Management

Ray Serve supports serving backends with different (possibly conflicting) python dependencies. For example, you can simultaneously serve one backend that uses legacy Tensorflow 1 and another backend that uses Tensorflow 2.

Currently this is supported using conda. You must have a conda environment set up for each set of dependencies you want to isolate. If using a multi-node cluster, the conda configuration must be identical across all nodes.

Here’s an example script. For it to work, first create a conda environment named ray-tf1 with Ray Serve and Tensorflow 1 installed, and another named ray-tf2 with Ray Serve and Tensorflow 2. The Ray and python versions must be the same in both environments. To specify an environment for a backend to use, simply pass the environment name in to client.create_backend as shown below.

import requests
from ray import serve
from ray.serve import CondaEnv
import tensorflow as tf

client = serve.start()

def tf_version(request):
    return ("Tensorflow " + tf.__version__)

client.create_backend("tf1", tf_version, env=CondaEnv("ray-tf1"))
client.create_endpoint("tf1", backend="tf1", route="/tf1")
client.create_backend("tf2", tf_version, env=CondaEnv("ray-tf2"))
client.create_endpoint("tf2", backend="tf2", route="/tf2")

print(requests.get("").text)  # Tensorflow 1.15.0
print(requests.get("").text)  # Tensorflow 2.3.0


If the argument env is omitted, backends will be started in the same conda environment as the caller of client.create_backend by default.

The dependencies required in the backend may be different than the dependencies installed in the driver program (the one running Serve API calls). In this case, you can use an ImportedBackend to specify a backend based on a class that is installed in the Python environment that the workers will run in. Example:

import requests

from ray import serve
from ray.serve.backends import ImportedBackend

client = serve.start()

# Include your class as input to the ImportedBackend constructor.
backend_class = ImportedBackend("ray.serve.utils.MockImportedBackend")
client.create_backend("imported", backend_class, "input_arg")
client.create_endpoint("imported", backend="imported", route="/imported")


Configuring HTTP Server Locations

By default, Ray Serve starts only one HTTP on the head node of the Ray cluster. You can configure this behavior using the http_options={"location": ...} flag in serve.start:

  • “HeadOnly”: start one HTTP server on the head node. Serve assumes the head node is the node you executed serve.start on. This is the default.

  • “EveryNode”: start one HTTP server per node.

  • “NoServer” or None: disable HTTP server.


Using the “EveryNode” option, you can point a cloud load balancer to the instance group of Ray cluster to achieve high availability of Serve’s HTTP proxies.