Advanced Topics and Configurations

Ray Serve has a number of knobs and tools for you to tune for your particular workload. All Ray Serve advanced options and topics are covered on this page aside from the fundamentals of Deploying Ray Serve. For a more hands on take, please check out the Serve Tutorials.

There are a number of things you’ll likely want to do with your serving application including scaling out, splitting traffic, or batching input for better performance. To do all of this, you will create a BackendConfig, a configuration object that you’ll use to set the properties of a particular backend.

Scaling Out

To scale out a backend to many instances, simply configure the number of replicas.

config = {"num_replicas": 10}
client.create_backend("my_scaled_endpoint_backend", handle_request, config=config)

# scale it back down...
config = {"num_replicas": 2}
client.update_backend_config("my_scaled_endpoint_backend", config)

This will scale up or down the number of replicas that can accept requests.

Using Resources (CPUs, GPUs)

To assign hardware resources per replica, you can pass resource requirements to ray_actor_options. By default, each replica requires one CPU. To learn about options to pass in, take a look at Resources with Actor guide.

For example, to create a backend where each replica uses a single GPU, you can do the following:

config = {"num_gpus": 1}
client.create_backend("my_gpu_backend", handle_request, ray_actor_options=config)

Fractional Resources

The resources specified in ray_actor_options can also be fractional. This allows you to flexibly share resources between replicas. For example, if you have two models and each doesn’t fully saturate a GPU, you might want to have them share a GPU by allocating 0.5 GPUs each. The same could be done to multiplex over CPUs.

half_gpu_config = {"num_gpus": 0.5}
client.create_backend("my_gpu_backend_1", handle_request, ray_actor_options=half_gpu_config)
client.create_backend("my_gpu_backend_2", handle_request, ray_actor_options=half_gpu_config)

Configuring Parallelism with OMP_NUM_THREADS

Deep learning models like PyTorch and Tensorflow often use multithreading when performing inference. The number of CPUs they use is controlled by the OMP_NUM_THREADS environment variable. To avoid contention, Ray sets OMP_NUM_THREADS=1 by default because Ray workers and actors use a single CPU by default. If you do want to enable this parallelism in your Serve backend, just set OMP_NUM_THREADS to the desired value either when starting Ray or in your function/class definition:

OMP_NUM_THREADS=12 ray start --head
OMP_NUM_THREADS=12 ray start --address=$HEAD_NODE_ADDRESS
class MyBackend:
    def __init__(self, parallelism):
        os.environ["OMP_NUM_THREADS"] = parallelism
        # Download model weights, initialize model, etc.

client.create_backend("parallel_backend", MyBackend, 12)

Batching to improve performance

You can also have Ray Serve batch requests for performance. In order to do use this feature, you need to: 1. Set the max_batch_size in the config dictionary. 2. Modify your backend implementation to accept a list of requests and return a list of responses instead of handling a single request.

class BatchingExample:
    def __init__(self):
        self.count = 0

    def __call__(self, requests):
        responses = []
            for request in requests:
        return responses

config = {"max_batch_size": 5}
client.create_backend("counter1", BatchingExample, config=config)
client.create_endpoint("counter1", backend="counter1", route="/increment")

Please take a look at Batching Tutorial for a deep dive.

Splitting Traffic Between Backends

At times it may be useful to expose a single endpoint that is served by multiple backends. You can do this by splitting the traffic for an endpoint between backends using client.set_traffic. When calling client.set_traffic, you provide a dictionary of backend name to a float value that will be used to randomly route that portion of traffic (out of a total of 1.0) to the given backend. For example, here we split traffic 50/50 between two backends:

client.create_backend("backend1", MyClass1)
client.create_backend("backend2", MyClass2)

client.create_endpoint("fifty-fifty", backend="backend1", route="/fifty")
client.set_traffic("fifty-fifty", {"backend1": 0.5, "backend2": 0.5})

Each request is routed randomly between the backends in the traffic dictionary according to the provided weights. Please see Session Affinity for details on how to ensure that clients or users are consistently mapped to the same backend.

Canary Deployments

client.set_traffic can be used to implement canary deployments, where one backend serves the majority of traffic, while a small fraction is routed to a second backend. This is especially useful for “canary testing” a new model on a small percentage of users, while the tried and true old model serves the majority. Once you are satisfied with the new model, you can reroute all traffic to it and remove the old model:

client.create_backend("default_backend", MyClass)

# Initially, set all traffic to be served by the "default" backend.
client.create_endpoint("canary_endpoint", backend="default_backend", route="/canary-test")

# Add a second backend and route 1% of the traffic to it.
client.create_backend("new_backend", MyNewClass)
client.set_traffic("canary_endpoint", {"default_backend": 0.99, "new_backend": 0.01})

# Add a third backend that serves another 1% of the traffic.
client.create_backend("new_backend2", MyNewClass2)
client.set_traffic("canary_endpoint", {"default_backend": 0.98, "new_backend": 0.01, "new_backend2": 0.01})

# Route all traffic to the new, better backend.
client.set_traffic("canary_endpoint", {"new_backend": 1.0})

# Or, if not so succesful, revert to the "default" backend for all traffic.
client.set_traffic("canary_endpoint", {"default_backend": 1.0})

Incremental Rollout

client.set_traffic can also be used to implement incremental rollout. Here, we want to replace an existing backend with a new implementation by gradually increasing the proportion of traffic that it serves. In the example below, we do this repeatedly in one script, but in practice this would likely happen over time across multiple scripts.

client.create_backend("existing_backend", MyClass)

# Initially, all traffic is served by the existing backend.
client.create_endpoint("incremental_endpoint", backend="existing_backend", route="/incremental")

# Then we can slowly increase the proportion of traffic served by the new backend.
client.create_backend("new_backend", MyNewClass)
client.set_traffic("incremental_endpoint", {"existing_backend": 0.9, "new_backend": 0.1})
client.set_traffic("incremental_endpoint", {"existing_backend": 0.8, "new_backend": 0.2})
client.set_traffic("incremental_endpoint", {"existing_backend": 0.5, "new_backend": 0.5})
client.set_traffic("incremental_endpoint", {"new_backend": 1.0})

# At any time, we can roll back to the existing backend.
client.set_traffic("incremental_endpoint", {"existing_backend": 1.0})

Session Affinity

Splitting traffic randomly among backends for each request is is general and simple, but it can be an issue when you want to ensure that a given user or client is served by the same backend repeatedly. To address this, a “shard key” can be specified for each request that will deterministically map to a backend. In practice, this should be something that uniquely identifies the entity that you want to consistently map, like a client ID or session ID. The shard key can either be specified via the X-SERVE-SHARD-KEY HTTP header or handle.options(shard_key="key").


The mapping from shard key to backend may change when you update the traffic policy for an endpoint.

# Specifying the shard key via an HTTP header.
requests.get("", headers={"X-SERVE-SHARD-KEY": session_id})

# Specifying the shard key in a call made via serve handle.
handle = client.get_handle("api_endpoint")

Shadow Testing

Sometimes when deploying a new backend, you may want to test it out without affecting the results seen by users. You can do this with client.shadow_traffic, which allows you to duplicate requests to multiple backends for testing while still having them served by the set of backends specified via client.set_traffic. Metrics about these requests are recorded as usual so you can use them to validate model performance. This is demonstrated in the example below, where we create an endpoint serviced by a single backend but shadow traffic to two other backends for testing.

client.create_backend("existing_backend", MyClass)

# All traffic is served by the existing backend.
client.create_endpoint("shadowed_endpoint", backend="existing_backend", route="/shadow")

# Create two new backends that we want to test.
client.create_backend("new_backend_1", MyNewClass)
client.create_backend("new_backend_2", MyNewClass)

# Shadow traffic to the two new backends. This does not influence the result
# of requests to the endpoint, but a proportion of requests are
# *additionally* sent to these backends.

# Send 50% of all queries to the endpoint new_backend_1.
client.shadow_traffic("shadowed_endpoint", "new_backend_1", 0.5)
# Send 10% of all queries to the endpoint new_backend_2.
client.shadow_traffic("shadowed_endpoint", "new_backend_2", 0.1)

# Stop shadowing traffic to the backends.
client.shadow_traffic("shadowed_endpoint", "new_backend_1", 0)
client.shadow_traffic("shadowed_endpoint", "new_backend_2", 0)

Composing Multiple Models

Ray Serve supports composing individually scalable models into a single model out of the box. For instance, you can combine multiple models to perform stacking or ensembles.

To define a higher-level composed model you need to do three things:

  1. Define your underlying models (the ones that you will compose together) as Ray Serve backends

  2. Define your composed model, using the handles of the underlying models (see the example below).

  3. Define an endpoint representing this composed model and query it!

In order to avoid synchronous execution in the composed model (e.g., it’s very slow to make calls to the composed model), you’ll need to make the function asynchronous by using an async def. You’ll see this in the example below.

That’s it. Let’s take a look at an example:

from random import random
import requests
import ray
from ray import serve

client = serve.start()

# Our pipeline will be structured as follows:
# - Input comes in, the composed model sends it to model_one
# - model_one outputs a random number between 0 and 1, if the value is
#   greater than 0.5, then the data is sent to model_two
# - otherwise, the data is returned to the user.

# Let's define two models that just print out the data they received.

def model_one(request):
    print("Model 1 called with data ", request.args.get("data"))
    return random()

def model_two(request):
    print("Model 2 called with data ", request.args.get("data"))
    return request.args.get("data")

class ComposedModel:
    def __init__(self):
        client = serve.connect()
        self.model_one = client.get_handle("model_one")
        self.model_two = client.get_handle("model_two")

    # This method can be called concurrently!
    async def __call__(self, flask_request):
        data =

        score = await self.model_one.remote(data=data)
        if score > 0.5:
            result = await self.model_two.remote(data=data)
            result = {"model_used": 2, "score": score}
            result = {"model_used": 1, "score": score}

        return result

client.create_backend("model_one", model_one)
client.create_endpoint("model_one", backend="model_one")

client.create_backend("model_two", model_two)
client.create_endpoint("model_two", backend="model_two")

# max_concurrent_queries is optional. By default, if you pass in an async
# function, Ray Serve sets the limit to a high number.
    "composed_backend", ComposedModel, config={"max_concurrent_queries": 10})
    "composed", backend="composed_backend", route="/composed")

for _ in range(5):
    resp = requests.get("", data="hey!")
# Output
# {'model_used': 2, 'score': 0.6250189863595503}
# {'model_used': 1, 'score': 0.03146855349621436}
# {'model_used': 2, 'score': 0.6916977560006987}
# {'model_used': 2, 'score': 0.8169693450866928}
# {'model_used': 2, 'score': 0.9540681979573862}


Ray Serve exposes important system metrics like the number of successful and errored requests through the Ray metrics monitoring infrastructure. By default, the metrics are exposed in Prometheus format on each node. See the Ray Monitoring documentation for more information.

Reconfiguring Backends (Experimental)

Suppose you want to update a parameter in your model without creating a whole new backend. You can do this by writing a reconfigure method for the class underlying your backend. At runtime, you can then pass in your new parameters by setting the user_config field of BackendConfig.

The following simple example will make the usage clear:

import requests
import random

import ray
from ray import serve
from ray.serve import BackendConfig

client = serve.start()

class Threshold:
    def __init__(self):
        # self.model won't be changed by reconfigure.
        self.model = random.Random()  # Imagine this is some heavyweight model.

    def reconfigure(self, config):
        # This will be called when the class is created and when
        # the user_config field of BackendConfig is updated.
        self.threshold = config["threshold"]

    def __call__(self, request):
        return self.model.random() > self.threshold

backend_config = BackendConfig(user_config={"threshold": 0.01})
client.create_backend("threshold", Threshold, config=backend_config)
client.create_endpoint("threshold", backend="threshold", route="/threshold")
print(requests.get("").text)  # true, probably

backend_config = BackendConfig(user_config={"threshold": 0.99})
client.update_backend_config("threshold", backend_config)
print(requests.get("").text)  # false, probably

The reconfigure method is called when the class is created if user_config is set. In particular, it’s also called when new replicas are created in the future, in case you decide to scale up your backend later. The reconfigure method is also called each time user_config is updated via client.update_backend_config.

Dependency Management

Ray Serve supports serving backends with different (possibly conflicting) python dependencies. For example, you can simultaneously serve one backend that uses legacy Tensorflow 1 and another backend that uses Tensorflow 2.

Currently this is supported using conda. You must have a conda environment set up for each set of dependencies you want to isolate. If using a multi-node cluster, the conda configuration must be identical across all nodes.

Here’s an example script. For it to work, first create a conda environment named ray-tf1 with Ray Serve and Tensorflow 1 installed, and another named ray-tf2 with Ray Serve and Tensorflow 2. The Ray and python versions must be the same in both environments. To specify an environment for a backend to use, simply pass the environment name in to client.create_backend as shown below. Be sure to run the script in an activated conda environment (not required to be ray-tf1 or ray-tf2).

import requests
import ray
from ray import serve
from ray.serve import CondaEnv
import tensorflow as tf

client = serve.start()

def tf_version(request):
    return ("Tensorflow " + tf.__version__)

client.create_backend("tf1", tf_version, env=CondaEnv("ray-tf1"))
client.create_endpoint("tf1", backend="tf1", route="/tf1")
client.create_backend("tf2", tf_version, env=CondaEnv("ray-tf2"))
client.create_endpoint("tf2", backend="tf2", route="/tf2")

print(requests.get("").text)  # Tensorflow 1.15.0
print(requests.get("").text)  # Tensorflow 2.3.0

Alternatively, you may omit the argument env and call client.create_backend from a script running in the conda environment you want the backend to run in.