Advanced Topics, Configurations, & FAQ

Ray Serve has a number of knobs and tools for you to tune for your particular workload. All Ray Serve advanced options and topics are covered on this page aside from the fundamentals of Deploying Ray Serve. For a more hands on take, please check out the Serve Tutorials.

There are a number of things you’ll likely want to do with your serving application including scaling out, splitting traffic, or batching input for better performance. To do all of this, you will create a BackendConfig, a configuration object that you’ll use to set the properties of a particular backend.

Scaling Out

To scale out a backend to multiple workers, simplify configure the number of replicas.

config = {"num_replicas": 10}
client.create_backend("my_scaled_endpoint_backend", handle_request, config=config)

# scale it back down...
config = {"num_replicas": 2}
client.update_backend_config("my_scaled_endpoint_backend", config)

This will scale up or down the number of workers that can accept requests.

Using Resources (CPUs, GPUs)

To assign hardware resource per worker, you can pass resource requirements to ray_actor_options. To learn about options to pass in, take a look at Resources with Actor guide.

For example, to create a backend where each replica uses a single GPU, you can do the following:

config = {"num_gpus": 1}
client.create_backend("my_gpu_backend", handle_request, ray_actor_options=config)

Configuring Parallelism with OMP_NUM_THREADS

Deep learning models like PyTorch and Tensorflow often use multithreading when performing inference. The number of CPUs they use is controlled by the OMP_NUM_THREADS environment variable. To avoid contention, Ray sets OMP_NUM_THREADS=1 by default because Ray workers and actors use a single CPU by default. If you do want to enable this parallelism in your Serve backend, just set OMP_NUM_THREADS to the desired value either when starting Ray or in your function/class definition:

OMP_NUM_THREADS=12 ray start --head
OMP_NUM_THREADS=12 ray start --address=$HEAD_NODE_ADDRESS
class MyBackend:
    def __init__(self, parallelism):
        os.environ["OMP_NUM_THREADS"] = parallelism
        # Download model weights, initialize model, etc.

client.create_backend("parallel_backend", MyBackend, 12)

Batching to improve performance

You can also have Ray Serve batch requests for performance. In order to do use this feature, you need to: 1. Set the max_batch_size in the config dictionary. 2. Modify your backend implementation to accept a list of requests and return a list of responses instead of handling a single request.

class BatchingExample:
    def __init__(self):
        self.count = 0

    def __call__(self, requests):
        responses = []
            for request in requests:
        return responses

config = {"max_batch_size": 5}
client.create_backend("counter1", BatchingExample, config=config)
client.create_endpoint("counter1", backend="counter1", route="/increment")

Please take a look at Batching Tutorial for a deep dive.

Splitting Traffic Between Backends

At times it may be useful to expose a single endpoint that is served by multiple backends. You can do this by splitting the traffic for an endpoint between backends using client.set_traffic. When calling client.set_traffic, you provide a dictionary of backend name to a float value that will be used to randomly route that portion of traffic (out of a total of 1.0) to the given backend. For example, here we split traffic 50/50 between two backends:

client.create_backend("backend1", MyClass1)
client.create_backend("backend2", MyClass2)

client.create_endpoint("fifty-fifty", backend="backend1", route="/fifty")
client.set_traffic("fifty-fifty", {"backend1": 0.5, "backend2": 0.5})

Each request is routed randomly between the backends in the traffic dictionary according to the provided weights. Please see Session Affinity for details on how to ensure that clients or users are consistently mapped to the same backend.

Canary Deployments

client.set_traffic can be used to implement canary deployments, where one backend serves the majority of traffic, while a small fraction is routed to a second backend. This is especially useful for “canary testing” a new model on a small percentage of users, while the tried and true old model serves the majority. Once you are satisfied with the new model, you can reroute all traffic to it and remove the old model:

client.create_backend("default_backend", MyClass)

# Initially, set all traffic to be served by the "default" backend.
client.create_endpoint("canary_endpoint", backend="default_backend", route="/canary-test")

# Add a second backend and route 1% of the traffic to it.
client.create_backend("new_backend", MyNewClass)
client.set_traffic("canary_endpoint", {"default_backend": 0.99, "new_backend": 0.01})

# Add a third backend that serves another 1% of the traffic.
client.create_backend("new_backend2", MyNewClass2)
client.set_traffic("canary_endpoint", {"default_backend": 0.98, "new_backend": 0.01, "new_backend2": 0.01})

# Route all traffic to the new, better backend.
client.set_traffic("canary_endpoint", {"new_backend": 1.0})

# Or, if not so succesful, revert to the "default" backend for all traffic.
client.set_traffic("canary_endpoint", {"default_backend": 1.0})

Incremental Rollout

client.set_traffic can also be used to implement incremental rollout. Here, we want to replace an existing backend with a new implementation by gradually increasing the proportion of traffic that it serves. In the example below, we do this repeatedly in one script, but in practice this would likely happen over time across multiple scripts.

client.create_backend("existing_backend", MyClass)

# Initially, all traffic is served by the existing backend.
client.create_endpoint("incremental_endpoint", backend="existing_backend", route="/incremental")

# Then we can slowly increase the proportion of traffic served by the new backend.
client.create_backend("new_backend", MyNewClass)
client.set_traffic("incremental_endpoint", {"existing_backend": 0.9, "new_backend": 0.1})
client.set_traffic("incremental_endpoint", {"existing_backend": 0.8, "new_backend": 0.2})
client.set_traffic("incremental_endpoint", {"existing_backend": 0.5, "new_backend": 0.5})
client.set_traffic("incremental_endpoint", {"new_backend": 1.0})

# At any time, we can roll back to the existing backend.
client.set_traffic("incremental_endpoint", {"existing_backend": 1.0})

Session Affinity

Splitting traffic randomly among backends for each request is is general and simple, but it can be an issue when you want to ensure that a given user or client is served by the same backend repeatedly. To address this, Serve offers a “shard key” can be specified for each request that will deterministically map to a backend. In practice, this should be something that uniquely identifies the entity that you want to consistently map, like a client ID or session ID. The shard key can either be specified via the X-SERVE-SHARD-KEY HTTP header or handle.options(shard_key="key").


The mapping from shard key to backend may change when you update the traffic policy for an endpoint.

# Specifying the shard key via an HTTP header.
requests.get("", headers={"X-SERVE-SHARD-KEY": session_id})

# Specifying the shard key in a call made via serve handle.
handle = client.get_handle("api_endpoint")

Shadow Testing

Sometimes when deploying a new backend, you may want to test it out without affecting the results seen by users. You can do this with client.shadow_traffic, which allows you to duplicate requests to multiple backends for testing while still having them served by the set of backends specified via client.set_traffic. Metrics about these requests are recorded as usual so you can use them to validate model performance. This is demonstrated in the example below, where we create an endpoint serviced by a single backend but shadow traffic to two other backends for testing.

client.create_backend("existing_backend", MyClass)

# All traffic is served by the existing backend.
client.create_endpoint("shadowed_endpoint", backend="existing_backend", route="/shadow")

# Create two new backends that we want to test.
client.create_backend("new_backend_1", MyNewClass)
client.create_backend("new_backend_2", MyNewClass)

# Shadow traffic to the two new backends. This does not influence the result
# of requests to the endpoint, but a proportion of requests are
# *additionally* sent to these backends.

# Send 50% of all queries to the endpoint new_backend_1.
client.shadow_traffic("shadowed_endpoint", "new_backend_1", 0.5)
# Send 10% of all queries to the endpoint new_backend_2.
client.shadow_traffic("shadowed_endpoint", "new_backend_2", 0.1)

# Stop shadowing traffic to the backends.
client.shadow_traffic("shadowed_endpoint", "new_backend_1", 0)
client.shadow_traffic("shadowed_endpoint", "new_backend_2", 0)

Composing Multiple Models

Ray Serve supports composing individually scalable models into a single model out of the box. For instance, you can combine multiple models to perform stacking or ensembles.

To define a higher-level composed model you need to do three things:

  1. Define your underlying models (the ones that you will compose together) as Ray Serve backends

  2. Define your composed model, using the handles of the underlying models (see the example below).

  3. Define an endpoint representing this composed model and query it!

In order to avoid synchronous execution in the composed model (e.g., it’s very slow to make calls to the composed model), you’ll need to make the function asynchronous by using an async def. You’ll see this in the example below.

That’s it. Let’s take a look at an example:

from random import random
import requests
import ray
from ray import serve

client = serve.start()

# Our pipeline will be structured as follows:
# - Input comes in, the composed model sends it to model_one
# - model_one outputs a random number between 0 and 1, if the value is
#   greater than 0.5, then the data is sent to model_two
# - otherwise, the data is returned to the user.

# Let's define two models that just print out the data they received.

def model_one(request):
    print("Model 1 called with data ", request.args.get("data"))
    return random()

def model_two(request):
    print("Model 2 called with data ", request.args.get("data"))
    return request.args.get("data")

class ComposedModel:
    def __init__(self):
        client = serve.connect()
        self.model_one = client.get_handle("model_one")
        self.model_two = client.get_handle("model_two")

    # This method can be called concurrently!
    async def __call__(self, flask_request):
        data =

        score = await self.model_one.remote(data=data)
        if score > 0.5:
            result = await self.model_two.remote(data=data)
            result = {"model_used": 2, "score": score}
            result = {"model_used": 1, "score": score}

        return result

client.create_backend("model_one", model_one)
client.create_endpoint("model_one", backend="model_one")

client.create_backend("model_two", model_two)
client.create_endpoint("model_two", backend="model_two")

# max_concurrent_queries is optional. By default, if you pass in an async
# function, Ray Serve sets the limit to a high number.
    "composed_backend", ComposedModel, config={"max_concurrent_queries": 10})
    "composed", backend="composed_backend", route="/composed")

for _ in range(5):
    resp = requests.get("", data="hey!")
# Output
# {'model_used': 2, 'score': 0.6250189863595503}
# {'model_used': 1, 'score': 0.03146855349621436}
# {'model_used': 2, 'score': 0.6916977560006987}
# {'model_used': 2, 'score': 0.8169693450866928}
# {'model_used': 2, 'score': 0.9540681979573862}


Ray Serve exposes important system metrics like the number of successful and errored requests through the Ray metrics monitoring infrastructure. By default, the metrics are exposed in Prometheus format on each node. See the Ray Monitoring documentation for more information.

Ray Serve FAQ

How do I deploy serve?

See Deploying Ray Serve for information about how to deploy serve.

How do I delete backends and endpoints?

To delete a backend, you can use client.delete_backend. Note that the backend must not be use by any endpoints in order to be delete. Once a backend is deleted, its tag can be reused.


To delete a endpoint, you can use client.delete_endpoint. Note that the endpoint will no longer work and return a 404 when queried. Once a endpoint is deleted, its tag can be reused.


How do I call an endpoint from Python code?

Use client.get_handle to get a handle to the endpoint, then use handle.remote to send requests to that endpoint. This returns a Ray ObjectRef whose result can be waited for or retrieved using ray.wait or ray.get, respectively.

handle = client.get_handle("api_endpoint")

How do I call a method on my backend class besides __call__?

To call a method via HTTP use the header field X-SERVE-CALL-METHOD.

To call a method via Python, use handle.options:

class StatefulProcessor:
    def __init__(self):
        self.count = 1

    def __call__(self, request):
        return {"current": self.count}

    def other_method(self, inc):
        self.count += inc
        return True

handle = client.get_handle("backend_name")

How do I enable CORS and other HTTP features?

Serve supports arbitrary Starlette middlewares and custom middlewares in Starlette format. The example below shows how to enable Cross-Origin Resource Sharing (CORS). You can follow the same pattern for other Starlette middlewares.


Serve does not list Starlette as one of its dependencies. To utilize this feature, you will need to:

pip install starlette
from starlette.middleware import Middleware
from starlette.middleware.cors import CORSMiddleware

client = serve.start(
            CORSMiddleware, allow_origins=["*"], allow_methods=["*"])

How do ServeHandle and ServeRequest work?

Ray Serve enables you to query models both from HTTP and Python. This feature enables seamless model composition. You can get a ServeHandle corresponding to an endpoint, similar how you can reach an endpoint through HTTP via a specific route. When you issue a request to an endpoint through ServeHandle, the request goes through the same code path as an HTTP request would: choosing backends through traffic policies, finding the next available replica, and batching requests together.

When the request arrives in the model, you can access the data similarly to how you would with HTTP request. Here are some examples how ServeRequest mirrors Flask.Request:



(Flask.Request and ServeRequest)

requests.get(..., headers={...})





request.get(..., json={...})



request.get(..., form={...})



request.get(..., params={"a":"b"})



request.get(..., data="long string")

handle.remote("long string")




You might have noticed that the last row of the table shows that ServeRequest supports Python object pass through the handle. This is not possible in HTTP. If you need to distinguish if the origin of the request is from Python or HTTP, you can do an isinstance check:

import flask

if isinstance(request, flask.Request):
    print("Request coming from web!")
elif isinstance(request, ServeRequest):
    print("Request coming from Python!")