Core APIs

Deploying a Backend

Backends define the implementation of your business logic or models that will handle incoming requests. In order to support seamless scalability backends can have many replicas, which are individual processes running in the Ray cluster to handle requests. To define a backend, you must first define the “handler” or the business logic you’d like to respond with. The handler should take as input a Starlette Request object and return any JSON-serializable object as output. For a more customizable response type, the handler may return a Starlette Response object.

A backend is defined using client.create_backend, and the implementation can be defined as either a function or a class. Use a function when your response is stateless and a class when you might need to maintain some state (like a model). When using a class, you can specify arguments to be passed to the constructor in client.create_backend, shown below.

A backend consists of a number of replicas, which are individual copies of the function or class that are started in separate Ray Workers (processes).

def handle_request(starlette_request):
  return "hello world"

class RequestHandler:
  # Take the message to return as an argument to the constructor.
  def __init__(self, msg):
      self.msg = msg

  def __call__(self, starlette_request):
      return self.msg

client.create_backend("simple_backend", handle_request)
# Pass in the message that the backend will return as an argument.
# If we call this backend, it will respond with "hello, world!".
client.create_backend("simple_backend_class", RequestHandler, "hello, world!")

We can also list all available backends and delete them to reclaim resources. Note that a backend cannot be deleted while it is in use by an endpoint because then traffic to an endpoint may not be able to be handled.

>> client.list_backends()
    'simple_backend': {'accepts_batches': False, 'num_replicas': 1, 'max_batch_size': None},
    'simple_backend_class': {'accepts_batches': False, 'num_replicas': 1, 'max_batch_size': None},
>> client.delete_backend("simple_backend")
>> client.list_backends()
    'simple_backend_class': {'accepts_batches': False, 'num_replicas': 1, 'max_batch_size': None},

Exposing a Backend

While backends define the implementation of your request handling logic, endpoints allow you to expose them via HTTP. Endpoints are “logical” and can have one or multiple backends that serve requests to them. To create an endpoint, we simply need to specify a name for the endpoint, the name of a backend to handle requests to the endpoint, and the route and methods where it will be accesible. By default endpoints are serviced only by the backend provided to client.create_endpoint, but in some cases you may want to specify multiple backends for an endpoint, e.g., for A/B testing or incremental rollout. For information on how to do this, please see Splitting Traffic.

client.create_endpoint("simple_endpoint", backend="simple_backend", route="/simple", methods=["GET"])

After creating the endpoint, it is now exposed by the HTTP server and handles requests using the specified backend. We can query the model to verify that it’s working.

import requests

We can also query the endpoint using the ServeHandle interface.

handle = client.get_handle("simple_endpoint")

To view all of the existing endpoints that have created, use client.list_endpoints.

>>> client.list_endpoints()
{'simple_endpoint': {'route': '/simple', 'methods': ['GET'], 'traffic': {}}}

You can also delete an endpoint using client.delete_endpoint. Endpoints and backends are independent, so deleting an endpoint will not delete its backends. However, an endpoint must be deleted in order to delete the backends that serve its traffic.


Configuring a Backend

There are a number of things you’ll likely want to do with your serving application including scaling out, splitting traffic, or batching input for better performance. To do all of this, you will create a BackendConfig, a configuration object that you’ll use to set the properties of a particular backend.

The BackendConfig for a running backend can be updated using client.update_backend_config.

Scaling Out

To scale out a backend to many processes, simply configure the number of replicas.

config = {"num_replicas": 10}
client.create_backend("my_scaled_endpoint_backend", handle_request, config=config)

# scale it back down...
config = {"num_replicas": 2}
client.update_backend_config("my_scaled_endpoint_backend", config)

This will scale up or down the number of replicas that can accept requests.

Resource Management (CPUs, GPUs)

To assign hardware resources per replica, you can pass resource requirements to ray_actor_options. By default, each replica requires one CPU. To learn about options to pass in, take a look at Resources with Actor guide.

For example, to create a backend where each replica uses a single GPU, you can do the following:

config = {"num_gpus": 1}
client.create_backend("my_gpu_backend", handle_request, ray_actor_options=config)

Fractional Resources

The resources specified in ray_actor_options can also be fractional. This allows you to flexibly share resources between replicas. For example, if you have two models and each doesn’t fully saturate a GPU, you might want to have them share a GPU by allocating 0.5 GPUs each. The same could be done to multiplex over CPUs.

half_gpu_config = {"num_gpus": 0.5}
client.create_backend("my_gpu_backend_1", handle_request, ray_actor_options=half_gpu_config)
client.create_backend("my_gpu_backend_2", handle_request, ray_actor_options=half_gpu_config)

Configuring Parallelism with OMP_NUM_THREADS

Deep learning models like PyTorch and Tensorflow often use multithreading when performing inference. The number of CPUs they use is controlled by the OMP_NUM_THREADS environment variable. To avoid contention, Ray sets OMP_NUM_THREADS=1 by default because Ray workers and actors use a single CPU by default. If you do want to enable this parallelism in your Serve backend, just set OMP_NUM_THREADS to the desired value either when starting Ray or in your function/class definition:

OMP_NUM_THREADS=12 ray start --head
OMP_NUM_THREADS=12 ray start --address=$HEAD_NODE_ADDRESS
class MyBackend:
    def __init__(self, parallelism):
        os.environ["OMP_NUM_THREADS"] = parallelism
        # Download model weights, initialize model, etc.

client.create_backend("parallel_backend", MyBackend, 12)


Some other libraries may not respect OMP_NUM_THREADS and have their own way to configure parallelism. For example, if you’re using OpenCV, you’ll need to manually set the number of threads using cv2.setNumThreads(num_threads) (set to 0 to disable multi-threading). You can check the configuration using cv2.getNumThreads() and cv2.getNumberOfCPUs().

Batching to Improve Performance

You can also have Ray Serve batch requests for performance. In order to do use this feature, you need to: 1. Set the max_batch_size in the config dictionary. 2. Modify your backend implementation to accept a list of requests and return a list of responses instead of handling a single request.

class BatchingExample:
    def __init__(self):
        self.count = 0

    def __call__(self, requests):
        responses = []
            for request in requests:
        return responses

config = {"max_batch_size": 5}
client.create_backend("counter1", BatchingExample, config=config)
client.create_endpoint("counter1", backend="counter1", route="/increment")

Please take a look at Batching Tutorial for a deep dive.

User Configuration (Experimental)

Suppose you want to update a parameter in your model without creating a whole new backend. You can do this by writing a reconfigure method for the class underlying your backend. At runtime, you can then pass in your new parameters by setting the user_config field of BackendConfig.

The following simple example will make the usage clear:

import requests
import random

import ray
from ray import serve
from ray.serve import BackendConfig

client = serve.start()

class Threshold:
    def __init__(self):
        # self.model won't be changed by reconfigure.
        self.model = random.Random()  # Imagine this is some heavyweight model.

    def reconfigure(self, config):
        # This will be called when the class is created and when
        # the user_config field of BackendConfig is updated.
        self.threshold = config["threshold"]

    def __call__(self, request):
        return self.model.random() > self.threshold

backend_config = BackendConfig(user_config={"threshold": 0.01})
client.create_backend("threshold", Threshold, config=backend_config)
client.create_endpoint("threshold", backend="threshold", route="/threshold")
print(requests.get("").text)  # true, probably

backend_config = BackendConfig(user_config={"threshold": 0.99})
client.update_backend_config("threshold", backend_config)
print(requests.get("").text)  # false, probably

The reconfigure method is called when the class is created if user_config is set. In particular, it’s also called when new replicas are created in the future if scale up your backend later. The reconfigure method is also called each time user_config is updated via client.update_backend_config.

Dependency Management

Ray Serve supports serving backends with different (possibly conflicting) python dependencies. For example, you can simultaneously serve one backend that uses legacy Tensorflow 1 and another backend that uses Tensorflow 2.

Currently this is supported using conda. You must have a conda environment set up for each set of dependencies you want to isolate. If using a multi-node cluster, the conda configuration must be identical across all nodes.

Here’s an example script. For it to work, first create a conda environment named ray-tf1 with Ray Serve and Tensorflow 1 installed, and another named ray-tf2 with Ray Serve and Tensorflow 2. The Ray and python versions must be the same in both environments. To specify an environment for a backend to use, simply pass the environment name in to client.create_backend as shown below.

import requests
from ray import serve
from ray.serve import CondaEnv
import tensorflow as tf

client = serve.start()

def tf_version(request):
    return ("Tensorflow " + tf.__version__)

client.create_backend("tf1", tf_version, env=CondaEnv("ray-tf1"))
client.create_endpoint("tf1", backend="tf1", route="/tf1")
client.create_backend("tf2", tf_version, env=CondaEnv("ray-tf2"))
client.create_endpoint("tf2", backend="tf2", route="/tf2")

print(requests.get("").text)  # Tensorflow 1.15.0
print(requests.get("").text)  # Tensorflow 2.3.0


If the argument env is omitted, backends will be started in the same conda environment as the caller of client.create_backend by default.

The dependencies required in the backend may be different than the dependencies installed in the driver program (the one running Serve API calls). In this case, you can pass the backend in as an import path that will be imported in the Python environment in the workers, but not the driver. Example:

import requests

from ray import serve

client = serve.start()

# Include your class as input to the ImportedBackend constructor.
import_path = "ray.serve.utils.MockImportedBackend"
client.create_backend("imported", import_path, "input_arg")
client.create_endpoint("imported", backend="imported", route="/imported")