Ray Serve

Ray Serve is a serving library that exposes python function/classes to HTTP. It has built-in support for flexible traffic policy. This means you can easy split incoming traffic to multiple implementations.


Ray Serve is under development and its API may be revised in future Ray releases. If you encounter any bugs, please file an issue on GitHub.

With Ray Serve, you can deploy your services at any scale.


Full example of ray.serve module

import time

import requests

import ray
import ray.serve as serve
from ray.serve.utils import pformat_color_json

# initialize ray serve system.
# blocking=True will wait for HTTP server to be ready to serve request.

# an endpoint is associated with an http URL.
serve.create_endpoint("my_endpoint", "/echo")

# a backend can be a function or class.
# it can be made to be invoked from web as well as python.
def echo_v1(flask_request, response="hello from python!"):
    if serve.context.web:
        response = flask_request.url
    return response

serve.create_backend(echo_v1, "echo:v1")
backend_config_v1 = serve.get_backend_config("echo:v1")

# We can link an endpoint to a backend, the means all the traffic
# goes to my_endpoint will now goes to echo:v1 backend.
serve.link("my_endpoint", "echo:v1")

print(requests.get("", timeout=0.5).text)
# The service will be reachable from http


# as well as within the ray system.

# We can also add a new backend and split the traffic.
def echo_v2(flask_request):
    # magic, only from web.
    return "something new"

serve.create_backend(echo_v2, "echo:v2")
backend_config_v2 = serve.get_backend_config("echo:v2")

# The two backend will now split the traffic 50%-50%.
serve.split("my_endpoint", {"echo:v1": 0.5, "echo:v2": 0.5})

# Observe requests are now split between two backends.
for _ in range(10):

# You can also change number of replicas
# for each backend independently.
backend_config_v1.num_replicas = 2
serve.set_backend_config("echo:v1", backend_config_v1)
backend_config_v2.num_replicas = 2
serve.set_backend_config("echo:v2", backend_config_v2)

# As well as retrieving relevant system metrics


class ray.serve.RoutePolicy[source]

A class for registering the backend selection policy. Add a name and the corresponding class. Serve will support the added policy and policy can be accessed in serve.init method through name provided here.

ray.serve.init(kv_store_connector=None, kv_store_path=None, blocking=False, start_server=True, http_host='', http_port=8000, ray_init_kwargs={'num_cpus': 8, 'object_store_memory': 100000000}, gc_window_seconds=3600, queueing_policy=<RoutePolicy.Random: <ray.serve.policy.ActorClass(RandomPolicyQueueActor) object>>, policy_kwargs={})[source]

Initialize a serve cluster.

If serve cluster has already initialized, this function will just return.

Calling ray.init before serve.init is optional. When there is not a ray cluster initialized, serve will call ray.init with object_store_memory requirement.

  • kv_store_connector (callable) – Function of (namespace) => TableObject. We will use a SQLite connector that stores to /tmp by default.

  • kv_store_path (str, path) – Path to the SQLite table.

  • blocking (bool) – If true, the function will wait for the HTTP server to be healthy, and other components to be ready before returns.

  • start_server (bool) – If true, serve.init starts http server. (Default: True)

  • http_host (str) – Host for HTTP server. Default to “”.

  • http_port (int) – Port for HTTP server. Default to 8000.

  • ray_init_kwargs (dict) – Argument passed to ray.init, if there is no ray connection. Default to {“object_store_memory”: int(1e8)} for performance stability reason

  • gc_window_seconds (int) – How long will we keep the metric data in memory. Data older than the gc_window will be deleted. The default is 3600 seconds, which is 1 hour.

  • queueing_policy (RoutePolicy) – Define the queueing policy for selecting the backend for a service. (Default: RoutePolicy.Random)

  • policy_kwargs – Arguments required to instantiate a queueing policy

ray.serve.create_backend(func_or_class, backend_tag, *actor_init_args, backend_config=None)[source]

Create a backend using func_or_class and assign backend_tag.

  • func_or_class (callable, class) – a function or a class implements __call__ protocol.

  • backend_tag (str) – a unique tag assign to this backend. It will be used to associate services in traffic policy.

  • backend_config (BackendConfig) – An object defining backend properties

  • starting a backend. (for) –

  • *actor_init_args (optional) – the argument to pass to the class initialization method.

ray.serve.create_endpoint(endpoint_name, route=None, methods=['GET'])[source]

Create a service endpoint given route_expression.

  • endpoint_name (str) – A name to associate to the endpoint. It will be used as key to set traffic policy.

  • route (str) – A string begin with “/”. HTTP server will use the string to match the path.

  • blocking (bool) – If true, the function will wait for service to be registered before returning

Associate a service endpoint with backend tag.


>>> serve.link("service-name", "backend:v1")

Note: This is equivalent to

>>> serve.split("service-name", {"backend:v1": 1.0})
ray.serve.split(endpoint_name, traffic_policy_dictionary)[source]

Associate a service endpoint with traffic policy.


>>> serve.split("service-name", {
    "backend:v1": 0.5,
    "backend:v2": 0.5
  • endpoint_name (str) – A registered service endpoint.

  • traffic_policy_dictionary (dict) – a dictionary maps backend names to their traffic weights. The weights must sum to 1.

ray.serve.get_handle(endpoint_name, relative_slo_ms=None, absolute_slo_ms=None, missing_ok=False)[source]

Retrieve RayServeHandle for service endpoint to invoke it from Python.

  • endpoint_name (str) – A registered service endpoint.

  • relative_slo_ms (float) – Specify relative deadline in milliseconds for queries fired using this handle. (Default: None)

  • absolute_slo_ms (float) – Specify absolute deadline in milliseconds for queries fired using this handle. (Default: None)

  • missing_ok (bool) – If true, skip the check for the endpoint existence. It can be useful when the endpoint has not been registered.



ray.serve.stat(percentiles=[50, 90, 95], agg_windows_seconds=[10, 60, 300, 600, 3600])[source]

Retrieve metric statistics about ray serve system.

  • percentiles (List[int]) – The percentiles for aggregation operations. Default is 50th, 90th, 95th percentile.

  • agg_windows_seconds (List[int]) – The aggregation windows in seconds. The longest aggregation window must be shorter or equal to the gc_window_seconds.

ray.serve.set_backend_config(backend_tag, backend_config)[source]

Set a backend configuration for a backend tag

  • backend_tag (str) – A registered backend.

  • backend_config (BackendConfig) – Desired backend configuration.


get the backend configuration for a backend tag


backend_tag (str) – A registered backend.


Annotation to mark a serving function that batch is accepted.

This annotation need to be used to mark a function expect all arguments to be passed into a list.


>>> @serve.accept_batch
    def serving_func(flask_request):
        assert isinstance(flask_request, list)
>>> class ServingActor:
        def __call__(self, *, python_arg=None):
            assert isinstance(python_arg, list)
class ray.serve.route(url_route)[source]

Convient method to create a backend and link to service.

When called, the following will happen: - An endpoint is created with the same of the function - A backend is created and instantiate the function - The endpoint and backend are linked together - The handle is returned

def my_handler(flask_request):