Ray Serve (Experimental)

Ray Serve is a serving library that exposes python function/classes to HTTP. It has built-in support for flexible traffic policy. This means you can easy split incoming traffic to multiple implementations.

With Ray Serve, you can deploy your services at any scale.


Ray Serve is Python 3 only.


Full example of ray.serve module

import time

import requests

import ray
import ray.experimental.serve as serve
from ray.experimental.serve.utils import pformat_color_json

# initialize ray serve system.
# blocking=True will wait for HTTP server to be ready to serve request.

# an endpoint is associated with an http URL.
serve.create_endpoint("my_endpoint", "/echo")

# a backend can be a function or class.
# it can be made to be invoked from web as well as python.
def echo_v1(flask_request, response="hello from python!"):
    if serve.context.web:
        response = flask_request.url
    return response

serve.create_backend(echo_v1, "echo:v1")

# We can link an endpoint to a backend, the means all the traffic
# goes to my_endpoint will now goes to echo:v1 backend.
serve.link("my_endpoint", "echo:v1")

# The service will be reachable from http


# as well as within the ray system.

# We can also add a new backend and split the traffic.
def echo_v2(flask_request):
    # magic, only from web.
    return "something new"

serve.create_backend(echo_v2, "echo:v2")

# The two backend will now split the traffic 50%-50%.
serve.split("my_endpoint", {"echo:v1": 0.5, "echo:v2": 0.5})

# Observe requests are now split between two backends.
for _ in range(10):

# You can also scale each backend independently.
serve.scale("echo:v1", 2)
serve.scale("echo:v2", 2)

# As well as retrieving relevant system metrics


ray.experimental.serve.init(kv_store_connector=None, kv_store_path='/tmp/ray_serve.db', blocking=False, http_host='', http_port=8000, ray_init_kwargs={'object_store_memory': 100000000}, gc_window_seconds=3600)[source]

Initialize a serve cluster.

If serve cluster has already initialized, this function will just return.

Calling ray.init before serve.init is optional. When there is not a ray cluster initialized, serve will call ray.init with object_store_memory requirement.

  • kv_store_connector (callable) – Function of (namespace) => TableObject. We will use a SQLite connector that stores to /tmp by default.
  • kv_store_path (str, path) – Path to the SQLite table.
  • blocking (bool) – If true, the function will wait for the HTTP server to be healthy, and other components to be ready before returns.
  • http_host (str) – Host for HTTP server. Default to “”.
  • http_port (int) – Port for HTTP server. Default to 8000.
  • ray_init_kwargs (dict) – Argument passed to ray.init, if there is no ray connection. Default to {“object_store_memory”: int(1e8)} for performance stability reason
  • gc_window_seconds (int) – How long will we keep the metric data in memory. Data older than the gc_window will be deleted. The default is 3600 seconds, which is 1 hour.
ray.experimental.serve.create_backend(func_or_class, backend_tag, *actor_init_args)[source]

Create a backend using func_or_class and assign backend_tag.

  • func_or_class (callable, class) – a function or a class implements __call__ protocol.
  • backend_tag (str) – a unique tag assign to this backend. It will be used to associate services in traffic policy.
  • *actor_init_args (optional) – the argument to pass to the class initialization method.
ray.experimental.serve.create_endpoint(endpoint_name, route, blocking=True)[source]

Create a service endpoint given route_expression.

  • endpoint_name (str) – A name to associate to the endpoint. It will be used as key to set traffic policy.
  • route (str) – A string begin with “/”. HTTP server will use the string to match the path.
  • blocking (bool) – If true, the function will wait for service to be registered before returning

Associate a service endpoint with backend tag.


>>> serve.link("service-name", "backend:v1")

Note: This is equivalent to

>>> serve.split("service-name", {"backend:v1": 1.0})
ray.experimental.serve.split(endpoint_name, traffic_policy_dictionary)[source]

Associate a service endpoint with traffic policy.


>>> serve.split("service-name", {
    "backend:v1": 0.5,
    "backend:v2": 0.5
  • endpoint_name (str) – A registered service endpoint.
  • traffic_policy_dictionary (dict) – a dictionary maps backend names to their traffic weights. The weights must sum to 1.

Retrieve RayServeHandle for service endpoint to invoke it from Python.

Parameters:endpoint_name (str) – A registered service endpoint.
ray.experimental.serve.stat(percentiles=[50, 90, 95], agg_windows_seconds=[10, 60, 300, 600, 3600])[source]

Retrieve metric statistics about ray serve system.

  • percentiles (List[int]) – The percentiles for aggregation operations. Default is 50th, 90th, 95th percentile.
  • agg_windows_seconds (List[int]) – The aggregation windows in seconds. The longest aggregation window must be shorter or equal to the gc_window_seconds.
ray.experimental.serve.scale(backend_tag, num_replicas)[source]

Set the number of replicas for backend_tag.

  • backend_tag (str) – A registered backend.
  • num_replicas (int) – Desired number of replicas