Ray Serve (Experimental)

Ray Serve is a serving library that exposes python function/classes to HTTP. It has built-in support for flexible traffic policy. This means you can easy split incoming traffic to multiple implementations.

With Ray Serve, you can deploy your services at any scale.


Ray Serve is Python 3 only.


Full example of ray.serve module

import ray
import ray.experimental.serve as serve
from ray.experimental.serve.utils import pformat_color_json
import requests
import time

# initialize ray serve system.
# blocking=True will wait for HTTP server to be ready to serve request.

# an endpoint is associated with an http URL.
serve.create_endpoint("my_endpoint", "/echo")

# a backend can be a function or class.
# it can be made to be invoked from web as well as python.
def echo_v1(flask_request, response="hello from python!"):
    if serve.context.web:
        response = flask_request.url
    return response

serve.create_backend(echo_v1, "echo:v1")

# We can link an endpoint to a backend, the means all the traffic
# goes to my_endpoint will now goes to echo:v1 backend.
serve.link("my_endpoint", "echo:v1")

# The service will be reachable from http


# as well as within the ray system.

# We can also add a new backend and split the traffic.
def echo_v2(flask_request):
    # magic, only from web.
    return "something new"

serve.create_backend(echo_v2, "echo:v2")

# The two backend will now split the traffic 50%-50%.
serve.split("my_endpoint", {"echo:v1": 0.5, "echo:v2": 0.5})

# Observe requests are now split between two backends.
for _ in range(10):

# You can also scale each backend independently.
serve.scale("echo:v1", 2)
serve.scale("echo:v2", 2)

# As well as retrieving relevant system metrics


ray.experimental.serve.init(blocking=False, object_store_memory=100000000, gc_window_seconds=3600)[source]

Initialize a serve cluster.

Calling ray.init before serve.init is optional. When there is not a ray cluster initialized, serve will call ray.init with object_store_memory requirement.

  • blocking (bool) – If true, the function will wait for the HTTP server to be healthy, and other components to be ready before returns.
  • object_store_memory (int) – Allocated shared memory size in bytes. The default is 100MiB. The default is kept low for latency stability reason.
  • gc_window_seconds (int) – How long will we keep the metric data in memory. Data older than the gc_window will be deleted. The default is 3600 seconds, which is 1 hour.
ray.experimental.serve.create_backend(func_or_class, backend_tag, *actor_init_args)[source]

Create a backend using func_or_class and assign backend_tag.

  • func_or_class (callable, class) – a function or a class implements __call__ protocol.
  • backend_tag (str) – a unique tag assign to this backend. It will be used to associate services in traffic policy.
  • *actor_init_args (optional) – the argument to pass to the class initialization method.
ray.experimental.serve.create_endpoint(endpoint_name, route_expression, blocking=True)[source]

Create a service endpoint given route_expression.

  • endpoint_name (str) – A name to associate to the endpoint. It will be used as key to set traffic policy.
  • route_expression (str) – A string begin with “/”. HTTP server will use the string to match the path.
  • blocking (bool) – If true, the function will wait for service to be registered before returning

Associate a service endpoint with backend tag.


>>> serve.link("service-name", "backend:v1")

Note: This is equivalent to

>>> serve.split("service-name", {"backend:v1": 1.0})
ray.experimental.serve.split(endpoint_name, traffic_policy_dictionary)[source]

Associate a service endpoint with traffic policy.


>>> serve.split("service-name", {
    "backend:v1": 0.5,
    "backend:v2": 0.5
  • endpoint_name (str) – A registered service endpoint.
  • traffic_policy_dictionary (dict) – a dictionary maps backend names to their traffic weights. The weights must sum to 1.

Rollback a traffic policy decision.

Parameters:endpoint_name (str) – A registered service endpoint.

Retrieve RayServeHandle for service endpoint to invoke it from Python.

Parameters:endpoint_name (str) – A registered service endpoint.
ray.experimental.serve.stat(percentiles=[50, 90, 95], agg_windows_seconds=[10, 60, 300, 600, 3600])[source]

Retrieve metric statistics about ray serve system.

  • percentiles (List[int]) – The percentiles for aggregation operations. Default is 50th, 90th, 95th percentile.
  • agg_windows_seconds (List[int]) – The aggregation windows in seconds. The longest aggregation window must be shorter or equal to the gc_window_seconds.
ray.experimental.serve.scale(backend_tag, num_replicas)[source]

Set the number of replicas for backend_tag.

  • backend_tag (str) – A registered backend.
  • num_replicas (int) – Desired number of replicas