Key Concepts


Deployments are the central concept in Ray Serve. They allow you to define and update your business logic or models that will handle incoming requests as well as how this is exposed over HTTP or in Python.

A deployment is defined using @serve.deployment on a Python class (or function for simple use cases). You can specify arguments to be passed to the constructor when you call Deployment.deploy(), shown below.

A deployment consists of a number of replicas, which are individual copies of the function or class that are started in separate Ray Actors (processes).

class MyFirstDeployment:
  # Take the message to return as an argument to the constructor.
  def __init__(self, msg):
      self.msg = msg

  def __call__(self, request):
      return self.msg

  def other_method(self, arg):
      return self.msg

MyFirstDeployment.deploy("Hello world!")

Deployments can be exposed in two ways: over HTTP or in Python via the ServeHandle API. By default, HTTP requests will be forwarded to the __call__ method of the class (or the function) and a Starlette Request object will be the sole argument. You can also define a deployment that wraps a FastAPI app for more flexible handling of HTTP requests. See FastAPI HTTP Deployments for details.

To serve multiple deployments defined by the same class, use the name option:


You can also list all available deployments and dynamically get references to them:

>> serve.list_deployments()
{'A': Deployment(name=A,version=None,route_prefix=/A)}
{'MyFirstDeployment': Deployment(name=MyFirstDeployment,version=None,route_prefix=/MyFirstDeployment}

# Returns the same object as the original MyFirstDeployment object.
# This can be used to redeploy, get a handle, etc.
deployment = serve.get_deployment("MyFirstDeployment")

HTTP Ingress

By default, deployments are exposed over HTTP at http://localhost:8000/<deployment_name>. The HTTP path that the deployment is available at can be changed using the route_prefix option. All requests to /{route_prefix} and any subpaths will be routed to the deployment (using a longest-prefix match for overlapping route prefixes).

Here’s an example:

@serve.deployment(name="http_deployment", route_prefix="/api")
class HTTPDeployment:
  def __call__(self, request):
      return "Hello world!"

After creating the deployment, it is now exposed by the HTTP server and handles requests using the specified class. We can query the model to verify that it’s working.

import requests


We can also query the deployment using the ServeHandle interface.

# To get a handle from the same script, use the Deployment object directly:
handle = HTTPDeployment.get_handle()

# To get a handle from a different script, reference it by name:
handle = serve.get_deployment("http_deployment").get_handle()


As noted above, there are two ways to expose deployments. The first is by using the ServeHandle interface. This method allows you to access deployments within a Python script or code, making it convenient for a Python developer. And the second is by using the HTTP request, allowing access to deployments via a web client application.


Let’s look at a simple end-to-end example using both ways to expose and access deployments. Your output may vary due to random nature of how the prediction is computed; however, the example illustrates two things:

  1. how to expose and use deployments and 2) how to use replicas, to which requests are sent. Note that each pid is a separate replica associated with each deployment name, rep-1 and rep-2 respectively.

# This brief example shows how to create, deploy, and expose access to
# deployment models, using the simple Ray Serve deployment APIs.
# Once deployed, you can access deployment via two methods:
# ServerHandle API and HTTP
import os
from random import random

import requests
import starlette
from starlette.requests import Request
import ray
from ray import serve

# A simple example model stored in a pickled format at an accessible path
# that can be reloaded and deserialized into a model instance. Once deployed
# in Ray Serve, we can use it for prediction. The prediction is a fake condition,
# based on threshold of weight greater than 0.5.

class Model:
    def __init__(self, path):
        self.path = path

    def predict(self, data):
        return random() + data if data > 0.5 else data

class Deployment:
    # Take in a path to load your desired model
    def __init__(self, path: str) -> None:
        self.path = path
        self.model = Model(path)
        # Get the pid on which this deployment is running on = os.getpid()

    # Deployments are callable. Here we simply return a prediction from
    # our request
    def __call__(self, starlette_request) -> str:
        # Request came via an HTTP
        if isinstance(starlette_request, starlette.requests.Request):
            data = starlette_request.query_params['data']
            # Request came via a ServerHandle API method call.
            data = starlette_request
        pred = self.model.predict(float(data))
        return f"(pid: {}); path: {self.path}; data: {float(data):.3f}; prediction: {pred:.3f}"

if __name__ == '__main__':

    # Start a Ray Serve instance. This will automatically start
    # or connect to an existing Ray cluster.

    # Create two distinct deployments of the same class as
    # two replicas. Associate each deployment with a unique 'name'.
    # This name can be used as to fetch its respective serve handle.
    # See code below for method 1.
    Deployment.options(name="rep-1", num_replicas=2).deploy("/model/rep-1.pkl")
    Deployment.options(name="rep-2", num_replicas=2).deploy("/model/rep-2.pkl")

    # Get the current list of deployments

    print("ServerHandle API responses: " + "--" * 5)

    # Method 1) Access each deployment using the ServerHandle API
    for _ in range(2):
        for d_name in ["rep-1", "rep-2"]:
            # Get handle to the each deployment and invoke its method.
            # Which replica the request is dispatched to is determined
            # by the Router actor.
            handle = serve.get_deployment(d_name).get_handle()
            print(f"handle name : {d_name}")
            print(f"prediction  : {ray.get(handle.remote(random()))}")
            print("-" * 2)

    print("HTTP responses: " + "--" * 5)

    # Method 2) Access deployment via HTTP Request
    for _ in range(2):
        for d_name in ["rep-1", "rep-2"]:
            # Send HTTP request along with data payload
            url = f"{d_name}"
            print(f"handle name : {d_name}")
            print(f"prediction  : {requests.get(url, params= {'data': random()}).text}")

# Output:
# {'rep-1': Deployment(name=rep-1,version=None,route_prefix=/rep-1),
# 'rep-2': Deployment(name=rep-2,version=None,route_prefix=/rep-2)}
# ServerHandle API responses: ----------
# handle name : rep-1
# prediction  : (pid: 62636); path: /model/rep-1.pkl; data: 0.600; prediction: 1.292
# --
# handle name : rep-2
# prediction  : (pid: 62635); path: /model/rep-2.pkl; data: 0.075; prediction: 0.075
# --
# handle name : rep-1
# prediction  : (pid: 62634); path: /model/rep-1.pkl; data: 0.186; prediction: 0.186
# --
# handle name : rep-2
# prediction  : (pid: 62637); path: /model/rep-2.pkl; data: 0.751; prediction: 1.444
# --
# HTTP responses: ----------
# handle name : rep-1
# prediction  : (pid: 62636); path: /model/rep-1.pkl; data: 0.582; prediction: 1.481
# handle name : rep-2
# prediction  : (pid: 62637); path: /model/rep-2.pkl; data: 0.778; prediction: 1.678
# handle name : rep-1
# prediction  : (pid: 62634); path: /model/rep-1.pkl; data: 0.139; prediction: 0.139
# handle name : rep-2
# prediction  : (pid: 62635); path: /model/rep-2.pkl; data: 0.569; prediction: 1.262

Deployment Graph

Building on top of the Deployment concept, Ray Serve provides a first-class API for composing models into a graph structure.

Here’s a simple example combining a preprocess function and model.

import ray
from ray import serve
from ray.serve.dag import InputNode
from ray.serve.drivers import DAGDriver

def preprocess(inp: int):
    return inp + 1

class Model:
    def __init__(self, increment: int):
        self.increment = increment

    def predict(self, inp: int):
        return inp + self.increment

with InputNode() as inp:
    model = Model.bind(increment=2)
    output = model.predict.bind(preprocess.bind(inp))
    serve_dag = DAGDriver.bind(output)

handle =
assert ray.get(handle.predict.remote(1)) == 4