Serve: Scalable and Programmable Serving


Get in touch with us if you’re using or considering using Ray Serve.


Ray Serve is a scalable model serving library for building online inference APIs. Serve is framework agnostic, so you can use a single toolkit to serve everything from deep learning models built with frameworks like PyTorch, Tensorflow, and Keras, to Scikit-Learn models, to arbitrary Python business logic.

Serve is particularly well suited for model composition, enabling you to build a complex inference service consisting of multiple ML models and business logic all in Python code.

Serve is built on top of Ray, so it easily scales to many machines and offers flexible scheduling support such as fractional GPUs so you can share resources and serve many machine learning models at low cost.


Install Ray Serve and its dependencies:

pip install "ray[serve]"

In this quick-start example we will define a simple “hello world” deployment, deploy it behind HTTP locally, and query it.

import requests
from starlette.requests import Request
from typing import Dict

from ray import serve

# 1: Define a Ray Serve deployment.
class MyModelDeployment:
    def __init__(self, msg: str):
        # Initialize model state: could be very large neural net weights.
        self._msg = msg

    def __call__(self, request: Request) -> Dict:
        return {"result": self._msg}

# 2: Deploy the model."Hello world!"))

# 3: Query the deployment and print the result.
# {'result': 'Hello world!'}

For more examples, select from the tabs.

In this example, we demonstrate how you can use Serve’s model composition API to express a complex computation graph and deploy it as a Serve application.

import requests
from ray import serve
from ray.serve.drivers import DAGDriver
from ray.serve.dag import InputNode
from ray.serve.http_adapters import json_request

# 1. Define the models in our composition graph
class Adder:
    def __init__(self, increment: int):
        self.increment = increment

    def predict(self, inp: int):
        return self.increment + inp

def combine_average(*input_values) -> float:
    return {"result": sum(input_values) / len(input_values)}

# 2: Define the model composition graph and call it.
with InputNode() as input_node:
    adder_1 = Adder.bind(increment=1)
    adder_2 = Adder.bind(increment=2)
    dag = combine_average.bind(
        adder_1.predict.bind(input_node), adder_2.predict.bind(input_node)
    ), http_adapter=json_request))

# 3: Query the deployment and print the result.
print("http://localhost:8000/", json=100).json())
# {"result": 101.5}

In this example we will use Serve’s FastAPI integration to make use of more advanced HTTP functionality.

import requests
from fastapi import FastAPI
from ray import serve

# 1: Define a FastAPI app and wrap it in a deployment with a route handler.
app = FastAPI()

class FastAPIDeployment:
    # FastAPI will automatically parse the HTTP request for us.
    def say_hello(self, name: str) -> str:
        return f"Hello {name}!"

# 2: Deploy the deployment.

# 3: Query the deployment and print the result.
print(requests.get("http://localhost:8000/hello", params={"name": "Theodore"}).json())
# "Hello Theodore!"

To run this example, install the following: pip install transformers

In this example we will serve a pre-trained Hugging Face transformers model using Ray Serve. The model we’ll use is a sentiment analysis model: it will take a text string as input and return if the text was “POSITIVE” or “NEGATIVE.”

import requests
from starlette.requests import Request
from typing import Dict

from transformers import pipeline

from ray import serve

# 1: Wrap the pretrained sentiment analysis model in a Serve deployment.
class SentimentAnalysisDeployment:
    def __init__(self):
        self._model = pipeline("sentiment-analysis")

    def __call__(self, request: Request) -> Dict:
        return self._model(request.query_params["text"])[0]

# 2: Deploy the deployment.

# 3: Query the deployment and print the result.
        "http://localhost:8000/", params={"text": "Ray Serve is great!"}
# {'label': 'POSITIVE', 'score': 0.9998476505279541}

Why choose Serve?

How can Serve help me as a…

How does Serve compare to …

We truly believe Serve is unique as it gives you end-to-end control over your ML application while delivering scalability and high performance. To achieve Serve’s feature offerings with other tools, you would need to glue together multiple frameworks like Tensorflow Serving and SageMaker, or even roll your own micro-batching component to improve throughput.

Learn More

Check out Getting Started and Key Concepts, or head over to the Examples to get started building your Ray Serve applications.

Getting Started

Start with our quick start tutorials for deploying a single model locally and how to convert an existing model into a Ray Serve deployment .

Key Concepts

Understand the key concepts behind Ray Serve. Learn about Deployments, how to query them, and the Deployment Graph API for composing models into a graph structure.

User Guides

Learn best practices for common patterns like scaling and resource allocation and model composition. Learn how to develop Serve applications locally and go to production.


Follow the tutorials to learn how to integrate Ray Serve with TensorFlow, Scikit-Learn, and RLlib.

API Reference

Get more in-depth information about the Ray Serve API.

Serve Architecture

Understand how each component in Ray Serve works.

For more, see the following blog posts about Ray Serve: