Ray Serve: Scalable and Programmable Serving#

Ray Serve is a scalable model serving library for building online inference APIs. Serve is framework-agnostic, so you can use a single toolkit to serve everything from deep learning models built with frameworks like PyTorch, TensorFlow, and Keras, to Scikit-Learn models, to arbitrary Python business logic. It has several features and performance optimizations for serving Large Language Models such as response streaming, dynamic request batching, multi-node/multi-GPU serving, etc.

Ray Serve is particularly well suited for model composition and multi-model serving, enabling you to build a complex inference service consisting of multiple ML models and business logic all in Python code.

Ray Serve is built on top of Ray, so it easily scales to many machines and offers flexible scheduling support such as fractional GPUs so you can share resources and serve many machine learning models at low cost.

Quickstart#

Install Ray Serve and its dependencies:

pip install "ray[serve]"

Define a simple “hello world” application, run it locally, and query it over HTTP.

import requests
from starlette.requests import Request
from typing import Dict

from ray import serve


# 1: Define a Ray Serve application.
@serve.deployment
class MyModelDeployment:
    def __init__(self, msg: str):
        # Initialize model state: could be very large neural net weights.
        self._msg = msg

    def __call__(self, request: Request) -> Dict:
        return {"result": self._msg}


app = MyModelDeployment.bind(msg="Hello world!")

# 2: Deploy the application locally.
serve.run(app, route_prefix="/")

# 3: Query the application and print the result.
print(requests.get("http://localhost:8000/").json())
# {'result': 'Hello world!'}

More examples#

Model composition

Use Serve’s model composition API to combine multiple deployments into a single application.

import requests
import starlette
from typing import Dict
from ray import serve
from ray.serve.handle import DeploymentHandle


# 1. Define the models in our composition graph and an ingress that calls them.
@serve.deployment
class Adder:
    def __init__(self, increment: int):
        self.increment = increment

    def add(self, inp: int):
        return self.increment + inp


@serve.deployment
class Combiner:
    def average(self, *inputs) -> float:
        return sum(inputs) / len(inputs)


@serve.deployment
class Ingress:
    def __init__(
        self,
        adder1: DeploymentHandle,
        adder2: DeploymentHandle,
        combiner: DeploymentHandle,
    ):
        self._adder1 = adder1
        self._adder2 = adder2
        self._combiner = combiner

    async def __call__(self, request: starlette.requests.Request) -> Dict[str, float]:
        input_json = await request.json()
        final_result = await self._combiner.average.remote(
            self._adder1.add.remote(input_json["val"]),
            self._adder2.add.remote(input_json["val"]),
        )
        return {"result": final_result}


# 2. Build the application consisting of the models and ingress.
app = Ingress.bind(Adder.bind(increment=1), Adder.bind(increment=2), Combiner.bind())
serve.run(app)

# 3: Query the application and print the result.
print(requests.post("http://localhost:8000/", json={"val": 100.0}).json())
# {"result": 101.5}

FastAPI integration

Use Serve’s FastAPI integration to elegantly handle HTTP parsing and validation.

import requests
from fastapi import FastAPI
from ray import serve

# 1: Define a FastAPI app and wrap it in a deployment with a route handler.
app = FastAPI()


@serve.deployment
@serve.ingress(app)
class FastAPIDeployment:
    # FastAPI will automatically parse the HTTP request for us.
    @app.get("/hello")
    def say_hello(self, name: str) -> str:
        return f"Hello {name}!"


# 2: Deploy the deployment.
serve.run(FastAPIDeployment.bind(), route_prefix="/")

# 3: Query the deployment and print the result.
print(requests.get("http://localhost:8000/hello", params={"name": "Theodore"}).json())
# "Hello Theodore!"

Hugging Face Transformers model

To run this example, install the following: pip install transformers

Serve a pre-trained Hugging Face Transformers model using Ray Serve. The model we’ll use is a sentiment analysis model: it will take a text string as input and return if the text was “POSITIVE” or “NEGATIVE.”

import requests
from starlette.requests import Request
from typing import Dict

from transformers import pipeline

from ray import serve


# 1: Wrap the pretrained sentiment analysis model in a Serve deployment.
@serve.deployment
class SentimentAnalysisDeployment:
    def __init__(self):
        self._model = pipeline("sentiment-analysis")

    def __call__(self, request: Request) -> Dict:
        return self._model(request.query_params["text"])[0]


# 2: Deploy the deployment.
serve.run(SentimentAnalysisDeployment.bind(), route_prefix="/")

# 3: Query the deployment and print the result.
print(
    requests.get(
        "http://localhost:8000/", params={"text": "Ray Serve is great!"}
    ).json()
)
# {'label': 'POSITIVE', 'score': 0.9998476505279541}

Why choose Serve?#

How can Serve help me as a…#

How does Serve compare to …#

We truly believe Serve is unique as it gives you end-to-end control over your ML application while delivering scalability and high performance. To achieve Serve’s feature offerings with other tools, you would need to glue together multiple frameworks like Tensorflow Serving and SageMaker, or even roll your own micro-batching component to improve throughput.

Learn More#

Check out Getting Started and Key Concepts, or head over to the Examples to get started building your Ray Serve applications.

Getting Started

Start with our quick start tutorials for deploying a single model locally and how to convert an existing model into a Ray Serve deployment.

Get Started with Ray Serve

Key Concepts

Understand the key concepts behind Ray Serve. Learn about Deployments, how to query them, and using DeploymentHandles to compose multiple models and business logic together.

Learn Key Concepts

Examples

Follow the tutorials to learn how to integrate Ray Serve with TensorFlow, and Scikit-Learn.

Serve Examples

API Reference

Get more in-depth information about the Ray Serve API.

Read the API Reference

Ray Serve Office Hours

Join our bi-weekly community office hours for anyone working with Ray Serve. Bring questions, issues you’re experiencing, or ideas you want to sanity check with the Ray Serve team.

For more, see the following blog posts about Ray Serve:

Serving ML Models in Production: Common Patterns by Simon Mo, Edward Oakes, and Michael Galarnyk
The Simplest Way to Serve your NLP Model in Production with Pure Python by Edward Oakes and Bill Chambers
Machine Learning Serving is Broken by Simon Mo
How to Scale Up Your FastAPI Application Using Ray Serve by Archit Kulkarni