Serve: Scalable and Programmable Serving

Tip

Get in touch with us if you’re using or considering using Ray Serve.

Chat with Ray Serve users and developers on our forum.

../_images/logo.svg

Ray Serve is a scalable model serving library for building online inference APIs. Serve is framework agnostic, so you can use a single toolkit to serve everything from deep learning models built with frameworks like PyTorch, Tensorflow, and Keras, to Scikit-Learn models, to arbitrary Python business logic.

Serve is particularly well suited for Model Composition, enabling you to build a complex inference service consisting of multiple ML models and business logic all in Python code.

Serve is built on top of Ray, so it easily scales to many machines and offers flexible scheduling support such as fractional GPUs so you can share resources and serve many machine learning models at low cost.

Install Ray Serve and its dependencies:

pip install "ray[serve]"

To run this example, install the following: pip install ray["serve"]

In this quick-start example we will define a simple “hello world” deployment, deploy it behind HTTP locally, and query it.

import requests
from ray import serve


# 1: Define a Ray Serve deployment.
@serve.deployment(route_prefix="/")
class MyModelDeployment:
    def __init__(self, msg: str):
        # Initialize model state: could be very large neural net weights.
        self._msg = msg

    def __call__(self, request):
        return {"result": self._msg}


# 2: Deploy the model.
serve.start()
MyModelDeployment.deploy(msg="Hello world!")

# 3: Query the deployment and print the result.
print(requests.get("http://localhost:8000/").json())
# {'result': 'Hello world!'}

To run this example, install the following: pip install ray["serve"]

In this example we will use Serve’s FastAPI integration to make use of more advanced HTTP functionality.

import requests
from fastapi import FastAPI
from ray import serve

# 1: Define a FastAPI app and wrap it in a deployment with a route handler.
app = FastAPI()


@serve.deployment(route_prefix="/")
@serve.ingress(app)
class FastAPIDeployment:
    # FastAPI will automatically parse the HTTP request for us.
    @app.get("/hello")
    def say_hello(self, name: str) -> str:
        return f"Hello {name}!"


# 2: Deploy the deployment.
serve.start()
FastAPIDeployment.deploy()

# 3: Query the deployment and print the result.
print(requests.get("http://localhost:8000/hello", params={"name": "Theodore"}).json())
# "Hello Theodore!"

To run this example, install the following: pip install ray["serve"] transformers

In this example we will serve a pre-trained Hugging Face transformers model using Ray Serve. The model we’ll use is a sentiment analysis model: it will take a text string as input and return if the text was “POSITIVE” or “NEGATIVE.”

import requests
from transformers import pipeline
from ray import serve


# 1: Wrap the pretrained sentiment analysis model in a Serve deployment.
@serve.deployment(route_prefix="/")
class SentimentAnalysisDeployment:
    def __init__(self):
        self._model = pipeline("sentiment-analysis")

    def __call__(self, request):
        return self._model(request.query_params["text"])[0]


# 2: Deploy the deployment.
serve.start()
SentimentAnalysisDeployment.deploy()

# 3: Query the deployment and print the result.
print(
    requests.get(
        "http://localhost:8000/", params={"text": "Ray Serve is great!"}
    ).json()
)
# {'label': 'POSITIVE', 'score': 0.9998476505279541}

Why choose Serve?

Learn More

Check out Getting Started and Key Concepts, look at the Ray Serve FAQ, or head over to the Examples to get started building your Ray Serve applications.

For more, see the following blog posts about Ray Serve: