Warning

Ray Serve is changing fast! You’re probably running the latest pip release and not the nightly build, so please ensure you’re viewing the correct version of this documentation. Here’s the documentation for the latest pip release of Ray Serve.

Ray Serve: Scalable and Programmable Serving

../_images/logo.svg

Ray Serve is an easy-to-use scalable model serving library built on Ray. Ray Serve is:

  • Framework-agnostic: Use a single toolkit to serve everything from deep learning models built with frameworks like PyTorch, Tensorflow, and Keras, to Scikit-Learn models, to arbitrary Python business logic.

  • Python-first: Configure your model serving with pure Python code—no more YAML or JSON configs.

Since Ray Serve is built on Ray, it allows you to easily scale to many machines, both in your datacenter and in the cloud.

Ray Serve can be used in two primary ways to deploy your models at scale:

  1. Have Python functions and classes automatically placed behind HTTP endpoints.

  2. Alternatively, call them from within your existing Python web server using the Python-native ServeHandle API.

Tip

Chat with Ray Serve users and developers on our forum!

Note

Starting with Ray version 1.2.0, Ray Serve backends take in a Starlette Request object instead of a Flask Request object. See the migration guide for details.

Ray Serve Quickstart

Ray Serve supports Python versions 3.6 through 3.8. To install Ray Serve, run the following command:

pip install "ray[serve]"

Now you can serve a function…

import ray
from ray import serve
import requests

ray.init(num_cpus=4)
serve.start()


def say_hello(request):
    return "hello " + request.query_params["name"] + "!"


# Form a backend from our function and connect it to an endpoint.
serve.create_backend("my_backend", say_hello)
serve.create_endpoint("my_endpoint", backend="my_backend", route="/hello")

# Query our endpoint in two different ways: from HTTP and from Python.
print(requests.get("http://127.0.0.1:8000/hello?name=serve").text)
# > hello serve!
print(ray.get(serve.get_handle("my_endpoint").remote(name="serve")))
# > hello serve!

…or serve a stateful class.

import ray
from ray import serve
import requests

ray.init(num_cpus=4)
serve.start()


class Counter:
    def __init__(self):
        self.count = 0

    def __call__(self, request):
        self.count += 1
        return {"count": self.count}


# Form a backend from our class and connect it to an endpoint.
serve.create_backend("my_backend", Counter)
serve.create_endpoint("my_endpoint", backend="my_backend", route="/counter")

# Query our endpoint in two different ways: from HTTP and from Python.
print(requests.get("http://127.0.0.1:8000/counter").json())
# > {"count": 1}
print(ray.get(serve.get_handle("my_endpoint").remote()))
# > {"count": 2}

See Core APIs for more exhaustive coverage about Ray Serve and its core concepts: backends and endpoints. For a high-level view of the architecture underlying Ray Serve, see Serve Architecture.

Why Ray Serve?

There are generally two ways of serving machine learning applications, both with serious limitations: you can use a traditional web server—your own Flask app—or you can use a cloud-hosted solution.

The first approach is easy to get started with, but it’s hard to scale each component. The second approach requires vendor lock-in (SageMaker), framework-specific tooling (TFServing), and a general lack of flexibility.

Ray Serve solves these problems by giving you a simple web server (and the ability to use your own) while still handling the complex routing, scaling, and testing logic necessary for production deployments.

Beyond scaling up your backends with multiple replicas, Ray Serve also enables:

For more on the motivation behind Ray Serve, check out these meetup slides and this blog post.

When should I use Ray Serve?

Ray Serve is a simple (but flexible) tool for deploying, operating, and monitoring Python-based machine learning models. Ray Serve excels when scaling out to serve models in production is a necessity. This might be because of large-scale batch processing requirements or because you’re going to serve a number of models behind different endpoints and may need to run A/B tests or control traffic between different models.

If you plan on running on multiple machines, Ray Serve will serve you well.

What’s next?

Check out the End-to-End Tutorial and Core APIs, look at the Ray Serve FAQ, or head over to the Advanced Tutorials to get started building your Ray Serve applications.

For more, see the following blog posts about Ray Serve: