Ray Serve: Scalable and Programmable Serving


Ray Serve is a scalable model-serving library built on Ray.

For users, Ray Serve is:

  • Framework Agnostic: Use the same toolkit to serve everything from deep learning models built with frameworks like PyTorch, Tensorflow, and Keras, to Scikit-Learn models, to arbitrary business logic.

  • Python First: Configure your model serving with pure Python code - no more YAML or JSON configs.

As a library, Ray Serve enables:

  • Splitting Traffic Between Backends with zero downtime, by decoupling routing logic from response handling logic.

  • Batching to improve performance is built in to help you meet your performance objectives. You can also use your model for batch and online processing.

  • Because Serve is a library, it’s esay to integrate it with other tools in your environment, such as CI/CD.

Since Serve is built on Ray, it also allows you to scale to many machines, in your datacenter or in cloud environments, and it allows you to leverage all of the other Ray frameworks.


If you want to try out Serve, join our community slack and discuss in the #serve channel.


Ray Serve supports Python versions 3.5 and higher. To install Ray Serve:

pip install "ray[serve]"

Ray Serve in 90 Seconds

Serve a function by defining a function, an endpoint, and a backend (in this case a stateless function) then connecting the two by setting traffic from the endpoint to the backend.

from ray import serve
import requests

client = serve.start()

def echo(flask_request):
    return "hello " + flask_request.args.get("name", "serve!")

client.create_backend("hello", echo)
client.create_endpoint("hello", backend="hello", route="/hello")

# > "hello serve!"

Serve a stateful class by defining a class (Counter), creating an endpoint and a backend, then connecting the two by setting traffic from the endpoint to the backend.

from ray import serve
import requests

client = serve.start()

class Counter:
    def __init__(self):
        self.count = 0

    def __call__(self, flask_request):
        return {"current_counter": self.count}

client.create_backend("counter", Counter)
client.create_endpoint("counter", backend="counter", route="/counter")

# > {"current_counter": self.count}

See Key Concepts for more exhaustive coverage about Ray Serve and its core concepts.

Why Ray Serve?

There are generally two ways of serving machine learning applications, both with serious limitations: you can build using a traditional webserver - your own Flask app or you can use a cloud hosted solution.

The first approach is easy to get started with, but it’s hard to scale each component. The second approach requires vendor lock-in (SageMaker), framework specific tooling (TFServing), and a general lack of flexibility.

Ray Serve solves these problems by giving a user the ability to leverage the simplicity of deployment of a simple webserver but handles the complex routing, scaling, and testing logic necessary for production deployments.

For more on the motivation behind Ray Serve, check out these meetup slides.

When should I use Ray Serve?

Ray Serve is a simple (but flexible) tool for deploying, operating, and monitoring Python based machine learning models. Ray Serve excels when scaling out to serve models in production is a necessity. This might be because of large scale batch processing requirements or because you’re going to serve a number of models behind different endpoints and may need to run A/B tests or control traffic between different models.

If you plan on running on multiple machines, Ray Serve will serve you well.

What’s next?

Check out the Key Concepts, learn more about Advanced Topics, Configurations, & FAQ, look at the Ray Serve FAQ, or head over to the Tutorials to get started building your Ray Serve Applications.