Serving ML Models

This section should help you:

  • batch requests to optimize performance

  • serve multiple models by composing deployments

Request Batching

You can also have Ray Serve batch requests for performance, which is especially important for some ML models that run on GPUs. In order to use this feature, you need to do the following two things:

  1. Use async def for your request handling logic to process queries concurrently.

  2. Use the @serve.batch decorator to batch individual queries that come into the replica. The method/function that’s decorated should handle a list of requests and return a list of the same length.

class BatchingExample:
    def __init__(self):
        self.count = 0

    async def handle_batch(self, requests):
        responses = []
        for request in requests:

        return responses

    async def __call__(self, request):
        return await self.handle_batch(request)


Please take a look at Batching Tutorial for a deep dive.

Model Composition


Serve recently added an experimental first-class API for model composition (pipelines). Please take a look at the Pipeline API and try it out!

Ray Serve supports composing individually scalable models into a single model out of the box. For instance, you can combine multiple models to perform stacking or ensembles.

To define a higher-level composed model you need to do three things:

  1. Define your underlying models (the ones that you will compose together) as Ray Serve deployments.

  2. Define your composed model, using the handles of the underlying models (see the example below).

  3. Define a deployment representing this composed model and query it!

In order to avoid synchronous execution in the composed model (e.g., it’s very slow to make calls to the composed model), you’ll need to make the function asynchronous by using an async def. You’ll see this in the example below.

That’s it. Let’s take a look at an example:

from random import random
import requests
import ray
from ray import serve


# Our pipeline will be structured as follows:
# - Input comes in, the composed model sends it to model_one
# - model_one outputs a random number between 0 and 1, if the value is
#   greater than 0.5, then the data is sent to model_two
# - otherwise, the data is returned to the user.

# Let's define two models that just print out the data they received.

def model_one(data):
    print("Model 1 called with data ", data)
    return random()


def model_two(data):
    print("Model 2 called with data ", data)
    return data


# max_concurrent_queries is optional. By default, if you pass in an async
# function, Ray Serve sets the limit to a high number.
@serve.deployment(max_concurrent_queries=10, route_prefix="/composed")
class ComposedModel:
    def __init__(self):
        self.model_one = model_one.get_handle()
        self.model_two = model_two.get_handle()

    # This method can be called concurrently!
    async def __call__(self, starlette_request):
        data = await starlette_request.body()

        score = await self.model_one.remote(data=data)
        if score > 0.5:
            result = await self.model_two.remote(data=data)
            result = {"model_used": 2, "score": score}
            result = {"model_used": 1, "score": score}

        return result


for _ in range(5):
    resp = requests.get("", data="hey!")
# Output
# {'model_used': 2, 'score': 0.6250189863595503}
# {'model_used': 1, 'score': 0.03146855349621436}
# {'model_used': 2, 'score': 0.6916977560006987}
# {'model_used': 2, 'score': 0.8169693450866928}
# {'model_used': 2, 'score': 0.9540681979573862}

Integration with Model Registries

Ray Serve is flexible. If you can load your model as a Python function or class, then you can scale it up and serve it with Ray Serve.

For example, if you are using the MLflow Model Registry to manage your models, the following wrapper class will allow you to load a model using its MLflow Model URI:

import pandas as pd
import mlflow.pyfunc

class MLflowDeployment:
    def __init__(self, model_uri):
        self.model = mlflow.pyfunc.load_model(model_uri=model_uri)

    async def __call__(self, request):
        csv_text = await request.body() # The body contains just raw csv text.
        df = pd.read_csv(csv_text)
        return self.model.predict(df)

model_uri = "model:/my_registered_model/Production"

To serve multiple different MLflow models in the same program, use the name option:



The above approach will work for any model registry, not just MLflow. Namely, load the model from the registry in __init__, and forward the request to the model in __call__.

For an even more hands-off and seamless integration with MLflow, check out the Ray Serve MLflow deployment plugin. A full tutorial is available here.

Framework-Specific Tutorials

Ray Serve seamlessly integrates with popular Python ML libraries. Below are tutorials with some of these frameworks to help get you started.