Integration with Existing Web Servers

In this guide, you will learn how to use Ray Serve to scale up your existing web application. The key feature of Ray Serve that makes this possible is the Python-native ServeHandle API, which allows you keep using your same Python web server while offloading your heavy computation to Ray Serve.

We give two examples, one using a FastAPI web server and another using an AIOHTTP web server, but the same approach will work with any Python web server.

Scaling Up a FastAPI Application

For this example, you must have either Pytorch or Tensorflow installed, as well as Huggingface Transformers and FastAPI. For example:

pip install "ray[serve]" tensorflow transformers fastapi

Here’s a simple FastAPI web server. It uses Huggingface Transformers to auto-generate text based on a short initial input using OpenAI’s GPT-2 model.

from fastapi import FastAPI
from transformers import pipeline  # A simple API for NLP tasks.

app = FastAPI()

nlp_model = pipeline("text-generation", model="gpt2")  # Load the model.

# The function below handles GET requests to the URL `/generate`.
def generate(query: str):
    return nlp_model(query, max_length=50)  # Output 50 words based on query.

To scale this up, we define a Ray Serve backend containing our text model and call it from Python using a ServeHandle:

import ray
from ray import serve

from fastapi import FastAPI
from transformers import pipeline

app = FastAPI()

# Define our deployment.
class GPT2:
    def __init__(self):
        self.nlp_model = pipeline("text-generation", model="gpt2")

    async def __call__(self, request):
        return self.nlp_model(await request.body(), max_length=50)

@app.on_event("startup")  # Code to be run when the server starts.
async def startup_event():
    ray.init(address="auto")  # Connect to the running Ray cluster.
    serve.start(http_host=None)  # Start the Ray Serve instance.

    # Deploy our GPT2 Deployment.

async def generate(query: str):
    # Get a handle to our deployment so we can query it in Python.
    handle = GPT2.get_handle()
    return await handle.remote(query)

To run this example, save it as and then in the same directory, run the following commands to start a local Ray cluster on your machine and run the FastAPI application:

ray start --head
uvicorn main:app

Now you can query your web server, for example by running the following in another terminal:

curl ""

The terminal should then print the generated text:

[{"generated_text":"Hello friend, how's your morning?\n\nSven: Thank you.\n\nMRS. MELISSA: I feel like it really has done to you.\n\nMRS. MELISSA: The only thing I"}]%

To clean up the Ray cluster, run ray stop in the terminal.


According to the backend configuration parameter num_replicas, Ray Serve will place multiple replicas of your model across multiple CPU cores and multiple machines (provided you have started a multi-node Ray cluster), which will correspondingly multiply your throughput.

Scaling Up an AIOHTTP Application

In this section, we’ll integrate Ray Serve with an AIOHTTP web server run using Gunicorn. You’ll need to install AIOHTTP and gunicorn with the command pip install aiohttp gunicorn.

First, here is the script that deploys Ray Serve:

# File name:
import ray
from ray import serve

# Connect to the running Ray cluster.

# Start a detached Ray Serve instance.  It will persist after the script exits.
serve.start(http_host=None, detached=True)

# Set up a deployment with the desired number of replicas. This could also be
# a stateful class (e.g., if we had an expensive model to set up).
@serve.deployment(name="my_model", num_replicas=2)
async def my_model(request):
    data = await request.body()
    return f"Model received data: {data}"


Next is the script that defines the AIOHTTP server:

# File name:
from aiohttp import web

import ray
from ray import serve

# Connect to the running Ray cluster.

# Fetch the ServeHandle to query our model.
my_handle = serve.get_deployment("my_model").get_handle()

# Define our AIOHTTP request handler.
async def handle_request(request):
    # Offload the computation to our Ray Serve backend.
    result = await my_handle.remote("dummy input")
    return web.Response(text=result)

# Set up an HTTP endpoint.
app = web.Application()
app.add_routes([web.get("/dummy-model", handle_request)])

if __name__ == "__main__":

Here’s how to run this example:

  1. Run ray start --head to start a local Ray cluster in the background.

  2. In the directory where the example files are saved, run python to deploy our Ray Serve endpoint.

  3. Run gunicorn aiohttp_app:app --worker-class aiohttp.GunicornWebWorker to start the AIOHTTP app using gunicorn.

  4. To test out the server, run curl localhost:8000/dummy-model. This should output Model received data: dummy input.

  5. For cleanup, you can press Ctrl-C to stop the Gunicorn server, and run ray stop to stop the background Ray cluster.