Calling Deployments via HTTP and Python

This section should help you:

  • understand how deployments can be called in two ways: from HTTP and from Python

  • integrate Ray Serve with an existing web server

Calling Deployments via HTTP

Basic Example

As shown in the Ray Serve Quickstart, when you create a deployment, it is exposed over HTTP by default at /{deployment_name}. You can change the route by specifying the route_prefix argument to the @serve.deployment decorator.

@serve.deployment(route_prefix="/counter")
class Counter:
    def __call__(self, request):
        pass

When you make a request to the Serve HTTP server at /counter, it will forward the request to the deployment’s __call__ method and provide a Starlette Request object as the sole argument. The __call__ method can return any JSON-serializable object or a Starlette Response object (e.g., to return a custom status code).

Below, we discuss some advanced features for customizing Ray Serve’s HTTP functionality.

FastAPI HTTP Deployments

If you want to define more complex HTTP handling logic, Serve integrates with FastAPI. This allows you to define a Serve deployment using the @serve.ingress decorator that wraps a FastAPI app with its full range of features. The most basic example of this is shown below, but for more details on all that FastAPI has to offer such as variable routes, automatic type validation, dependency injection (e.g., for database connections), and more, please check out their documentation.

import ray

from fastapi import FastAPI
from ray import serve

app = FastAPI()
ray.init(address="auto", namespace="summarizer")
serve.start(detached=True)

@serve.deployment(route_prefix="/hello")
@serve.ingress(app)
class MyFastAPIDeployment:
    @app.get("/")
    def root(self):
        return "Hello, world!"

MyFastAPIDeployment.deploy()

Now if you send a request to /hello, this will be routed to the root method of our deployment. We can also easily leverage FastAPI to define multiple routes with different HTTP methods:

import ray

from fastapi import FastAPI
from ray import serve

app = FastAPI()
ray.init(address="auto", namespace="summarizer")
serve.start(detached=True)

@serve.deployment(route_prefix="/hello")
@serve.ingress(app)
class MyFastAPIDeployment:
    @app.get("/")
    def root(self):
        return "Hello, world!"

    @app.post("/{subpath}")
    def root(self, subpath: str):
        return f"Hello from {subpath}!"

MyFastAPIDeployment.deploy()

You can also pass in an existing FastAPI app to a deployment to serve it as-is:

import ray

from fastapi import FastAPI
from ray import serve

app = FastAPI()
ray.init(address="auto", namespace="summarizer")
serve.start(detached=True)

@app.get("/")
def f():
    return "Hello from the root!"

# ... add more routes, routers, etc. to `app` ...

@serve.deployment(route_prefix="/")
@serve.ingress(app)
class FastAPIWrapper:
    pass

FastAPIWrapper.deploy()

This is useful for scaling out an existing FastAPI app with no modifications necessary. Existing middlewares, automatic OpenAPI documentation generation, and other advanced FastAPI features should work as-is. You can also combine routes defined this way with routes defined on the deployment:

import ray

from fastapi import FastAPI
from ray import serve

app = FastAPI()
ray.init(address="auto", namespace="summarizer")
serve.start(detached=True)

@app.get("/")
def f():
    return "Hello from the root!"

@serve.deployment(route_prefix="/api1")
@serve.ingress(app)
class FastAPIWrapper1:
    @app.get("/subpath")
    def method(self):
        return "Hello 1!"

@serve.deployment(route_prefix="/api2")
@serve.ingress(app)
class FastAPIWrapper2:
    @app.get("/subpath")
    def method(self):
        return "Hello 2!"

FastAPIWrapper1.deploy()
FastAPIWrapper2.deploy()

In this example, requests to both /api1 and /api2 would return Hello from the root! while a request to /api1/subpath would return Hello 1! and a request to /api2/subpath would return Hello 2!.

To try it out, save a code snippet in a local python file (i.e. main.py) and in the same directory, run the following commands to start a local Ray cluster on your machine.

ray start --head
python main.py

HTTP Adapters

Ray Serve provides a suite of adapters to convert HTTP requests to ML inputs like numpy arrays. You can just use it with Ray AI Runtime (AIR) model wrapper feature to one click deploy pre-trained models. Alternatively, you can directly import them and put them into your FastAPI app.

For example, we provide a simple adapter for n-dimensional array.

With model wrappers, you can specify it via the input_schema field.

from ray import serve
from ray.serve.http_adapters import json_to_ndarray
from ray.serve.model_wrappers import ModelWrapperDeployment

ModelWrapperDeployment.options(name="my_model").deploy(
    my_ray_air_predictor,
    my_ray_air_checkpoint,
    input_schema=json_to_ndarray
)

You can also bring the adapter to your own FastAPI app using Depends. The input schema will automatically be part of the generated OpenAPI schema with FastAPI.

from fastapi import FastAPI, Depends
from ray.serve.http_adapters import json_to_ndarray

app = FastAPI()

@app.post("/endpoint")
async def endpoint(np_array = Depends(json_to_ndarray)):
    ...

It has the following schema for input:

pydantic model ray.serve.http_adapters.NdArray[source]

Schema for numeric array input.

Show JSON schema
{
   "title": "NdArray",
   "description": "Schema for numeric array input.",
   "type": "object",
   "properties": {
      "array": {
         "title": "Array",
         "description": "The array content as a nested list. You can pass in 1D to 4D array as nested list, or flatten them. When you flatten the array, you can use the `shape` parameter to perform reshaping.",
         "anyOf": [
            {
               "type": "array",
               "items": {
                  "type": "number"
               }
            },
            {
               "type": "array",
               "items": {
                  "type": "array",
                  "items": {
                     "type": "number"
                  }
               }
            },
            {
               "type": "array",
               "items": {
                  "type": "array",
                  "items": {
                     "type": "array",
                     "items": {
                        "type": "number"
                     }
                  }
               }
            },
            {
               "type": "array",
               "items": {
                  "type": "array",
                  "items": {
                     "type": "array",
                     "items": {
                        "type": "array",
                        "items": {
                           "type": "number"
                        }
                     }
                  }
               }
            }
         ]
      },
      "shape": {
         "title": "Shape",
         "description": "The shape of the array. If present, the array will be reshaped.",
         "type": "array",
         "items": {
            "type": "integer"
         }
      },
      "dtype": {
         "title": "Dtype",
         "description": "The numpy dtype of the array. If present, the array will be cast by `astype`.",
         "type": "string"
      }
   },
   "required": [
      "array"
   ]
}

Fields
field array: Union[List[float], List[List[float]], List[List[List[float]]], List[List[List[List[float]]]]] [Required]

The array content as a nested list. You can pass in 1D to 4D array as nested list, or flatten them. When you flatten the array, you can use the shape parameter to perform reshaping.

field dtype: Optional[str] = None

The numpy dtype of the array. If present, the array will be cast by astype.

field shape: Optional[List[int]] = None

The shape of the array. If present, the array will be reshaped.

Here is a list of adapters and please feel free to contribute more!

ray.serve.http_adapters.json_to_ndarray(payload: ray.serve.http_adapters.NdArray) numpy.ndarray[source]

Accepts an NdArray JSON from an HTTP body and converts it to a numpy array.

ray.serve.http_adapters.image_to_ndarray(img: bytes = File(Ellipsis)) numpy.ndarray[source]

Accepts a PIL-readable file from an HTTP form and converts it to a numpy array.

Configuring HTTP Server Locations

By default, Ray Serve starts a single HTTP server on the head node of the Ray cluster. You can configure this behavior using the http_options={"location": ...} flag in serve.start:

  • “HeadOnly”: start one HTTP server on the head node. Serve assumes the head node is the node you executed serve.start on. This is the default.

  • “EveryNode”: start one HTTP server per node.

  • “NoServer” or None: disable HTTP server.

Note

Using the “EveryNode” option, you can point a cloud load balancer to the instance group of Ray cluster to achieve high availability of Serve’s HTTP proxies.

Enabling CORS and other HTTP middlewares

Serve supports arbitrary Starlette middlewares and custom middlewares in Starlette format. The example below shows how to enable Cross-Origin Resource Sharing (CORS). You can follow the same pattern for other Starlette middlewares.

from starlette.middleware import Middleware
from starlette.middleware.cors import CORSMiddleware

client = serve.start(
    http_options={"middlewares": [
        Middleware(
            CORSMiddleware, allow_origins=["*"], allow_methods=["*"])
    ]})

ServeHandle: Calling Deployments from Python

Ray Serve enables you to query models both from HTTP and Python. This feature enables seamless model composition. You can get a ServeHandle corresponding to deployment, similar how you can reach a deployment through HTTP via a specific route. When you issue a request to a deployment through ServeHandle, the request is load balanced across available replicas in the same way an HTTP request is.

To call a Ray Serve deployment from python, use Deployment.get_handle to get a handle to the deployment, then use handle.remote to send requests to that deployment. These requests can pass ordinary args and kwargs that are passed directly to the method. This returns a Ray ObjectRef whose result can be waited for or retrieved using ray.wait or ray.get.

@serve.deployment
class Deployment:
    def method1(self, arg):
        return f"Method1: {arg}"

    def __call__(self, arg):
        return f"__call__: {arg}"

Deployment.deploy()

handle = Deployment.get_handle()
ray.get(handle.remote("hi")) # Defaults to calling the __call__ method.
ray.get(handle.method1.remote("hi")) # Call a different method.

If you want to use the same deployment to serve both HTTP and ServeHandle traffic, the recommended best practice is to define an internal method that the HTTP handling logic will call:

@serve.deployment(route_prefix="/api")
class Deployment:
    def say_hello(self, name: str):
        return f"Hello {name}!"

    def __call__(self, request):
        return self.say_hello(request.query_params["name"])

Deployment.deploy()

Now we can invoke the same logic from both HTTP or Python:

print(requests.get("http://localhost:8000/api?name=Alice"))
# Hello Alice!

handle = Deployment.get_handle()
print(ray.get(handle.say_hello.remote("Alice")))
# Hello Alice!

Sync and Async Handles

Ray Serve offers two types of ServeHandle. You can use the Deployment.get_handle(..., sync=True|False) flag to toggle between them.

  • When you set sync=True (the default), a synchronous handle is returned. Calling handle.remote() should return a Ray ObjectRef.

  • When you set sync=False, an asyncio based handle is returned. You need to Call it with await handle.remote() to return a Ray ObjectRef. To use await, you have to run Deployment.get_handle and handle.remote in Python asyncio event loop.

The async handle has performance advantage because it uses asyncio directly; as compared to the sync handle, which talks to an asyncio event loop in a thread. To learn more about the reasoning behind these, checkout our architecture documentation.

Integrating with existing web servers

Ray Serve comes with its own HTTP server out of the box, but if you have an existing web application, you can still plug in Ray Serve to scale up your compute using the ServeHandle. For a tutorial with sample code, see Integration with Existing Web Servers.