Calling Endpoints via HTTP and ServeHandle


Ray Serve endpoints can be called in two ways: from HTTP and from Python. On this page we will show you both of these approaches and then give a tutorial on how to integrate Ray Serve with an existing web server.

Calling Endpoints via HTTP

As described in the End-to-End Tutorial, when you create a Ray Serve endpoint, to serve it over HTTP you just need to specify the route parameter to serve.create_endpoint:

serve.create_endpoint("my_endpoint", backend="my_backend", route="/counter")

Below, we discuss some advanced features for customizing Ray Serve’s HTTP functionality:

Configuring HTTP Server Locations

By default, Ray Serve starts a single HTTP server on the head node of the Ray cluster. You can configure this behavior using the http_options={"location": ...} flag in serve.start:

  • “HeadOnly”: start one HTTP server on the head node. Serve assumes the head node is the node you executed serve.start on. This is the default.

  • “EveryNode”: start one HTTP server per node.

  • “NoServer” or None: disable HTTP server.


Using the “EveryNode” option, you can point a cloud load balancer to the instance group of Ray cluster to achieve high availability of Serve’s HTTP proxies.

Custom HTTP response status codes

You can return a Starlette Response object from your Ray Serve backend code:

from starlette.responses import Response

def f(starlette_request):
    return Response('Hello, world!', status_code=123, media_type='text/plain')

serve.create_backend("hello", f)

Enabling CORS and other HTTP middlewares

Serve supports arbitrary Starlette middlewares and custom middlewares in Starlette format. The example below shows how to enable Cross-Origin Resource Sharing (CORS). You can follow the same pattern for other Starlette middlewares.

from starlette.middleware import Middleware
from starlette.middleware.cors import CORSMiddleware

client = serve.start(
    http_options={"middlewares": [
            CORSMiddleware, allow_origins=["*"], allow_methods=["*"])

ServeHandle: Calling Endpoints from Python

Ray Serve enables you to query models both from HTTP and Python. This feature enables seamless model composition. You can get a ServeHandle corresponding to an endpoint, similar how you can reach an endpoint through HTTP via a specific route. When you issue a request to an endpoint through ServeHandle, the request goes through the same code path as an HTTP request would: choosing backends through traffic policies and load balancing across available replicas.

To call a Ray Serve endpoint from python, use serve.get_handle to get a handle to the endpoint, then use handle.remote to send requests to that endpoint. This returns a Ray ObjectRef whose result can be waited for or retrieved using ray.wait or ray.get, respectively.

handle = serve.get_handle("api_endpoint")

Accessing data from the request

When the request arrives in the model, you can access the data similarly to how you would with an HTTP request. Here are some examples how Ray Serve’s built-in ServeRequest mirrors `starlette.requests.request:



(Starlette.Request and ServeRequest)

requests.get(..., headers={...})





requests.get(..., json={...})


await request.json()

requests.get(..., form={...})


await request.form()

requests.get(..., params={"a":"b"})



requests.get(..., data="long string")

handle.remote("long string")

await request.body()




You might have noticed that the last row of the table shows that ServeRequest supports passing Python objects through the handle. This is not possible in HTTP. If you need to distinguish if the origin of the request is from Python or HTTP, you can do an isinstance check:

import starlette.requests

if isinstance(request, starlette.requests.Request):
    print("Request coming from web!")
elif isinstance(request, ServeRequest):
    print("Request coming from Python!")


One special case is when you pass a web request to a handle.


In this case, Serve will not wrap it in ServeRequest. You can directly process the request as a starlette.requests.Request.

Sync and Async Handles

Ray Serve offers two types of ServeHandle. You can use the serve.get_handle(..., sync=True|False) flag to toggle between them.

  • When you set sync=True (the default), a synchronous handle is returned. Calling handle.remote() should return a Ray ObjectRef.

  • When you set sync=False, an asyncio based handle is returned. You need to Call it with await handle.remote() to return a Ray ObjectRef. To use await, you have to run serve.get_handle and handle.remote in Python asyncio event loop.

The async handle has performance advantage because it uses asyncio directly; as compared to the sync handle, which talks to an asyncio event loop in a thread. To learn more about the reasoning behind these, checkout our architecture documentation.

Calling methods on a Serve backend besides __call__

By default, Ray Serve will serve the user-defined __call__ method of your class, but other methods of your class can be served as well.

To call a custom method via HTTP, pass in the method name in the header field X-SERVE-CALL-METHOD.

To call a custom method via Python, use handle.options:

class StatefulProcessor:
    def __init__(self):
        self.count = 1

    def __call__(self, request):
        return {"current": self.count}

    def other_method(self, inc):
        self.count += inc
        return True

handle = serve.get_handle("endpoint_name")

The call is the same as a regular query except a different method is called within the replica.

Integrating with existing web servers

Ray Serve comes with its own HTTP server out of the box, but if you have an existing web application, you can still plug in Ray Serve to scale up your backend computation.

Using ServeHandle makes this easy. For a tutorial with sample code, see Integration with Existing Web Servers.