Ray Serve FAQ

This page answers some common questions about Ray Serve. If you have more questions, feel free to ask them in the Discussion Board.

How do I deploy serve?

See Deploying Ray Serve for information about how to deploy serve.

How do I delete backends and endpoints?

To delete a backend, you can use client.delete_backend. Note that the backend must not be use by any endpoints in order to be delete. Once a backend is deleted, its tag can be reused.


To delete a endpoint, you can use client.delete_endpoint. Note that the endpoint will no longer work and return a 404 when queried. Once a endpoint is deleted, its tag can be reused.


How do I call an endpoint from Python code?

Use client.get_handle to get a handle to the endpoint, then use handle.remote to send requests to that endpoint. This returns a Ray ObjectRef whose result can be waited for or retrieved using ray.wait or ray.get, respectively.

handle = client.get_handle("api_endpoint")

How do I call a method on my replica besides __call__?

To call a method via HTTP use the header field X-SERVE-CALL-METHOD.

To call a method via Python, use handle.options:

class StatefulProcessor:
    def __init__(self):
        self.count = 1

    def __call__(self, request):
        return {"current": self.count}

    def other_method(self, inc):
        self.count += inc
        return True

handle = client.get_handle("endpoint_name")

The call is the same as a regular query except a different method is called within the replica. It is compatible with batching as well.

How do I enable CORS and other HTTP features?

Serve supports arbitrary Starlette middlewares and custom middlewares in Starlette format. The example below shows how to enable Cross-Origin Resource Sharing (CORS). You can follow the same pattern for other Starlette middlewares.


Serve does not list Starlette as one of its dependencies. To utilize this feature, you will need to:

pip install starlette
from starlette.middleware import Middleware
from starlette.middleware.cors import CORSMiddleware

client = serve.start(
    http_options={"middlewares": [
            CORSMiddleware, allow_origins=["*"], allow_methods=["*"])

How do ServeHandle and ServeRequest work?

Ray Serve enables you to query models both from HTTP and Python. This feature enables seamless model composition. You can get a ServeHandle corresponding to an endpoint, similar how you can reach an endpoint through HTTP via a specific route. When you issue a request to an endpoint through ServeHandle, the request goes through the same code path as an HTTP request would: choosing backends through traffic policies, finding the next available replica, and batching requests together.

When the request arrives in the model, you can access the data similarly to how you would with HTTP request. Here are some examples how ServeRequest mirrors Starlette.Request:



(Starlette.Request and ServeRequest)

requests.get(..., headers={...})






requests.get(..., json={...})


await request.json()

requests.get(..., form={...})


await request.form()

requests.get(..., params={"a":"b"})



requests.get(..., data="long string")

handle.remote("long string")

await request.body()





You might have noticed that the last row of the table shows that ServeRequest supports Python object pass through the handle. This is not possible in HTTP. If you need to distinguish if the origin of the request is from Python or HTTP, you can do an isinstance check:

import starlette.requests

if isinstance(request, starlette.requests.Request):
    print("Request coming from web!")
elif isinstance(request, ServeRequest):
    print("Request coming from Python!")


Once special case is when you pass a web request to a handle.


In this case, Serve will not wrap it in ServeRequest. You can directly process the request as a starlette.requests.Request.

How fast is Ray Serve?

We are continuously benchmarking Ray Serve. We can confidently say:

  • Ray Serve’s latency overhead is single digit milliseconds, often times just 1-2 milliseconds.

  • For throughput, Serve achieves about 3-4k qps on a single machine.

  • It is horizontally scalable so you can add more machines to increase the overall throughput.

You can checkout our microbenchmark instruction to benchmark on your hardware.

Can I use asyncio along with Ray Serve?

Yes! You can make your servable methods async def and Serve will run them concurrently inside a Python asyncio event loop.

Are there any other similar frameworks?

Yes and no. We truly believe Serve is unique as it gives you end to end control over the API while delivering scalability and high performance. To achieve something like what Serve offers, you often need to glue together multiple frameworks like Tensorflow Serving, SageMaker, or even roll your own batching server.

How does Serve compare to TFServing, TorchServe, ONNXRuntime, and others?

Ray Serve is framework agnostic, you can use any Python framework and libraries. We believe data scientists are not bounded a particular machine learning framework. They use the best tool available for the job.

Compared to these framework specific solution, Ray Serve doesn’t perform any optimizations to make your ML model run faster. However, you can still optimize the models yourself and run them in Ray Serve: for example, you can run a model compiled by PyTorch JIT.

How does Serve compare to AWS SageMaker, Azure ML, Google AI Platform?

Ray Serve brings the scalability and parallelism of these hosted offering to your own infrastructure. You can use our cluster launcher to deploy Ray Serve to all major public clouds, K8s, as well as on bare-metal, on-premise machines.

Compared to these offerings, Ray Serve lacks a unified user interface and functionality let you manage the lifecycle of the models, visualize it’s performance, etc. Ray Serve focuses on just model serving and provides the primitives for you to build your own ML platform on top.

How does Serve compare to Seldon, KFServing, Cortex?

You can develop Ray Serve on your laptop, deploy it on a dev box, and scale it out to multiple machines or K8s cluster without changing one lines of code. It’s a lot easier to get started with when you don’t need to provision and manage K8s cluster. When it’s time to deploy, you can use Ray cluster launcher to transparently put your Ray Serve application in K8s.

Compare to these frameworks letting you deploy ML models on K8s, Ray Serve lacks the ability to declaratively configure your ML application via YAML files. In Ray Serve, you configure everything by Python code.

How does Ray Serve scale behave on spikes?

You can easily scale your models just by changing the number of replicas in the BackendConfig. Ray Serve also has an experimental autoscaler that scales up your model replicas based on load. We can improve it and welcome any feedback! We also rely on the Ray cluster launcher for adding more machines.

Is Ray Serve only for ML models?

Nope! Ray Serve can be used to build any type of Python microservices application. You can also use the full power of Ray within your Ray Serve programs, so it’s easy to run parallel computations within your backends.