Serve Architecture

This document provides an overview of how each component in Serve works.


High Level View

Serve runs on Ray and utilizes Ray actors.

There are three kinds of actors that are created to make up a Serve instance:

  • Controller: A global actor unique to each Serve instance that manages the control plane. The Controller is responsible for creating, updating, and destroying other actors. Serve API calls like client.create_backend, client.create_endpoint make remote calls to the Controller.

  • Router: There is one router per node. Each router is a Uvicorn HTTP server that accepts incoming requests, forwards them to replicas, and responds once they are completed.

  • Worker Replica: Worker replicas actually execute the code in response to a request. For example, they may contain an instantiation of an ML model. Each replica processes individual requests or batches of requests from the routers.

Lifetime of a Request

When an HTTP request is sent to the router, the follow things happen:

  • The HTTP request is received and parsed.

  • The correct endpoint associated with the HTTP url path is looked up.

  • One or more backends is selected to handle the request given the traffic splitting and shadow testing rules. The requests for each backend are placed on a queue.

  • For each request in a backend queue, an available replica is looked up and the request is sent to it. If there are no available replicas (there are more than max_concurrent_queries requests outstanding), the request is left in the queue until an outstanding request is finished.

Each replica maintains a queue of requests and processes one batch of requests at a time. By default the batch size is 1, you can increase the batch size <ref> to increase throughput. If the handler (the function for the backend or __call__) is async, the replica will not wait for the handler to run; otherwise, the replica will block until the handler returns.


How does Serve handle fault tolerance?

Application errors like exceptions in your model evaluation code are caught and wrapped. A 500 status code will be returned with the traceback information. The replica will be able to continue to handle requests.

Machine errors and faults will be handled by Ray. Serve utilizes the actor reconstruction capability. For example, when a machine hosting any of the actors crashes, those actors will be automatically restarted on another available machine. All data in the Controller (routing policies, backend configurations, etc) is checkpointed to the Ray. Transient data in the router and the replica (like network connections and internal request queues) will be lost upon failure.

How does Serve ensure horizontal scalability and availability?

Serve starts one router per node. Each router will bind the same port. You should be able to reach Serve and send requests to any models via any of the servers.

This architecture ensures horizontal scalability for Serve. You can scale the router by adding more nodes and scale the model by increasing the number of replicas.

How do ServeHandles work?

ServeHandles wrap a handle to the router actor on the same node. When a request is sent from one via replica to another via the handle, the requests go through the same data path as incoming HTTP requests. This enables the same backend selection and batching procedures to happen. ServeHandles are often used to implement model composition.

What happens to large requests?

Serve utilizes Ray’s shared memory object store and in process memory store. Small request objects are directly sent between actors via network call. Larger request objects (100KiB+) are written to a distributed shared memory store and the replica can read them via zero-copy read.