Serve API Reference¶
Start or Connect to a Cluster¶
-
ray.serve.
start
(detached: bool = False, http_host: Optional[str] = '127.0.0.1', http_port: int = 8000, http_middlewares: List[Any] = [], http_options: Union[dict, ray.serve.config.HTTPOptions, None] = None) → ray.serve.api.Client[source]¶ Initialize a serve instance.
By default, the instance will be scoped to the lifetime of the returned Client object (or when the script exits). If detached is set to True, the instance will instead persist until client.shutdown() is called and clients to it can be connected using serve.connect(). This is only relevant if connecting to a long-running Ray cluster (e.g., with address=”auto”).
- Parameters
detached (bool) – Whether not the instance should be detached from this script.
http_host (Optional[str]) – Deprecated, use http_options instead.
http_port (int) – Deprecated, use http_options instead.
http_middlewares (list) – Deprecated, use http_options instead.
http_options (Optional[Dict, serve.HTTPOptions]) –
Configuration options for HTTP proxy. You can pass in a dictionary or HTTPOptions object with fields:
host(str, None): Host for HTTP servers to listen on. Defaults to “127.0.0.1”. To expose Serve publicly, you probably want to set this to “0.0.0.0”.
port(int): Port for HTTP server. Defaults to 8000.
middlewares(list): A list of Starlette middlewares that will be applied to the HTTP servers in the cluster.
location(str, serve.config.DeploymentMode): The deployment location of HTTP servers:
”HeadOnly”: start one HTTP server on the head node. Serve assumes the head node is the node you executed serve.start on. This is the default.
”EveryNode”: start one HTTP server per node.
”NoServer” or None: disable HTTP server.
-
ray.serve.
connect
() → ray.serve.api.Client[source]¶ Connect to an existing Serve instance on this Ray cluster.
If calling from the driver program, the Serve instance on this Ray cluster must first have been initialized using serve.start(detached=True).
If called from within a backend, will connect to the same Serve instance that the backend is running in.
Client API¶
-
class
ray.serve.api.
Client
(controller: ray.actor.ActorHandle, controller_name: str, detached: bool = False)[source]¶ -
shutdown
() → None[source]¶ Completely shut down the connected Serve instance.
Shuts down all processes and deletes all state associated with the instance.
-
create_endpoint
(endpoint_name: str, *, backend: str = None, route: Optional[str] = None, methods: List[str] = ['GET']) → None[source]¶ Create a service endpoint given route_expression.
- Parameters
endpoint_name (str) – A name to associate to with the endpoint.
backend (str, required) – The backend that will serve requests to this endpoint. To change this or split traffic among backends, use serve.set_traffic.
route (str, optional) – A string begin with “/”. HTTP server will use the string to match the path.
methods (List[str], optional) – The HTTP methods that are valid for this endpoint.
-
delete_endpoint
(endpoint: str) → None[source]¶ Delete the given endpoint.
Does not delete any associated backends.
-
list_endpoints
() → Dict[str, Dict[str, Any]][source]¶ Returns a dictionary of all registered endpoints.
The dictionary keys are endpoint names and values are dictionaries of the form: {“methods”: List[str], “traffic”: Dict[str, float]}.
-
update_backend_config
(backend_tag: str, config_options: Union[ray.serve.config.BackendConfig, Dict[str, Any]]) → None[source]¶ Update a backend configuration for a backend tag.
Keys not specified in the passed will be left unchanged.
- Parameters
backend_tag (str) – A registered backend.
config_options (dict, serve.BackendConfig) – Backend config options to update. Either a BackendConfig object or a dict mapping strings to values for the following supported options: - “num_replicas”: number of processes to start up that will handle requests to this backend. - “max_batch_size”: the maximum number of requests that will be processed in one batch by this backend. - “batch_wait_timeout”: time in seconds that backend replicas will wait for a full batch of requests before processing a partial batch. - “max_concurrent_queries”: the maximum number of queries that will be sent to a replica of this backend without receiving a response. - “user_config” (experimental): Arguments to pass to the reconfigure method of the backend. The reconfigure method is called if “user_config” is not None.
-
get_backend_config
(backend_tag: str) → ray.serve.config.BackendConfig[source]¶ Get the backend configuration for a backend tag.
- Parameters
backend_tag (str) – A registered backend.
-
create_backend
(backend_tag: str, backend_def: Union[Callable, Type[Callable], str], *init_args: Any, ray_actor_options: Optional[Dict] = None, config: Union[ray.serve.config.BackendConfig, Dict[str, Any], None] = None, env: Optional[ray.serve.env.CondaEnv] = None) → None[source]¶ Create a backend with the provided tag.
- Parameters
backend_tag (str) – a unique tag assign to identify this backend.
backend_def (callable, class, str) – a function or class implementing __call__ and returning a JSON-serializable object or a Starlette Response object. A string import path can also be provided (e.g., “my_module.MyClass”), in which case the underlying function or class will be imported dynamically in the worker replicas.
*init_args (optional) – the arguments to pass to the class initialization method. Not valid if backend_def is a function.
ray_actor_options (optional) – options to be passed into the @ray.remote decorator for the backend actor.
config (dict, serve.BackendConfig, optional) – configuration options for this backend. Either a BackendConfig, or a dictionary mapping strings to values for the following supported options: - “num_replicas”: number of processes to start up that will handle requests to this backend. - “max_batch_size”: the maximum number of requests that will be processed in one batch by this backend. - “batch_wait_timeout”: time in seconds that backend replicas will wait for a full batch of requests before processing a partial batch. - “max_concurrent_queries”: the maximum number of queries that will be sent to a replica of this backend without receiving a response. - “user_config” (experimental): Arguments to pass to the reconfigure method of the backend. The reconfigure method is called if “user_config” is not None.
env (serve.CondaEnv, optional) – conda environment to run this backend in. Requires the caller to be running in an activated conda environment (not necessarily
env
), and requiresenv
to be an existing conda environment on all nodes. Ifenv
is not provided but conda is activated, the backend will run in the conda environment of the caller.
-
list_backends
() → Dict[str, ray.serve.config.BackendConfig][source]¶ Returns a dictionary of all registered backends.
Dictionary maps backend tags to backend config objects.
-
delete_backend
(backend_tag: str, force: bool = False) → None[source]¶ Delete the given backend.
The backend must not currently be used by any endpoints.
- Parameters
backend_tag (str) – The backend tag to be deleted.
force (bool) – Whether or not to force the deletion, without waiting for graceful shutdown. Default to false.
-
set_traffic
(endpoint_name: str, traffic_policy_dictionary: Dict[str, float]) → None[source]¶ Associate a service endpoint with traffic policy.
Example:
>>> serve.set_traffic("service-name", { "backend:v1": 0.5, "backend:v2": 0.5 })
- Parameters
endpoint_name (str) – A registered service endpoint.
traffic_policy_dictionary (dict) – a dictionary maps backend names to their traffic weights. The weights must sum to 1.
-
shadow_traffic
(endpoint_name: str, backend_tag: str, proportion: float) → None[source]¶ Shadow traffic from an endpoint to a backend.
The specified proportion of requests will be duplicated and sent to the backend. Responses of the duplicated traffic will be ignored. The backend must not already be in use.
To stop shadowing traffic to a backend, call shadow_traffic with proportion equal to 0.
- Parameters
endpoint_name (str) – A registered service endpoint.
backend_tag (str) – A registered backend.
proportion (float) – The proportion of traffic from 0 to 1.
-
get_handle
(endpoint_name: str, missing_ok: Optional[bool] = False, sync: bool = True) → Union[ray.serve.handle.RayServeHandle, ray.serve.handle.RayServeSyncHandle][source]¶ Retrieve RayServeHandle for service endpoint to invoke it from Python.
- Parameters
endpoint_name (str) – A registered service endpoint.
missing_ok (bool) – If true, then Serve won’t check the endpoint is registered. False by default.
sync (bool) – If true, then Serve will return a ServeHandle that works everywhere. Otherwise, Serve will return a ServeHandle that’s only usable in asyncio loop.
- Returns
RayServeHandle
-
Backend Configuration¶
-
class
ray.serve.
BackendConfig
[source]¶ Configuration options for a backend, to be set by the user.
- Parameters
num_replicas (Optional[int]) – The number of processes to start up that will handle requests to this backend. Defaults to 0.
max_batch_size (Optional[int]) – The maximum number of requests that will be processed in one batch by this backend. Defaults to None (no maximium).
batch_wait_timeout (Optional[float]) – The time in seconds that backend replicas will wait for a full batch of requests before processing a partial batch. Defaults to 0.
max_concurrent_queries (Optional[int]) – The maximum number of queries that will be sent to a replica of this backend without receiving a response. Defaults to None (no maximum).
user_config (Optional[Any]) – Arguments to pass to the reconfigure method of the backend. The reconfigure method is called if user_config is not None.
experimental_graceful_shutdown_wait_loop_s (Optional[float]) – Duration that backend workers will wait until there is no more work to be done before shutting down. Defaults to 2s.
experimental_graceful_shutdown_timeout_s (Optional[float]) – Controller waits for this duration to forcefully kill the replica for shutdown. Defaults to 20s.
ServeHandle API¶
-
class
ray.serve.handle.
RayServeHandle
(router, endpoint_name, handle_options: Optional[ray.serve.handle.HandleOptions] = None)[source]¶ A handle to a service endpoint.
Invoking this endpoint with .remote is equivalent to pinging an HTTP endpoint.
Example
>>> handle = serve_client.get_handle("my_endpoint") >>> handle RayServeSyncHandle(endpoint="my_endpoint") >>> handle.remote(my_request_content) ObjectRef(...) >>> ray.get(handle.remote(...)) # result >>> ray.get(handle.remote(let_it_crash_request)) # raises RayTaskError Exception
>>> async_handle = serve_client.get_handle("my_endpoint", sync=False) >>> async_handle RayServeHandle(endpoint="my_endpoint") >>> await async_handle.remote(my_request_content) ObjectRef(...) >>> ray.get(await async_handle.remote(...)) # result >>> ray.get(await async_handle.remote(let_it_crash_request)) # raises RayTaskError Exception
-
options
(*, method_name: Union[str, ray.serve.handle.DEFAULT] = <DEFAULT.VALUE: 1>, shard_key: Union[str, ray.serve.handle.DEFAULT] = <DEFAULT.VALUE: 1>, http_method: Union[str, ray.serve.handle.DEFAULT] = <DEFAULT.VALUE: 1>, http_headers: Union[Dict[str, str], ray.serve.handle.DEFAULT] = <DEFAULT.VALUE: 1>)[source]¶ Set options for this handle.
- Parameters
method_name (str) – The method to invoke on the backend.
http_method (str) – The HTTP method to use for the request.
shard_key (str) – A string to use to deterministically map this request to a backend if there are multiple for this endpoint.
-
async
remote
(request_data: Union[Dict, Any, None] = None, **kwargs)[source]¶ Issue an asynchronous request to the endpoint.
Returns a Ray ObjectRef whose results can be waited for or retrieved using ray.wait or ray.get (or
await object_ref
), respectively.- Returns
ray.ObjectRef
- Parameters
request_data (dict, Any) – If it’s a dictionary, the data will be available in
request.json()
orrequest.form()
. Otherwise, it will be available inrequest.body()
.**kwargs – All keyword arguments will be available in
request.query_params
.
-
When calling from Python, the backend implementation will receive ServeRequest
objects instead of Starlette requests.
-
class
ray.serve.utils.
ServeRequest
(data, kwargs, headers, method)[source]¶ The request object used when passing arguments via ServeHandle.
ServeRequest partially implements the API of Starlette Request. You only need to write your model serving code once; it can be queried by both HTTP and Python.
To use the full Starlette Request interface with ServeHandle, you may instead directly pass in a Starlette Request object to the ServeHandle.
-
property
headers
¶ The HTTP headers from
handle.option(http_headers=...)
.
-
property
method
¶ The HTTP method data from
handle.option(http_method=...)
.
-
property
query_params
¶ The keyword arguments from
handle.remote(**kwargs)
.
-
property
Batching Requests¶
-
ray.serve.
accept_batch
(f: Callable) → Callable[source]¶ Annotation to mark that a serving function accepts batches of requests.
In order to accept batches of requests as input, the implementation must handle a list of requests being passed in rather than just a single request.
This must be set on any backend implementation that will have max_batch_size set to greater than 1.
Example:
>>> @serve.accept_batch def serving_func(requests): assert isinstance(requests, list) ...
>>> class ServingActor: @serve.accept_batch def __call__(self, requests): assert isinstance(requests, list)