Serve API Reference

Core APIs

ray.serve.start(detached: bool = False, http_host: Optional[str] = '127.0.0.1', http_port: int = 8000, http_middlewares: List[Any] = [], http_options: Union[dict, ray.serve.config.HTTPOptions, None] = None, dedicated_cpu: bool = False) → ray.serve.api.Client[source]

Initialize a serve instance.

By default, the instance will be scoped to the lifetime of the returned Client object (or when the script exits). If detached is set to True, the instance will instead persist until serve.shutdown() is called. This is only relevant if connecting to a long-running Ray cluster (e.g., with ray.init(address=”auto”) or ray.init(“ray://<remote_addr>”)).

Parameters
  • detached (bool) – Whether not the instance should be detached from this script. If set, the instance will live on the Ray cluster until it is explicitly stopped with serve.shutdown(). This should not be set in an anonymous Ray namespace because you will not be able to reconnect to the instance after the script exits.

  • http_host (Optional[str]) – Deprecated, use http_options instead.

  • http_port (int) – Deprecated, use http_options instead.

  • http_middlewares (list) – Deprecated, use http_options instead.

  • http_options (Optional[Dict, serve.HTTPOptions]) –

    Configuration options for HTTP proxy. You can pass in a dictionary or HTTPOptions object with fields:

    • host(str, None): Host for HTTP servers to listen on. Defaults to “127.0.0.1”. To expose Serve publicly, you probably want to set this to “0.0.0.0”.

    • port(int): Port for HTTP server. Defaults to 8000.

    • middlewares(list): A list of Starlette middlewares that will be applied to the HTTP servers in the cluster. Defaults to [].

    • location(str, serve.config.DeploymentMode): The deployment location of HTTP servers:

      • ”HeadOnly”: start one HTTP server on the head node. Serve assumes the head node is the node you executed serve.start on. This is the default.

      • ”EveryNode”: start one HTTP server per node.

      • ”NoServer” or None: disable HTTP server.

    • num_cpus (int): The number of CPU cores to reserve for each internal Serve HTTP proxy actor. Defaults to 0.

  • dedicated_cpu (bool) – Whether to reserve a CPU core for the internal Serve controller actor. Defaults to False.

PublicAPI: This API is stable across Ray releases.

ray.serve.deployment(_func_or_class: Optional[Callable] = None, name: Optional[str] = None, version: Optional[str] = None, prev_version: Optional[str] = None, num_replicas: Optional[int] = None, init_args: Optional[Tuple[Any]] = None, route_prefix: Optional[str] = None, ray_actor_options: Optional[Dict] = None, user_config: Optional[Any] = None, max_concurrent_queries: Optional[int] = None) → Callable[[Callable], ray.serve.api.Deployment][source]

Define a Serve deployment.

Parameters
  • name (Optional[str]) – Globally-unique name identifying this deployment. If not provided, the name of the class or function will be used.

  • version (Optional[str]) – Version of the deployment. This is used to indicate a code change for the deployment; when it is re-deployed with a version change, a rolling update of the replicas will be performed. If not provided, every deployment will be treated as a new version.

  • prev_version (Optional[str]) – Version of the existing deployment which is used as a precondition for the next deployment. If prev_version does not match with the existing deployment’s version, the deployment will fail. If not provided, deployment procedure will not check the existing deployment’s version.

  • num_replicas (Optional[int]) – The number of processes to start up that will handle requests to this backend. Defaults to 1.

  • init_args (Optional[Tuple]) – Arguments to be passed to the class constructor when starting up deployment replicas. These can also be passed when you call .deploy() on the returned Deployment.

  • route_prefix (Optional[str]) – Requests to paths under this HTTP path prefix will be routed to this deployment. Defaults to ‘/{name}’. Routing is done based on longest-prefix match, so if you have deployment A with a prefix of ‘/a’ and deployment B with a prefix of ‘/a/b’, requests to ‘/a’, ‘/a/’, and ‘/a/c’ go to A and requests to ‘/a/b’, ‘/a/b/’, and ‘/a/b/c’ go to B. Routes must not end with a ‘/’ unless they’re the root (just ‘/’), which acts as a catch-all.

  • ray_actor_options (dict) – Options to be passed to the Ray actor constructor such as resource requirements.

  • user_config (Optional[Any]) – [experimental] Config to pass to the reconfigure method of the backend. This can be updated dynamically without changing the version of the deployment and restarting its replicas. The user_config needs to be hashable to keep track of updates, so it must only contain hashable types, or hashable types nested in lists and dictionaries.

  • max_concurrent_queries (Optional[int]) – The maximum number of queries that will be sent to a replica of this backend without receiving a response. Defaults to 100.

Example:

>>> @serve.deployment(name="deployment1", version="v1")
    class MyDeployment:
        pass
>>> MyDeployment.deploy(*init_args)
>>> MyDeployment.options(num_replicas=2, init_args=init_args).deploy()
Returns

Deployment

PublicAPI: This API is stable across Ray releases.

ray.serve.list_deployments() → Dict[str, ray.serve.api.Deployment][source]

Returns a dictionary of all active deployments.

Dictionary maps deployment name to Deployment objects.

PublicAPI: This API is stable across Ray releases.

ray.serve.get_deployment(name: str)ray.serve.api.Deployment[source]

Dynamically fetch a handle to a Deployment object.

This can be used to update and redeploy a deployment without access to the original definition.

Example:

>>> MyDeployment = serve.get_deployment("name")
>>> MyDeployment.options(num_replicas=10).deploy()
Parameters
  • name (str) – name of the deployment. This must have already been

  • deployed.

Returns

Deployment

PublicAPI: This API is stable across Ray releases.

ray.serve.shutdown() → None[source]

Completely shut down the connected Serve instance.

Shuts down all processes and deletes all state associated with the instance.

PublicAPI: This API is stable across Ray releases.

ray.serve.connect() → ray.serve.api.Client[source]

Connect to an existing Serve instance on this Ray cluster.

If calling from the driver program, the Serve instance on this Ray cluster must first have been initialized using serve.start(detached=True).

If called from within a backend, this will connect to the same Serve instance that the backend is running in.

DEPRECATED: This API is deprecated and may be removed in future Ray releases.

ray.serve.create_backend(backend_tag: str, backend_def: Union[Callable, Type[Callable], str], *init_args: Any, ray_actor_options: Optional[Dict] = None, config: Union[ray.serve.config.BackendConfig, Dict[str, Any], None] = None) → None[source]

Create a backend with the provided tag.

Parameters
  • backend_tag (str) – a unique tag assign to identify this backend.

  • backend_def (callable, class, str) – a function or class implementing __call__ and returning a JSON-serializable object or a Starlette Response object. A string import path can also be provided (e.g., “my_module.MyClass”), in which case the underlying function or class will be imported dynamically in the worker replicas.

  • *init_args (optional) – the arguments to pass to the class initialization method. Not valid if backend_def is a function.

  • ray_actor_options (optional) – options to be passed into the @ray.remote decorator for the backend actor.

  • config (dict, serve.BackendConfig, optional) – configuration options for this backend. Either a BackendConfig, or a dictionary mapping strings to values for the following supported options: - “num_replicas”: number of processes to start up that will handle requests to this backend. - “max_concurrent_queries”: the maximum number of queries that will be sent to a replica of this backend without receiving a response. - “user_config” (experimental): Arguments to pass to the reconfigure method of the backend. The reconfigure method is called if “user_config” is not None.

DEPRECATED: This API is deprecated and may be removed in future Ray releases.

ray.serve.list_backends() → Dict[str, ray.serve.config.BackendConfig][source]

Returns a dictionary of all registered backends.

Dictionary maps backend tags to backend config objects.

DEPRECATED: This API is deprecated and may be removed in future Ray releases.

ray.serve.delete_backend(backend_tag: str, force: bool = False) → None[source]

Delete the given backend.

The backend must not currently be used by any endpoints.

Parameters
  • backend_tag (str) – The backend tag to be deleted.

  • force (bool) – Whether or not to force the deletion, without waiting for graceful shutdown. Default to false.

DEPRECATED: This API is deprecated and may be removed in future Ray releases.

ray.serve.get_backend_config(backend_tag: str) → ray.serve.config.BackendConfig[source]

Get the backend configuration for a backend tag.

Parameters

backend_tag (str) – A registered backend.

DEPRECATED: This API is deprecated and may be removed in future Ray releases.

ray.serve.update_backend_config(backend_tag: str, config_options: Union[ray.serve.config.BackendConfig, Dict[str, Any]]) → None[source]

Update a backend configuration for a backend tag.

Keys not specified in the passed will be left unchanged.

Parameters
  • backend_tag (str) – A registered backend.

  • config_options (dict, serve.BackendConfig) – Backend config options to update. Either a BackendConfig object or a dict mapping strings to values for the following supported options: - “num_replicas”: number of processes to start up that will handle requests to this backend. - “max_concurrent_queries”: the maximum number of queries that will be sent to a replica of this backend without receiving a response. - “user_config” (experimental): Arguments to pass to the reconfigure method of the backend. The reconfigure method is called if “user_config” is not None.

DEPRECATED: This API is deprecated and may be removed in future Ray releases.

ray.serve.create_endpoint(endpoint_name: str, *, backend: str = None, route: Optional[str] = None, methods: List[str] = ['GET']) → None[source]

Create a service endpoint given route_expression.

Parameters
  • endpoint_name (str) – A name to associate to with the endpoint.

  • backend (str, required) – The backend that will serve requests to this endpoint. To change this or split traffic among backends, use serve.set_traffic.

  • route (str, optional) – A string begin with “/”. HTTP server will use the string to match the path.

  • methods (List[str], optional) – The HTTP methods that are valid for this endpoint.

DEPRECATED: This API is deprecated and may be removed in future Ray releases.

ray.serve.list_endpoints() → Dict[str, Dict[str, Any]][source]

Returns a dictionary of all registered endpoints.

The dictionary keys are endpoint names and values are dictionaries of the form: {“methods”: List[str], “traffic”: Dict[str, float]}.

DEPRECATED: This API is deprecated and may be removed in future Ray releases.

ray.serve.delete_endpoint(endpoint: str) → None[source]

Delete the given endpoint.

Does not delete any associated backends.

DEPRECATED: This API is deprecated and may be removed in future Ray releases.

ray.serve.set_traffic(endpoint_name: str, traffic_policy_dictionary: Dict[str, float]) → None[source]

Associate a service endpoint with traffic policy.

Example:

>>> serve.set_traffic("service-name", {
    "backend:v1": 0.5,
    "backend:v2": 0.5
})
Parameters
  • endpoint_name (str) – A registered service endpoint.

  • traffic_policy_dictionary (dict) – a dictionary maps backend names to their traffic weights. The weights must sum to 1.

DEPRECATED: This API is deprecated and may be removed in future Ray releases.

ray.serve.shadow_traffic(endpoint_name: str, backend_tag: str, proportion: float) → None[source]

Shadow traffic from an endpoint to a backend.

The specified proportion of requests will be duplicated and sent to the backend. Responses of the duplicated traffic will be ignored. The backend must not already be in use.

To stop shadowing traffic to a backend, call shadow_traffic with proportion equal to 0.

Parameters
  • endpoint_name (str) – A registered service endpoint.

  • backend_tag (str) – A registered backend.

  • proportion (float) – The proportion of traffic from 0 to 1.

DEPRECATED: This API is deprecated and may be removed in future Ray releases.

ray.serve.get_handle(endpoint_name: str, missing_ok: bool = False, sync: bool = True, _internal_use_serve_request: bool = True, _internal_pickled_http_request: bool = False) → Union[ray.serve.handle.RayServeHandle, ray.serve.handle.RayServeSyncHandle][source]

Retrieve RayServeHandle for service endpoint to invoke it from Python.

Parameters
  • endpoint_name (str) – A registered service endpoint.

  • missing_ok (bool) – If true, then Serve won’t check the endpoint is registered. False by default.

  • sync (bool) – If true, then Serve will return a ServeHandle that works everywhere. Otherwise, Serve will return a ServeHandle that’s only usable in asyncio loop.

Returns

RayServeHandle

DEPRECATED: This API is deprecated and may be removed in future Ray releases.

Deployment API

class ray.serve.api.Deployment(func_or_class: Callable, name: str, config: ray.serve.config.BackendConfig, version: Optional[str] = None, prev_version: Optional[str] = None, init_args: Optional[Tuple[Any]] = None, route_prefix: Optional[str] = None, ray_actor_options: Optional[Dict] = None, _internal=False)[source]

PublicAPI: This API is stable across Ray releases.

deploy(*init_args, _blocking=True)[source]

Deploy or update this deployment.

Args:
init_args (optional): args to pass to the class __init__

method. Not valid if this deployment wraps a function.

PublicAPI: This API is stable across Ray releases.

delete()[source]

Delete this deployment. PublicAPI: This API is stable across Ray releases.

get_handle(sync: Optional[bool] = True) → Union[ray.serve.handle.RayServeHandle, ray.serve.handle.RayServeSyncHandle][source]

Get a ServeHandle to this deployment to invoke it from Python.

Args:
sync (bool): If true, then Serve will return a ServeHandle that

works everywhere. Otherwise, Serve will return an asyncio-optimized ServeHandle that’s only usable in an asyncio loop.

Returns:

ServeHandle

PublicAPI: This API is stable across Ray releases.

options(func_or_class: Optional[Callable] = None, name: Optional[str] = None, version: Optional[str] = None, prev_version: Optional[str] = None, init_args: Optional[Tuple[Any]] = None, route_prefix: Optional[str] = None, num_replicas: Optional[int] = None, ray_actor_options: Optional[Dict] = None, user_config: Optional[Any] = None, max_concurrent_queries: Optional[int] = None)ray.serve.api.Deployment[source]

Return a copy of this deployment with updated options.

Only those options passed in will be updated, all others will remain unchanged from the existing deployment.

PublicAPI: This API is stable across Ray releases.

ServeHandle API

class ray.serve.handle.RayServeHandle(controller_handle: ray.actor.ActorHandle, endpoint_name: str, handle_options: Optional[ray.serve.handle.HandleOptions] = None, *, known_python_methods: List[str] = [], _router: Optional[ray.serve.router.EndpointRouter] = None, _internal_use_serve_request: Optional[bool] = True, _internal_pickled_http_request: bool = False)[source]

A handle to a service endpoint.

Invoking this endpoint with .remote is equivalent to pinging an HTTP endpoint.

Example

>>> handle = serve_client.get_handle("my_endpoint")
>>> handle
RayServeSyncHandle(endpoint="my_endpoint")
>>> handle.remote(my_request_content)
ObjectRef(...)
>>> ray.get(handle.remote(...))
# result
>>> ray.get(handle.remote(let_it_crash_request))
# raises RayTaskError Exception
>>> async_handle = serve_client.get_handle("my_endpoint", sync=False)
>>> async_handle
RayServeHandle(endpoint="my_endpoint")
>>> await async_handle.remote(my_request_content)
ObjectRef(...)
>>> ray.get(await async_handle.remote(...))
# result
>>> ray.get(await async_handle.remote(let_it_crash_request))
# raises RayTaskError Exception
options(*, method_name: Union[str, ray.serve.handle.DEFAULT] = <DEFAULT.VALUE: 1>, shard_key: Union[str, ray.serve.handle.DEFAULT] = <DEFAULT.VALUE: 1>, http_method: Union[str, ray.serve.handle.DEFAULT] = <DEFAULT.VALUE: 1>, http_headers: Union[Dict[str, str], ray.serve.handle.DEFAULT] = <DEFAULT.VALUE: 1>)[source]

Set options for this handle.

Parameters
  • method_name (str) – The method to invoke on the backend.

  • http_method (str) – The HTTP method to use for the request.

  • shard_key (str) – A string to use to deterministically map this request to a backend if there are multiple for this endpoint.

async remote(*args, **kwargs)[source]

Issue an asynchronous request to the endpoint.

Returns a Ray ObjectRef whose results can be waited for or retrieved using ray.wait or ray.get (or await object_ref), respectively.

Returns

ray.ObjectRef

Parameters
  • request_data (dict, Any) – If it’s a dictionary, the data will be available in request.json() or request.form(). Otherwise, it will be available in request.body().

  • **kwargs – All keyword arguments will be available in request.query_params.

Batching Requests

ray.serve.batch(max_batch_size=10, batch_wait_timeout_s=0.0)[source]

Converts a function to asynchronously handle batches.

The function can be a standalone function or a class method. In both cases, the function must be async def and take a list of objects as its sole argument and return a list of the same length as a result.

When invoked, the caller passes a single object. These will be batched and executed asynchronously once there is a batch of max_batch_size or batch_wait_timeout_s has elapsed, whichever occurs first.

Example:

>>> @serve.batch(max_batch_size=50, batch_wait_timeout_s=0.5)
    async def handle_batch(batch: List[str]):
        return [s.lower() for s in batch]
>>> async def handle_single(s: str):
        return await handle_batch(s) # Returns s.lower().
Parameters
  • max_batch_size (int) – the maximum batch size that will be executed in one call to the underlying function.

  • batch_wait_timeout_s (float) – the maximum duration to wait for max_batch_size elements before running the underlying function.