ray.serve.handle.DeploymentHandle#
- class ray.serve.handle.DeploymentHandle[source]#
A handle used to make requests to a deployment at runtime.
This is primarily used to compose multiple deployments within a single application. It can also be used to make calls to the ingress deployment of an application (e.g., for programmatic testing).
Example:
import ray from ray import serve from ray.serve.handle import DeploymentHandle, DeploymentResponse @serve.deployment class Downstream: def say_hi(self, message: str): return f"Hello {message}!" self._message = message @serve.deployment class Ingress: def __init__(self, handle: DeploymentHandle): self._downstream_handle = handle async def __call__(self, name: str) -> str: response = self._downstream_handle.say_hi.remote(name) return await response app = Ingress.bind(Downstream.bind()) handle: DeploymentHandle = serve.run(app) response = handle.remote("world") assert response.result() == "Hello world!"
- options(*, method_name: str | DEFAULT = DEFAULT.VALUE, multiplexed_model_id: str | DEFAULT = DEFAULT.VALUE, session_id: str | DEFAULT = DEFAULT.VALUE, stream: bool | DEFAULT = DEFAULT.VALUE, use_new_handle_api: bool | DEFAULT = DEFAULT.VALUE, _prefer_local_routing: bool | DEFAULT = DEFAULT.VALUE, _by_reference: bool | DEFAULT = DEFAULT.VALUE, request_serialization: str | DEFAULT = DEFAULT.VALUE, response_serialization: str | DEFAULT = DEFAULT.VALUE) DeploymentHandle[T][source]#
Set options for this handle and return an updated copy of it.
- Parameters:
method_name – The method name to call on the deployment.
multiplexed_model_id – The model ID to use for multiplexed model requests.
session_id – Session identifier used for honoring session stickiness.
stream – Whether to use streaming for the request.
use_new_handle_api – Whether to use the new handle API.
_prefer_local_routing – Whether to prefer local routing.
_by_reference – Whether to use by reference.
request_serialization – Serialization method for RPC requests. Available options: “cloudpickle”, “pickle”, “msgpack”, “orjson”. Defaults to “cloudpickle”.
response_serialization – Serialization method for RPC responses. Available options: “cloudpickle”, “pickle”, “msgpack”, “orjson”. Defaults to “cloudpickle”.
Example:
response: DeploymentResponse = handle.options( method_name="other_method", multiplexed_model_id="model:v1", ).remote()
- remote(*args, **kwargs) DeploymentResponse[Any] | DeploymentResponseGenerator[Any][source]#
Issue a remote call to a method of the deployment.
By default, the result is a
DeploymentResponsethat can be awaited to fetch the result of the call or passed to another.remote()call to compose multiple deployments.If
handle.options(stream=True)is set and a generator method is called, this returns aDeploymentResponseGeneratorinstead.Example:
# Fetch the result directly. response = handle.remote() result = await response # Pass the result to another handle call. composed_response = handle2.remote(handle1.remote()) composed_result = await composed_response
- Parameters:
*args – Positional arguments to be serialized and passed to the remote method call.
**kwargs – Keyword arguments to be serialized and passed to the remote method call.
- choose_replica(*args: Any, **kwargs: Any) AsyncContextManager[ReplicaSelection, bool | None][source]#
Execute the request router to select a replica without dispatching.
This method runs the full routing logic (load balancing, locality awareness, queue length probing, etc.) and returns an async context manager that yields a ReplicaSelection. A request slot is reserved on the selected replica, guaranteeing that dispatch will succeed.
The context manager ensures proper cleanup: - If dispatch() is called, the slot is consumed normally. - If the context exits without dispatch (e.g., exception, early return), the slot is released.
The method name is determined at
choose_replicatime. Any method name on the handle passed todispatchis ignored.- Parameters:
*args – Arguments that may influence routing decisions
**kwargs – Keyword arguments that may influence routing decisions.
- Returns:
AsyncContextManager[ReplicaSelection] - must be used with async with.
- dispatch(selection: ReplicaSelection, *args: Any, **kwargs: Any) DeploymentResponse[Any] | DeploymentResponseGenerator[Any][source]#
Dispatch a request to a previously selected replica.
By default, the result is a
DeploymentResponsethat can be awaited to fetch the result of the call. Like.remote(),DeploymentResponseobjects can be passed as arguments for deployment composition.If
handle.options(stream=True)is set and a generator method is called, this returns aDeploymentResponseGeneratorinstead. If the selected replica becomes unavailable before dispatch executes,ReplicaUnavailableErroris propagated from the router dispatch path.The returned response must be awaited before the
choose_replicacontext exits. The router fireson_request_completedexactly once per dispatched request to decrement its queue-length cache. Exiting the context with an unawaited response fires it twice — once during context cleanup, then again when the deferred dispatch task eventually completes — leaving the cache under-counted.- Parameters:
selection – A ReplicaSelection from choose_replica() context manager.
*args – The request arguments to send to the replica.
**kwargs – The request keyword arguments to send to the replica.
- Returns:
DeploymentResponse or DeploymentResponseGenerator (if streaming).
- Raises:
ValueError – If selection was created by a different DeploymentHandle.
- broadcast(method_name: str, *args, **kwargs) DeploymentBroadcastResponse[source]#
Call a method on all replicas of this deployment in parallel.
Unlike
remote(), which routes the request to a single replica via load balancing,broadcast()fans the call out to every running replica concurrently.This is useful for coordinated operations such as cache resets, configuration updates, or state synchronization across replicas.
Warning
broadcast()bypasses per-replica backpressure (max_queued_requestsis not enforced). It is intended for infrequent control-plane operations such as cache invalidation, configuration reload, or state synchronisation across replicas. Do not call it on the hot request path — doing so will send one request per replica on every call, with no rate limiting.Example:
handle = serve.get_deployment_handle("MyDeployment", "app") # Call reset_cache on every replica and collect results. response = handle.broadcast("reset_cache") results = response.results() # Pass arguments to the broadcast call. response = handle.broadcast("update_config", new_value=42) results = response.results()
- Parameters:
method_name – The name of the method to call on each replica.
*args – Positional arguments passed to the method.
**kwargs – Keyword arguments passed to the method.
- Returns:
A
DeploymentBroadcastResponsethat can be used to collect results from all replicas.