ray.serve.handle.DeploymentHandle#

class ray.serve.handle.DeploymentHandle[source]#

A handle used to make requests to a deployment at runtime.

This is primarily used to compose multiple deployments within a single application. It can also be used to make calls to the ingress deployment of an application (e.g., for programmatic testing).

Example:

import ray
from ray import serve
from ray.serve.handle import DeploymentHandle, DeploymentResponse

@serve.deployment
class Downstream:
    def say_hi(self, message: str):
        return f"Hello {message}!"
        self._message = message

@serve.deployment
class Ingress:
    def __init__(self, handle: DeploymentHandle):
        self._downstream_handle = handle

    async def __call__(self, name: str) -> str:
        response = self._downstream_handle.say_hi.remote(name)
        return await response

app = Ingress.bind(Downstream.bind())
handle: DeploymentHandle = serve.run(app)
response = handle.remote("world")
assert response.result() == "Hello world!"
options(*, method_name: str | DEFAULT = DEFAULT.VALUE, multiplexed_model_id: str | DEFAULT = DEFAULT.VALUE, session_id: str | DEFAULT = DEFAULT.VALUE, stream: bool | DEFAULT = DEFAULT.VALUE, use_new_handle_api: bool | DEFAULT = DEFAULT.VALUE, _prefer_local_routing: bool | DEFAULT = DEFAULT.VALUE, _by_reference: bool | DEFAULT = DEFAULT.VALUE, request_serialization: str | DEFAULT = DEFAULT.VALUE, response_serialization: str | DEFAULT = DEFAULT.VALUE) DeploymentHandle[T][source]#

Set options for this handle and return an updated copy of it.

Parameters:
  • method_name – The method name to call on the deployment.

  • multiplexed_model_id – The model ID to use for multiplexed model requests.

  • session_id – Session identifier used for honoring session stickiness.

  • stream – Whether to use streaming for the request.

  • use_new_handle_api – Whether to use the new handle API.

  • _prefer_local_routing – Whether to prefer local routing.

  • _by_reference – Whether to use by reference.

  • request_serialization – Serialization method for RPC requests. Available options: “cloudpickle”, “pickle”, “msgpack”, “orjson”. Defaults to “cloudpickle”.

  • response_serialization – Serialization method for RPC responses. Available options: “cloudpickle”, “pickle”, “msgpack”, “orjson”. Defaults to “cloudpickle”.

Example:

response: DeploymentResponse = handle.options(
    method_name="other_method",
    multiplexed_model_id="model:v1",
).remote()
remote(*args, **kwargs) DeploymentResponse[Any] | DeploymentResponseGenerator[Any][source]#

Issue a remote call to a method of the deployment.

By default, the result is a DeploymentResponse that can be awaited to fetch the result of the call or passed to another .remote() call to compose multiple deployments.

If handle.options(stream=True) is set and a generator method is called, this returns a DeploymentResponseGenerator instead.

Example:

# Fetch the result directly.
response = handle.remote()
result = await response

# Pass the result to another handle call.
composed_response = handle2.remote(handle1.remote())
composed_result = await composed_response
Parameters:
  • *args – Positional arguments to be serialized and passed to the remote method call.

  • **kwargs – Keyword arguments to be serialized and passed to the remote method call.

choose_replica(*args: Any, **kwargs: Any) AsyncContextManager[ReplicaSelection, bool | None][source]#

Execute the request router to select a replica without dispatching.

This method runs the full routing logic (load balancing, locality awareness, queue length probing, etc.) and returns an async context manager that yields a ReplicaSelection. A request slot is reserved on the selected replica, guaranteeing that dispatch will succeed.

The context manager ensures proper cleanup: - If dispatch() is called, the slot is consumed normally. - If the context exits without dispatch (e.g., exception, early return), the slot is released.

The method name is determined at choose_replica time. Any method name on the handle passed to dispatch is ignored.

Parameters:
  • *args – Arguments that may influence routing decisions

  • **kwargs – Keyword arguments that may influence routing decisions.

Returns:

AsyncContextManager[ReplicaSelection] - must be used with async with.

dispatch(selection: ReplicaSelection, *args: Any, **kwargs: Any) DeploymentResponse[Any] | DeploymentResponseGenerator[Any][source]#

Dispatch a request to a previously selected replica.

By default, the result is a DeploymentResponse that can be awaited to fetch the result of the call. Like .remote(), DeploymentResponse objects can be passed as arguments for deployment composition.

If handle.options(stream=True) is set and a generator method is called, this returns a DeploymentResponseGenerator instead. If the selected replica becomes unavailable before dispatch executes, ReplicaUnavailableError is propagated from the router dispatch path.

The returned response must be awaited before the choose_replica context exits. The router fires on_request_completed exactly once per dispatched request to decrement its queue-length cache. Exiting the context with an unawaited response fires it twice — once during context cleanup, then again when the deferred dispatch task eventually completes — leaving the cache under-counted.

Parameters:
  • selection – A ReplicaSelection from choose_replica() context manager.

  • *args – The request arguments to send to the replica.

  • **kwargs – The request keyword arguments to send to the replica.

Returns:

DeploymentResponse or DeploymentResponseGenerator (if streaming).

Raises:

ValueError – If selection was created by a different DeploymentHandle.

broadcast(method_name: str, *args, **kwargs) DeploymentBroadcastResponse[source]#

Call a method on all replicas of this deployment in parallel.

Unlike remote(), which routes the request to a single replica via load balancing, broadcast() fans the call out to every running replica concurrently.

This is useful for coordinated operations such as cache resets, configuration updates, or state synchronization across replicas.

Warning

broadcast() bypasses per-replica backpressure (max_queued_requests is not enforced). It is intended for infrequent control-plane operations such as cache invalidation, configuration reload, or state synchronisation across replicas. Do not call it on the hot request path — doing so will send one request per replica on every call, with no rate limiting.

Example:

handle = serve.get_deployment_handle("MyDeployment", "app")

# Call reset_cache on every replica and collect results.
response = handle.broadcast("reset_cache")
results = response.results()

# Pass arguments to the broadcast call.
response = handle.broadcast("update_config", new_value=42)
results = response.results()
Parameters:
  • method_name – The name of the method to call on each replica.

  • *args – Positional arguments passed to the method.

  • **kwargs – Keyword arguments passed to the method.

Returns:

A DeploymentBroadcastResponse that can be used to collect results from all replicas.