ray.serve.handle.DeploymentHandle#

class ray.serve.handle.DeploymentHandle[source]#

A handle used to make requests to a deployment at runtime.

This is primarily used to compose multiple deployments within a single application. It can also be used to make calls to the ingress deployment of an application (e.g., for programmatic testing).

Example:

import ray
from ray import serve
from ray.serve.handle import DeploymentHandle, DeploymentResponse

@serve.deployment
class Downstream:
    def say_hi(self, message: str):
        return f"Hello {message}!"
        self._message = message

@serve.deployment
class Ingress:
    def __init__(self, handle: DeploymentHandle):
        self._downstream_handle = handle

    async def __call__(self, name: str) -> str:
        response = self._downstream_handle.say_hi.remote(name)
        return await response

app = Ingress.bind(Downstream.bind())
handle: DeploymentHandle = serve.run(app)
response = handle.remote("world")
assert response.result() == "Hello world!"
options(*, method_name: str | DEFAULT = DEFAULT.VALUE, multiplexed_model_id: str | DEFAULT = DEFAULT.VALUE, stream: bool | DEFAULT = DEFAULT.VALUE, use_new_handle_api: bool | DEFAULT = DEFAULT.VALUE, _prefer_local_routing: bool | DEFAULT = DEFAULT.VALUE, _by_reference: bool | DEFAULT = DEFAULT.VALUE, request_serialization: str | DEFAULT = DEFAULT.VALUE, response_serialization: str | DEFAULT = DEFAULT.VALUE) DeploymentHandle[T][source]#

Set options for this handle and return an updated copy of it.

Parameters:
  • method_name – The method name to call on the deployment.

  • multiplexed_model_id – The model ID to use for multiplexed model requests.

  • stream – Whether to use streaming for the request.

  • use_new_handle_api – Whether to use the new handle API.

  • _prefer_local_routing – Whether to prefer local routing.

  • _by_reference – Whether to use by reference.

  • request_serialization – Serialization method for RPC requests. Available options: “cloudpickle”, “pickle”, “msgpack”, “orjson”. Defaults to “cloudpickle”.

  • response_serialization – Serialization method for RPC responses. Available options: “cloudpickle”, “pickle”, “msgpack”, “orjson”. Defaults to “cloudpickle”.

Example:

response: DeploymentResponse = handle.options(
    method_name="other_method",
    multiplexed_model_id="model:v1",
).remote()
remote(*args, **kwargs) DeploymentResponse[Any] | DeploymentResponseGenerator[Any][source]#

Issue a remote call to a method of the deployment.

By default, the result is a DeploymentResponse that can be awaited to fetch the result of the call or passed to another .remote() call to compose multiple deployments.

If handle.options(stream=True) is set and a generator method is called, this returns a DeploymentResponseGenerator instead.

Example:

# Fetch the result directly.
response = handle.remote()
result = await response

# Pass the result to another handle call.
composed_response = handle2.remote(handle1.remote())
composed_result = await composed_response
Parameters:
  • *args – Positional arguments to be serialized and passed to the remote method call.

  • **kwargs – Keyword arguments to be serialized and passed to the remote method call.

broadcast(method_name: str, *args, **kwargs) DeploymentBroadcastResponse[source]#

Call a method on all replicas of this deployment in parallel.

Unlike remote(), which routes the request to a single replica via load balancing, broadcast() fans the call out to every running replica concurrently.

This is useful for coordinated operations such as cache resets, configuration updates, or state synchronization across replicas.

Warning

broadcast() bypasses per-replica backpressure (max_queued_requests is not enforced). It is intended for infrequent control-plane operations such as cache invalidation, configuration reload, or state synchronisation across replicas. Do not call it on the hot request path — doing so will send one request per replica on every call, with no rate limiting.

Example:

handle = serve.get_deployment_handle("MyDeployment", "app")

# Call reset_cache on every replica and collect results.
response = handle.broadcast("reset_cache")
results = response.results()

# Pass arguments to the broadcast call.
response = handle.broadcast("update_config", new_value=42)
results = response.results()
Parameters:
  • method_name – The name of the method to call on each replica.

  • *args – Positional arguments passed to the method.

  • **kwargs – Keyword arguments passed to the method.

Returns:

A DeploymentBroadcastResponse that can be used to collect results from all replicas.