ray.serve.autoscaling_policy.async_inference_autoscaling_policy#

ray.serve.autoscaling_policy.async_inference_autoscaling_policy(ctx: AutoscalingContext) → Tuple[int | float, Dict[str, Any]][source]#

Autoscaling policy for async inference workloads.

Scales replicas based on the total workload, which includes both HTTP requests and the async inference task queue length.

PublicAPI (beta): This API is in beta and may change before becoming stable.