ray.serve.autoscaling_policy.async_inference_autoscaling_policy#
- ray.serve.autoscaling_policy.async_inference_autoscaling_policy(ctx: AutoscalingContext) Tuple[int | float, Dict[str, Any]][source]#
Autoscaling policy for async inference workloads.
Scales replicas based on the total workload, which includes both HTTP requests and the async inference task queue length.
PublicAPI (beta): This API is in beta and may change before becoming stable.