ray.serve.config.AutoscalingConfig#

pydantic model ray.serve.config.AutoscalingConfig[source]#

Config for the Serve Autoscaler.

This class configures how Ray Serve scales a deployment’s replicas up and down in response to traffic. The autoscaler periodically aggregates request metrics over a look-back window, compares them to target_ongoing_requests, and adjusts the replica count between min_replicas and max_replicas. upscale_delay_s and downscale_delay_s control how quickly the autoscaler reacts to traffic changes, while upscaling_factor and downscaling_factor dampen the magnitude of each scaling decision.

For an end-to-end guide, see the Serve autoscaling guide and the advanced autoscaling guide.

field aggregation_function: str | AggregationFunction = AggregationFunction.MEAN#

Function used to aggregate metrics across a time window.

field downscale_delay_s: float = 600.0#

How long to wait before scaling down replicas to a value greater than 0.

Constraints:
  • ge = 0

field downscale_smoothing_factor: float | None = None#

[DEPRECATED] Please use downscaling_factor instead.

field downscale_to_zero_delay_s: float | None = None#

How long to wait before scaling down replicas from 1 to 0. If not set, the value of downscale_delay_s will be used.

field downscaling_factor: float | None = None#

Multiplicative “gain” factor to limit downscaling decisions.

field initial_replicas: int | None = None#

The number of replicas started when the deployment is first deployed. If not set, defaults to the value of min_replicas.

field look_back_period_s: float = 30.0#

Time window to average over for metrics.

Constraints:
  • gt = 0

field max_replicas: int = 1#

The maximum number of replicas for the deployment. Must be greater than or equal to min_replicas. Ray Serve relies on the Ray Autoscaler to add cluster nodes when existing nodes lack the resources (CPUs, GPUs, etc.) needed to schedule additional replicas.

Constraints:
  • gt = 0

field metrics_interval_s: float = 10.0#

[DEPRECATED] How often to scrape for metrics. Will be replaced by the environment variables RAY_SERVE_REPLICA_AUTOSCALING_METRIC_PUSH_INTERVAL_S and RAY_SERVE_HANDLE_AUTOSCALING_METRIC_PUSH_INTERVAL_S in a future release.

Constraints:
  • gt = 0

field min_replicas: int = 1#

The minimum number of replicas for the deployment. Set this to a positive value to keep replicas ready for traffic at all times, or set it to 0 to allow scaling to zero when there is no traffic. Scaling to zero reduces cost but introduces cold-start latency when traffic resumes.

Constraints:
  • ge = 0

field policy: AutoscalingPolicy [Optional]#

The autoscaling policy for the deployment.

field smoothing_factor: float = 1.0#

[DEPRECATED] Smoothing factor for autoscaling decisions.

Constraints:
  • gt = 0

field target_ongoing_requests: float | None = 2#

The target number of requests being processed and queued per replica. Serve scales the replica count up or down to keep each replica close to this value. Lower values reduce per-replica load and tail latency at the cost of running more replicas; higher values pack more traffic onto each replica. Defaults to 2.

field upscale_delay_s: float = 30.0#

How long to wait before scaling up replicas.

Constraints:
  • ge = 0

field upscale_smoothing_factor: float | None = None#

[DEPRECATED] Please use upscaling_factor instead.

field upscaling_factor: float | None = None#

Multiplicative “gain” factor to limit upscaling decisions.

validator aggregation_function_valid  »  aggregation_function[source]#
classmethod default()[source]#
get_downscaling_factor() float[source]#
get_target_ongoing_requests() float[source]#
get_upscaling_factor() float[source]#
validator look_back_period_s_valid  »  look_back_period_s[source]#
validator metrics_interval_s_deprecation_warning  »  metrics_interval_s[source]#
validator replicas_settings_valid  »  all fields[source]#