ray.train.FailureConfig#
- class ray.train.FailureConfig(max_failures: int = 0, fail_fast: bool | str = 'DEPRECATED', controller_failure_limit: int = -1)#
Bases:
FailureConfig
Configuration related to failure handling of each training run.
- Parameters:
max_failures – Tries to recover a run from training worker errors at least this many times. Will recover from the latest checkpoint if present. Setting to -1 will lead to infinite recovery retries. Setting to 0 will disable retries. Defaults to 0.
controller_failure_limit – [DeveloperAPI] The maximum number of controller failures to tolerate. Setting to -1 will lead to infinite controller retries. Setting to 0 will disable controller retries. Defaults to -1.
Methods
Attributes