class ray.air.FailureConfig(max_failures: int = 0, fail_fast: Union[bool, str] = False)[source]#

Bases: object

Configuration related to failure handling of each training/tuning run.

  • max_failures – Tries to recover a run at least this many times. Will recover from the latest checkpoint if present. Setting to -1 will lead to infinite recovery retries. Setting to 0 will disable retries. Defaults to 0.

  • fail_fast – Whether to fail upon the first error. Only used for Ray Tune - this does not apply to single training runs (e.g. with Trainer.fit()). If fail_fast=’raise’ provided, Ray Tune will automatically raise the exception received by the Trainable. fail_fast=’raise’ can easily leak resources and should be used with caution (it is best used with ray.init(local_mode=True)).

PublicAPI (beta): This API is in beta and may change before becoming stable.