ray.rllib.algorithms.algorithm_config.AlgorithmConfig.fault_tolerance#

AlgorithmConfig.fault_tolerance(*, restart_failed_env_runners: bool | None = <ray.rllib.utils.from_config._NotProvided object>, ignore_env_runner_failures: bool | None = <ray.rllib.utils.from_config._NotProvided object>, max_num_env_runner_restarts: int | None = <ray.rllib.utils.from_config._NotProvided object>, delay_between_env_runner_restarts_s: float | None = <ray.rllib.utils.from_config._NotProvided object>, restart_failed_sub_environments: bool | None = <ray.rllib.utils.from_config._NotProvided object>, num_consecutive_env_runner_failures_tolerance: int | None = <ray.rllib.utils.from_config._NotProvided object>, env_runner_health_probe_timeout_s: float | None = <ray.rllib.utils.from_config._NotProvided object>, env_runner_restore_timeout_s: float | None = <ray.rllib.utils.from_config._NotProvided object>, recreate_failed_env_runners=-1, ignore_worker_failures=-1, recreate_failed_workers=-1, max_num_worker_restarts=-1, delay_between_worker_restarts_s=-1, num_consecutive_worker_failures_tolerance=-1, worker_health_probe_timeout_s=-1, worker_restore_timeout_s=-1)[source]#

Sets the config’s fault tolerance settings.

Parameters:
  • restart_failed_env_runners – Whether - upon an EnvRunner failure - RLlib tries to restart the lost EnvRunner(s) as an identical copy of the failed one(s). You should set this to True when training on SPOT instances that may preempt any time. The new, recreated EnvRunner(s) only differ from the failed one in their self.recreated_worker=True property value and have the same worker_index as the original(s). If this setting is True, the value of the ignore_env_runner_failures setting is ignored.

  • ignore_env_runner_failures – Whether to ignore any EnvRunner failures and continue running with the remaining EnvRunners. This setting is ignored, if restart_failed_env_runners=True.

  • max_num_env_runner_restarts – The maximum number of times any EnvRunner is allowed to be restarted (if restart_failed_env_runners is True).

  • delay_between_env_runner_restarts_s – The delay (in seconds) between two consecutive EnvRunner restarts (if restart_failed_env_runners is True).

  • restart_failed_sub_environments – If True and any sub-environment (within a vectorized env) throws any error during env stepping, the Sampler tries to restart the faulty sub-environment. This is done without disturbing the other (still intact) sub-environment and without the EnvRunner crashing.

  • num_consecutive_env_runner_failures_tolerance – The number of consecutive times an EnvRunner failure (also for evaluation) is tolerated before finally crashing the Algorithm. Only useful if either ignore_env_runner_failures or restart_failed_env_runners is True. Note that for restart_failed_sub_environments and sub-environment failures, the EnvRunner itself is NOT affected and won’t throw any errors as the flawed sub-environment is silently restarted under the hood.

  • env_runner_health_probe_timeout_s – Max amount of time in seconds, we should spend waiting for EnvRunner health probe calls (EnvRunner.ping.remote()) to respond. Health pings are very cheap, however, we perform the health check via a blocking ray.get(), so the default value should not be too large.

  • env_runner_restore_timeout_s – Max amount of time we should wait to restore states on recovered EnvRunner actors. Default is 30 mins.

Returns:

This updated AlgorithmConfig object.