ray.rllib.algorithms.algorithm_config.AlgorithmConfig.fault_tolerance#

AlgorithmConfig.fault_tolerance(ignore_env_runner_failures: bool | None = <ray.rllib.utils.from_config._NotProvided object>, recreate_failed_env_runners: bool | None = <ray.rllib.utils.from_config._NotProvided object>, max_num_env_runner_restarts: int | None = <ray.rllib.utils.from_config._NotProvided object>, delay_between_env_runner_restarts_s: float | None = <ray.rllib.utils.from_config._NotProvided object>, restart_failed_sub_environments: bool | None = <ray.rllib.utils.from_config._NotProvided object>, num_consecutive_env_runner_failures_tolerance: int | None = <ray.rllib.utils.from_config._NotProvided object>, env_runner_health_probe_timeout_s: float | None = <ray.rllib.utils.from_config._NotProvided object>, env_runner_restore_timeout_s: float | None = <ray.rllib.utils.from_config._NotProvided object>, ignore_worker_failures=-1, recreate_failed_workers=-1, max_num_worker_restarts=-1, delay_between_worker_restarts_s=-1, num_consecutive_worker_failures_tolerance=-1, worker_health_probe_timeout_s=-1, worker_restore_timeout_s=-1)[source]#

Sets the config’s fault tolerance settings.

Parameters:
  • ignore_env_runner_failures – Whether to ignore any EnvRunner failures and continue running with the remaining EnvRunners. This setting will be ignored, if recreate_failed_env_runners=True.

  • recreate_failed_env_runners – Whether - upon an EnvRunner failure - RLlib will try to recreate the lost EnvRunner as an identical copy of the failed one. The new EnvRunner will only differ from the failed one in its self.recreated_worker=True property value. It will have the same worker_index as the original one. If True, the ignore_env_runner_failures setting will be ignored.

  • max_num_env_runner_restarts – The maximum number of times any EnvRunner is allowed to be restarted (if recreate_failed_env_runners is True).

  • delay_between_env_runner_restarts_s – The delay (in seconds) between two consecutive EnvRunner restarts (if recreate_failed_env_runners is True).

  • restart_failed_sub_environments – If True and any sub-environment (within a vectorized env) throws any error during env stepping, the Sampler will try to restart the faulty sub-environment. This is done without disturbing the other (still intact) sub-environment and without the EnvRunner crashing.

  • num_consecutive_env_runner_failures_tolerance – The number of consecutive times an EnvRunner failure (also for evaluation) is tolerated before finally crashing the Algorithm. Only useful if either ignore_env_runner_failures or recreate_failed_env_runners is True. Note that for restart_failed_sub_environments and sub-environment failures, the EnvRunner itself is NOT affected and won’t throw any errors as the flawed sub-environment is silently restarted under the hood.

  • env_runner_health_probe_timeout_s – Max amount of time in seconds, we should spend waiting for EnvRunner health probe calls (EnvRunner.ping.remote()) to respond. Health pings are very cheap, however, we perform the health check via a blocking ray.get(), so the default value should not be too large.

  • env_runner_restore_timeout_s – Max amount of time we should wait to restore states on recovered EnvRunner actors. Default is 30 mins.

Returns:

This updated AlgorithmConfig object.