ray.rllib.algorithms.algorithm_config.AlgorithmConfig.fault_tolerance#

AlgorithmConfig.fault_tolerance(recreate_failed_workers: bool | None = <ray.rllib.utils.from_config._NotProvided object>, max_num_worker_restarts: int | None = <ray.rllib.utils.from_config._NotProvided object>, delay_between_worker_restarts_s: float | None = <ray.rllib.utils.from_config._NotProvided object>, restart_failed_sub_environments: bool | None = <ray.rllib.utils.from_config._NotProvided object>, num_consecutive_worker_failures_tolerance: int | None = <ray.rllib.utils.from_config._NotProvided object>, worker_health_probe_timeout_s: int = <ray.rllib.utils.from_config._NotProvided object>, worker_restore_timeout_s: int = <ray.rllib.utils.from_config._NotProvided object>)[source]#

Sets the config’s fault tolerance settings.

Parameters:
  • recreate_failed_workers – Whether - upon a worker failure - RLlib will try to recreate the lost worker as an identical copy of the failed one. The new worker will only differ from the failed one in its self.recreated_worker=True property value. It will have the same worker_index as the original one. If True, the ignore_worker_failures setting will be ignored.

  • max_num_worker_restarts – The maximum number of times a worker is allowed to be restarted (if recreate_failed_workers is True).

  • delay_between_worker_restarts_s – The delay (in seconds) between two consecutive worker restarts (if recreate_failed_workers is True).

  • restart_failed_sub_environments – If True and any sub-environment (within a vectorized env) throws any error during env stepping, the Sampler will try to restart the faulty sub-environment. This is done without disturbing the other (still intact) sub-environment and without the EnvRunner crashing.

  • num_consecutive_worker_failures_tolerance – The number of consecutive times a rollout worker (or evaluation worker) failure is tolerated before finally crashing the Algorithm. Only useful if either ignore_worker_failures or recreate_failed_workers is True. Note that for restart_failed_sub_environments and sub-environment failures, the worker itself is NOT affected and won’t throw any errors as the flawed sub-environment is silently restarted under the hood.

  • worker_health_probe_timeout_s – Max amount of time we should spend waiting for health probe calls to finish. Health pings are very cheap, so the default is 1 minute.

  • worker_restore_timeout_s – Max amount of time we should wait to restore states on recovered worker actors. Default is 30 mins.

Returns:

This updated AlgorithmConfig object.