- AlgorithmConfig.fault_tolerance(recreate_failed_workers: Optional[bool] = <ray.rllib.utils.from_config._NotProvided object>, max_num_worker_restarts: Optional[int] = <ray.rllib.utils.from_config._NotProvided object>, delay_between_worker_restarts_s: Optional[float] = <ray.rllib.utils.from_config._NotProvided object>, restart_failed_sub_environments: Optional[bool] = <ray.rllib.utils.from_config._NotProvided object>, num_consecutive_worker_failures_tolerance: Optional[int] = <ray.rllib.utils.from_config._NotProvided object>, worker_health_probe_timeout_s: int = <ray.rllib.utils.from_config._NotProvided object>, worker_restore_timeout_s: int = <ray.rllib.utils.from_config._NotProvided object>)#
Sets the config’s fault tolerance settings.
recreate_failed_workers – Whether - upon a worker failure - RLlib will try to recreate the lost worker as an identical copy of the failed one. The new worker will only differ from the failed one in its
self.recreated_worker=Trueproperty value. It will have the same
worker_indexas the original one. If True, the
ignore_worker_failuressetting will be ignored.
max_num_worker_restarts – The maximum number of times a worker is allowed to be restarted (if
delay_between_worker_restarts_s – The delay (in seconds) between two consecutive worker restarts (if
restart_failed_sub_environments – If True and any sub-environment (within a vectorized env) throws any error during env stepping, the Sampler will try to restart the faulty sub-environment. This is done without disturbing the other (still intact) sub-environment and without the RolloutWorker crashing.
num_consecutive_worker_failures_tolerance – The number of consecutive times a rollout worker (or evaluation worker) failure is tolerated before finally crashing the Algorithm. Only useful if either
recreate_failed_workersis True. Note that for
restart_failed_sub_environmentsand sub-environment failures, the worker itself is NOT affected and won’t throw any errors as the flawed sub-environment is silently restarted under the hood.
worker_health_probe_timeout_s – Max amount of time we should spend waiting for health probe calls to finish. Health pings are very cheap, so the default is 1 minute.
worker_restore_timeout_s – Max amount of time we should wait to restore states on recovered worker actors. Default is 30 mins.
This updated AlgorithmConfig object.