ray.rllib.algorithms.algorithm_config.AlgorithmConfig.fault_tolerance#
- AlgorithmConfig.fault_tolerance(ignore_env_runner_failures: bool | None = <ray.rllib.utils.from_config._NotProvided object>, recreate_failed_env_runners: bool | None = <ray.rllib.utils.from_config._NotProvided object>, max_num_env_runner_restarts: int | None = <ray.rllib.utils.from_config._NotProvided object>, delay_between_env_runner_restarts_s: float | None = <ray.rllib.utils.from_config._NotProvided object>, restart_failed_sub_environments: bool | None = <ray.rllib.utils.from_config._NotProvided object>, num_consecutive_env_runner_failures_tolerance: int | None = <ray.rllib.utils.from_config._NotProvided object>, env_runner_health_probe_timeout_s: float | None = <ray.rllib.utils.from_config._NotProvided object>, env_runner_restore_timeout_s: float | None = <ray.rllib.utils.from_config._NotProvided object>, ignore_worker_failures=-1, recreate_failed_workers=-1, max_num_worker_restarts=-1, delay_between_worker_restarts_s=-1, num_consecutive_worker_failures_tolerance=-1, worker_health_probe_timeout_s=-1, worker_restore_timeout_s=-1)[source]#
Sets the config’s fault tolerance settings.
- Parameters:
ignore_env_runner_failures – Whether to ignore any EnvRunner failures and continue running with the remaining EnvRunners. This setting will be ignored, if
recreate_failed_env_runners=True
.recreate_failed_env_runners – Whether - upon an EnvRunner failure - RLlib will try to recreate the lost EnvRunner as an identical copy of the failed one. The new EnvRunner will only differ from the failed one in its
self.recreated_worker=True
property value. It will have the sameworker_index
as the original one. If True, theignore_env_runner_failures
setting will be ignored.max_num_env_runner_restarts – The maximum number of times any EnvRunner is allowed to be restarted (if
recreate_failed_env_runners
is True).delay_between_env_runner_restarts_s – The delay (in seconds) between two consecutive EnvRunner restarts (if
recreate_failed_env_runners
is True).restart_failed_sub_environments – If True and any sub-environment (within a vectorized env) throws any error during env stepping, the Sampler will try to restart the faulty sub-environment. This is done without disturbing the other (still intact) sub-environment and without the EnvRunner crashing.
num_consecutive_env_runner_failures_tolerance – The number of consecutive times an EnvRunner failure (also for evaluation) is tolerated before finally crashing the Algorithm. Only useful if either
ignore_env_runner_failures
orrecreate_failed_env_runners
is True. Note that forrestart_failed_sub_environments
and sub-environment failures, the EnvRunner itself is NOT affected and won’t throw any errors as the flawed sub-environment is silently restarted under the hood.env_runner_health_probe_timeout_s – Max amount of time in seconds, we should spend waiting for EnvRunner health probe calls (
EnvRunner.ping.remote()
) to respond. Health pings are very cheap, however, we perform the health check via a blockingray.get()
, so the default value should not be too large.env_runner_restore_timeout_s – Max amount of time we should wait to restore states on recovered EnvRunner actors. Default is 30 mins.
- Returns:
This updated AlgorithmConfig object.