ray.rllib.algorithms.algorithm_config.AlgorithmConfig.fault_tolerance#
- AlgorithmConfig.fault_tolerance(*, restart_failed_env_runners: bool | None = <ray.rllib.utils.from_config._NotProvided object>, ignore_env_runner_failures: bool | None = <ray.rllib.utils.from_config._NotProvided object>, max_num_env_runner_restarts: int | None = <ray.rllib.utils.from_config._NotProvided object>, delay_between_env_runner_restarts_s: float | None = <ray.rllib.utils.from_config._NotProvided object>, restart_failed_sub_environments: bool | None = <ray.rllib.utils.from_config._NotProvided object>, num_consecutive_env_runner_failures_tolerance: int | None = <ray.rllib.utils.from_config._NotProvided object>, env_runner_health_probe_timeout_s: float | None = <ray.rllib.utils.from_config._NotProvided object>, env_runner_restore_timeout_s: float | None = <ray.rllib.utils.from_config._NotProvided object>, recreate_failed_env_runners=-1, ignore_worker_failures=-1, recreate_failed_workers=-1, max_num_worker_restarts=-1, delay_between_worker_restarts_s=-1, num_consecutive_worker_failures_tolerance=-1, worker_health_probe_timeout_s=-1, worker_restore_timeout_s=-1)[source]#
Sets the config’s fault tolerance settings.
- Parameters:
restart_failed_env_runners – Whether - upon an EnvRunner failure - RLlib tries to restart the lost EnvRunner(s) as an identical copy of the failed one(s). You should set this to True when training on SPOT instances that may preempt any time. The new, recreated EnvRunner(s) only differ from the failed one in their
self.recreated_worker=True
property value and have the sameworker_index
as the original(s). If this setting is True, the value of theignore_env_runner_failures
setting is ignored.ignore_env_runner_failures – Whether to ignore any EnvRunner failures and continue running with the remaining EnvRunners. This setting is ignored, if
restart_failed_env_runners=True
.max_num_env_runner_restarts – The maximum number of times any EnvRunner is allowed to be restarted (if
restart_failed_env_runners
is True).delay_between_env_runner_restarts_s – The delay (in seconds) between two consecutive EnvRunner restarts (if
restart_failed_env_runners
is True).restart_failed_sub_environments – If True and any sub-environment (within a vectorized env) throws any error during env stepping, the Sampler tries to restart the faulty sub-environment. This is done without disturbing the other (still intact) sub-environment and without the EnvRunner crashing.
num_consecutive_env_runner_failures_tolerance – The number of consecutive times an EnvRunner failure (also for evaluation) is tolerated before finally crashing the Algorithm. Only useful if either
ignore_env_runner_failures
orrestart_failed_env_runners
is True. Note that forrestart_failed_sub_environments
and sub-environment failures, the EnvRunner itself is NOT affected and won’t throw any errors as the flawed sub-environment is silently restarted under the hood.env_runner_health_probe_timeout_s – Max amount of time in seconds, we should spend waiting for EnvRunner health probe calls (
EnvRunner.ping.remote()
) to respond. Health pings are very cheap, however, we perform the health check via a blockingray.get()
, so the default value should not be too large.env_runner_restore_timeout_s – Max amount of time we should wait to restore states on recovered EnvRunner actors. Default is 30 mins.
- Returns:
This updated AlgorithmConfig object.