Note

Ray 2.40 uses RLlib’s new API stack by default. The Ray team has mostly completed transitioning algorithms, example scripts, and documentation to the new code base.

If you’re still using the old API stack, see New API stack migration guide for details on how to migrate.

Algorithm Configuration API#

Constructor#

AlgorithmConfig

A RLlib AlgorithmConfig builds an RLlib Algorithm from a given configuration.

Builder methods#

build_algo

Builds an Algorithm from this AlgorithmConfig (or a copy thereof).

build_learner_group

Builds and returns a new LearnerGroup object based on settings in self.

build_learner

Builds and returns a new Learner object based on settings in self.

Properties#

is_multi_agent

Returns whether this config specifies a multi-agent setup.

is_offline

Defines, if this config is for offline RL.

learner_class

Returns the Learner sub-class to use by this Algorithm.

model_config

Defines the model configuration used.

rl_module_spec

total_train_batch_size

Getter methods#

get_default_learner_class

Returns the Learner class to use for this algorithm.

get_default_rl_module_spec

Returns the RLModule spec to use for this algorithm.

get_evaluation_config_object

Creates a full AlgorithmConfig object from self.evaluation_config.

get_multi_rl_module_spec

Returns the MultiRLModuleSpec based on the given env/spaces.

get_multi_agent_setup

Compiles complete multi-agent config (dict) from the information in self.

get_rollout_fragment_length

Automatically infers a proper rollout_fragment_length setting if "auto".

Public methods#

copy

Creates a deep copy of this config and (un)freezes if necessary.

validate

Validates all values in this config.

freeze

Freezes this config object, such that no attributes can be set anymore.

Configuration methods#

Configuring the RL Environment#

AlgorithmConfig.environment(env: str | ~typing.Any | gymnasium.Env | None = <ray.rllib.utils.from_config._NotProvided object>, *, env_config: dict | None = <ray.rllib.utils.from_config._NotProvided object>, observation_space: gymnasium.spaces.Space | None = <ray.rllib.utils.from_config._NotProvided object>, action_space: gymnasium.spaces.Space | None = <ray.rllib.utils.from_config._NotProvided object>, render_env: bool | None = <ray.rllib.utils.from_config._NotProvided object>, clip_rewards: bool | float | None = <ray.rllib.utils.from_config._NotProvided object>, normalize_actions: bool | None = <ray.rllib.utils.from_config._NotProvided object>, clip_actions: bool | None = <ray.rllib.utils.from_config._NotProvided object>, disable_env_checking: bool | None = <ray.rllib.utils.from_config._NotProvided object>, is_atari: bool | None = <ray.rllib.utils.from_config._NotProvided object>, action_mask_key: str | None = <ray.rllib.utils.from_config._NotProvided object>, env_task_fn=-1) AlgorithmConfig[source]

Sets the config’s RL-environment settings.

Parameters:
  • env – The environment specifier. This can either be a tune-registered env, via tune.register_env([name], lambda env_ctx: [env object]), or a string specifier of an RLlib supported type. In the latter case, RLlib tries to interpret the specifier as either an Farama-Foundation gymnasium env, a PyBullet env, or a fully qualified classpath to an Env class, e.g. “ray.rllib.examples.envs.classes.random_env.RandomEnv”.

  • env_config – Arguments dict passed to the env creator as an EnvContext object (which is a dict plus the properties: num_env_runners, worker_index, vector_index, and remote).

  • observation_space – The observation space for the Policies of this Algorithm.

  • action_space – The action space for the Policies of this Algorithm.

  • render_env – If True, try to render the environment on the local worker or on worker 1 (if num_env_runners > 0). For vectorized envs, this usually means that only the first sub-environment is rendered. In order for this to work, your env has to implement the render() method which either: a) handles window generation and rendering itself (returning True) or b) returns a numpy uint8 image of shape [height x width x 3 (RGB)].

  • clip_rewards – Whether to clip rewards during Policy’s postprocessing. None (default): Clip for Atari only (r=sign(r)). True: r=sign(r): Fixed rewards -1.0, 1.0, or 0.0. False: Never clip. [float value]: Clip at -value and + value. Tuple[value1, value2]: Clip at value1 and value2.

  • normalize_actions – If True, RLlib learns entirely inside a normalized action space (0.0 centered with small stddev; only affecting Box components). RLlib unsquashes actions (and clip, just in case) to the bounds of the env’s action space before sending actions back to the env.

  • clip_actions – If True, the RLlib default ModuleToEnv connector clips actions according to the env’s bounds (before sending them into the env.step() call).

  • disable_env_checking – Disable RLlib’s env checks after a gymnasium.Env instance has been constructed in an EnvRunner. Note that the checks include an env.reset() and env.step() (with a random action), which might tinker with your env’s logic and behavior and thus negatively influence sample collection- and/or learning behavior.

  • is_atari – This config can be used to explicitly specify whether the env is an Atari env or not. If not specified, RLlib tries to auto-detect this.

  • action_mask_key – If observation is a dictionary, expect the value by the key action_mask_key to contain a valid actions mask (numpy.int8 array of zeros and ones). Defaults to “action_mask”.

Returns:

This updated AlgorithmConfig object.

Configuring training behavior#

AlgorithmConfig.training(*, gamma: float | None = <ray.rllib.utils.from_config._NotProvided object>, lr: float | ~typing.List[~typing.List[int | float]] | ~typing.List[~typing.Tuple[int, int | float]] | None = <ray.rllib.utils.from_config._NotProvided object>, grad_clip: float | None = <ray.rllib.utils.from_config._NotProvided object>, grad_clip_by: str | None = <ray.rllib.utils.from_config._NotProvided object>, train_batch_size: int | None = <ray.rllib.utils.from_config._NotProvided object>, train_batch_size_per_learner: int | None = <ray.rllib.utils.from_config._NotProvided object>, num_epochs: int | None = <ray.rllib.utils.from_config._NotProvided object>, minibatch_size: int | None = <ray.rllib.utils.from_config._NotProvided object>, shuffle_batch_per_epoch: bool | None = <ray.rllib.utils.from_config._NotProvided object>, model: dict | None = <ray.rllib.utils.from_config._NotProvided object>, optimizer: dict | None = <ray.rllib.utils.from_config._NotProvided object>, learner_class: ~typing.Type[Learner] | None = <ray.rllib.utils.from_config._NotProvided object>, learner_connector: ~typing.Callable[[RLModule], ConnectorV2 | ~typing.List[ConnectorV2]] | None = <ray.rllib.utils.from_config._NotProvided object>, add_default_connectors_to_learner_pipeline: bool | None = <ray.rllib.utils.from_config._NotProvided object>, learner_config_dict: ~typing.Dict[str, ~typing.Any] | None = <ray.rllib.utils.from_config._NotProvided object>, num_sgd_iter=-1, max_requests_in_flight_per_sampler_worker=-1) AlgorithmConfig[source]

Sets the training related configuration.

Parameters:
  • gamma – Float specifying the discount factor of the Markov Decision process.

  • lr – The learning rate (float) or learning rate schedule in the format of [[timestep, lr-value], [timestep, lr-value], …] In case of a schedule, intermediary timesteps are assigned to linearly interpolated learning rate values. A schedule config’s first entry must start with timestep 0, i.e.: [[0, initial_value], […]]. Note: If you require a) more than one optimizer (per RLModule), b) optimizer types that are not Adam, c) a learning rate schedule that is not a linearly interpolated, piecewise schedule as described above, or d) specifying c’tor arguments of the optimizer that are not the learning rate (e.g. Adam’s epsilon), then you must override your Learner’s configure_optimizer_for_module() method and handle lr-scheduling yourself.

  • grad_clip – If None, no gradient clipping is applied. Otherwise, depending on the setting of grad_clip_by, the (float) value of grad_clip has the following effect: If grad_clip_by=value: Clips all computed gradients individually inside the interval [-grad_clip, +`grad_clip`]. If grad_clip_by=norm, computes the L2-norm of each weight/bias gradient tensor individually and then clip all gradients such that these L2-norms do not exceed grad_clip. The L2-norm of a tensor is computed via: sqrt(SUM(w0^2, w1^2, ..., wn^2)) where w[i] are the elements of the tensor (no matter what the shape of this tensor is). If grad_clip_by=global_norm, computes the square of the L2-norm of each weight/bias gradient tensor individually, sum up all these squared L2-norms across all given gradient tensors (e.g. the entire module to be updated), square root that overall sum, and then clip all gradients such that this global L2-norm does not exceed the given value. The global L2-norm over a list of tensors (e.g. W and V) is computed via: sqrt[SUM(w0^2, w1^2, ..., wn^2) + SUM(v0^2, v1^2, ..., vm^2)], where w[i] and v[j] are the elements of the tensors W and V (no matter what the shapes of these tensors are).

  • grad_clip_by – See grad_clip for the effect of this setting on gradient clipping. Allowed values are value, norm, and global_norm.

  • train_batch_size_per_learner – Train batch size per individual Learner worker. This setting only applies to the new API stack. The number of Learner workers can be set via config.resources( num_learners=...). The total effective batch size is then num_learners x train_batch_size_per_learner and you can access it with the property AlgorithmConfig.total_train_batch_size.

  • train_batch_size – Training batch size, if applicable. When on the new API stack, this setting should no longer be used. Instead, use train_batch_size_per_learner (in combination with num_learners).

  • num_epochs – The number of complete passes over the entire train batch (per Learner). Each pass might be further split into n minibatches (if minibatch_size provided).

  • minibatch_size – The size of minibatches to use to further split the train batch into.

  • shuffle_batch_per_epoch – Whether to shuffle the train batch once per epoch. If the train batch has a time rank (axis=1), shuffling only takes place along the batch axis to not disturb any intact (episode) trajectories.

  • model – Arguments passed into the policy model. See models/catalog.py for a full list of the available model options. TODO: Provide ModelConfig objects instead of dicts.

  • optimizer – Arguments to pass to the policy optimizer. This setting is not used when enable_rl_module_and_learner=True.

  • learner_class – The Learner class to use for (distributed) updating of the RLModule. Only used when enable_rl_module_and_learner=True.

  • learner_connector – A callable taking an env observation space and an env action space as inputs and returning a learner ConnectorV2 (might be a pipeline) object.

  • add_default_connectors_to_learner_pipeline – If True (default), RLlib’s Learners automatically add the default Learner ConnectorV2 pieces to the LearnerPipeline. These automatically perform: a) adding observations from episodes to the train batch, if this has not already been done by a user-provided connector piece b) if RLModule is stateful, add a time rank to the train batch, zero-pad the data, and add the correct state inputs, if this has not already been done by a user-provided connector piece. c) add all other information (actions, rewards, terminateds, etc..) to the train batch, if this has not already been done by a user-provided connector piece. Only if you know exactly what you are doing, you should set this setting to False. Note that this setting is only relevant if the new API stack is used (including the new EnvRunner classes).

  • learner_config_dict – A dict to insert any settings accessible from within the Learner instance. This should only be used in connection with custom Learner subclasses and in case the user doesn’t want to write an extra AlgorithmConfig subclass just to add a few settings to the base Algo’s own config class.

Returns:

This updated AlgorithmConfig object.

Configuring EnvRunnerGroup and EnvRunner actors#

AlgorithmConfig.env_runners(*, env_runner_cls: type | None = <ray.rllib.utils.from_config._NotProvided object>, num_env_runners: int | None = <ray.rllib.utils.from_config._NotProvided object>, num_envs_per_env_runner: int | None = <ray.rllib.utils.from_config._NotProvided object>, gym_env_vectorize_mode: str | None = <ray.rllib.utils.from_config._NotProvided object>, num_cpus_per_env_runner: int | None = <ray.rllib.utils.from_config._NotProvided object>, num_gpus_per_env_runner: int | float | None = <ray.rllib.utils.from_config._NotProvided object>, custom_resources_per_env_runner: dict | None = <ray.rllib.utils.from_config._NotProvided object>, validate_env_runners_after_construction: bool | None = <ray.rllib.utils.from_config._NotProvided object>, sample_timeout_s: float | None = <ray.rllib.utils.from_config._NotProvided object>, max_requests_in_flight_per_env_runner: int | None = <ray.rllib.utils.from_config._NotProvided object>, env_to_module_connector: ~typing.Callable[[~typing.Any | gymnasium.Env], ConnectorV2 | ~typing.List[ConnectorV2]] | None = <ray.rllib.utils.from_config._NotProvided object>, module_to_env_connector: ~typing.Callable[[~typing.Any | gymnasium.Env, RLModule], ConnectorV2 | ~typing.List[ConnectorV2]] | None = <ray.rllib.utils.from_config._NotProvided object>, add_default_connectors_to_env_to_module_pipeline: bool | None = <ray.rllib.utils.from_config._NotProvided object>, add_default_connectors_to_module_to_env_pipeline: bool | None = <ray.rllib.utils.from_config._NotProvided object>, episode_lookback_horizon: int | None = <ray.rllib.utils.from_config._NotProvided object>, use_worker_filter_stats: bool | None = <ray.rllib.utils.from_config._NotProvided object>, update_worker_filter_stats: bool | None = <ray.rllib.utils.from_config._NotProvided object>, compress_observations: bool | None = <ray.rllib.utils.from_config._NotProvided object>, rollout_fragment_length: int | str | None = <ray.rllib.utils.from_config._NotProvided object>, batch_mode: str | None = <ray.rllib.utils.from_config._NotProvided object>, explore: bool | None = <ray.rllib.utils.from_config._NotProvided object>, exploration_config: dict | None = <ray.rllib.utils.from_config._NotProvided object>, create_env_on_local_worker: bool | None = <ray.rllib.utils.from_config._NotProvided object>, sample_collector: ~typing.Type[~ray.rllib.evaluation.collectors.sample_collector.SampleCollector] | None = <ray.rllib.utils.from_config._NotProvided object>, remote_worker_envs: bool | None = <ray.rllib.utils.from_config._NotProvided object>, remote_env_batch_wait_ms: float | None = <ray.rllib.utils.from_config._NotProvided object>, preprocessor_pref: str | None = <ray.rllib.utils.from_config._NotProvided object>, observation_filter: str | None = <ray.rllib.utils.from_config._NotProvided object>, enable_tf1_exec_eagerly: bool | None = <ray.rllib.utils.from_config._NotProvided object>, sampler_perf_stats_ema_coef: float | None = <ray.rllib.utils.from_config._NotProvided object>, num_rollout_workers=-1, num_envs_per_worker=-1, validate_workers_after_construction=-1, ignore_worker_failures=-1, recreate_failed_workers=-1, restart_failed_sub_environments=-1, num_consecutive_worker_failures_tolerance=-1, worker_health_probe_timeout_s=-1, worker_restore_timeout_s=-1, synchronize_filter=-1, enable_connectors=-1) AlgorithmConfig[source]

Sets the rollout worker configuration.

Parameters:
  • env_runner_cls – The EnvRunner class to use for environment rollouts (data collection).

  • num_env_runners – Number of EnvRunner actors to create for parallel sampling. Setting this to 0 forces sampling to be done in the local EnvRunner (main process or the Algorithm’s actor when using Tune).

  • num_envs_per_env_runner – Number of environments to step through (vector-wise) per EnvRunner. This enables batching when computing actions through RLModule inference, which can improve performance for inference-bottlenecked workloads.

  • gym_env_vectorize_mode – The gymnasium vectorization mode for vector envs. Must be a gymnasium.envs.registration.VectorizeMode (enum) value. Default is SYNC. Set this to ASYNC to parallelize the individual sub environments within the vector. This can speed up your EnvRunners significantly when using heavier environments.

  • num_cpus_per_env_runner – Number of CPUs to allocate per EnvRunner.

  • num_gpus_per_env_runner – Number of GPUs to allocate per EnvRunner. This can be fractional. This is usually needed only if your env itself requires a GPU (i.e., it is a GPU-intensive video game), or model inference is unusually expensive.

  • custom_resources_per_env_runner – Any custom Ray resources to allocate per EnvRunner.

  • sample_timeout_s – The timeout in seconds for calling sample() on remote EnvRunner workers. Results (episode list) from workers that take longer than this time are discarded. Only used by algorithms that sample synchronously in turn with their update step (e.g., PPO or DQN). Not relevant for any algos that sample asynchronously, such as APPO or IMPALA.

  • max_requests_in_flight_per_env_runner – Max number of in-flight requests to each EnvRunner (actor)). See the ray.rllib.utils.actor_manager.FaultTolerantActorManager class for more details. Tuning these values is important when running experiments with large sample batches, where there is the risk that the object store may fill up, causing spilling of objects to disk. This can cause any asynchronous requests to become very slow, making your experiment run slowly as well. You can inspect the object store during your experiment through a call to ray memory on your head node, and by using the Ray dashboard. If you’re seeing that the object store is filling up, turn down the number of remote requests in flight or enable compression or increase the object store memory through, for example: ray.init(object_store_memory=10 * 1024 * 1024 * 1024)  # =10 GB

  • sample_collector – For the old API stack only. The SampleCollector class to be used to collect and retrieve environment-, model-, and sampler data. Override the SampleCollector base class to implement your own collection/buffering/retrieval logic.

  • create_env_on_local_worker – When num_env_runners > 0, the driver (local_worker; worker-idx=0) does not need an environment. This is because it doesn’t have to sample (done by remote_workers; worker_indices > 0) nor evaluate (done by evaluation workers; see below).

  • env_to_module_connector – A callable taking an Env as input arg and returning an env-to-module ConnectorV2 (might be a pipeline) object.

  • module_to_env_connector – A callable taking an Env and an RLModule as input args and returning a module-to-env ConnectorV2 (might be a pipeline) object.

  • add_default_connectors_to_env_to_module_pipeline – If True (default), RLlib’s EnvRunners automatically add the default env-to-module ConnectorV2 pieces to the EnvToModulePipeline. These automatically perform adding observations and states (in case of stateful Module(s)), agent-to-module mapping, batching, and conversion to tensor data. Only if you know exactly what you are doing, you should set this setting to False. Note that this setting is only relevant if the new API stack is used (including the new EnvRunner classes).

  • add_default_connectors_to_module_to_env_pipeline – If True (default), RLlib’s EnvRunners automatically add the default module-to-env ConnectorV2 pieces to the ModuleToEnvPipeline. These automatically perform removing the additional time-rank (if applicable, in case of stateful Module(s)), module-to-agent unmapping, un-batching (to lists), and conversion from tensor data to numpy. Only if you know exactly what you are doing, you should set this setting to False. Note that this setting is only relevant if the new API stack is used (including the new EnvRunner classes).

  • episode_lookback_horizon – The amount of data (in timesteps) to keep from the preceeding episode chunk when a new chunk (for the same episode) is generated to continue sampling at a later time. The larger this value, the more an env-to-module connector can look back in time and compile RLModule input data from this information. For example, if your custom env-to-module connector (and your custom RLModule) requires the previous 10 rewards as inputs, you must set this to at least 10.

  • use_worker_filter_stats – Whether to use the workers in the EnvRunnerGroup to update the central filters (held by the local worker). If False, stats from the workers aren’t used and are discarded.

  • update_worker_filter_stats – Whether to push filter updates from the central filters (held by the local worker) to the remote workers’ filters. Setting this to True might be useful within the evaluation config in order to disable the usage of evaluation trajectories for synching the central filter (used for training).

  • rollout_fragment_length – Divide episodes into fragments of this many steps each during sampling. Trajectories of this size are collected from EnvRunners and combined into a larger batch of train_batch_size for learning. For example, given rollout_fragment_length=100 and train_batch_size=1000: 1. RLlib collects 10 fragments of 100 steps each from rollout workers. 2. These fragments are concatenated and we perform an epoch of SGD. When using multiple envs per worker, the fragment size is multiplied by num_envs_per_env_runner. This is since we are collecting steps from multiple envs in parallel. For example, if num_envs_per_env_runner=5, then EnvRunners return experiences in chunks of 5*100 = 500 steps. The dataflow here can vary per algorithm. For example, PPO further divides the train batch into minibatches for multi-epoch SGD. Set rollout_fragment_length to “auto” to have RLlib compute an exact value to match the given batch size.

  • batch_mode – How to build individual batches with the EnvRunner(s). Batches coming from distributed EnvRunners are usually concat’d to form the train batch. Note that “steps” below can mean different things (either env- or agent-steps) and depends on the count_steps_by setting, adjustable via AlgorithmConfig.multi_agent(count_steps_by=..): 1) “truncate_episodes”: Each call to EnvRunner.sample() returns a batch of at most rollout_fragment_length * num_envs_per_env_runner in size. The batch is exactly rollout_fragment_length * num_envs in size if postprocessing does not change batch sizes. Episodes may be truncated in order to meet this size requirement. This mode guarantees evenly sized batches, but increases variance as the future return must now be estimated at truncation boundaries. 2) “complete_episodes”: Each call to EnvRunner.sample() returns a batch of at least rollout_fragment_length * num_envs_per_env_runner in size. Episodes aren’t truncated, but multiple episodes may be packed within one batch to meet the (minimum) batch size. Note that when num_envs_per_env_runner > 1, episode steps are buffered until the episode completes, and hence batches may contain significant amounts of off-policy data.

  • explore – Default exploration behavior, iff explore=None is passed into compute_action(s). Set to False for no exploration behavior (e.g., for evaluation).

  • exploration_config – A dict specifying the Exploration object’s config.

  • remote_worker_envs – If using num_envs_per_env_runner > 1, whether to create those new envs in remote processes instead of in the same worker. This adds overheads, but can make sense if your envs can take much time to step / reset (e.g., for StarCraft). Use this cautiously; overheads are significant.

  • remote_env_batch_wait_ms – Timeout that remote workers are waiting when polling environments. 0 (continue when at least one env is ready) is a reasonable default, but optimal value could be obtained by measuring your environment step / reset and model inference perf.

  • validate_env_runners_after_construction – Whether to validate that each created remote EnvRunner is healthy after its construction process.

  • preprocessor_pref – Whether to use “rllib” or “deepmind” preprocessors by default. Set to None for using no preprocessor. In this case, the model has to handle possibly complex observations from the environment.

  • observation_filter – Element-wise observation filter, either “NoFilter” or “MeanStdFilter”.

  • compress_observations – Whether to LZ4 compress individual observations in the SampleBatches collected during rollouts.

  • enable_tf1_exec_eagerly – Explicitly tells the rollout worker to enable TF eager execution. This is useful for example when framework is “torch”, but a TF2 policy needs to be restored for evaluation or league-based purposes.

  • sampler_perf_stats_ema_coef – If specified, perf stats are in EMAs. This is the coeff of how much new data points contribute to the averages. Default is None, which uses simple global average instead. The EMA update rule is: updated = (1 - ema_coef) * old + ema_coef * new

Returns:

This updated AlgorithmConfig object.

Configuring LearnerGroup and Learner actors#

AlgorithmConfig.learners(*, num_learners: int | None = <ray.rllib.utils.from_config._NotProvided object>, num_cpus_per_learner: int | float | None = <ray.rllib.utils.from_config._NotProvided object>, num_gpus_per_learner: int | float | None = <ray.rllib.utils.from_config._NotProvided object>, local_gpu_idx: int | None = <ray.rllib.utils.from_config._NotProvided object>, max_requests_in_flight_per_learner: int | None = <ray.rllib.utils.from_config._NotProvided object>)[source]

Sets LearnerGroup and Learner worker related configurations.

Parameters:
  • num_learners – Number of Learner workers used for updating the RLModule. A value of 0 means training takes place on a local Learner on main process CPUs or 1 GPU (determined by num_gpus_per_learner). For multi-gpu training, you have to set num_learners to > 1 and set num_gpus_per_learner accordingly (e.g., 4 GPUs total and model fits on 1 GPU: num_learners=4; num_gpus_per_learner=1 OR 4 GPUs total and model requires 2 GPUs: num_learners=2; num_gpus_per_learner=2).

  • num_cpus_per_learner – Number of CPUs allocated per Learner worker. Only necessary for custom processing pipeline inside each Learner requiring multiple CPU cores. Ignored if num_learners=0.

  • num_gpus_per_learner – Number of GPUs allocated per Learner worker. If num_learners=0, any value greater than 0 runs the training on a single GPU on the main process, while a value of 0 runs the training on main process CPUs. If num_gpus_per_learner is > 0, then you shouldn’t change num_cpus_per_learner (from its default value of 1).

  • local_gpu_idx – If num_gpus_per_learner > 0, and num_learners < 2, then RLlib uses this GPU index for training. This is an index into the available CUDA devices. For example if os.environ["CUDA_VISIBLE_DEVICES"] = "1" and local_gpu_idx=0, RLlib uses the GPU with ID=1 on the node.

  • max_requests_in_flight_per_learner – Max number of in-flight requests to each Learner (actor). You normally do not have to tune this setting (default is 3), however, for asynchronous algorithms, this determines the “queue” size for incoming batches (or lists of episodes) into each Learner worker, thus also determining, how much off-policy’ness would be acceptable. The off-policy’ness is the difference between the numbers of updates a policy has undergone on the Learner vs the EnvRunners. See the ray.rllib.utils.actor_manager.FaultTolerantActorManager class for more details.

Returns:

This updated AlgorithmConfig object.

Configuring custom callbacks#

AlgorithmConfig.callbacks(callbacks_class: ~typing.Type[~ray.rllib.callbacks.callbacks.RLlibCallback] | ~typing.List[~typing.Type[~ray.rllib.callbacks.callbacks.RLlibCallback]] | None = <ray.rllib.utils.from_config._NotProvided object>, *, on_algorithm_init: ~typing.Callable | ~typing.List[~typing.Callable] | None = <ray.rllib.utils.from_config._NotProvided object>, on_train_result: ~typing.Callable | ~typing.List[~typing.Callable] | None = <ray.rllib.utils.from_config._NotProvided object>, on_evaluate_start: ~typing.Callable | ~typing.List[~typing.Callable] | None = <ray.rllib.utils.from_config._NotProvided object>, on_evaluate_end: ~typing.Callable | ~typing.List[~typing.Callable] | None = <ray.rllib.utils.from_config._NotProvided object>, on_env_runners_recreated: ~typing.Callable | ~typing.List[~typing.Callable] | None = <ray.rllib.utils.from_config._NotProvided object>, on_checkpoint_loaded: ~typing.Callable | ~typing.List[~typing.Callable] | None = <ray.rllib.utils.from_config._NotProvided object>, on_environment_created: ~typing.Callable | ~typing.List[~typing.Callable] | None = <ray.rllib.utils.from_config._NotProvided object>, on_episode_created: ~typing.Callable | ~typing.List[~typing.Callable] | None = <ray.rllib.utils.from_config._NotProvided object>, on_episode_start: ~typing.Callable | ~typing.List[~typing.Callable] | None = <ray.rllib.utils.from_config._NotProvided object>, on_episode_step: ~typing.Callable | ~typing.List[~typing.Callable] | None = <ray.rllib.utils.from_config._NotProvided object>, on_episode_end: ~typing.Callable | ~typing.List[~typing.Callable] | None = <ray.rllib.utils.from_config._NotProvided object>, on_sample_end: ~typing.Callable | ~typing.List[~typing.Callable] | None = <ray.rllib.utils.from_config._NotProvided object>) AlgorithmConfig[source]

Sets the callbacks configuration.

Parameters:
  • callbacks_class – RLlibCallback class, whose methods are called during various phases of training and RL environment sample collection. TODO (sven): Change the link to new rst callbacks page. See the RLlibCallback class and examples/metrics/custom_metrics_and_callbacks.py for more information.

  • on_algorithm_init – A callable or a list of callables. If a list, RLlib calls the items in the same sequence. on_algorithm_init methods overridden in callbacks_class take precedence and are called first. See on_algorithm_init() # noqa for more information.

  • on_evaluate_start – A callable or a list of callables. If a list, RLlib calls the items in the same sequence. on_evaluate_start methods overridden in callbacks_class take precedence and are called first. See on_evaluate_start() # noqa for more information.

  • on_evaluate_end – A callable or a list of callables. If a list, RLlib calls the items in the same sequence. on_evaluate_end methods overridden in callbacks_class take precedence and are called first. See on_evaluate_end() # noqa for more information.

  • on_env_runners_recreated – A callable or a list of callables. If a list, RLlib calls the items in the same sequence. on_env_runners_recreated methods overridden in callbacks_class take precedence and are called first. See on_env_runners_recreated() # noqa for more information.

  • on_checkpoint_loaded – A callable or a list of callables. If a list, RLlib calls the items in the same sequence. on_checkpoint_loaded methods overridden in callbacks_class take precedence and are called first. See on_checkpoint_loaded() # noqa for more information.

  • on_environment_created – A callable or a list of callables. If a list, RLlib calls the items in the same sequence. on_environment_created methods overridden in callbacks_class take precedence and are called first. See on_environment_created() # noqa for more information.

  • on_episode_created – A callable or a list of callables. If a list, RLlib calls the items in the same sequence. on_episode_created methods overridden in callbacks_class take precedence and are called first. See on_episode_created() # noqa for more information.

  • on_episode_start – A callable or a list of callables. If a list, RLlib calls the items in the same sequence. on_episode_start methods overridden in callbacks_class take precedence and are called first. See on_episode_start() # noqa for more information.

  • on_episode_step – A callable or a list of callables. If a list, RLlib calls the items in the same sequence. on_episode_step methods overridden in callbacks_class take precedence and are called first. See on_episode_step() # noqa for more information.

  • on_episode_end – A callable or a list of callables. If a list, RLlib calls the items in the same sequence. on_episode_end methods overridden in callbacks_class take precedence and are called first. See on_episode_end() # noqa for more information.

  • on_sample_end – A callable or a list of callables. If a list, RLlib calls the items in the same sequence. on_sample_end methods overridden in callbacks_class take precedence and are called first. See on_sample_end() # noqa for more information.

Returns:

This updated AlgorithmConfig object.

Configuring multi-agent specific settings#

AlgorithmConfig.multi_agent(*, policies: ~typing.Dict[str, PolicySpec] | ~typing.Collection[str] | None = <ray.rllib.utils.from_config._NotProvided object>, policy_map_capacity: int | None = <ray.rllib.utils.from_config._NotProvided object>, policy_mapping_fn: ~typing.Callable[[~typing.Any, EpisodeType], str] | None = <ray.rllib.utils.from_config._NotProvided object>, policies_to_train: ~typing.Collection[str] | ~typing.Callable[[str, SampleBatch | MultiAgentBatch | ~typing.Dict[str, ~typing.Any]], bool] | None = <ray.rllib.utils.from_config._NotProvided object>, policy_states_are_swappable: bool | None = <ray.rllib.utils.from_config._NotProvided object>, observation_fn: ~typing.Callable | None = <ray.rllib.utils.from_config._NotProvided object>, count_steps_by: str | None = <ray.rllib.utils.from_config._NotProvided object>, algorithm_config_overrides_per_module=-1, replay_mode=-1, policy_map_cache=-1) AlgorithmConfig[source]

Sets the config’s multi-agent settings.

Validates the new multi-agent settings and translates everything into a unified multi-agent setup format. For example a policies list or set of IDs is properly converted into a dict mapping these IDs to PolicySpecs.

Parameters:
  • policies – Map of type MultiAgentPolicyConfigDict from policy ids to either 4-tuples of (policy_cls, obs_space, act_space, config) or PolicySpecs. These tuples or PolicySpecs define the class of the policy, the observation- and action spaces of the policies, and any extra config.

  • policy_map_capacity – Keep this many policies in the “policy_map” (before writing least-recently used ones to disk/S3).

  • policy_mapping_fn – Function mapping agent ids to policy ids. The signature is: (agent_id, episode, worker, **kwargs) -> PolicyID.

  • policies_to_train – Determines those policies that should be updated. Options are: - None, for training all policies. - An iterable of PolicyIDs that should be trained. - A callable, taking a PolicyID and a SampleBatch or MultiAgentBatch and returning a bool (indicating whether the given policy is trainable or not, given the particular batch). This allows you to have a policy trained only on certain data (e.g. when playing against a certain opponent).

  • policy_states_are_swappable – Whether all Policy objects in this map can be “swapped out” via a simple state = A.get_state(); B.set_state(state), where A and B are policy instances in this map. You should set this to True for significantly speeding up the PolicyMap’s cache lookup times, iff your policies all share the same neural network architecture and optimizer types. If True, the PolicyMap doesn’t have to garbage collect old, least recently used policies, but instead keeps them in memory and simply override their state with the state of the most recently accessed one. For example, in a league-based training setup, you might have 100s of the same policies in your map (playing against each other in various combinations), but all of them share the same state structure (are “swappable”).

  • observation_fn – Optional function that can be used to enhance the local agent observations to include more state. See rllib/evaluation/observation_function.py for more info.

  • count_steps_by – Which metric to use as the “batch size” when building a MultiAgentBatch. The two supported values are: “env_steps”: Count each time the env is “stepped” (no matter how many multi-agent actions are passed/how many multi-agent observations have been returned in the previous step). “agent_steps”: Count each individual agent step as one step.

Returns:

This updated AlgorithmConfig object.

Configuring offline RL specific settings#

AlgorithmConfig.offline_data(*, input_: str | ~typing.Callable[[~ray.rllib.offline.io_context.IOContext], ~ray.rllib.offline.input_reader.InputReader] | None = <ray.rllib.utils.from_config._NotProvided object>, offline_data_class: ~typing.Type | None = <ray.rllib.utils.from_config._NotProvided object>, input_read_method: str | ~typing.Callable | None = <ray.rllib.utils.from_config._NotProvided object>, input_read_method_kwargs: ~typing.Dict | None = <ray.rllib.utils.from_config._NotProvided object>, input_read_schema: ~typing.Dict[str, str] | None = <ray.rllib.utils.from_config._NotProvided object>, input_read_episodes: bool | None = <ray.rllib.utils.from_config._NotProvided object>, input_read_sample_batches: bool | None = <ray.rllib.utils.from_config._NotProvided object>, input_read_batch_size: int | None = <ray.rllib.utils.from_config._NotProvided object>, input_filesystem: str | None = <ray.rllib.utils.from_config._NotProvided object>, input_filesystem_kwargs: ~typing.Dict | None = <ray.rllib.utils.from_config._NotProvided object>, input_compress_columns: ~typing.List[str] | None = <ray.rllib.utils.from_config._NotProvided object>, materialize_data: bool | None = <ray.rllib.utils.from_config._NotProvided object>, materialize_mapped_data: bool | None = <ray.rllib.utils.from_config._NotProvided object>, map_batches_kwargs: ~typing.Dict | None = <ray.rllib.utils.from_config._NotProvided object>, iter_batches_kwargs: ~typing.Dict | None = <ray.rllib.utils.from_config._NotProvided object>, prelearner_class: ~typing.Type | None = <ray.rllib.utils.from_config._NotProvided object>, prelearner_buffer_class: ~typing.Type | None = <ray.rllib.utils.from_config._NotProvided object>, prelearner_buffer_kwargs: ~typing.Dict | None = <ray.rllib.utils.from_config._NotProvided object>, prelearner_module_synch_period: int | None = <ray.rllib.utils.from_config._NotProvided object>, dataset_num_iters_per_learner: int | None = <ray.rllib.utils.from_config._NotProvided object>, input_config: ~typing.Dict | None = <ray.rllib.utils.from_config._NotProvided object>, actions_in_input_normalized: bool | None = <ray.rllib.utils.from_config._NotProvided object>, postprocess_inputs: bool | None = <ray.rllib.utils.from_config._NotProvided object>, shuffle_buffer_size: int | None = <ray.rllib.utils.from_config._NotProvided object>, output: str | None = <ray.rllib.utils.from_config._NotProvided object>, output_config: ~typing.Dict | None = <ray.rllib.utils.from_config._NotProvided object>, output_compress_columns: ~typing.List[str] | None = <ray.rllib.utils.from_config._NotProvided object>, output_max_file_size: float | None = <ray.rllib.utils.from_config._NotProvided object>, output_max_rows_per_file: int | None = <ray.rllib.utils.from_config._NotProvided object>, output_write_remaining_data: bool | None = <ray.rllib.utils.from_config._NotProvided object>, output_write_method: str | None = <ray.rllib.utils.from_config._NotProvided object>, output_write_method_kwargs: ~typing.Dict | None = <ray.rllib.utils.from_config._NotProvided object>, output_filesystem: str | None = <ray.rllib.utils.from_config._NotProvided object>, output_filesystem_kwargs: ~typing.Dict | None = <ray.rllib.utils.from_config._NotProvided object>, output_write_episodes: bool | None = <ray.rllib.utils.from_config._NotProvided object>, offline_sampling: str | None = <ray.rllib.utils.from_config._NotProvided object>) AlgorithmConfig[source]

Sets the config’s offline data settings.

Parameters:
  • input – Specify how to generate experiences: - “sampler”: Generate experiences via online (env) simulation (default). - A local directory or file glob expression (e.g., “/tmp/.json”). - A list of individual file paths/URIs (e.g., [“/tmp/1.json”, “s3://bucket/2.json”]). - A dict with string keys and sampling probabilities as values (e.g., {“sampler”: 0.4, “/tmp/.json”: 0.4, “s3://bucket/expert.json”: 0.2}). - A callable that takes an IOContext object as only arg and returns a ray.rllib.offline.InputReader. - A string key that indexes a callable with tune.registry.register_input

  • offline_data_class – An optional OfflineData class that is used to define the offline data pipeline, including the dataset and the sampling methodology. Override the OfflineData class and pass your derived class here, if you need some primer transformations specific to your data or your loss. Usually overriding the OfflinePreLearner and using the resulting customization via prelearner_class suffices for most cases. The default is None which uses the base OfflineData defined in ray.rllib.offline.offline_data.OfflineData.

  • input_read_method – Read method for the ray.data.Dataset to read in the offline data from input_. The default is read_parquet for Parquet files. See https://docs.ray.io/en/latest/data/api/input_output.html for more info about available read methods in ray.data.

  • input_read_method_kwargs – Keyword args for input_read_method. These are passed by RLlib into the read method without checking. Use these keyword args together with map_batches_kwargs and iter_batches_kwargs to tune the performance of the data pipeline. It is strongly recommended to rely on Ray Data’s automatic read performance tuning.

  • input_read_schema – Table schema for converting offline data to episodes. This schema maps the offline data columns to ray.rllib.core.columns.Columns: {Columns.OBS: 'o_t', Columns.ACTIONS: 'a_t', ...}. Columns in the data set that are not mapped via this schema are sorted into episodes’ extra_model_outputs. If no schema is passed in the default schema used is ray.rllib.offline.offline_data.SCHEMA. If your data set contains already the names in this schema, no input_read_schema is needed. The same applies if the data is in RLlib’s EpisodeType or its old SampleBatch format.

  • input_read_episodes – Whether offline data is already stored in RLlib’s EpisodeType format, i.e. ray.rllib.env.SingleAgentEpisode (multi -agent is planned but not supported, yet). Reading episodes directly avoids additional transform steps and is usually faster and therefore the recommended format when your application remains fully inside of RLlib’s schema. The other format is a columnar format and is agnostic to the RL framework used. Use the latter format, if you are unsure when to use the data or in which RL framework. The default is to read column data, for example, False. input_read_episodes, and input_read_sample_batches can’t be True at the same time. See also output_write_episodes to define the output data format when recording.

  • input_read_sample_batches – Whether offline data is stored in RLlib’s old stack SampleBatch type. This is usually the case for older data recorded with RLlib in JSON line format. Reading in SampleBatch data needs extra transforms and might not concatenate episode chunks contained in different SampleBatch`es in the data. If possible avoid to read `SampleBatch`es and convert them in a controlled form into RLlib's `EpisodeType (i.e. SingleAgentEpisode). The default is False. input_read_episodes, and input_read_sample_batches can’t be True at the same time.

  • input_read_batch_size – Batch size to pull from the data set. This could differ from the train_batch_size_per_learner, if a dataset holds EpisodeType (i.e., SingleAgentEpisode) or SampleBatch, or any other data type that contains multiple timesteps in a single row of the dataset. In such cases a single batch of size train_batch_size_per_learner will potentially pull a multiple of train_batch_size_per_learner timesteps from the offline dataset. The default is None in which the train_batch_size_per_learner is pulled.

  • input_filesystem – A cloud filesystem to handle access to cloud storage when reading experiences. Can be either “gcs” for Google Cloud Storage, “s3” for AWS S3 buckets, “abs” for Azure Blob Storage, or any filesystem supported by PyArrow. In general the file path is sufficient for accessing data from public or local storage systems. See https://arrow.apache.org/docs/python/filesystems.html for details.

  • input_filesystem_kwargs – A dictionary holding the kwargs for the filesystem given by input_filesystem. See gcsfs.GCSFilesystem for GCS, pyarrow.fs.S3FileSystem, for S3, and ablfs.AzureBlobFilesystem for ABS filesystem arguments.

  • input_compress_columns – What input columns are compressed with LZ4 in the input data. If data is stored in RLlib’s SingleAgentEpisode ( MultiAgentEpisode not supported, yet). Note the providing rllib.core.columns.Columns.OBS also tries to decompress rllib.core.columns.Columns.NEXT_OBS.

  • materialize_data – Whether the raw data should be materialized in memory. This boosts performance, but requires enough memory to avoid an OOM, so make sure that your cluster has the resources available. For very large data you might want to switch to streaming mode by setting this to False (default). If your algorithm does not need the RLModule in the Learner connector pipeline or all (learner) connectors are stateless you should consider setting materialize_mapped_data to True instead (and set materialize_data to False). If your data does not fit into memory and your Learner connector pipeline requires an RLModule or is stateful, set both materialize_data and materialize_mapped_data to False.

  • materialize_mapped_data – Whether the data should be materialized after running it through the Learner connector pipeline (i.e. after running the OfflinePreLearner). This improves performance, but should only be used in case the (learner) connector pipeline does not require an RLModule and the (learner) connector pipeline is stateless. For example, MARWIL’s Learner connector pipeline requires the RLModule for value function predictions and training batches would become stale after some iterations causing learning degradation or divergence. Also ensure that your cluster has enough memory available to avoid an OOM. If set to True (True), make sure that materialize_data is set to False to avoid materialization of two datasets. If your data does not fit into memory and your Learner connector pipeline requires an RLModule or is stateful, set both materialize_data and materialize_mapped_data to False.

  • map_batches_kwargs – Keyword args for the map_batches method. These are passed into the ray.data.Dataset.map_batches method when sampling without checking. If no arguments passed in the default arguments {'concurrency': max(2, num_learners), 'zero_copy_batch': True} is used. Use these keyword args together with input_read_method_kwargs and iter_batches_kwargs to tune the performance of the data pipeline.

  • iter_batches_kwargs – Keyword args for the iter_batches method. These are passed into the ray.data.Dataset.iter_batches method when sampling without checking. If no arguments are passed in, the default argument {'prefetch_batches': 2} is used. Use these keyword args together with input_read_method_kwargs and map_batches_kwargs to tune the performance of the data pipeline.

  • prelearner_class – An optional OfflinePreLearner class that is used to transform data batches in ray.data.map_batches used in the OfflineData class to transform data from columns to batches that can be used in the Learner.update...() methods. Override the OfflinePreLearner class and pass your derived class in here, if you need to make some further transformations specific for your data or loss. The default is None which uses the base OfflinePreLearner defined in ray.rllib.offline.offline_prelearner.

  • prelearner_buffer_class – An optional EpisodeReplayBuffer class that RLlib uses to buffer experiences when data is in EpisodeType or RLlib’s previous SampleBatch type format. In this case, a single data row may contain multiple timesteps and the buffer serves two purposes: (a) to store intermediate data in memory, and (b) to ensure that RLlib samples exactly train_batch_size_per_learner experiences per batch. The default is RLlib’s EpisodeReplayBuffer.

  • prelearner_buffer_kwargs – Optional keyword arguments for intializing the EpisodeReplayBuffer. In most cases this value is simply the capacity for the default buffer that RLlib uses (EpisodeReplayBuffer), but it may differ if the prelearner_buffer_class uses a custom buffer.

  • prelearner_module_synch_period – The period (number of batches converted) after which the RLModule held by the PreLearner should sync weights. The PreLearner is used to preprocess batches for the learners. The higher this value, the more off-policy the PreLearner’s module is. Values too small force the PreLearner to sync more frequently and thus might slow down the data pipeline. The default value chosen by the OfflinePreLearner is 10.

  • dataset_num_iters_per_learner – Number of updates to run in each learner during a single training iteration. If None, each learner runs a complete epoch over its data block (the dataset is partitioned into at least as many blocks as there are learners). The default is None. This value must be set to 1, if RLlib uses a single (local) learner.

  • input_config – Arguments that describe the settings for reading the input. If input is “sample”, this is the environment configuration, e.g. env_name and env_config, etc. See EnvContext for more info. If the input is “dataset”, this contains e.g. format, path.

  • actions_in_input_normalized – True, if the actions in a given offline “input” are already normalized (between -1.0 and 1.0). This is usually the case when the offline file has been generated by another RLlib algorithm (e.g. PPO or SAC), while “normalize_actions” was set to True.

  • postprocess_inputs – Whether to run postprocess_trajectory() on the trajectory fragments from offline inputs. Note that postprocessing is done using the current policy, not the behavior policy, which is typically undesirable for on-policy algorithms.

  • shuffle_buffer_size – If positive, input batches are shuffled via a sliding window buffer of this number of batches. Use this if the input data is not in random enough order. Input is delayed until the shuffle buffer is filled.

  • output – Specify where experiences should be saved: - None: don’t save any experiences - “logdir” to save to the agent log dir - a path/URI to save to a custom output directory (e.g., “s3://bckt/”) - a function that returns a rllib.offline.OutputWriter

  • output_config – Arguments accessible from the IOContext for configuring custom output.

  • output_compress_columns – What sample batch columns to LZ4 compress in the output data. Note that providing rllib.core.columns.Columns.OBS also compresses rllib.core.columns.Columns.NEXT_OBS.

  • output_max_file_size – Max output file size (in bytes) before rolling over to a new file.

  • output_max_rows_per_file – Max output row numbers before rolling over to a new file.

  • output_write_remaining_data – Determines whether any remaining data in the recording buffers should be stored to disk. It is only applicable if output_max_rows_per_file is defined. When sampling data, it is buffered until the threshold specified by output_max_rows_per_file is reached. Only complete multiples of output_max_rows_per_file are written to disk, while any leftover data remains in the buffers. If a recording session is stopped, residual data may still reside in these buffers. Setting output_write_remaining_data to True ensures this data is flushed to disk. By default, this attribute is set to False.

  • output_write_method – Write method for the ray.data.Dataset to write the offline data to output. The default is read_parquet for Parquet files. See https://docs.ray.io/en/latest/data/api/input_output.html for more info about available read methods in ray.data.

  • output_write_method_kwargskwargs for the output_write_method. These are passed into the write method without checking.

  • output_filesystem – A cloud filesystem to handle access to cloud storage when writing experiences. Should be either “gcs” for Google Cloud Storage, “s3” for AWS S3 buckets, or “abs” for Azure Blob Storage.

  • output_filesystem_kwargs – A dictionary holding the kwargs for the filesystem given by output_filesystem. See gcsfs.GCSFilesystem for GCS, pyarrow.fs.S3FileSystem, for S3, and ablfs.AzureBlobFilesystem for ABS filesystem arguments.

  • output_write_episodes – If RLlib should record data in its RLlib’s EpisodeType format (that is, SingleAgentEpisode objects). Use this format, if you need RLlib to order data in time and directly group by episodes for example to train stateful modules or if you plan to use recordings exclusively in RLlib. Otherwise RLlib records data in tabular (columnar) format. Default is True.

  • offline_sampling – Whether sampling for the Algorithm happens via reading from offline data. If True, EnvRunners don’t limit the number of collected batches within the same sample() call based on the number of sub-environments within the worker (no sub-environments present).

Returns:

This updated AlgorithmConfig object.

Configuring evaluation settings#

AlgorithmConfig.evaluation(*, evaluation_interval: int | None = <ray.rllib.utils.from_config._NotProvided object>, evaluation_duration: int | str | None = <ray.rllib.utils.from_config._NotProvided object>, evaluation_duration_unit: str | None = <ray.rllib.utils.from_config._NotProvided object>, evaluation_sample_timeout_s: float | None = <ray.rllib.utils.from_config._NotProvided object>, evaluation_parallel_to_training: bool | None = <ray.rllib.utils.from_config._NotProvided object>, evaluation_force_reset_envs_before_iteration: bool | None = <ray.rllib.utils.from_config._NotProvided object>, evaluation_config: ~ray.rllib.algorithms.algorithm_config.AlgorithmConfig | dict | None = <ray.rllib.utils.from_config._NotProvided object>, off_policy_estimation_methods: ~typing.Dict | None = <ray.rllib.utils.from_config._NotProvided object>, ope_split_batch_by_episode: bool | None = <ray.rllib.utils.from_config._NotProvided object>, evaluation_num_env_runners: int | None = <ray.rllib.utils.from_config._NotProvided object>, custom_evaluation_function: ~typing.Callable | None = <ray.rllib.utils.from_config._NotProvided object>, always_attach_evaluation_results=-1, evaluation_num_workers=-1) AlgorithmConfig[source]

Sets the config’s evaluation settings.

Parameters:
  • evaluation_interval – Evaluate with every evaluation_interval training iterations. The evaluation stats are reported under the “evaluation” metric key. Set to None (or 0) for no evaluation.

  • evaluation_duration – Duration for which to run evaluation each evaluation_interval. The unit for the duration can be set via evaluation_duration_unit to either “episodes” (default) or “timesteps”. If using multiple evaluation workers (EnvRunners) in the evaluation_num_env_runners > 1 setting, the amount of episodes/timesteps to run are split amongst these. A special value of “auto” can be used in case evaluation_parallel_to_training=True. This is the recommended way when trying to save as much time on evaluation as possible. The Algorithm then runs as many timesteps via the evaluation workers as possible, while not taking longer than the parallely running training step and thus, never wasting any idle time on either training- or evaluation workers. When using this setting (evaluation_duration="auto"), it is strongly advised to set evaluation_interval=1 and evaluation_force_reset_envs_before_iteration=True at the same time.

  • evaluation_duration_unit – The unit, with which to count the evaluation duration. Either “episodes” (default) or “timesteps”. Note that this setting is ignored if evaluation_duration="auto".

  • evaluation_sample_timeout_s – The timeout (in seconds) for evaluation workers to sample a complete episode in the case your config settings are: evaluation_duration != auto and evaluation_duration_unit=episode. After this time, the user receives a warning and instructions on how to fix the issue.

  • evaluation_parallel_to_training – Whether to run evaluation in parallel to the Algorithm.training_step() call, using threading. Default=False. E.g. for evaluation_interval=1 -> In every call to Algorithm.train(), the Algorithm.training_step() and Algorithm.evaluate() calls run in parallel. Note that this setting - albeit extremely efficient b/c it wastes no extra time for evaluation - causes the evaluation results to lag one iteration behind the rest of the training results. This is important when picking a good checkpoint. For example, if iteration 42 reports a good evaluation episode_return_mean, be aware that these results were achieved on the weights trained in iteration 41, so you should probably pick the iteration 41 checkpoint instead.

  • evaluation_force_reset_envs_before_iteration – Whether all environments should be force-reset (even if they are not done yet) right before the evaluation step of the iteration begins. Setting this to True (default) makes sure that the evaluation results aren’t polluted with episode statistics that were actually (at least partially) achieved with an earlier set of weights. Note that this setting is only supported on the new API stack w/ EnvRunners and ConnectorV2 (config.enable_rl_module_and_learner=True AND config.enable_env_runner_and_connector_v2=True).

  • evaluation_config – Typical usage is to pass extra args to evaluation env creator and to disable exploration by computing deterministic actions. IMPORTANT NOTE: Policy gradient algorithms are able to find the optimal policy, even if this is a stochastic one. Setting “explore=False” here results in the evaluation workers not using this optimal policy!

  • off_policy_estimation_methods – Specify how to evaluate the current policy, along with any optional config parameters. This only has an effect when reading offline experiences (“input” is not “sampler”). Available keys: {ope_method_name: {“type”: ope_type, …}} where ope_method_name is a user-defined string to save the OPE results under, and ope_type can be any subclass of OffPolicyEstimator, e.g. ray.rllib.offline.estimators.is::ImportanceSampling or your own custom subclass, or the full class path to the subclass. You can also add additional config arguments to be passed to the OffPolicyEstimator in the dict, e.g. {“qreg_dr”: {“type”: DoublyRobust, “q_model_type”: “qreg”, “k”: 5}}

  • ope_split_batch_by_episode – Whether to use SampleBatch.split_by_episode() to split the input batch to episodes before estimating the ope metrics. In case of bandits you should make this False to see improvements in ope evaluation speed. In case of bandits, it is ok to not split by episode, since each record is one timestep already. The default is True.

  • evaluation_num_env_runners – Number of parallel EnvRunners to use for evaluation. Note that this is set to zero by default, which means evaluation is run in the algorithm process (only if evaluation_interval is not 0 or None). If you increase this, also increases the Ray resource usage of the algorithm since evaluation workers are created separately from those EnvRunners used to sample data for training.

  • custom_evaluation_function – Customize the evaluation method. This must be a function of signature (algo: Algorithm, eval_workers: EnvRunnerGroup) -> (metrics: dict, env_steps: int, agent_steps: int) (metrics: dict if enable_env_runner_and_connector_v2=True), where env_steps and agent_steps define the number of sampled steps during the evaluation iteration. See the Algorithm.evaluate() method to see the default implementation. The Algorithm guarantees all eval workers have the latest policy state before this function is called.

Returns:

This updated AlgorithmConfig object.

Configuring deep learning framework settings#

AlgorithmConfig.framework(framework: str | None = <ray.rllib.utils.from_config._NotProvided object>, *, eager_tracing: bool | None = <ray.rllib.utils.from_config._NotProvided object>, eager_max_retraces: int | None = <ray.rllib.utils.from_config._NotProvided object>, tf_session_args: ~typing.Dict[str, ~typing.Any] | None = <ray.rllib.utils.from_config._NotProvided object>, local_tf_session_args: ~typing.Dict[str, ~typing.Any] | None = <ray.rllib.utils.from_config._NotProvided object>, torch_compile_learner: bool | None = <ray.rllib.utils.from_config._NotProvided object>, torch_compile_learner_what_to_compile: str | None = <ray.rllib.utils.from_config._NotProvided object>, torch_compile_learner_dynamo_mode: str | None = <ray.rllib.utils.from_config._NotProvided object>, torch_compile_learner_dynamo_backend: str | None = <ray.rllib.utils.from_config._NotProvided object>, torch_compile_worker: bool | None = <ray.rllib.utils.from_config._NotProvided object>, torch_compile_worker_dynamo_backend: str | None = <ray.rllib.utils.from_config._NotProvided object>, torch_compile_worker_dynamo_mode: str | None = <ray.rllib.utils.from_config._NotProvided object>, torch_ddp_kwargs: ~typing.Dict[str, ~typing.Any] | None = <ray.rllib.utils.from_config._NotProvided object>, torch_skip_nan_gradients: bool | None = <ray.rllib.utils.from_config._NotProvided object>) AlgorithmConfig[source]

Sets the config’s DL framework settings.

Parameters:
  • framework – torch: PyTorch; tf2: TensorFlow 2.x (eager execution or traced if eager_tracing=True); tf: TensorFlow (static-graph);

  • eager_tracing – Enable tracing in eager mode. This greatly improves performance (speedup ~2x), but makes it slightly harder to debug since Python code won’t be evaluated after the initial eager pass. Only possible if framework=tf2.

  • eager_max_retraces – Maximum number of tf.function re-traces before a runtime error is raised. This is to prevent unnoticed retraces of methods inside the ..._eager_traced Policy, which could slow down execution by a factor of 4, without the user noticing what the root cause for this slowdown could be. Only necessary for framework=tf2. Set to None to ignore the re-trace count and never throw an error.

  • tf_session_args – Configures TF for single-process operation by default.

  • local_tf_session_args – Override the following tf session args on the local worker

  • torch_compile_learner – If True, forward_train methods on TorchRLModules on the learner are compiled. If not specified, the default is to compile forward train on the learner.

  • torch_compile_learner_what_to_compile – A TorchCompileWhatToCompile mode specifying what to compile on the learner side if torch_compile_learner is True. See TorchCompileWhatToCompile for details and advice on its usage.

  • torch_compile_learner_dynamo_backend – The torch dynamo backend to use on the learner.

  • torch_compile_learner_dynamo_mode – The torch dynamo mode to use on the learner.

  • torch_compile_worker – If True, forward exploration and inference methods on TorchRLModules on the workers are compiled. If not specified, the default is to not compile forward methods on the workers because retracing can be expensive.

  • torch_compile_worker_dynamo_backend – The torch dynamo backend to use on the workers.

  • torch_compile_worker_dynamo_mode – The torch dynamo mode to use on the workers.

  • torch_ddp_kwargs – The kwargs to pass into torch.nn.parallel.DistributedDataParallel when using num_learners > 1. This is specifically helpful when searching for unused parameters that are not used in the backward pass. This can give hints for errors in custom models where some parameters do not get touched in the backward pass although they should.

  • torch_skip_nan_gradients – If updates with nan gradients should be entirely skipped. This skips updates in the optimizer entirely if they contain any nan gradient. This can help to avoid biasing moving-average based optimizers - like Adam. This can help in training phases where policy updates can be highly unstable such as during the early stages of training or with highly exploratory policies. In such phases many gradients might turn nan and setting them to zero could corrupt the optimizer’s internal state. The default is False and turns nan gradients to zero. If many nan gradients are encountered consider (a) monitoring gradients by setting log_gradients in AlgorithmConfig to True, (b) use proper weight initialization (e.g. Xavier, Kaiming) via the model_config_dict in AlgorithmConfig.rl_module and/or (c) gradient clipping via grad_clip in AlgorithmConfig.training.

Returns:

This updated AlgorithmConfig object.

Configuring reporting settings#

AlgorithmConfig.reporting(*, keep_per_episode_custom_metrics: bool | None = <ray.rllib.utils.from_config._NotProvided object>, metrics_episode_collection_timeout_s: float | None = <ray.rllib.utils.from_config._NotProvided object>, metrics_num_episodes_for_smoothing: int | None = <ray.rllib.utils.from_config._NotProvided object>, min_time_s_per_iteration: float | None = <ray.rllib.utils.from_config._NotProvided object>, min_train_timesteps_per_iteration: int | None = <ray.rllib.utils.from_config._NotProvided object>, min_sample_timesteps_per_iteration: int | None = <ray.rllib.utils.from_config._NotProvided object>, log_gradients: bool | None = <ray.rllib.utils.from_config._NotProvided object>) AlgorithmConfig[source]

Sets the config’s reporting settings.

Parameters:
  • keep_per_episode_custom_metrics – Store raw custom metrics without calculating max, min, mean

  • metrics_episode_collection_timeout_s – Wait for metric batches for at most this many seconds. Those that have not returned in time are collected in the next train iteration.

  • metrics_num_episodes_for_smoothing – Smooth rollout metrics over this many episodes, if possible. In case rollouts (sample collection) just started, there may be fewer than this many episodes in the buffer and we’ll compute metrics over this smaller number of available episodes. In case there are more than this many episodes collected in a single training iteration, use all of these episodes for metrics computation, meaning don’t ever cut any “excess” episodes. Set this to 1 to disable smoothing and to always report only the most recently collected episode’s return.

  • min_time_s_per_iteration – Minimum time (in sec) to accumulate within a single Algorithm.train() call. This value does not affect learning, only the number of times Algorithm.training_step() is called by Algorithm.train(). If - after one such step attempt, the time taken has not reached min_time_s_per_iteration, performs n more Algorithm.training_step() calls until the minimum time has been consumed. Set to 0 or None for no minimum time.

  • min_train_timesteps_per_iteration – Minimum training timesteps to accumulate within a single train() call. This value does not affect learning, only the number of times Algorithm.training_step() is called by Algorithm.train(). If - after one such step attempt, the training timestep count has not been reached, performs n more training_step() calls until the minimum timesteps have been executed. Set to 0 or None for no minimum timesteps.

  • min_sample_timesteps_per_iteration – Minimum env sampling timesteps to accumulate within a single train() call. This value does not affect learning, only the number of times Algorithm.training_step() is called by Algorithm.train(). If - after one such step attempt, the env sampling timestep count has not been reached, performs n more training_step() calls until the minimum timesteps have been executed. Set to 0 or None for no minimum timesteps.

  • log_gradients – Log gradients to results. If this is True the global norm of the gradients dictionariy for each optimizer is logged to results. The default is True.

Returns:

This updated AlgorithmConfig object.

Configuring checkpointing settings#

AlgorithmConfig.checkpointing(export_native_model_files: bool | None = <ray.rllib.utils.from_config._NotProvided object>, checkpoint_trainable_policies_only: bool | None = <ray.rllib.utils.from_config._NotProvided object>) AlgorithmConfig[source]

Sets the config’s checkpointing settings.

Parameters:
  • export_native_model_files – Whether an individual Policy- or the Algorithm’s checkpoints also contain (tf or torch) native model files. These could be used to restore just the NN models from these files w/o requiring RLlib. These files are generated by calling the tf- or torch- built-in saving utility methods on the actual models.

  • checkpoint_trainable_policies_only – Whether to only add Policies to the Algorithm checkpoint (in sub-directory “policies/”) that are trainable according to the is_trainable_policy callable of the local worker.

Returns:

This updated AlgorithmConfig object.

Configuring debugging settings#

AlgorithmConfig.debugging(*, logger_creator: ~typing.Callable[[], ~ray.tune.logger.logger.Logger] | None = <ray.rllib.utils.from_config._NotProvided object>, logger_config: dict | None = <ray.rllib.utils.from_config._NotProvided object>, log_level: str | None = <ray.rllib.utils.from_config._NotProvided object>, log_sys_usage: bool | None = <ray.rllib.utils.from_config._NotProvided object>, fake_sampler: bool | None = <ray.rllib.utils.from_config._NotProvided object>, seed: int | None = <ray.rllib.utils.from_config._NotProvided object>, _run_training_always_in_thread: bool | None = <ray.rllib.utils.from_config._NotProvided object>, _evaluation_parallel_to_training_wo_thread: bool | None = <ray.rllib.utils.from_config._NotProvided object>) AlgorithmConfig[source]

Sets the config’s debugging settings.

Parameters:
  • logger_creator – Callable that creates a ray.tune.Logger object. If unspecified, a default logger is created.

  • logger_config – Define logger-specific configuration to be used inside Logger Default value None allows overwriting with nested dicts.

  • log_level – Set the ray.rllib.* log level for the agent process and its workers. Should be one of DEBUG, INFO, WARN, or ERROR. The DEBUG level also periodically prints out summaries of relevant internal dataflow (this is also printed out once at startup at the INFO level).

  • log_sys_usage – Log system resource metrics to results. This requires psutil to be installed for sys stats, and gputil for GPU metrics.

  • fake_sampler – Use fake (infinite speed) sampler. For testing only.

  • seed – This argument, in conjunction with worker_index, sets the random seed of each worker, so that identically configured trials have identical results. This makes experiments reproducible.

  • _run_training_always_in_thread – Runs the n training_step() calls per iteration always in a separate thread (just as we would do with evaluation_parallel_to_training=True, but even without evaluation going on and even without evaluation workers being created in the Algorithm).

  • _evaluation_parallel_to_training_wo_thread – Only relevant if evaluation_parallel_to_training is True. Then, in order to achieve parallelism, RLlib doesn’t use a thread pool (as it usually does in this situation).

Returns:

This updated AlgorithmConfig object.

Configuring experimental settings#

AlgorithmConfig.experimental(*, _use_msgpack_checkpoints: bool | None = <ray.rllib.utils.from_config._NotProvided object>, _torch_grad_scaler_class: ~typing.Type | None = <ray.rllib.utils.from_config._NotProvided object>, _torch_lr_scheduler_classes: ~typing.List[~typing.Type] | ~typing.Dict[str, ~typing.List[~typing.Type]] | None = <ray.rllib.utils.from_config._NotProvided object>, _tf_policy_handles_more_than_one_loss: bool | None = <ray.rllib.utils.from_config._NotProvided object>, _disable_preprocessor_api: bool | None = <ray.rllib.utils.from_config._NotProvided object>, _disable_action_flattening: bool | None = <ray.rllib.utils.from_config._NotProvided object>, _disable_initialize_loss_from_dummy_batch: bool | None = <ray.rllib.utils.from_config._NotProvided object>) AlgorithmConfig[source]

Sets the config’s experimental settings.

Parameters:
  • _use_msgpack_checkpoints – Create state files in all checkpoints through msgpack rather than pickle.

  • _torch_grad_scaler_class – Class to use for torch loss scaling (and gradient unscaling). The class must implement the following methods to be compatible with a TorchLearner. These methods/APIs match exactly those of torch’s own torch.amp.GradScaler (see here for more details https://pytorch.org/docs/stable/amp.html#gradient-scaling): scale([loss]) to scale the loss by some factor. get_scale() to get the current scale factor value. step([optimizer]) to unscale the grads (divide by the scale factor) and step the given optimizer. update() to update the scaler after an optimizer step (for example to adjust the scale factor).

  • _torch_lr_scheduler_classes – A list of torch.lr_scheduler.LRScheduler (see here for more details https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate) classes or a dictionary mapping module IDs to such a list of respective scheduler classes. Multiple scheduler classes can be applied in sequence and are stepped in the same sequence as defined here. Note, most learning rate schedulers need arguments to be configured, that is, you might have to partially initialize the schedulers in the list(s) using functools.partial.

  • _tf_policy_handles_more_than_one_loss – Experimental flag. If True, TFPolicy handles more than one loss or optimizer. Set this to True, if you would like to return more than one loss term from your loss_fn and an equal number of optimizers from your optimizer_fn.

  • _disable_preprocessor_api – Experimental flag. If True, no (observation) preprocessor is created and observations arrive in model as they are returned by the env.

  • _disable_action_flattening – Experimental flag. If True, RLlib doesn’t flatten the policy-computed actions into a single tensor (for storage in SampleCollectors/output files/etc..), but leave (possibly nested) actions as-is. Disabling flattening affects: - SampleCollectors: Have to store possibly nested action structs. - Models that have the previous action(s) as part of their input. - Algorithms reading from offline files (incl. action information).

Returns:

This updated AlgorithmConfig object.