Note

From Ray 2.6.0 onwards, RLlib is adopting a new stack for training and model customization, gradually replacing the ModelV2 API and some convoluted parts of Policy API with the RLModule API. Click here for details.

Algorithms#

Tip

Check out the environments page to learn more about different environment types.

Available Algorithms - Overview#

Algorithm

Frameworks

Discrete Actions

Continuous Actions

Multi-Agent

Model Support

Multi-GPU

APPO

tf + torch

Yes +parametric

Yes

Yes

+RNN, +LSTM auto-wrapping, +Attention, +autoreg

tf + torch

BC

tf + torch

Yes +parametric

Yes

Yes

+RNN

torch

CQL

tf + torch

No

Yes

No

tf + torch

DreamerV3

tf

Yes

Yes

No

+RNN (GRU-based by default)

tf

DQN, Rainbow

tf + torch

Yes +parametric

No

Yes

tf + torch

IMPALA

tf + torch

Yes +parametric

Yes

Yes

+RNN, +LSTM auto-wrapping, +Attention, +autoreg

tf + torch

MARWIL

tf + torch

Yes +parametric

Yes

Yes

+RNN

torch

PPO

tf + torch

Yes +parametric

Yes

Yes

+RNN, +LSTM auto-wrapping, +Attention, +autoreg

tf + torch

SAC

tf + torch

Yes

Yes

Yes

torch

Multi-Agent only Methods

Algorithm

Frameworks

Discrete Actions

Continuous Actions

Multi-Agent

Model Support

Parameter Sharing

Depends on bootstrapped algorithm

Fully Independent Learning

Depends on bootstrapped algorithm

Shared Critic Methods

Depends on bootstrapped algorithm

Offline#

Behavior Cloning (BC; derived from MARWIL implementation)#

pytorch tensorflow [paper] [implementation]

Our behavioral cloning implementation is directly derived from our MARWIL implementation, with the only difference being the beta parameter force-set to 0.0. This makes BC try to match the behavior policy, which generated the offline data, disregarding any resulting rewards. BC requires the offline datasets API to be used.

Tuned examples: CartPole-v1

BC-specific configs (see also common configs):

class ray.rllib.algorithms.bc.bc.BCConfig(algo_class=None)[source]#

Defines a configuration class from which a new BC Algorithm can be built

from ray.rllib.algorithms.bc import BCConfig
# Run this from the ray directory root.
config = BCConfig().training(lr=0.00001, gamma=0.99)
config = config.offline_data(
    input_="./rllib/tests/data/cartpole/large.json")

# Build an Algorithm object from the config and run 1 training iteration.
algo = config.build()
algo.train()
from ray.rllib.algorithms.bc import BCConfig
from ray import tune
config = BCConfig()
# Print out some default values.
print(config.beta)
# Update the config object.
config.training(
    lr=tune.grid_search([0.001, 0.0001]), beta=0.75
)
# Set the config object's data path.
# Run this from the ray directory root.
config.offline_data(
    input_="./rllib/tests/data/cartpole/large.json"
)
# Set the config object's env, used for evaluation.
config.environment(env="CartPole-v1")
# Use to_dict() to get the old-style python config dict
# when running with tune.
tune.Tuner(
    "BC",
    param_space=config.to_dict(),
).fit()
training(*, beta: float | None = <ray.rllib.utils.from_config._NotProvided object>, bc_logstd_coeff: float | None = <ray.rllib.utils.from_config._NotProvided object>, moving_average_sqd_adv_norm_update_rate: float | None = <ray.rllib.utils.from_config._NotProvided object>, moving_average_sqd_adv_norm_start: float | None = <ray.rllib.utils.from_config._NotProvided object>, vf_coeff: float | None = <ray.rllib.utils.from_config._NotProvided object>, grad_clip: float | None = <ray.rllib.utils.from_config._NotProvided object>, **kwargs) MARWILConfig#

Sets the training related configuration.

Parameters:
  • beta – Scaling of advantages in exponential terms. When beta is 0.0, MARWIL is reduced to behavior cloning (imitation learning); see bc.py algorithm in this same directory.

  • bc_logstd_coeff – A coefficient to encourage higher action distribution entropy for exploration.

  • moving_average_sqd_adv_norm_start – Starting value for the squared moving average advantage norm (c^2).

  • vf_coeff – Balancing value estimation loss and policy optimization loss. moving_average_sqd_adv_norm_update_rate: Update rate for the squared moving average advantage norm (c^2).

  • grad_clip – If specified, clip the global norm of gradients by this amount.

Returns:

This updated AlgorithmConfig object.

Conservative Q-Learning (CQL)#

pytorch tensorflow [paper] [implementation]

In offline RL, the algorithm has no access to an environment, but can only sample from a fixed dataset of pre-collected state-action-reward tuples. In particular, CQL (Conservative Q-Learning) is an offline RL algorithm that mitigates the overestimation of Q-values outside the dataset distribution via conservative critic estimates. It does so by adding a simple Q regularizer loss to the standard Bellman update loss. This ensures that the critic does not output overly-optimistic Q-values. This conservative correction term can be added on top of any off-policy Q-learning algorithm (here, we provide this for SAC).

RLlib’s CQL is evaluated against the Behavior Cloning (BC) benchmark at 500K gradient steps over the dataset. The only difference between the BC- and CQL configs is the bc_iters parameter in CQL, indicating how many gradient steps we perform over the BC loss. CQL is evaluated on the D4RL benchmark, which has pre-collected offline datasets for many types of environments.

Tuned examples: HalfCheetah Random, Hopper Random

CQL-specific configs (see also common configs):

class ray.rllib.algorithms.cql.cql.CQLConfig(algo_class=None)[source]#

Defines a configuration class from which a CQL can be built.

from ray.rllib.algorithms.cql import CQLConfig
config = CQLConfig().training(gamma=0.9, lr=0.01)
config = config.resources(num_gpus=0)
config = config.rollouts(num_rollout_workers=4)
print(config.to_dict())
# Build a Algorithm object from the config and run 1 training iteration.
algo = config.build(env="CartPole-v1")
algo.train()
training(*, bc_iters: int | None = <ray.rllib.utils.from_config._NotProvided object>, temperature: float | None = <ray.rllib.utils.from_config._NotProvided object>, num_actions: int | None = <ray.rllib.utils.from_config._NotProvided object>, lagrangian: bool | None = <ray.rllib.utils.from_config._NotProvided object>, lagrangian_thresh: float | None = <ray.rllib.utils.from_config._NotProvided object>, min_q_weight: float | None = <ray.rllib.utils.from_config._NotProvided object>, **kwargs) CQLConfig[source]#

Sets the training-related configuration.

Parameters:
  • bc_iters – Number of iterations with Behavior Cloning pretraining.

  • temperature – CQL loss temperature.

  • num_actions – Number of actions to sample for CQL loss

  • lagrangian – Whether to use the Lagrangian for Alpha Prime (in CQL loss).

  • lagrangian_thresh – Lagrangian threshold.

  • min_q_weight – in Q weight multiplier.

Returns:

This updated AlgorithmConfig object.

Monotonic Advantage Re-Weighted Imitation Learning (MARWIL)#

pytorch tensorflow [paper] [implementation]

MARWIL is a hybrid imitation learning and policy gradient algorithm suitable for training on batched historical data. When the beta hyperparameter is set to zero, the MARWIL objective reduces to vanilla imitation learning (see BC). MARWIL requires the offline datasets API to be used.

Tuned examples: CartPole-v1

MARWIL-specific configs (see also common configs):

class ray.rllib.algorithms.marwil.marwil.MARWILConfig(algo_class=None)[source]#

Defines a configuration class from which a MARWIL Algorithm can be built.

Example

>>> from ray.rllib.algorithms.marwil import MARWILConfig
>>> # Run this from the ray directory root.
>>> config = MARWILConfig()  
>>> config = config.training(beta=1.0, lr=0.00001, gamma=0.99)  
>>> config = config.offline_data(  
...     input_=["./rllib/tests/data/cartpole/large.json"])
>>> print(config.to_dict()) 
...
>>> # Build an Algorithm object from the config and run 1 training iteration.
>>> algo = config.build()  
>>> algo.train() 

Example

>>> from ray.rllib.algorithms.marwil import MARWILConfig
>>> from ray import tune
>>> config = MARWILConfig()
>>> # Print out some default values.
>>> print(config.beta)  
>>> # Update the config object.
>>> config.training(lr=tune.grid_search(  
...     [0.001, 0.0001]), beta=0.75)
>>> # Set the config object's data path.
>>> # Run this from the ray directory root.
>>> config.offline_data( 
...     input_=["./rllib/tests/data/cartpole/large.json"])
>>> # Set the config object's env, used for evaluation.
>>> config.environment(env="CartPole-v1")  
>>> # Use to_dict() to get the old-style python config dict
>>> # when running with tune.
>>> tune.Tuner(  
...     "MARWIL",
...     param_space=config.to_dict(),
... ).fit()
training(*, beta: float | None = <ray.rllib.utils.from_config._NotProvided object>, bc_logstd_coeff: float | None = <ray.rllib.utils.from_config._NotProvided object>, moving_average_sqd_adv_norm_update_rate: float | None = <ray.rllib.utils.from_config._NotProvided object>, moving_average_sqd_adv_norm_start: float | None = <ray.rllib.utils.from_config._NotProvided object>, vf_coeff: float | None = <ray.rllib.utils.from_config._NotProvided object>, grad_clip: float | None = <ray.rllib.utils.from_config._NotProvided object>, **kwargs) MARWILConfig[source]#

Sets the training related configuration.

Parameters:
  • beta – Scaling of advantages in exponential terms. When beta is 0.0, MARWIL is reduced to behavior cloning (imitation learning); see bc.py algorithm in this same directory.

  • bc_logstd_coeff – A coefficient to encourage higher action distribution entropy for exploration.

  • moving_average_sqd_adv_norm_start – Starting value for the squared moving average advantage norm (c^2).

  • vf_coeff – Balancing value estimation loss and policy optimization loss. moving_average_sqd_adv_norm_update_rate: Update rate for the squared moving average advantage norm (c^2).

  • grad_clip – If specified, clip the global norm of gradients by this amount.

Returns:

This updated AlgorithmConfig object.

Model-free On-policy RL#

Asynchronous Proximal Policy Optimization (APPO)#

pytorch tensorflow [paper] [implementation] We include an asynchronous variant of Proximal Policy Optimization (PPO) based on the IMPALA architecture. This is similar to IMPALA but using a surrogate policy loss with clipping. Compared to synchronous PPO, APPO is more efficient in wall-clock time due to its use of asynchronous sampling. Using a clipped loss also allows for multiple SGD passes, and therefore the potential for better sample efficiency compared to IMPALA. V-trace can also be enabled to correct for off-policy samples.

Tip

APPO is not always more efficient; it is often better to use standard PPO or IMPALA.

../_images/impala-arch.svg

APPO architecture (same as IMPALA)#

Tuned examples: PongNoFrameskip-v4

APPO-specific configs (see also common configs):

class ray.rllib.algorithms.appo.appo.APPOConfig(algo_class=None)[source]#

Defines a configuration class from which an APPO Algorithm can be built.

from ray.rllib.algorithms.appo import APPOConfig
config = APPOConfig().training(lr=0.01, grad_clip=30.0, train_batch_size=50)
config = config.resources(num_gpus=0)
config = config.rollouts(num_rollout_workers=1)
config = config.environment("CartPole-v1")

# Build an Algorithm object from the config and run 1 training iteration.
algo = config.build()
algo.train()
del algo
from ray.rllib.algorithms.appo import APPOConfig
from ray import air
from ray import tune

config = APPOConfig()
# Update the config object.
config = config.training(lr=tune.grid_search([0.001,]))
# Set the config object's env.
config = config.environment(env="CartPole-v1")
# Use to_dict() to get the old-style python config dict
# when running with tune.
tune.Tuner(
    "APPO",
    run_config=air.RunConfig(stop={"training_iteration": 1},
                             verbose=0),
    param_space=config.to_dict(),

).fit()
training(*, vtrace: bool | None = <ray.rllib.utils.from_config._NotProvided object>, use_critic: bool | None = <ray.rllib.utils.from_config._NotProvided object>, use_gae: bool | None = <ray.rllib.utils.from_config._NotProvided object>, lambda_: float | None = <ray.rllib.utils.from_config._NotProvided object>, clip_param: float | None = <ray.rllib.utils.from_config._NotProvided object>, use_kl_loss: bool | None = <ray.rllib.utils.from_config._NotProvided object>, kl_coeff: float | None = <ray.rllib.utils.from_config._NotProvided object>, kl_target: float | None = <ray.rllib.utils.from_config._NotProvided object>, tau: float | None = <ray.rllib.utils.from_config._NotProvided object>, target_update_frequency: int | None = <ray.rllib.utils.from_config._NotProvided object>, **kwargs) APPOConfig[source]#

Sets the training related configuration.

Parameters:
  • vtrace – Whether to use V-trace weighted advantages. If false, PPO GAE advantages will be used instead.

  • use_critic – Should use a critic as a baseline (otherwise don’t use value baseline; required for using GAE). Only applies if vtrace=False.

  • use_gae – If true, use the Generalized Advantage Estimator (GAE) with a value function, see https://arxiv.org/pdf/1506.02438.pdf. Only applies if vtrace=False.

  • lambda – GAE (lambda) parameter.

  • clip_param – PPO surrogate slipping parameter.

  • use_kl_loss – Whether to use the KL-term in the loss function.

  • kl_coeff – Coefficient for weighting the KL-loss term.

  • kl_target – Target term for the KL-term to reach (via adjusting the kl_coeff automatically).

  • tau – The factor by which to update the target policy network towards the current policy network. Can range between 0 and 1. e.g. updated_param = tau * current_param + (1 - tau) * target_param

  • target_update_frequency – The frequency to update the target policy and tune the kl loss coefficients that are used during training. After setting this parameter, the algorithm waits for at least target_update_frequency * minibatch_size * num_sgd_iter number of samples to be trained on by the learner group before updating the target networks and tuned the kl loss coefficients that are used during training. NOTE: This parameter is only applicable when using the Learner API (_enable_new_api_stack=True).

Returns:

This updated AlgorithmConfig object.

Proximal Policy Optimization (PPO)#

pytorch tensorflow [paper] [implementation] PPO’s clipped objective supports multiple SGD passes over the same batch of experiences. RLlib’s multi-GPU optimizer pins that data in GPU memory to avoid unnecessary transfers from host memory, substantially improving performance over a naive implementation. PPO scales out using multiple workers for experience collection, and also to multiple GPUs for SGD.

Tip

If you need to scale out with GPUs on multiple nodes, consider using decentralized PPO.

../_images/ppo-arch.svg

PPO architecture#

Tuned examples: Unity3D Soccer (multi-agent: Strikers vs Goalie), Humanoid-v1, Hopper-v1, Pendulum-v1, PongDeterministic-v4, Walker2d-v1, HalfCheetah-v2, {BeamRider,Breakout,Qbert,SpaceInvaders}NoFrameskip-v4

Atari results: more details

Atari env

RLlib PPO @10M

RLlib PPO @25M

Baselines PPO @10M

BeamRider

2807

4480

~1800

Breakout

104

201

~250

Qbert

11085

14247

~14000

SpaceInvaders

671

944

~800

Scalability: more details

MuJoCo env

RLlib PPO 16-workers @ 1h

Fan et al PPO 16-workers @ 1h

HalfCheetah

9664

~7700

../_images/ppo.png

RLlib’s multi-GPU PPO scales to multiple GPUs and hundreds of CPUs on solving the Humanoid-v1 task. Here we compare against a reference MPI-based implementation.#

PPO-specific configs (see also common configs):

class ray.rllib.algorithms.ppo.ppo.PPOConfig(algo_class=None)[source]#

Defines a configuration class from which a PPO Algorithm can be built.

from ray.rllib.algorithms.ppo import PPOConfig
config = PPOConfig()
config = config.training(gamma=0.9, lr=0.01, kl_coeff=0.3,
    train_batch_size=128)
config = config.resources(num_gpus=0)
config = config.rollouts(num_rollout_workers=1)

# Build a Algorithm object from the config and run 1 training iteration.
algo = config.build(env="CartPole-v1")
algo.train()
from ray.rllib.algorithms.ppo import PPOConfig
from ray import air
from ray import tune
config = PPOConfig()
# Print out some default values.

# Update the config object.
config.training(
    lr=tune.grid_search([0.001 ]), clip_param=0.2
)
# Set the config object's env.
config = config.environment(env="CartPole-v1")

# Use to_dict() to get the old-style python config dict
# when running with tune.
tune.Tuner(
    "PPO",
    run_config=air.RunConfig(stop={"training_iteration": 1}),
    param_space=config.to_dict(),
).fit()
training(*, lr_schedule: ~typing.List[~typing.List[int | float]] | None = <ray.rllib.utils.from_config._NotProvided object>, use_critic: bool | None = <ray.rllib.utils.from_config._NotProvided object>, use_gae: bool | None = <ray.rllib.utils.from_config._NotProvided object>, lambda_: float | None = <ray.rllib.utils.from_config._NotProvided object>, use_kl_loss: bool | None = <ray.rllib.utils.from_config._NotProvided object>, kl_coeff: float | None = <ray.rllib.utils.from_config._NotProvided object>, kl_target: float | None = <ray.rllib.utils.from_config._NotProvided object>, mini_batch_size_per_learner: int | None = <ray.rllib.utils.from_config._NotProvided object>, sgd_minibatch_size: int | None = <ray.rllib.utils.from_config._NotProvided object>, num_sgd_iter: int | None = <ray.rllib.utils.from_config._NotProvided object>, shuffle_sequences: bool | None = <ray.rllib.utils.from_config._NotProvided object>, vf_loss_coeff: float | None = <ray.rllib.utils.from_config._NotProvided object>, entropy_coeff: float | None = <ray.rllib.utils.from_config._NotProvided object>, entropy_coeff_schedule: ~typing.List[~typing.List[int | float]] | None = <ray.rllib.utils.from_config._NotProvided object>, clip_param: float | None = <ray.rllib.utils.from_config._NotProvided object>, vf_clip_param: float | None = <ray.rllib.utils.from_config._NotProvided object>, grad_clip: float | None = <ray.rllib.utils.from_config._NotProvided object>, vf_share_layers=-1, **kwargs) PPOConfig[source]#

Sets the training related configuration.

Parameters:
  • lr_schedule – Learning rate schedule. In the format of [[timestep, lr-value], [timestep, lr-value], …] Intermediary timesteps will be assigned to interpolated learning rate values. A schedule should normally start from timestep 0.

  • use_critic – Should use a critic as a baseline (otherwise don’t use value baseline; required for using GAE).

  • use_gae – If true, use the Generalized Advantage Estimator (GAE) with a value function, see https://arxiv.org/pdf/1506.02438.pdf.

  • lambda – The GAE (lambda) parameter.

  • use_kl_loss – Whether to use the KL-term in the loss function.

  • kl_coeff – Initial coefficient for KL divergence.

  • kl_target – Target value for KL divergence.

  • mini_batch_size_per_learner – Only use if new API stack is enabled. The mini batch size per Learner worker. This is the batch size that each Learner worker’s training batch (whose size is s`elf.train_batch_size_per_learner) will be split into. For example, if the train batch size per Learner worker is 4000 and the mini batch size per Learner worker is 400, the train batch will be split into 10 equal sized chunks (or “mini batches”). Each such mini batch will be used for one SGD update. Overall, the train batch on each Learner worker will be traversed self.num_sgd_iter times. In the above example, if self.num_sgd_iter is 5, we will altogether perform 50 (10x5) SGD updates per Learner update step.

  • sgd_minibatch_size – Total SGD batch size across all devices for SGD. This defines the minibatch size within each epoch. Deprecated on the new API stack (use mini_batch_size_per_learner instead).

  • num_sgd_iter – Number of SGD iterations in each outer loop (i.e., number of epochs to execute per train batch).

  • shuffle_sequences – Whether to shuffle sequences in the batch when training (recommended).

  • vf_loss_coeff – Coefficient of the value function loss. IMPORTANT: you must tune this if you set vf_share_layers=True inside your model’s config.

  • entropy_coeff – Coefficient of the entropy regularizer.

  • entropy_coeff_schedule – Decay schedule for the entropy regularizer.

  • clip_param – The PPO clip parameter.

  • vf_clip_param – Clip param for the value function. Note that this is sensitive to the scale of the rewards. If your expected V is large, increase this.

  • grad_clip – If specified, clip the global norm of gradients by this amount.

Returns:

This updated AlgorithmConfig object.

Importance Weighted Actor-Learner Architecture (IMPALA)#

pytorch tensorflow [paper] [implementation] In IMPALA, a central learner runs SGD in a tight loop while asynchronously pulling sample batches from many actor processes. RLlib’s IMPALA implementation uses DeepMind’s reference V-trace code. Note that we do not provide a deep residual network out of the box, but one can be plugged in as a custom model. Multiple learner GPUs and experience replay are also supported.

../_images/impala-arch.svg

IMPALA architecture#

Tuned examples: PongNoFrameskip-v4, vectorized configuration, multi-gpu configuration, {BeamRider,Breakout,Qbert,SpaceInvaders}NoFrameskip-v4

Atari results @10M steps: more details

Atari env

RLlib IMPALA 32-workers

Mnih et al A3C 16-workers

BeamRider

2071

~3000

Breakout

385

~150

Qbert

4068

~1000

SpaceInvaders

719

~600

Scalability:

Atari env

RLlib IMPALA 32-workers @1 hour

Mnih et al A3C 16-workers @1 hour

BeamRider

3181

~1000

Breakout

538

~10

Qbert

10850

~500

SpaceInvaders

843

~300

../_images/impala.png

Multi-GPU IMPALA scales up to solve PongNoFrameskip-v4 in ~3 minutes using a pair of V100 GPUs and 128 CPU workers. The maximum training throughput reached is ~30k transitions per second (~120k environment frames per second).#

IMPALA-specific configs (see also common configs):

class ray.rllib.algorithms.impala.impala.ImpalaConfig(algo_class=None)[source]#

Defines a configuration class from which an Impala can be built.

from ray.rllib.algorithms.impala import ImpalaConfig
config = ImpalaConfig()
config = config.training(lr=0.0003, train_batch_size=512)
config = config.resources(num_gpus=0)
config = config.rollouts(num_rollout_workers=1)
# Build a Algorithm object from the config and run 1 training iteration.
algo = config.build(env="CartPole-v1")
algo.train()
del algo
from ray.rllib.algorithms.impala import ImpalaConfig
from ray import air
from ray import tune
config = ImpalaConfig()

# Update the config object.
config = config.training(
    lr=tune.grid_search([0.0001, 0.0002]), grad_clip=20.0
)
config = config.resources(num_gpus=0)
config = config.rollouts(num_rollout_workers=1)
# Set the config object's env.
config = config.environment(env="CartPole-v1")
# Run with tune.
tune.Tuner(
    "IMPALA",
    param_space=config,
    run_config=air.RunConfig(stop={"training_iteration": 1}),
).fit()
training(*, vtrace: bool | None = <ray.rllib.utils.from_config._NotProvided object>, vtrace_clip_rho_threshold: float | None = <ray.rllib.utils.from_config._NotProvided object>, vtrace_clip_pg_rho_threshold: float | None = <ray.rllib.utils.from_config._NotProvided object>, gamma: float | None = <ray.rllib.utils.from_config._NotProvided object>, num_multi_gpu_tower_stacks: int | None = <ray.rllib.utils.from_config._NotProvided object>, minibatch_buffer_size: int | None = <ray.rllib.utils.from_config._NotProvided object>, minibatch_size: int | str | None = <ray.rllib.utils.from_config._NotProvided object>, num_sgd_iter: int | None = <ray.rllib.utils.from_config._NotProvided object>, replay_proportion: float | None = <ray.rllib.utils.from_config._NotProvided object>, replay_buffer_num_slots: int | None = <ray.rllib.utils.from_config._NotProvided object>, learner_queue_size: int | None = <ray.rllib.utils.from_config._NotProvided object>, learner_queue_timeout: float | None = <ray.rllib.utils.from_config._NotProvided object>, max_requests_in_flight_per_aggregator_worker: int | None = <ray.rllib.utils.from_config._NotProvided object>, timeout_s_sampler_manager: float | None = <ray.rllib.utils.from_config._NotProvided object>, timeout_s_aggregator_manager: float | None = <ray.rllib.utils.from_config._NotProvided object>, broadcast_interval: int | None = <ray.rllib.utils.from_config._NotProvided object>, num_aggregation_workers: int | None = <ray.rllib.utils.from_config._NotProvided object>, grad_clip: float | None = <ray.rllib.utils.from_config._NotProvided object>, opt_type: str | None = <ray.rllib.utils.from_config._NotProvided object>, lr_schedule: ~typing.List[~typing.List[int | float]] | None = <ray.rllib.utils.from_config._NotProvided object>, decay: float | None = <ray.rllib.utils.from_config._NotProvided object>, momentum: float | None = <ray.rllib.utils.from_config._NotProvided object>, epsilon: float | None = <ray.rllib.utils.from_config._NotProvided object>, vf_loss_coeff: float | None = <ray.rllib.utils.from_config._NotProvided object>, entropy_coeff: float | None = <ray.rllib.utils.from_config._NotProvided object>, entropy_coeff_schedule: ~typing.List[~typing.List[int | float]] | None = <ray.rllib.utils.from_config._NotProvided object>, _separate_vf_optimizer: bool | None = <ray.rllib.utils.from_config._NotProvided object>, _lr_vf: float | None = <ray.rllib.utils.from_config._NotProvided object>, after_train_step: ~typing.Callable[[dict], None] | None = <ray.rllib.utils.from_config._NotProvided object>, **kwargs) ImpalaConfig[source]#

Sets the training related configuration.

Parameters:
  • vtrace – V-trace params (see vtrace_tf/torch.py).

  • vtrace_clip_rho_threshold

  • vtrace_clip_pg_rho_threshold

  • gamma – Float specifying the discount factor of the Markov Decision process.

  • num_multi_gpu_tower_stacks – For each stack of multi-GPU towers, how many slots should we reserve for parallel data loading? Set this to >1 to load data into GPUs in parallel. This will increase GPU memory usage proportionally with the number of stacks. Example: 2 GPUs and num_multi_gpu_tower_stacks=3: - One tower stack consists of 2 GPUs, each with a copy of the model/graph. - Each of the stacks will create 3 slots for batch data on each of its GPUs, increasing memory requirements on each GPU by 3x. - This enables us to preload data into these stacks while another stack is performing gradient calculations.

  • minibatch_buffer_size – How many train batches should be retained for minibatching. This conf only has an effect if num_sgd_iter > 1.

  • minibatch_size – The size of minibatches that are trained over during each SGD iteration. If “auto”, will use the same value as train_batch_size. Note that this setting only has an effect if _enable_new_api_stack=True and it must be a multiple of rollout_fragment_length or sequence_length and smaller than or equal to train_batch_size.

  • num_sgd_iter – Number of passes to make over each train batch.

  • replay_proportion – Set >0 to enable experience replay. Saved samples will be replayed with a p:1 proportion to new data samples.

  • replay_buffer_num_slots – Number of sample batches to store for replay. The number of transitions saved total will be (replay_buffer_num_slots * rollout_fragment_length).

  • learner_queue_size – Max queue size for train batches feeding into the learner.

  • learner_queue_timeout – Wait for train batches to be available in minibatch buffer queue this many seconds. This may need to be increased e.g. when training with a slow environment.

  • max_requests_in_flight_per_aggregator_worker – Level of queuing for replay aggregator operations (if using aggregator workers).

  • timeout_s_sampler_manager – The timeout for waiting for sampling results for workers – typically if this is too low, the manager won’t be able to retrieve ready sampling results.

  • timeout_s_aggregator_manager – The timeout for waiting for replay worker results – typically if this is too low, the manager won’t be able to retrieve ready replay requests.

  • broadcast_interval – Number of training step calls before weights are broadcasted to rollout workers that are sampled during any iteration.

  • num_aggregation_workers – Use n (num_aggregation_workers) extra Actors for multi-level aggregation of the data produced by the m RolloutWorkers (num_workers). Note that n should be much smaller than m. This can make sense if ingesting >2GB/s of samples, or if the data requires decompression.

  • grad_clip – If specified, clip the global norm of gradients by this amount.

  • opt_type – Either “adam” or “rmsprop”.

  • lr_schedule – Learning rate schedule. In the format of [[timestep, lr-value], [timestep, lr-value], …] Intermediary timesteps will be assigned to interpolated learning rate values. A schedule should normally start from timestep 0.

  • decay – Decay setting for the RMSProp optimizer, in case opt_type=rmsprop.

  • momentum – Momentum setting for the RMSProp optimizer, in case opt_type=rmsprop.

  • epsilon – Epsilon setting for the RMSProp optimizer, in case opt_type=rmsprop.

  • vf_loss_coeff – Coefficient for the value function term in the loss function.

  • entropy_coeff – Coefficient for the entropy regularizer term in the loss function.

  • entropy_coeff_schedule – Decay schedule for the entropy regularizer.

  • _separate_vf_optimizer – Set this to true to have two separate optimizers optimize the policy-and value networks. Only supported for some algorithms (APPO, IMPALA) on the old API stack.

  • _lr_vf – If _separate_vf_optimizer is True, define separate learning rate for the value network.

  • after_train_step – Callback for APPO to use to update KL, target network periodically. The input to the callback is the learner fetches dict.

Returns:

This updated AlgorithmConfig object.

Model-free Off-policy RL#

Deep Q Networks (DQN, Rainbow, Parametric DQN)#

pytorch tensorflow [paper] [implementation] DQN can be scaled by increasing the number of workers or using Ape-X. Memory usage is reduced by compressing samples in the replay buffer with LZ4. All of the DQN improvements evaluated in Rainbow are available, though not all are enabled by default. See also how to use parametric-actions in DQN.

../_images/dqn-arch.svg

DQN architecture#

Tuned examples: PongDeterministic-v4, Rainbow configuration, {BeamRider,Breakout,Qbert,SpaceInvaders}NoFrameskip-v4, with Dueling and Double-Q, with Distributional DQN.

Tip

Consider using Ape-X for faster training with similar timestep efficiency.

Hint

For a complete rainbow setup, make the following changes to the default DQN config: "n_step": [between 1 and 10], "noisy": True, "num_atoms": [more than 1], "v_min": -10.0, "v_max": 10.0 (set v_min and v_max according to your expected range of returns).

Atari results @10M steps: more details

Atari env

RLlib DQN

RLlib Dueling DDQN

RLlib Dist. DQN

Hessel et al. DQN

BeamRider

2869

1910

4447

~2000

Breakout

287

312

410

~150

Qbert

3921

7968

15780

~4000

SpaceInvaders

650

1001

1025

~500

DQN-specific configs (see also common configs):

class ray.rllib.algorithms.dqn.dqn.DQNConfig(algo_class=None)[source]#

Defines a configuration class from which a DQN Algorithm can be built.

from ray.rllib.algorithms.dqn.dqn import DQNConfig
config = DQNConfig()

replay_config = {
        "type": "MultiAgentPrioritizedReplayBuffer",
        "capacity": 60000,
        "prioritized_replay_alpha": 0.5,
        "prioritized_replay_beta": 0.5,
        "prioritized_replay_eps": 3e-6,
    }

config = config.training(replay_buffer_config=replay_config)
config = config.resources(num_gpus=0)
config = config.rollouts(num_rollout_workers=1)
config = config.environment("CartPole-v1")
algo = DQN(config=config)
algo.train()
del algo
from ray.rllib.algorithms.dqn.dqn import DQNConfig
from ray import air
from ray import tune
config = DQNConfig()
config = config.training(
    num_atoms=tune.grid_search([1,]))
config = config.environment(env="CartPole-v1")
tune.Tuner(
    "DQN",
    run_config=air.RunConfig(stop={"training_iteration":1}),
    param_space=config.to_dict()
).fit()
training(*, target_network_update_freq: int | None = <ray.rllib.utils.from_config._NotProvided object>, replay_buffer_config: dict | None = <ray.rllib.utils.from_config._NotProvided object>, store_buffer_in_checkpoints: bool | None = <ray.rllib.utils.from_config._NotProvided object>, lr_schedule: ~typing.List[~typing.List[int | float]] | None = <ray.rllib.utils.from_config._NotProvided object>, adam_epsilon: float | None = <ray.rllib.utils.from_config._NotProvided object>, grad_clip: int | None = <ray.rllib.utils.from_config._NotProvided object>, num_steps_sampled_before_learning_starts: int | None = <ray.rllib.utils.from_config._NotProvided object>, tau: float | None = <ray.rllib.utils.from_config._NotProvided object>, num_atoms: int | None = <ray.rllib.utils.from_config._NotProvided object>, v_min: float | None = <ray.rllib.utils.from_config._NotProvided object>, v_max: float | None = <ray.rllib.utils.from_config._NotProvided object>, noisy: bool | None = <ray.rllib.utils.from_config._NotProvided object>, sigma0: float | None = <ray.rllib.utils.from_config._NotProvided object>, dueling: bool | None = <ray.rllib.utils.from_config._NotProvided object>, hiddens: int | None = <ray.rllib.utils.from_config._NotProvided object>, double_q: bool | None = <ray.rllib.utils.from_config._NotProvided object>, n_step: int | None = <ray.rllib.utils.from_config._NotProvided object>, before_learn_on_batch: ~typing.Callable[[~typing.Type[~ray.rllib.policy.sample_batch.MultiAgentBatch], ~typing.List[~typing.Type[~ray.rllib.policy.policy.Policy]], ~typing.Type[int]], ~typing.Type[~ray.rllib.policy.sample_batch.MultiAgentBatch]] = <ray.rllib.utils.from_config._NotProvided object>, training_intensity: float | None = <ray.rllib.utils.from_config._NotProvided object>, td_error_loss_fn: str | None = <ray.rllib.utils.from_config._NotProvided object>, categorical_distribution_temperature: float | None = <ray.rllib.utils.from_config._NotProvided object>, **kwargs) DQNConfig[source]#

Sets the training related configuration.

Parameters:
  • target_network_update_freq – Update the target network every target_network_update_freq sample steps.

  • replay_buffer_config – Replay buffer config. Examples: { “_enable_replay_buffer_api”: True, “type”: “MultiAgentReplayBuffer”, “capacity”: 50000, “replay_sequence_length”: 1, } - OR - { “_enable_replay_buffer_api”: True, “type”: “MultiAgentPrioritizedReplayBuffer”, “capacity”: 50000, “prioritized_replay_alpha”: 0.6, “prioritized_replay_beta”: 0.4, “prioritized_replay_eps”: 1e-6, “replay_sequence_length”: 1, } - Where - prioritized_replay_alpha: Alpha parameter controls the degree of prioritization in the buffer. In other words, when a buffer sample has a higher temporal-difference error, with how much more probability should it drawn to use to update the parametrized Q-network. 0.0 corresponds to uniform probability. Setting much above 1.0 may quickly result as the sampling distribution could become heavily “pointy” with low entropy. prioritized_replay_beta: Beta parameter controls the degree of importance sampling which suppresses the influence of gradient updates from samples that have higher probability of being sampled via alpha parameter and the temporal-difference error. prioritized_replay_eps: Epsilon parameter sets the baseline probability for sampling so that when the temporal-difference error of a sample is zero, there is still a chance of drawing the sample.

  • store_buffer_in_checkpoints – Set this to True, if you want the contents of your buffer(s) to be stored in any saved checkpoints as well. Warnings will be created if: - This is True AND restoring from a checkpoint that contains no buffer data. - This is False AND restoring from a checkpoint that does contain buffer data.

  • lr_schedule – Learning rate schedule. In the format of [[timestep, value], [timestep, value], …]. A schedule should normally start from timestep 0.

  • adam_epsilon – Adam optimizer’s epsilon hyper parameter.

  • grad_clip – If not None, clip gradients during optimization at this value.

  • num_steps_sampled_before_learning_starts – Number of timesteps to collect from rollout workers before we start sampling from replay buffers for learning. Whether we count this in agent steps or environment steps depends on config.multi_agent(count_steps_by=..).

  • tau – Update the target by au * policy + (1- au) * target_policy.

  • num_atoms – Number of atoms for representing the distribution of return. When this is greater than 1, distributional Q-learning is used.

  • v_min – Minimum value estimation

  • v_max – Maximum value estimation

  • noisy – Whether to use noisy network to aid exploration. This adds parametric noise to the model weights.

  • sigma0 – Control the initial parameter noise for noisy nets.

  • dueling – Whether to use dueling DQN.

  • hiddens – Dense-layer setup for each the advantage branch and the value branch

  • double_q – Whether to use double DQN.

  • n_step – N-step for Q-learning.

  • before_learn_on_batch – Callback to run before learning on a multi-agent batch of experiences.

  • training_intensity – The intensity with which to update the model (vs collecting samples from the env). If None, uses “natural” values of: train_batch_size / (rollout_fragment_length x num_workers x num_envs_per_worker). If not None, will make sure that the ratio between timesteps inserted into and sampled from the buffer matches the given values. Example: training_intensity=1000.0 train_batch_size=250 rollout_fragment_length=1 num_workers=1 (or 0) num_envs_per_worker=1 -> natural value = 250 / 1 = 250.0 -> will make sure that replay+train op will be executed 4x asoften as rollout+insert op (4 * 250 = 1000). See: rllib/algorithms/dqn/dqn.py::calculate_rr_weights for further details.

  • td_error_loss_fn – “huber” or “mse”. loss function for calculating TD error when num_atoms is 1. Note that if num_atoms is > 1, this parameter is simply ignored, and softmax cross entropy loss will be used.

  • categorical_distribution_temperature – Set the temperature parameter used by Categorical action distribution. A valid temperature is in the range of [0, 1]. Note that this mostly affects evaluation since TD error uses argmax for return calculation.

Returns:

This updated AlgorithmConfig object.

Soft Actor Critic (SAC)#

pytorch tensorflow [original paper], [follow up paper], [discrete actions paper] [implementation]

../_images/dqn-arch.svg

SAC architecture (same as DQN)#

RLlib’s soft-actor critic implementation is ported from the official SAC repo to better integrate with RLlib APIs. Note that SAC has two fields to configure for custom models: policy_model_config and q_model_config, the model field of the config will be ignored.

Tuned examples (continuous actions): Pendulum-v1, HalfCheetah-v3, Tuned examples (discrete actions): CartPole-v1

MuJoCo results @3M steps: more details

MuJoCo env

RLlib SAC

Haarnoja et al SAC

HalfCheetah

13000

~15000

SAC-specific configs (see also common configs):

class ray.rllib.algorithms.sac.sac.SACConfig(algo_class=None)[source]#

Defines a configuration class from which an SAC Algorithm can be built.

config = SACConfig().training(gamma=0.9, lr=0.01, train_batch_size=32)
config = config.resources(num_gpus=0)
config = config.rollouts(num_rollout_workers=1)

# Build a Algorithm object from the config and run 1 training iteration.
algo = config.build(env="CartPole-v1")
algo.train()
training(*, twin_q: bool | None = <ray.rllib.utils.from_config._NotProvided object>, q_model_config: ~typing.Dict[str, ~typing.Any] | None = <ray.rllib.utils.from_config._NotProvided object>, policy_model_config: ~typing.Dict[str, ~typing.Any] | None = <ray.rllib.utils.from_config._NotProvided object>, tau: float | None = <ray.rllib.utils.from_config._NotProvided object>, initial_alpha: float | None = <ray.rllib.utils.from_config._NotProvided object>, target_entropy: str | float | None = <ray.rllib.utils.from_config._NotProvided object>, n_step: int | None = <ray.rllib.utils.from_config._NotProvided object>, store_buffer_in_checkpoints: bool | None = <ray.rllib.utils.from_config._NotProvided object>, replay_buffer_config: ~typing.Dict[str, ~typing.Any] | None = <ray.rllib.utils.from_config._NotProvided object>, training_intensity: float | None = <ray.rllib.utils.from_config._NotProvided object>, clip_actions: bool | None = <ray.rllib.utils.from_config._NotProvided object>, grad_clip: float | None = <ray.rllib.utils.from_config._NotProvided object>, optimization_config: ~typing.Dict[str, ~typing.Any] | None = <ray.rllib.utils.from_config._NotProvided object>, target_network_update_freq: int | None = <ray.rllib.utils.from_config._NotProvided object>, _deterministic_loss: bool | None = <ray.rllib.utils.from_config._NotProvided object>, _use_beta_distribution: bool | None = <ray.rllib.utils.from_config._NotProvided object>, num_steps_sampled_before_learning_starts: int | None = <ray.rllib.utils.from_config._NotProvided object>, **kwargs) SACConfig[source]#

Sets the training related configuration.

Parameters:
  • twin_q – Use two Q-networks (instead of one) for action-value estimation. Note: Each Q-network will have its own target network.

  • q_model_config – Model configs for the Q network(s). These will override MODEL_DEFAULTS. This is treated just as the top-level model dict in setting up the Q-network(s) (2 if twin_q=True). That means, you can do for different observation spaces: obs=Box(1D) -> Tuple(Box(1D) + Action) -> concat -> post_fcnet obs=Box(3D) -> Tuple(Box(3D) + Action) -> vision-net -> concat w/ action -> post_fcnet obs=Tuple(Box(1D), Box(3D)) -> Tuple(Box(1D), Box(3D), Action) -> vision-net -> concat w/ Box(1D) and action -> post_fcnet You can also have SAC use your custom_model as Q-model(s), by simply specifying the custom_model sub-key in below dict (just like you would do in the top-level model dict.

  • policy_model_config – Model options for the policy function (see q_model_config above for details). The difference to q_model_config above is that no action concat’ing is performed before the post_fcnet stack.

  • tau – Update the target by au * policy + (1- au) * target_policy.

  • initial_alpha – Initial value to use for the entropy weight alpha.

  • target_entropy – Target entropy lower bound. If “auto”, will be set to -|A| (e.g. -2.0 for Discrete(2), -3.0 for Box(shape=(3,))). This is the inverse of reward scale, and will be optimized automatically.

  • n_step – N-step target updates. If >1, sars’ tuples in trajectories will be postprocessed to become sa[discounted sum of R][s t+n] tuples. An integer will be interpreted as a fixed n-step value. In case of a tuple the n-step value will be drawn for each sample in the train batch from a uniform distribution over the interval defined by the ‘n-step’-tuple.

  • store_buffer_in_checkpoints – Set this to True, if you want the contents of your buffer(s) to be stored in any saved checkpoints as well. Warnings will be created if: - This is True AND restoring from a checkpoint that contains no buffer data. - This is False AND restoring from a checkpoint that does contain buffer data.

  • replay_buffer_config – Replay buffer config. Examples: { “_enable_replay_buffer_api”: True, “type”: “MultiAgentReplayBuffer”, “capacity”: 50000, “replay_batch_size”: 32, “replay_sequence_length”: 1, } - OR - { “_enable_replay_buffer_api”: True, “type”: “MultiAgentPrioritizedReplayBuffer”, “capacity”: 50000, “prioritized_replay_alpha”: 0.6, “prioritized_replay_beta”: 0.4, “prioritized_replay_eps”: 1e-6, “replay_sequence_length”: 1, } - Where - prioritized_replay_alpha: Alpha parameter controls the degree of prioritization in the buffer. In other words, when a buffer sample has a higher temporal-difference error, with how much more probability should it drawn to use to update the parametrized Q-network. 0.0 corresponds to uniform probability. Setting much above 1.0 may quickly result as the sampling distribution could become heavily “pointy” with low entropy. prioritized_replay_beta: Beta parameter controls the degree of importance sampling which suppresses the influence of gradient updates from samples that have higher probability of being sampled via alpha parameter and the temporal-difference error. prioritized_replay_eps: Epsilon parameter sets the baseline probability for sampling so that when the temporal-difference error of a sample is zero, there is still a chance of drawing the sample.

  • training_intensity – The intensity with which to update the model (vs collecting samples from the env). If None, uses “natural” values of: train_batch_size / (rollout_fragment_length x num_workers x num_envs_per_worker). If not None, will make sure that the ratio between timesteps inserted into and sampled from th buffer matches the given values. Example: training_intensity=1000.0 train_batch_size=250 rollout_fragment_length=1 num_workers=1 (or 0) num_envs_per_worker=1 -> natural value = 250 / 1 = 250.0 -> will make sure that replay+train op will be executed 4x asoften as rollout+insert op (4 * 250 = 1000). See: rllib/algorithms/dqn/dqn.py::calculate_rr_weights for further details.

  • clip_actions – Whether to clip actions. If actions are already normalized, this should be set to False.

  • grad_clip – If not None, clip gradients during optimization at this value.

  • optimization_config – Config dict for optimization. Set the supported keys actor_learning_rate, critic_learning_rate, and entropy_learning_rate in here.

  • target_network_update_freq – Update the target network every target_network_update_freq steps.

  • _deterministic_loss – Whether the loss should be calculated deterministically (w/o the stochastic action sampling step). True only useful for continuous actions and for debugging.

  • _use_beta_distribution – Use a Beta-distribution instead of a SquashedGaussian for bounded, continuous action spaces (not recommended; for debugging only).

Returns:

This updated AlgorithmConfig object.

Model-based RL#

DreamerV3#

tensorflow [paper] [implementation]

DreamerV3 trains a world model in supervised fashion using real environment interactions. The world model’s objective is to correctly predict all aspects of the transition dynamics of the RL environment, which includes (besides predicting the correct next observations) predicting the received rewards as well as a boolean episode continuation flag. A “recurrent state space model” or RSSM is used to alternatingly train the world model (from actual env data) as well as the critic and actor networks, both of which are trained on “dreamed” trajectories produced by the world model.

DreamerV3 can be used in all types of environments, including those with image- or vector based observations, continuous- or discrete actions, as well as sparse or dense reward functions.

Tuned examples: Atari 100k, Atari 200M, DeepMind Control Suite

Pong-v5 results (1, 2, and 4 GPUs):

../_images/pong_1_2_and_4gpus.svg

Episode mean rewards for the Pong-v5 environment (with the “100k” setting, in which only 100k environment steps are allowed): Note that despite the stable sample efficiency - shown by the constant learning performance per env step - the wall time improves almost linearly as we go from 1 to 4 GPUs. Left: Episode reward over environment timesteps sampled. Right: Episode reward over wall-time.#

Atari 100k results (1 vs 4 GPUs):

../_images/atari100k_1_vs_4gpus.svg

Episode mean rewards for various Atari 100k tasks on 1 vs 4 GPUs. Left: Episode reward over environment timesteps sampled. Right: Episode reward over wall-time.#

DeepMind Control Suite (vision) results (1 vs 4 GPUs):

../_images/dmc_1_vs_4gpus.svg

Episode mean rewards for various Atari 100k tasks on 1 vs 4 GPUs. Left: Episode reward over environment timesteps sampled. Right: Episode reward over wall-time.#

Multi-agent#

Parameter Sharing#

[paper], [paper] and [instructions]. Parameter sharing refers to a class of methods that take a base single agent method, and use it to learn a single policy for all agents. This simple approach has been shown to achieve state of the art performance in cooperative games, and is usually how you should start trying to learn a multi-agent problem.

Tuned examples: PettingZoo, waterworld, rock-paper-scissors, multi-agent cartpole

Shared Critic Methods#

[instructions] Shared critic methods are when all agents use a single parameter shared critic network (in some cases with access to more of the observation space than agents can see). Note that many specialized multi-agent algorithms such as MADDPG are mostly shared critic forms of their single-agent algorithm (DDPG in the case of MADDPG).

Tuned examples: TwoStepGame

Fully Independent Learning#

[instructions] Fully independent learning involves a collection of agents learning independently of each other via single agent methods. This typically works, but can be less effective than dedicated multi-agent RL methods, since they do not account for the non-stationarity of the multi-agent environment.

Tuned examples: waterworld, multiagent-cartpole