.. include:: /_includes/rllib/we_are_hiring.rst .. _rllib-algorithms-doc: Algorithms ========== .. include:: /_includes/rllib/new_api_stack.rst The following table is an overview of all available algorithms in RLlib. Note that all algorithms support multi-GPU training on a single (GPU) node in `Ray (open-source) `__ (|multi_gpu|) as well as multi-GPU training on multi-node (GPU) clusters when using the `Anyscale platform `__ (|multi_node_multi_gpu|). +-----------------------------------------------------------------------------+------------------------------+------------------------------------+--------------------------------+ | **Algorithm** | **Single- and Multi-agent** | **Multi-GPU (multi-node)** | **Action Spaces** | +-----------------------------------------------------------------------------+------------------------------+------------------------------------+--------------------------------+ | **On-Policy** | +-----------------------------------------------------------------------------+------------------------------+------------------------------------+--------------------------------+ | :ref:`PPO (Proximal Policy Optimization) ` | |single_agent| |multi_agent| | |multi_gpu| |multi_node_multi_gpu| | |cont_actions| |discr_actions| | +-----------------------------------------------------------------------------+------------------------------+------------------------------------+--------------------------------+ | **Off-Policy** | +-----------------------------------------------------------------------------+------------------------------+------------------------------------+--------------------------------+ | :ref:`DQN/Rainbow (Deep Q Networks) ` | |single_agent| |multi_agent| | |multi_gpu| |multi_node_multi_gpu| | |discr_actions| | +-----------------------------------------------------------------------------+------------------------------+------------------------------------+--------------------------------+ | :ref:`SAC (Soft Actor Critic) ` | |single_agent| |multi_agent| | |multi_gpu| |multi_node_multi_gpu| | |cont_actions| | +-----------------------------------------------------------------------------+------------------------------+------------------------------------+--------------------------------+ | **High-throughput on- and off policy** | +-----------------------------------------------------------------------------+------------------------------+------------------------------------+--------------------------------+ | :ref:`APPO (Asynchronous Proximal Policy Optimization) ` | |single_agent| |multi_agent| | |multi_gpu| |multi_node_multi_gpu| | |cont_actions| |discr_actions| | +-----------------------------------------------------------------------------+------------------------------+------------------------------------+--------------------------------+ | :ref:`IMPALA (Importance Weighted Actor-Learner Architecture) ` | |single_agent| |multi_agent| | |multi_gpu| |multi_node_multi_gpu| | |discr_actions| | +-----------------------------------------------------------------------------+------------------------------+------------------------------------+--------------------------------+ | **Model-based RL** | +-----------------------------------------------------------------------------+------------------------------+------------------------------------+--------------------------------+ | :ref:`DreamerV3 ` | |single_agent| | |multi_gpu| |multi_node_multi_gpu| | |cont_actions| |discr_actions| | +-----------------------------------------------------------------------------+------------------------------+------------------------------------+--------------------------------+ | **Offline RL and Imitation Learning** | +-----------------------------------------------------------------------------+------------------------------+------------------------------------+--------------------------------+ | :ref:`BC (Behavior Cloning) ` | |single_agent| | |multi_gpu| |multi_node_multi_gpu| | |cont_actions| |discr_actions| | +-----------------------------------------------------------------------------+------------------------------+------------------------------------+--------------------------------+ | :ref:`MARWIL (Monotonic Advantage Re-Weighted Imitation Learning) ` | |single_agent| | |multi_gpu| |multi_node_multi_gpu| | |cont_actions| |discr_actions| | +-----------------------------------------------------------------------------+------------------------------+------------------------------------+--------------------------------+ | **Algorithm Extensions and -Plugins** | +-----------------------------------------------------------------------------+------------------------------+------------------------------------+--------------------------------+ | :ref:`Curiosity-driven Exploration by Self-supervised Prediction ` | |single_agent| | |multi_gpu| |multi_node_multi_gpu| | |cont_actions| |discr_actions| | +-----------------------------------------------------------------------------+------------------------------+------------------------------------+--------------------------------+ On-policy ~~~~~~~~~ .. _ppo: Proximal Policy Optimization (PPO) ---------------------------------- `[paper] `__ `[implementation] `__ .. figure:: images/algos/ppo-architecture.svg :width: 750 :align: left **PPO architecture:** In a training iteration, PPO performs three major steps: 1. Sampling a set of episodes or episode fragments 1. Converting these into a train batch and updating the model using a clipped objective and multiple SGD passes over this batch 1. Syncing the weights from the Learners back to the EnvRunners PPO scales out on both axes, supporting multiple EnvRunners for sample collection and multiple GPU- or CPU-based Learners for updating the model. **Tuned examples:** `Pong-v5 `__, `CartPole-v1 `__. `Pendulum-v1 `__. **PPO-specific configs** (see also :ref:`generic algorithm settings `): .. autoclass:: ray.rllib.algorithms.ppo.ppo.PPOConfig :members: training Off-Policy ~~~~~~~~~~ .. _dqn: Deep Q Networks (DQN, Rainbow, Parametric DQN) ---------------------------------------------- `[paper] `__ `[implementation] `__ .. figure:: images/algos/dqn-architecture.svg :width: 650 :align: left **DQN architecture:** DQN uses a replay buffer to temporarily store episode samples that RLlib collects from the environment. Throughout different training iterations, these episodes and episode fragments are re-sampled from the buffer and re-used for updating the model, before eventually being discarded when the buffer has reached capacity and new samples keep coming in (FIFO). This reuse of training data makes DQN very sample-efficient and off-policy. DQN scales out on both axes, supporting multiple EnvRunners for sample collection and multiple GPU- or CPU-based Learners for updating the model. All of the DQN improvements evaluated in `Rainbow `__ are available, though not all are enabled by default. See also how to use `parametric-actions in DQN `__. **Tuned examples:** `PongDeterministic-v4 `__, `Rainbow configuration `__, `{BeamRider,Breakout,Qbert,SpaceInvaders}NoFrameskip-v4 `__, `with Dueling and Double-Q `__, `with Distributional DQN `__. .. hint:: For a complete `rainbow `__ setup, make the following changes to the default DQN config: ``"n_step": [between 1 and 10], "noisy": True, "num_atoms": [more than 1], "v_min": -10.0, "v_max": 10.0`` (set ``v_min`` and ``v_max`` according to your expected range of returns). **DQN-specific configs** (see also :ref:`generic algorithm settings `): .. autoclass:: ray.rllib.algorithms.dqn.dqn.DQNConfig :members: training .. _sac: Soft Actor Critic (SAC) ------------------------ `[original paper] `__, `[follow up paper] `__, `[implementation] `__. .. figure:: images/algos/sac-architecture.svg :width: 750 :align: left **SAC architecture:** SAC uses a replay buffer to temporarily store episode samples that RLlib collects from the environment. Throughout different training iterations, these episodes and episode fragments are re-sampled from the buffer and re-used for updating the model, before eventually being discarded when the buffer has reached capacity and new samples keep coming in (FIFO). This reuse of training data makes DQN very sample-efficient and off-policy. SAC scales out on both axes, supporting multiple EnvRunners for sample collection and multiple GPU- or CPU-based Learners for updating the model. **Tuned examples:** `Pendulum-v1 `__, `HalfCheetah-v3 `__, **SAC-specific configs** (see also :ref:`generic algorithm settings `): .. autoclass:: ray.rllib.algorithms.sac.sac.SACConfig :members: training High-Throughput On- and Off-Policy ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. _appo: Asynchronous Proximal Policy Optimization (APPO) ------------------------------------------------ .. tip:: APPO was originally `published under the name "IMPACT" `__. RLlib's APPO exactly matches the algorithm described in the paper. `[paper] `__ `[implementation] `__ .. figure:: images/algos/appo-architecture.svg :width: 750 :align: left **APPO architecture:** APPO is an asynchronous variant of :ref:`Proximal Policy Optimization (PPO) ` based on the IMPALA architecture, but using a surrogate policy loss with clipping, allowing for multiple SGD passes per collected train batch. In a training iteration, APPO requests samples from all EnvRunners asynchronously and the collected episode samples are returned to the main algorithm process as Ray references rather than actual objects available on the local process. APPO then passes these episode references to the Learners for asynchronous updates of the model. RLlib doesn't always synch back the weights to the EnvRunners right after a new model version is available. To account for the EnvRunners being off-policy, APPO uses a procedure called v-trace, `described in the IMPALA paper `__. APPO scales out on both axes, supporting multiple EnvRunners for sample collection and multiple GPU- or CPU-based Learners for updating the model. **Tuned examples:** `Pong-v5 `__ `HalfCheetah-v4 `__ **APPO-specific configs** (see also :ref:`generic algorithm settings `): .. autoclass:: ray.rllib.algorithms.appo.appo.APPOConfig :members: training .. _impala: Importance Weighted Actor-Learner Architecture (IMPALA) ------------------------------------------------------- `[paper] `__ `[implementation] `__ .. figure:: images/algos/impala-architecture.svg :width: 750 :align: left **IMPALA architecture:** In a training iteration, IMPALA requests samples from all EnvRunners asynchronously and the collected episodes are returned to the main algorithm process as Ray references rather than actual objects available on the local process. IMPALA then passes these episode references to the Learners for asynchronous updates of the model. RLlib doesn't always synch back the weights to the EnvRunners right after a new model version is available. To account for the EnvRunners being off-policy, IMPALA uses a procedure called v-trace, `described in the paper `__. IMPALA scales out on both axes, supporting multiple EnvRunners for sample collection and multiple GPU- or CPU-based Learners for updating the model. Tuned examples: `PongNoFrameskip-v4 `__, `vectorized configuration `__, `multi-gpu configuration `__, `{BeamRider,Breakout,Qbert,SpaceInvaders}NoFrameskip-v4 `__. .. figure:: images/impala.png :width: 650 Multi-GPU IMPALA scales up to solve PongNoFrameskip-v4 in ~3 minutes using a pair of V100 GPUs and 128 CPU workers. The maximum training throughput reached is ~30k transitions per second (~120k environment frames per second). **IMPALA-specific configs** (see also :ref:`generic algorithm settings `): .. autoclass:: ray.rllib.algorithms.impala.impala.IMPALAConfig :members: training Model-based RL ~~~~~~~~~~~~~~ .. _dreamerv3: DreamerV3 --------- `[paper] `__ `[implementation] `__ .. figure:: images/algos/dreamerv3-architecture.svg :width: 850 :align: left **DreamerV3 architecture:** DreamerV3 trains a recurrent WORLD_MODEL in supervised fashion using real environment interactions sampled from a replay buffer. The world model's objective is to correctly predict the transition dynamics of the RL environment: next observation, reward, and a boolean continuation flag. DreamerV3 trains the actor- and critic-networks on synthesized trajectories only, which are "dreamed" by the world model. DreamerV3 scales out on both axes, supporting multiple EnvRunners for sample collection and multiple GPU- or CPU-based Learners for updating the model. It can also be used in different environment types, including those with image- or vector based observations, continuous- or discrete actions, as well as sparse or dense reward functions. **Tuned examples:** `Atari 100k `__, `Atari 200M `__, `DeepMind Control Suite `__ **Pong-v5 results (1, 2, and 4 GPUs)**: .. figure:: images/dreamerv3/pong_1_2_and_4gpus.svg Episode mean rewards for the Pong-v5 environment (with the "100k" setting, in which only 100k environment steps are allowed): Note that despite the stable sample efficiency - shown by the constant learning performance per env step - the wall time improves almost linearly as we go from 1 to 4 GPUs. **Left**: Episode reward over environment timesteps sampled. **Right**: Episode reward over wall-time. **Atari 100k results (1 vs 4 GPUs)**: .. figure:: images/dreamerv3/atari100k_1_vs_4gpus.svg Episode mean rewards for various Atari 100k tasks on 1 vs 4 GPUs. **Left**: Episode reward over environment timesteps sampled. **Right**: Episode reward over wall-time. **DeepMind Control Suite (vision) results (1 vs 4 GPUs)**: .. figure:: images/dreamerv3/dmc_1_vs_4gpus.svg Episode mean rewards for various Atari 100k tasks on 1 vs 4 GPUs. **Left**: Episode reward over environment timesteps sampled. **Right**: Episode reward over wall-time. Offline RL and Imitation Learning ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. _bc: Behavior Cloning (BC) --------------------- `[paper] `__ `[implementation] `__ .. figure:: images/algos/bc-architecture.svg :width: 750 :align: left **BC architecture:** RLlib's behavioral cloning (BC) uses Ray Data to tap into its parallel data processing capabilities. In one training iteration, BC reads episodes in parallel from offline files, for example `parquet `__, by the n DataWorkers. Connector pipelines then preprocess these episodes into train batches and send these as data iterators directly to the n Learners for updating the model. RLlib's (BC) implementation is directly derived from its `MARWIL`_ implementation, with the only difference being the ``beta`` parameter (set to 0.0). This makes BC try to match the behavior policy, which generated the offline data, disregarding any resulting rewards. **Tuned examples:** `CartPole-v1 `__ `Pendulum-v1 `__ **BC-specific configs** (see also :ref:`generic algorithm settings `): .. autoclass:: ray.rllib.algorithms.bc.bc.BCConfig :members: training .. _cql: Conservative Q-Learning (CQL) ----------------------------- `[paper] `__ `[implementation] `__ .. figure:: images/algos/cql-architecture.svg :width: 750 :align: left **CQL architecture:** CQL (Conservative Q-Learning) is an offline RL algorithm that mitigates the overestimation of Q-values outside the dataset distribution through a conservative critic estimate. It adds a simple Q regularizer loss to the standard Bellman update loss, ensuring that the critic doesn't output overly optimistic Q-values. The `SACLearner` adds this conservative correction term to the TD-based Q-learning loss. **Tuned examples:** `Pendulum-v1 `__ **CQL-specific configs** and :ref:`generic algorithm settings `): .. autoclass:: ray.rllib.algorithms.cql.cql.CQLConfig :members: training .. _marwil: Monotonic Advantage Re-Weighted Imitation Learning (MARWIL) ----------------------------------------------------------- `[paper] `__ `[implementation] `__ .. figure:: images/algos/marwil-architecture.svg :width: 750 :align: left **MARWIL architecture:** MARWIL is a hybrid imitation learning and policy gradient algorithm suitable for training on batched historical data. When the ``beta`` hyperparameter is set to zero, the MARWIL objective reduces to plain imitation learning (see `BC`_). MARWIL uses Ray.Data to tap into its parallel data processing capabilities. In one training iteration, MARWIL reads episodes in parallel from offline files, for example `parquet `__, by the n DataWorkers. Connector pipelines preprocess these episodes into train batches and send these as data iterators directly to the n Learners for updating the model. **Tuned examples:** `CartPole-v1 `__ **MARWIL-specific configs** (see also :ref:`generic algorithm settings `): .. autoclass:: ray.rllib.algorithms.marwil.marwil.MARWILConfig :members: training Algorithm Extensions- and Plugins ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. _icm: Curiosity-driven Exploration by Self-supervised Prediction ---------------------------------------------------------- `[paper] `__ `[implementation] `__ .. figure:: images/algos/curiosity-architecture.svg :width: 850 :align: left **Intrinsic Curiosity Model (ICM) architecture:** The main idea behind ICM is to train a world-model (in parallel to the "main" policy) to predict the environment's dynamics. The loss of the world model is the intrinsic reward that the `ICMLearner` adds to the env's (extrinsic) reward. This makes sure that when in regions of the environment that are relatively unknown (world model performs badly in predicting what happens next), the artificial intrinsic reward is large and the agent is motivated to go and explore these unknown regions. RLlib's curiosity implementation works with any of RLlib's algorithms. See these links here for example implementations on top of `PPO and DQN `__. ICM uses the chosen Algorithm's `training_step()` as-is, but then executes the following additional steps during `LearnerGroup.update`: Duplicate the train batch of the "main" policy and use it for performing a self-supervised update of the ICM. Use the ICM to compute the intrinsic rewards and add these to the extrinsic (env) rewards. Then continue updating the "main" policy. **Tuned examples:** `12x12 FrozenLake-v1 `__ .. |single_agent| image:: images/sigils/single-agent.svg :class: inline-figure :width: 84 .. |multi_agent| image:: images/sigils/multi-agent.svg :class: inline-figure :width: 84 .. |multi_gpu| image:: images/sigils/multi-gpu.svg :class: inline-figure :width: 84 .. |multi_node_multi_gpu| image:: images/sigils/multi-node-multi-gpu.svg :class: inline-figure :width: 84 .. |discr_actions| image:: images/sigils/discr-actions.svg :class: inline-figure :width: 84 .. |cont_actions| image:: images/sigils/cont-actions.svg :class: inline-figure :width: 84