.. include:: /_includes/rllib/we_are_hiring.rst .. _rllib-key-concepts: Key concepts ============ .. include:: /_includes/rllib/new_api_stack.rst To help you get a high-level understanding of how the library works, on this page, you learn about the key concepts and general architecture of RLlib. .. figure:: images/rllib_key_concepts.svg :width: 750 :align: left **RLlib overview:** The central component of RLlib is the :py:class:`~ray.rllib.algorithms.algorithm.Algorithm` class, acting as a runtime for executing your RL experiments. Your gateway into using an :ref:`Algorithm ` is the :py:class:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig` (cyan) class, allowing you to manage available configuration settings, for example learning rate or model architecture. Most :py:class:`~ray.rllib.algorithms.algorithm.Algorithm` objects have :py:class:`~ray.rllib.env.env_runner.EnvRunner` actors (blue) to collect training samples from the :ref:`RL environment ` and :py:class:`~ray.rllib.core.learner.learner.Learner` actors (yellow) to compute gradients and update your :ref:`models `. The algorithm synchronizes model weights after an update. .. _rllib-key-concepts-algorithms: AlgorithmConfig and Algorithm ----------------------------- .. todo (sven): Change the following link to the actual algorithm and algorithm-config page, once done. Right now, it's pointing to the algos-overview page, instead! .. tip:: The following is a quick overview of **RLlib AlgorithmConfigs and Algorithms**. See here for a :ref:`detailed description of the Algorithm class `. The RLlib :py:class:`~ray.rllib.algorithms.algorithm.Algorithm` class serves as a runtime for your RL experiments, bringing together all components required for learning an optimal solution to your :ref:`RL environment `. It exposes powerful Python APIs for controlling your experiment runs. The gateways into using the various RLlib :py:class:`~ray.rllib.algorithms.algorithm.Algorithm` types are the respective :py:class:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig` classes, allowing you to configure available settings in a checked and type-safe manner. For example, to configure a :py:class:`~ray.rllib.algorithms.ppo.ppo.PPO` ("Proximal Policy Optimization") algorithm instance, you use the :py:class:`~ray.rllib.algorithms.ppo.ppo.PPOConfig` class. During its construction, the :py:class:`~ray.rllib.algorithms.algorithm.Algorithm` first sets up its :py:class:`~ray.rllib.env.env_runner_group.EnvRunnerGroup`, containing ``n`` :py:class:`~ray.rllib.env.env_runner.EnvRunner` `actors `__, and its :py:class:`~ray.rllib.core.learner.learner_group.LearnerGroup`, containing ``m`` :py:class:`~ray.rllib.core.learner.learner.Learner` `actors `__. This way, you can scale up sample collection and training, respectively, from a single core to many thousands of cores in a cluster. .. todo: Separate out our scaling guide into its own page in new PR See this :ref:`scaling guide ` for more details here. You have two ways to interact with and run an :py:class:`~ray.rllib.algorithms.algorithm.Algorithm`: - You can create and manage an instance of it directly through the Python API. - Because the :py:class:`~ray.rllib.algorithms.algorithm.Algorithm` class is a subclass of the :ref:`Tune Trainable API `, you can use `Ray Tune `__ to more easily manage your experiment and tune hyperparameters. The following examples demonstrate this on RLlib's :py:class:`~ray.rllib.algorithms.ppo.PPO` ("Proximal Policy Optimization") algorithm: .. tab-set:: .. tab-item:: Manage Algorithm instance directly .. testcode:: from ray.rllib.algorithms.ppo import PPOConfig # Configure. config = ( PPOConfig() .environment("CartPole-v1") .training( train_batch_size_per_learner=2000, lr=0.0004, ) ) # Build the Algorithm. algo = config.build() # Train for one iteration, which is 2000 timesteps (1 train batch). print(algo.train()) .. testcode:: :hide: algo.stop() .. tab-item:: Run Algorithm through Ray Tune .. testcode:: from ray import tune from ray.rllib.algorithms.ppo import PPOConfig # Configure. config = ( PPOConfig() .environment("CartPole-v1") .training( train_batch_size_per_learner=2000, lr=0.0004, ) ) # Train through Ray Tune. results = tune.Tuner( "PPO", param_space=config, # Train for 4000 timesteps (2 iterations). run_config=tune.RunConfig(stop={"num_env_steps_sampled_lifetime": 4000}), ).fit() .. _rllib-key-concepts-environments: RL environments --------------- .. tip:: The following is a quick overview of **RL environments**. See :ref:`here for a detailed description of how to use RL environments in RLlib `. A reinforcement learning (RL) environment is a structured space, like a simulator or a controlled section of the real world, in which one or more agents interact and learn to achieve specific goals. The environment defines an observation space, which is the structure and shape of observable tensors at each timestep, an action space, which defines the available actions for the agents at each time step, a reward function, and the rules that govern environment transitions when applying actions. .. figure:: images/envs/env_loop_concept.svg :width: 900 :align: left A simple **RL environment** where an agent starts with an initial observation returned by the ``reset()`` method. The agent, possibly controlled by a neural network policy, sends actions, like ``right`` or ``jump``, to the environmant's ``step()`` method, which returns a reward. Here, the reward values are +5 for reaching the goal and 0 otherwise. The environment also returns a boolean flag indicating whether the episode is complete. Environments may vary in complexity, from simple tasks, like navigating a grid world, to highly intricate systems, like autonomous driving simulators, robotic control environments, or multi-agent games. RLlib interacts with the environment by playing through many :ref:`episodes ` during a training iteration to collect data, such as made observations, taken actions, received rewards and ``done`` flags (see preceding figure). It then converts this episode data into a train batch for model updating. The goal of these model updates is to change the agents' behaviors such that it leads to a maximum sum of received rewards over the agents' lifetimes. .. _rllib-key-concepts-rl-modules: RLModules --------- .. tip:: The following is a quick overview of **RLlib RLModules**. See :ref:`here for a detailed description of the RLModule class `. `RLModules `__ are deep-learning framework-specific neural network wrappers. RLlib's :ref:`EnvRunners ` use them for computing actions when stepping through the :ref:`RL environment ` and RLlib's :ref:`Learners ` use :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule` instances for computing losses and gradients before updating them. .. figure:: images/rl_modules/rl_module_overview.svg :width: 750 :align: left **RLModule overview**: *(left)* A minimal :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule` contains a neural network and defines its forward exploration-, inference- and training logic. *(right)* In more complex setups, a :py:class:`~ray.rllib.core.rl_module.multi_rl_module.MultiRLModule` contains many submodules, each itself an :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule` instance and identified by a ``ModuleID``, allowing you to implement arbitrarily complex multi-model and multi-agent algorithms. In a nutshell, an :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule` carries the neural network models and defines how to use them during the three phases of its RL lifecycle: **Exploration**, for collecting training data, **inference** when computing actions for evaluation or in production, and **training** for computing the loss function inputs. You can chose to use :ref:`RLlib's built-in default models and configure these ` as needed, for example for changing the number of layers or the activation functions, or :ref:`write your own custom models in PyTorch `, allowing you to implement any architecture and computation logic. .. figure:: images/rl_modules/rl_module_in_env_runner.svg :width: 450 :align: left **An RLModule inside an EnvRunner actor**: The :py:class:`~ray.rllib.env.env_runner.EnvRunner` operates on its own copy of an inference-only version of the :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule`, using it only to compute actions. Each :py:class:`~ray.rllib.env.env_runner.EnvRunner` actor, managed by the :py:class:`~ray.rllib.env.env_runner_group.EnvRunnerGroup` of the Algorithm, has a copy of the user's :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule`. Also, each :py:class:`~ray.rllib.core.learner.learner.Learner` actor, managed by the :py:class:`~ray.rllib.core.learner.learner_group.LearnerGroup` of the Algorithm has an :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule` copy. The :py:class:`~ray.rllib.env.env_runner.EnvRunner` copy is normally in its ``inference_only`` version, meaning that components not required for bare action computation, for example a value function estimate, are missing to save memory. .. figure:: images/rl_modules/rl_module_in_learner.svg :width: 400 :align: left **An RLModule inside a Learner actor**: The :py:class:`~ray.rllib.core.learner.learner.Learner` operates on its own copy of an :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule`, computing the loss function inputs, the loss itself, and the model's gradients, then updating the :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule` through the :py:class:`~ray.rllib.core.learner.learner.Learner`'s optimizers. .. _rllib-key-concepts-episodes: Episodes -------- .. tip:: The following is a quick overview of **Episode**. See :ref:`here for a detailed description of the Episode classes `. RLlib sends around all training data the form of :ref:`Episodes `. The :py:class:`~ray.rllib.env.single_agent_episode.SingleAgentEpisode` class describes single-agent trajectories. The :py:class:`~ray.rllib.env.multi_agent_episode.MultiAgentEpisode` class contains several such single-agent episodes and describes the stepping times- and patterns of the individual agents with respect to each other. Both ``Episode`` classes store the entire trajectory data generated while stepping through an :ref:`RL environment `. This data includes the observations, info dicts, actions, rewards, termination signals, and any model computations along the way, like recurrent states, action logits, or action log probabilities. .. tip:: See here for `RLlib's standardized column names `__. Note that episodes conveniently don't have to store any ``next obs`` information as it always overlaps with the information under ``obs``. This design saves almost 50% of memory, because observations are often the largest piece in a trajectory. The same is true for ``state_in`` and ``state_out`` information for stateful networks. RLlib only keeps the ``state_out`` key in the episodes. Typically, RLlib generates episode chunks of size ``config.rollout_fragment_length`` through the :ref:`EnvRunner ` actors in the Algorithm's :ref:`EnvRunnerGroup `, and sends as many episode chunks to each :ref:`Learner ` actor as required to build one training batch of exactly size ``config.train_batch_size_per_learner``. A typical :py:class:`~ray.rllib.env.single_agent_episode.SingleAgentEpisode` object roughly looks as follows: .. code-block:: python # A SingleAgentEpisode of length 20 has roughly the following schematic structure. # Note that after these 20 steps, you have 20 actions and rewards, but 21 observations and info dicts # due to the initial "reset" observation/infos. episode = { 'obs': np.ndarray((21, 4), dtype=float32), # 21 due to additional reset obs 'infos': [{}, {}, {}, {}, .., {}, {}], # infos are always lists of dicts 'actions': np.ndarray((20,), dtype=int64), # Discrete(4) action space 'rewards': np.ndarray((20,), dtype=float32), 'extra_model_outputs': { 'action_dist_inputs': np.ndarray((20, 4), dtype=float32), # Discrete(4) action space }, 'is_terminated': False, # <- single bool 'is_truncated': True, # <- single bool } For complex observations, for example ``gym.spaces.Dict``, the episode holds all observations in a struct entirely analogous to the observation space, with NumPy arrays at the leafs of that dict. For example: .. code-block:: python episode_w_complex_observations = { 'obs': { "camera": np.ndarray((21, 64, 64, 3), dtype=float32), # RGB images "sensors": { "front": np.ndarray((21, 15), dtype=float32), # 1D tensors "rear": np.ndarray((21, 5), dtype=float32), # another batch of 1D tensors }, }, ... Because RLlib keeps all values in NumPy arrays, this allows for efficient encoding and transmission across the network. In `multi-agent mode `__, the :py:class:`~ray.rllib.env.env_runner_group.EnvRunnerGroup` produces :py:class:`~ray.rllib.env.multi_agent_episode.MultiAgentEpisode` instances. .. note:: The Ray team is working on a detailed description of the :py:class:`~ray.rllib.env.multi_agent_episode.MultiAgentEpisode` class. .. _rllib-key-concepts-env-runners: EnvRunner: Combining RL environment and RLModule ------------------------------------------------ Given the :ref:`RL environment ` and an :ref:`RLModule `, an :py:class:`~ray.rllib.env.env_runner.EnvRunner` produces lists of :ref:`Episodes `. It does so by executing a classic environment interaction loop. Efficient sample collection can be burdensome to get right, especially when leveraging environment vectorization, stateful recurrent neural networks, or when operating in a multi-agent setting. RLlib provides two built-in :py:class:`~ray.rllib.env.env_runner.EnvRunner` classes, :py:class:`~ray.rllib.env.single_agent_env_runner.SingleAgentEnvRunner` and :py:class:`~ray.rllib.env.multi_agent_env_runner.MultiAgentEnvRunner` that automatically handle these complexities. RLlib picks the correct type based on your configuration, in particular the `config.environment()` and `config.multi_agent()` settings. .. tip:: Call the :py:meth:`~ray.rllib.algorithms.algorithm_config.AlgorithmConfig.is_multi_agent` method to find out, whether your config is multi-agent or not. RLlib bundles several :py:class:`~ray.rllib.env.env_runner.EnvRunner` actors through the :py:class:`~ray.rllib.env.env_runner_group.EnvRunnerGroup` API. You can also use an :py:class:`~ray.rllib.env.env_runner.EnvRunner` standalone to produce lists of Episodes by calling its :py:meth:`~ray.rllib.env.env_runner.EnvRunner.sample` method. Here is an example of creating a set of remote :py:class:`~ray.rllib.env.env_runner.EnvRunner` actors and using them to gather experiences in parallel: .. testcode:: import tree # pip install dm_tree import ray from ray.rllib.algorithms.ppo import PPOConfig from ray.rllib.env.single_agent_env_runner import SingleAgentEnvRunner # Configure the EnvRunners. config = ( PPOConfig() .environment("Acrobot-v1") .env_runners(num_env_runners=2, num_envs_per_env_runner=1) ) # Create the EnvRunner actors. env_runners = [ ray.remote(SingleAgentEnvRunner).remote(config=config) for _ in range(config.num_env_runners) ] # Gather lists of `SingleAgentEpisode`s (each EnvRunner actor returns one # such list with exactly two episodes in it). episodes = ray.get([ er.sample.remote(num_episodes=3) for er in env_runners ]) # Two remote EnvRunners used. assert len(episodes) == 2 # Each EnvRunner returns three episodes assert all(len(eps_list) == 3 for eps_list in episodes) # Report the returns of all episodes collected for episode in tree.flatten(episodes): print("R=", episode.get_return()) .. testcode:: :hide: for er in env_runners: er.stop.remote() .. _rllib-key-concepts-learners: Learner: Combining RLModule, loss function and optimizer -------------------------------------------------------- .. tip:: The following is a quick overview of **RLlib Learners**. See :ref:`here for a detailed description of the Learner class `. Given the :ref:`RLModule ` and one or more optimizers and loss functions, a :py:class:`~ray.rllib.core.learner.learner.Learner` computes losses and gradients, then updates the :py:class:`~ray.rllib.core.rl_module.rl_module.RLModule`. The input data for such an update step comes in as a list of :ref:`episodes `, which either the Learner's own connector pipeline or an external one converts into the final train batch. .. note:: :py:class:`~ray.rllib.connectors.connector_v2.ConnectorV2` documentation is work in progress. The Ray team links to the correct documentation page here, once it has completed this work. :py:class:`~ray.rllib.core.learner.learner.Learner` instances are algorithm-specific, mostly due to the various loss functions used by different RL algorithms. RLlib always bundles several :py:class:`~ray.rllib.core.learner.learner.Learner` actors through the :py:class:`~ray.rllib.core.learner.learner_group.LearnerGroup` API, automatically applying distributed data parallelism (``DDP``) on the training data. You can also use a :py:class:`~ray.rllib.core.learner.learner.Learner` standalone to update your RLModule with a list of Episodes. Here is an example of creating a remote :py:class:`~ray.rllib.core.learner.learner.Learner` actor and calling its :py:meth:`~ray.rllib.core.learner.learner.Learner.update` method. .. testcode:: import gymnasium as gym import ray from ray.rllib.algorithms.ppo import PPOConfig from ray.rllib.core.rl_module.default_model_config import DefaultModelConfig # Configure the Learner. config = ( PPOConfig() .environment("Acrobot-v1") .training(lr=0.0001) .rl_module(model_config=DefaultModelConfig(fcnet_hiddens=[64, 32])) ) # Get the Learner class. ppo_learner_class = config.get_default_learner_class() # Create the Learner actor. learner_actor = ray.remote(ppo_learner_class).remote( config=config, module_spec=config.get_multi_rl_module_spec(env=gym.make("Acrobot-v1")), ) # Build the Learner. ray.get(learner_actor.build.remote()) # Perform an update from the list of episodes we got from the `EnvRunners` above. learner_results = ray.get(learner_actor.update.remote( episodes=tree.flatten(episodes) )) print(learner_results["default_policy"]["policy_loss"])