Key concepts#

Note

Ray 2.40 uses RLlib’s new API stack by default. The Ray team has mostly completed transitioning algorithms, example scripts, and documentation to the new code base.

If you’re still using the old API stack, see New API stack migration guide for details on how to migrate.

To help you get a high-level understanding of how the library works, on this page, you learn about the key concepts and general architecture of RLlib.

../_images/rllib_key_concepts.svg

RLlib overview: The central component of RLlib is the Algorithm class, acting as a runtime for executing your RL experiments. Your gateway into using an Algorithm is the AlgorithmConfig (<span style=”color: #cfe0e1;”>cyan</span>) class, allowing you to manage available configuration settings, for example learning rate or model architecture. Most Algorithm objects have EnvRunner actors (<span style=”color: #d0e2f3;”>blue</span>) to collect training samples from the RL environment and Learner actors (<span style=”color: #fff2cc;”>yellow</span>) to compute gradients and update your models. The algorithm synchronizes model weights after an update.#

AlgorithmConfig and Algorithm#

Tip

The following is a quick overview of RLlib AlgorithmConfigs and Algorithms. See here for a detailed description of the Algorithm class.

The RLlib Algorithm class serves as a runtime for your RL experiments, bringing together all components required for learning an optimal solution to your RL environment. It exposes powerful Python APIs for controlling your experiment runs.

The gateways into using the various RLlib Algorithm types are the respective AlgorithmConfig classes, allowing you to configure available settings in a checked and type-safe manner. For example, to configure a PPO (“Proximal Policy Optimization”) algorithm instance, you use the PPOConfig class.

During its construction, the Algorithm first sets up its EnvRunnerGroup, containing n EnvRunner actors, and its LearnerGroup, containing m Learner actors. This way, you can scale up sample collection and training, respectively, from a single core to many thousands of cores in a cluster.

See this scaling guide for more details here.

You have two ways to interact with and run an Algorithm:

  • You can create and manage an instance of it directly through the Python API.

  • Because the Algorithm class is a subclass of the Tune Trainable API, you can use Ray Tune to more easily manage your experiment and tune hyperparameters.

The following examples demonstrate this on RLlib’s PPO (“Proximal Policy Optimization”) algorithm:

from ray.rllib.algorithms.ppo import PPOConfig

# Configure.
config = (
    PPOConfig()
    .environment("CartPole-v1")
    .training(
        train_batch_size_per_learner=2000,
        lr=0.0004,
    )
)

# Build the Algorithm.
algo = config.build()

# Train for one iteration, which is 2000 timesteps (1 train batch).
print(algo.train())
from ray import train, tune
from ray.rllib.algorithms.ppo import PPOConfig

# Configure.
config = (
    PPOConfig()
    .environment("CartPole-v1")
    .training(
        train_batch_size_per_learner=2000,
        lr=0.0004,
    )
)

# Train through Ray Tune.
results = tune.Tuner(
    "PPO",
    param_space=config,
    # Train for 4000 timesteps (2 iterations).
    run_config=train.RunConfig(stop={"num_env_steps_sampled_lifetime": 4000}),
).fit()

RL environments#

Tip

The following is a quick overview of RL environments. See here for a detailed description of how to use RL environments in RLlib.

A reinforcement learning (RL) environment is a structured space, like a simulator or a controlled section of the real world, in which one or more agents interact and learn to achieve specific goals. The environment defines an observation space, which is the structure and shape of observable tensors at each timestep, an action space, which defines the available actions for the agents at each time step, a reward function, and the rules that govern environment transitions when applying actions.

../_images/env_loop_concept.svg

A simple RL environment where an agent starts with an initial observation returned by the reset() method. The agent, possibly controlled by a neural network policy, sends actions, like right or jump, to the environmant’s step() method, which returns a reward. Here, the reward values are +5 for reaching the goal and 0 otherwise. The environment also returns a boolean flag indicating whether the episode is complete.#

Environments may vary in complexity, from simple tasks, like navigating a grid world, to highly intricate systems, like autonomous driving simulators, robotic control environments, or multi-agent games.

RLlib interacts with the environment by playing through many episodes during a training iteration to collect data, such as made observations, taken actions, received rewards and done flags (see preceding figure). It then converts this episode data into a train batch for model updating. The goal of these model updates is to change the agents’ behaviors such that it leads to a maximum sum of received rewards over the agents’ lifetimes.

RLModules#

Tip

The following is a quick overview of RLlib RLModules. See here for a detailed description of the RLModule class.

RLModules are deep-learning framework-specific neural network containers. RLlib’s EnvRunners use them for computing actions when stepping through the RL environment and RLlib’s Learners use RLModule instances for computing losses and gradients before updating them.

../_images/rl_module_overview.svg

RLModule overview: (left) A minimal RLModule contains a neural network and defines its forward exploration-, inference- and training logic. (right) In more complex setups, a MultiRLModule contains many submodules, each itself an RLModule instance and identified by a ModuleID, allowing you to implement arbitrarily complex multi-model and multi-agent algorithms.#

In a nutshell, an RLModule carries the neural network models and defines how to use them during the three phases of its RL lifecycle: Exploration, for collecting training data, inference when computing actions for evaluation or in production, and training for computing the loss function inputs.

You can chose to use RLlib’s built-in default models and configure these as needed, for example for changing the number of layers or the activation functions, or write your own custom models in PyTorch, allowing you to implement any architecture and computation logic.

../_images/rl_module_in_env_runner.svg

An RLModule inside an EnvRunner actor: The EnvRunner operates on its own copy of an inference-only version of the RLModule, using it only to compute actions.#

Each EnvRunner actor, managed by the EnvRunnerGroup of the Algorithm, has a copy of the user’s RLModule. Also, each Learner actor, managed by the LearnerGroup of the Algorithm has an RLModule copy.

The EnvRunner copy is normally in its inference_only version, meaning that components not required for bare action computation, for example a value function estimate, are missing to save memory.

../_images/rl_module_in_learner.svg

An RLModule inside a Learner actor: The Learner operates on its own copy of an RLModule, computing the loss function inputs, the loss itself, and the model’s gradients, then updating the RLModule through the Learner’s optimizers.#

Episodes#

Tip

The following is a quick overview of Episode. See here for a detailed description of the Episode classes.

RLlib sends around all training data the form of Episodes.

The SingleAgentEpisode class describes single-agent trajectories. The MultiAgentEpisode class contains several such single-agent episodes and describes the stepping times- and patterns of the individual agents with respect to each other.

Both Episode classes store the entire trajectory data generated while stepping through an RL environment. This data includes the observations, info dicts, actions, rewards, termination signals, and any model computations along the way, like recurrent states, action logits, or action log probabilities.

Tip

See here for RLlib’s standardized column names.

Note that episodes conveniently don’t have to store any next obs information as it always overlaps with the information under obs. This design saves almost 50% of memory, because observations are often the largest piece in a trajectory. The same is true for state_in and state_out information for stateful networks. RLlib only keeps the state_out key in the episodes.

Typically, RLlib generates episode chunks of size config.rollout_fragment_length through the EnvRunner actors in the Algorithm’s EnvRunnerGroup, and sends as many episode chunks to each Learner actor as required to build one training batch of exactly size config.train_batch_size_per_learner.

A typical SingleAgentEpisode object roughly looks as follows:

# A SingleAgentEpisode of length 20 has roughly the following schematic structure.
# Note that after these 20 steps, you have 20 actions and rewards, but 21 observations and info dicts
# due to the initial "reset" observation/infos.
episode = {
    'obs': np.ndarray((21, 4), dtype=float32),  # 21 due to additional reset obs
    'infos': [{}, {}, {}, {}, .., {}, {}],  # infos are always lists of dicts
    'actions': np.ndarray((20,), dtype=int64),  # Discrete(4) action space
    'rewards': np.ndarray((20,), dtype=float32),
    'extra_model_outputs': {
        'action_dist_inputs': np.ndarray((20, 4), dtype=float32),  # Discrete(4) action space
    },
    'is_terminated': False,  # <- single bool
    'is_truncated': True,  # <- single bool
}

For complex observations, for example gym.spaces.Dict, the episode holds all observations in a struct entirely analogous to the observation space, with NumPy arrays at the leafs of that dict. For example:

episode_w_complex_observations = {
    'obs': {
        "camera": np.ndarray((21, 64, 64, 3), dtype=float32),  # RGB images
        "sensors": {
            "front": np.ndarray((21, 15), dtype=float32),  # 1D tensors
            "rear": np.ndarray((21, 5), dtype=float32),  # another batch of 1D tensors
        },
    },
    ...

Because RLlib keeps all values in NumPy arrays, this allows for efficient encoding and transmission across the network.

In multi-agent mode, the EnvRunnerGroup produces MultiAgentEpisode instances.

Note

The Ray team is working on a detailed description of the MultiAgentEpisode class.

EnvRunner: Combining RL environment and RLModule#

Given the RL environment and an RLModule, an EnvRunner produces lists of Episodes.

It does so by executing a classic environment interaction loop. Efficient sample collection can be burdensome to get right, especially when leveraging environment vectorization, stateful recurrent neural networks, or when operating in a multi-agent setting.

RLlib provides two built-in EnvRunner classes, SingleAgentEnvRunner and MultiAgentEnvRunner that automatically handle these complexities. RLlib picks the correct type based on your configuration, in particular the config.environment() and config.multi_agent() settings.

Tip

Call the is_multi_agent() method to find out, whether your config is multi-agent or not.

RLlib bundles several EnvRunner actors through the EnvRunnerGroup API.

You can also use an EnvRunner standalone to produce lists of Episodes by calling its sample() method.

Here is an example of creating a set of remote EnvRunner actors and using them to gather experiences in parallel:

import tree  # pip install dm_tree
import ray
from ray.rllib.algorithms.ppo import PPOConfig
from ray.rllib.env.single_agent_env_runner import SingleAgentEnvRunner

# Configure the EnvRunners.
config = (
    PPOConfig()
    .environment("Acrobot-v1")
    .env_runners(num_env_runners=2, num_envs_per_env_runner=1)
)
# Create the EnvRunner actors.
env_runners = [
    ray.remote(SingleAgentEnvRunner).remote(config=config)
    for _ in range(config.num_env_runners)
]

# Gather lists of `SingleAgentEpisode`s (each EnvRunner actor returns one
# such list with exactly two episodes in it).
episodes = ray.get([
    er.sample.remote(num_episodes=3)
    for er in env_runners
])
# Two remote EnvRunners used.
assert len(episodes) == 2
# Each EnvRunner returns three episodes
assert all(len(eps_list) == 3 for eps_list in episodes)

# Report the returns of all episodes collected
for episode in tree.flatten(episodes):
    print("R=", episode.get_return())

Learner: Combining RLModule, loss function and optimizer#

Tip

The following is a quick overview of RLlib Learners. See here for a detailed description of the Learner class.

Given the RLModule and one or more optimizers and loss functions, a Learner computes losses and gradients, then updates the RLModule.

The input data for such an update step comes in as a list of episodes, which either the Learner’s own connector pipeline or an external one converts into the final train batch.

Note

ConnectorV2 documentation is work in progress. The Ray team links to the correct documentation page here, once it has completed this work.

Learner instances are algorithm-specific, mostly due to the various loss functions used by different RL algorithms.

RLlib always bundles several Learner actors through the LearnerGroup API, automatically applying distributed data parallelism (DDP) on the training data. You can also use a Learner standalone to update your RLModule with a list of Episodes.

Here is an example of creating a remote Learner actor and calling its update_from_episodes() method.

import gymnasium as gym
import ray
from ray.rllib.algorithms.ppo import PPOConfig
from ray.rllib.core.rl_module.default_model_config import DefaultModelConfig

# Configure the Learner.
config = (
    PPOConfig()
    .environment("Acrobot-v1")
    .training(lr=0.0001)
    .rl_module(model_config=DefaultModelConfig(fcnet_hiddens=[64, 32]))
)
# Get the Learner class.
ppo_learner_class = config.get_default_learner_class()

# Create the Learner actor.
learner_actor = ray.remote(ppo_learner_class).remote(
    config=config,
    module_spec=config.get_multi_rl_module_spec(env=gym.make("Acrobot-v1")),
)
# Build the Learner.
ray.get(learner_actor.build.remote())

# Perform an update from the list of episodes we got from the `EnvRunners` above.
learner_results = ray.get(learner_actor.update_from_episodes.remote(
    episodes=tree.flatten(episodes)
))
print(learner_results["default_policy"]["policy_loss"])