Key concepts#
Note
Ray 2.40 uses RLlib’s new API stack by default. The Ray team has mostly completed transitioning algorithms, example scripts, and documentation to the new code base.
If you’re still using the old API stack, see New API stack migration guide for details on how to migrate.
To help you get a high-level understanding of how the library works, on this page, you learn about the key concepts and general architecture of RLlib.
AlgorithmConfig and Algorithm#
Tip
The following is a quick overview of RLlib AlgorithmConfigs and Algorithms. See here for a detailed description of the Algorithm class.
The RLlib Algorithm
class serves as a runtime for your RL experiments,
bringing together all components required for learning an optimal solution to your RL environment.
It exposes powerful Python APIs for controlling your experiment runs.
The gateways into using the various RLlib Algorithm
types are the respective
AlgorithmConfig
classes, allowing you to configure
available settings in a checked and type-safe manner.
For example, to configure a PPO
(“Proximal Policy Optimization”) algorithm instance,
you use the PPOConfig
class.
During its construction, the Algorithm
first sets up its
EnvRunnerGroup
, containing n
EnvRunner
actors, and
its LearnerGroup
, containing
m
Learner
actors.
This way, you can scale up sample collection and training, respectively, from a single core to many thousands of cores in a cluster.
See this scaling guide for more details here.
You have two ways to interact with and run an Algorithm
:
You can create and manage an instance of it directly through the Python API.
Because the
Algorithm
class is a subclass of the Tune Trainable API, you can use Ray Tune to more easily manage your experiment and tune hyperparameters.
The following examples demonstrate this on RLlib’s PPO
(“Proximal Policy Optimization”) algorithm:
from ray.rllib.algorithms.ppo import PPOConfig
# Configure.
config = (
PPOConfig()
.environment("CartPole-v1")
.training(
train_batch_size_per_learner=2000,
lr=0.0004,
)
)
# Build the Algorithm.
algo = config.build()
# Train for one iteration, which is 2000 timesteps (1 train batch).
print(algo.train())
from ray import train, tune
from ray.rllib.algorithms.ppo import PPOConfig
# Configure.
config = (
PPOConfig()
.environment("CartPole-v1")
.training(
train_batch_size_per_learner=2000,
lr=0.0004,
)
)
# Train through Ray Tune.
results = tune.Tuner(
"PPO",
param_space=config,
# Train for 4000 timesteps (2 iterations).
run_config=train.RunConfig(stop={"num_env_steps_sampled_lifetime": 4000}),
).fit()
RL environments#
Tip
The following is a quick overview of RL environments. See here for a detailed description of how to use RL environments in RLlib.
A reinforcement learning (RL) environment is a structured space, like a simulator or a controlled section of the real world, in which one or more agents interact and learn to achieve specific goals. The environment defines an observation space, which is the structure and shape of observable tensors at each timestep, an action space, which defines the available actions for the agents at each time step, a reward function, and the rules that govern environment transitions when applying actions.
Environments may vary in complexity, from simple tasks, like navigating a grid world, to highly intricate systems, like autonomous driving simulators, robotic control environments, or multi-agent games.
RLlib interacts with the environment by playing through many episodes during a
training iteration to collect data, such as made observations, taken actions, received rewards and done
flags
(see preceding figure). It then converts this episode data into a train batch for model updating. The goal of these
model updates is to change the agents’ behaviors such that it leads to a maximum sum of received rewards over the agents’
lifetimes.
RLModules#
Tip
The following is a quick overview of RLlib RLModules. See here for a detailed description of the RLModule class.
RLModules are deep-learning framework-specific neural network containers.
RLlib’s EnvRunners use them for computing actions when stepping through the
RL environment and RLlib’s Learners use
RLModule
instances for computing losses and gradients before updating them.
In a nutshell, an RLModule
carries the neural
network models and defines how to use them during the three phases of its RL lifecycle:
Exploration, for collecting training data, inference when computing actions for evaluation or in production,
and training for computing the loss function inputs.
You can chose to use RLlib’s built-in default models and configure these as needed, for example for changing the number of layers or the activation functions, or write your own custom models in PyTorch, allowing you to implement any architecture and computation logic.
Each EnvRunner
actor, managed by the EnvRunnerGroup
of the Algorithm,
has a copy of the user’s RLModule
.
Also, each Learner
actor, managed by the
LearnerGroup
of the Algorithm has an RLModule
copy.
The EnvRunner
copy is normally in its inference_only
version, meaning that components
not required for bare action computation, for example a value function estimate, are missing to save memory.
Episodes#
Tip
The following is a quick overview of Episode. See here for a detailed description of the Episode classes.
RLlib sends around all training data the form of Episodes.
The SingleAgentEpisode
class describes
single-agent trajectories. The MultiAgentEpisode
class contains several
such single-agent episodes and describes the stepping times- and patterns of the individual agents with respect to each other.
Both Episode
classes store the entire trajectory data generated while stepping through an RL environment.
This data includes the observations, info dicts, actions, rewards, termination signals, and any
model computations along the way, like recurrent states, action logits, or action log probabilities.
Tip
See here for RLlib’s standardized column names.
Note that episodes conveniently don’t have to store any next obs
information as it always overlaps
with the information under obs
. This design saves almost 50% of memory, because
observations are often the largest piece in a trajectory. The same is true for state_in
and state_out
information for stateful networks. RLlib only keeps the state_out
key in the episodes.
Typically, RLlib generates episode chunks of size config.rollout_fragment_length
through the EnvRunner
actors in the Algorithm’s EnvRunnerGroup, and sends as many episode chunks to each
Learner actor as required to build one training batch of exactly size
config.train_batch_size_per_learner
.
A typical SingleAgentEpisode
object roughly looks as follows:
# A SingleAgentEpisode of length 20 has roughly the following schematic structure.
# Note that after these 20 steps, you have 20 actions and rewards, but 21 observations and info dicts
# due to the initial "reset" observation/infos.
episode = {
'obs': np.ndarray((21, 4), dtype=float32), # 21 due to additional reset obs
'infos': [{}, {}, {}, {}, .., {}, {}], # infos are always lists of dicts
'actions': np.ndarray((20,), dtype=int64), # Discrete(4) action space
'rewards': np.ndarray((20,), dtype=float32),
'extra_model_outputs': {
'action_dist_inputs': np.ndarray((20, 4), dtype=float32), # Discrete(4) action space
},
'is_terminated': False, # <- single bool
'is_truncated': True, # <- single bool
}
For complex observations, for example gym.spaces.Dict
, the episode holds all observations in a struct entirely analogous
to the observation space, with NumPy arrays at the leafs of that dict. For example:
episode_w_complex_observations = {
'obs': {
"camera": np.ndarray((21, 64, 64, 3), dtype=float32), # RGB images
"sensors": {
"front": np.ndarray((21, 15), dtype=float32), # 1D tensors
"rear": np.ndarray((21, 5), dtype=float32), # another batch of 1D tensors
},
},
...
Because RLlib keeps all values in NumPy arrays, this allows for efficient encoding and transmission across the network.
In multi-agent mode, the EnvRunnerGroup
produces MultiAgentEpisode
instances.
Note
The Ray team is working on a detailed description of the
MultiAgentEpisode
class.
EnvRunner: Combining RL environment and RLModule#
Given the RL environment and an RLModule,
an EnvRunner
produces lists of Episodes.
It does so by executing a classic environment interaction loop. Efficient sample collection can be burdensome to get right, especially when leveraging environment vectorization, stateful recurrent neural networks, or when operating in a multi-agent setting.
RLlib provides two built-in EnvRunner
classes,
SingleAgentEnvRunner
and
MultiAgentEnvRunner
that
automatically handle these complexities. RLlib picks the correct type based on your
configuration, in particular the config.environment()
and config.multi_agent()
settings.
Tip
Call the is_multi_agent()
method to find out, whether your config is multi-agent or not.
RLlib bundles several EnvRunner
actors through the
EnvRunnerGroup
API.
You can also use an EnvRunner
standalone to produce lists of Episodes by calling its
sample()
method.
Here is an example of creating a set of remote EnvRunner
actors
and using them to gather experiences in parallel:
import tree # pip install dm_tree
import ray
from ray.rllib.algorithms.ppo import PPOConfig
from ray.rllib.env.single_agent_env_runner import SingleAgentEnvRunner
# Configure the EnvRunners.
config = (
PPOConfig()
.environment("Acrobot-v1")
.env_runners(num_env_runners=2, num_envs_per_env_runner=1)
)
# Create the EnvRunner actors.
env_runners = [
ray.remote(SingleAgentEnvRunner).remote(config=config)
for _ in range(config.num_env_runners)
]
# Gather lists of `SingleAgentEpisode`s (each EnvRunner actor returns one
# such list with exactly two episodes in it).
episodes = ray.get([
er.sample.remote(num_episodes=3)
for er in env_runners
])
# Two remote EnvRunners used.
assert len(episodes) == 2
# Each EnvRunner returns three episodes
assert all(len(eps_list) == 3 for eps_list in episodes)
# Report the returns of all episodes collected
for episode in tree.flatten(episodes):
print("R=", episode.get_return())
Learner: Combining RLModule, loss function and optimizer#
Tip
The following is a quick overview of RLlib Learners. See here for a detailed description of the Learner class.
Given the RLModule and one or more optimizers and loss functions,
a Learner
computes losses and gradients, then updates the RLModule
.
The input data for such an update step comes in as a list of episodes, which either the Learner’s own connector pipeline or an external one converts into the final train batch.
Note
ConnectorV2
documentation is work in progress.
The Ray team links to the correct documentation page here, once it has completed this work.
Learner
instances are algorithm-specific, mostly due to the various
loss functions used by different RL algorithms.
RLlib always bundles several Learner
actors through
the LearnerGroup
API, automatically applying
distributed data parallelism (DDP
) on the training data.
You can also use a Learner
standalone to update your RLModule
with a list of Episodes.
Here is an example of creating a remote Learner
actor and calling its update_from_episodes()
method.
import gymnasium as gym
import ray
from ray.rllib.algorithms.ppo import PPOConfig
from ray.rllib.core.rl_module.default_model_config import DefaultModelConfig
# Configure the Learner.
config = (
PPOConfig()
.environment("Acrobot-v1")
.training(lr=0.0001)
.rl_module(model_config=DefaultModelConfig(fcnet_hiddens=[64, 32]))
)
# Get the Learner class.
ppo_learner_class = config.get_default_learner_class()
# Create the Learner actor.
learner_actor = ray.remote(ppo_learner_class).remote(
config=config,
module_spec=config.get_multi_rl_module_spec(env=gym.make("Acrobot-v1")),
)
# Build the Learner.
ray.get(learner_actor.build.remote())
# Perform an update from the list of episodes we got from the `EnvRunners` above.
learner_results = ray.get(learner_actor.update_from_episodes.remote(
episodes=tree.flatten(episodes)
))
print(learner_results["default_policy"]["policy_loss"])