RLlib’s New API Stack#
Overview#
Starting in Ray 2.10, you can opt-in to the alpha version of a “new API stack”, a fundamental overhaul from the ground up with respect to architecture, design principles, code base, and user facing APIs. The following select algorithms and setups are available.
Feature/Algo (on new API stack) |
PPO |
SAC |
---|---|---|
Single Agent |
Yes |
Yes |
Multi Agent |
Yes |
No |
Fully-connected (MLP) |
Yes |
Yes |
Image inputs (CNN) |
Yes |
No |
RNN support (LSTM) |
Yes |
No |
Complex inputs (flatten) |
Yes |
Yes |
Over the next couple of months, the Ray Team will continue to test, benchmark, bug-fix, and further polish these new APIs as well as rollout more and more algorithms that you can run in either stack. The goal is to reach a state where the new stack can completely replace the old one.
Keep in mind that due to its alpha nature, when using the new stack, you might run into issues and encounter instabilities. Also, rest assured that you are able to continue using your custom classes and setups on the old API stack for the foreseeable future (beyond Ray 3.0).
What is the New API Stack?#
The new API stack is the result of re-writing from scratch RLlib’s core APIs and reducing its user-facing classes from more than a dozen critical ones down to only a handful of classes. During the design of these new interfaces from the ground up, the Ray Team strictly applied the following principles:
Suppose a simple mental-model underlying the new APIs
Classes must be usable outside of RLlib
Separate concerns as much as possible. Try to answer: “WHAT should be done WHEN and by WHOM?”
Offer fine-grained modularity, full interoperability, and frictionless pluggability of classes
Applying the above principles, the Ray Team reduced the important must-know classes for the average RLlib user from seven on the old stack, to only four on the new stack. The core new API stack classes are:
Learner
(replacesRolloutWorker
and some ofPolicy
)SingleAgentEpisode
andMultiAgentEpisode
(replacesViewRequirement
,SampleCollector
,Episode
, andEpisodeV2
)ConnectorV2
(replacesConnector
and some ofRolloutWorker
andPolicy
)
The AlgorithmConfig
and Algorithm
APIs remain as-is. These are already established APIs on the old stack.
Who should use the new API stack?#
Eventually, all users of RLlib should switch over to running experiments and developing their custom classes against the new API stack.
Right now, it’s only available for a few algorithms and setups (see table above), however, if you do use PPO (single- or multi-agent) or SAC (single-agent), you should try it.
The following section, lists some compelling reasons to migrate to the new stack.
Note these indicators against using it at this early stage:
1) You’re using a custom ModelV2
class and aren’t interested right now in moving it into the new RLModule
API.
1) You’re using a custom Policy
class (e.g., with a custom loss function and aren’t interested right now in moving it into the new Learner
API.
1) You’re using custom Connector
classes and aren’t interested right now in moving them into the new ConnectorV2
API.
If any of the above applies to you, don’t migrate for now, and continue running with the old API stack. Migrate to the new stack whenever you’re ready to re-write some small part of your code.
Comparison to the Old API Stack#
This table compares features and design choices between the new and old API stack:
New API Stack |
Old API Stack |
|
---|---|---|
Reduced code complexity (for beginners and advanced users) |
5 user-facing classes ( |
8 user-facing classes ( |
Classes are usable outside of RLlib |
Yes |
Partly |
Separation-of-concerns design (e.g., during sampling, only action must be computed) |
Yes |
No |
Distributed/scalable sample collection |
Yes |
Yes |
Full 360° read/write access to (multi-)agent trajectories |
Yes |
No |
Multi-GPU and multi-node/multi-GPU |
Yes |
Yes & No |
Support for shared (multi-agent) model components (e.g., communication channels, shared value functions, etc.) |
Yes |
No |
Env vectorization with |
Yes |
No (RLlib’s own solution) |
How to Use the New API Stack?#
The new API stack is disabled by default for all algorithms.
To activate it for PPO (single- and multi-agent) or SAC (single-agent only),
change the following in your AlgorithmConfig
object:
from ray.rllib.algorithms.ppo import PPOConfig
config = (
PPOConfig()
.environment("CartPole-v1")
# Switch both the new API stack flags to True (both False by default).
# This enables the use of
# a) RLModule (replaces ModelV2) and Learner (replaces Policy)
# b) and automatically picks the correct EnvRunner (single-agent vs multi-agent)
# and enables ConnectorV2 support.
.api_stack(
enable_rl_module_and_learner=True,
enable_env_runner_and_connector_v2=True,
)
.resources(
num_cpus_for_main_process=1,
)
# We are using a simple 1-CPU setup here for learning. However, as the new stack
# supports arbitrary scaling on the learner axis, feel free to set
# `num_learners` to the number of available GPUs for multi-GPU training (and
# `num_gpus_per_learner=1`).
.learners(
num_learners=0, # <- in most cases, set this value to the number of GPUs
num_gpus_per_learner=0, # <- set this to 1, if you have at least 1 GPU
)
# When using RLlib's default models (RLModules) AND the new EnvRunners, you should
# set this flag in your model config. Having to set this, will no longer be required
# in the near future. It does yield a small performance advantage as value function
# predictions for PPO are no longer required to happen on the sampler side (but are
# now fully located on the learner side, which might have GPUs available).
.training(model={"uses_new_env_runners": True})
)
from ray.rllib.algorithms.ppo import PPOConfig # noqa
from ray.rllib.examples.envs.classes.multi_agent import MultiAgentCartPole # noqa
# A typical multi-agent setup (otherwise using the exact same parameters as before)
# looks like this.
config = (
PPOConfig()
.environment(MultiAgentCartPole, env_config={"num_agents": 2})
# Switch both the new API stack flags to True (both False by default).
# This enables the use of
# a) RLModule (replaces ModelV2) and Learner (replaces Policy)
# b) and automatically picks the correct EnvRunner (single-agent vs multi-agent)
# and enables ConnectorV2 support.
.api_stack(
enable_rl_module_and_learner=True,
enable_env_runner_and_connector_v2=True,
)
.resources(
num_cpus_for_main_process=1,
)
# We are using a simple 1-CPU setup here for learning. However, as the new stack
# supports arbitrary scaling on the learner axis, feel free to set
# `num_learners` to the number of available GPUs for multi-GPU training (and
# `num_gpus_per_learner=1`).
.learners(
num_learners=0, # <- in most cases, set this value to the number of GPUs
num_gpus_per_learner=0, # <- set this to 1, if you have at least 1 GPU
)
# When using RLlib's default models (RLModules) AND the new EnvRunners, you should
# set this flag in your model config. Having to set this, will no longer be required
# in the near future. It does yield a small performance advantage as value function
# predictions for PPO are no longer required to happen on the sampler side (but are
# now fully located on the learner side, which might have GPUs available).
.training(model={"uses_new_env_runners": True})
# Because you are in a multi-agent env, you have to set up the usual multi-agent
# parameters:
.multi_agent(
policies={"p0", "p1"},
# Map agent 0 to p0 and agent 1 to p1.
policy_mapping_fn=lambda agent_id, episode, **kwargs: f"p{agent_id}",
)
)
from ray.rllib.algorithms.sac import SACConfig # noqa
config = (
SACConfig()
.environment("Pendulum-v1")
# Switch both the new API stack flags to True (both False by default).
# This enables the use of
# a) RLModule (replaces ModelV2) and Learner (replaces Policy)
# b) and automatically picks the correct EnvRunner (single-agent vs multi-agent)
# and enables ConnectorV2 support.
.api_stack(
enable_rl_module_and_learner=True,
enable_env_runner_and_connector_v2=True,
)
.resources(
num_cpus_for_main_process=1,
)
# We are using a simple 1-CPU setup here for learning. However, as the new stack
# supports arbitrary scaling on the learner axis, feel free to set
# `num_learners` to the number of available GPUs for multi-GPU training (and
# `num_gpus_per_learner=1`).
.learners(
num_learners=0, # <- in most cases, set this value to the number of GPUs
num_gpus_per_learner=0, # <- set this to 1, if you have at least 1 GPU
)
# When using RLlib's default models (RLModules) AND the new EnvRunners, you should
# set this flag in your model config. Having to set this, will no longer be required
# in the near future. It does yield a small performance advantage as value function
# predictions for PPO are no longer required to happen on the sampler side (but are
# now fully located on the learner side, which might have GPUs available).
.training(
model={"uses_new_env_runners": True},
replay_buffer_config={"type": "EpisodeReplayBuffer"},
)
)