RLlib’s New API Stack#

Overview#

Starting in Ray 2.10, you can opt-in to the alpha version of a “new API stack”, a fundamental overhaul from the ground up with respect to architecture, design principles, code base, and user facing APIs. The following select algorithms and setups are available.

Feature/Algo (on new API stack)

PPO

SAC

Single Agent

Yes

Yes

Multi Agent

Yes

No

Fully-connected (MLP)

Yes

Yes

Image inputs (CNN)

Yes

No

RNN support (LSTM)

Yes

No

Complex inputs (flatten)

Yes

Yes

Over the next couple of months, the Ray Team will continue to test, benchmark, bug-fix, and further polish these new APIs as well as rollout more and more algorithms that you can run in either stack. The goal is to reach a state where the new stack can completely replace the old one.

Keep in mind that due to its alpha nature, when using the new stack, you might run into issues and encounter instabilities. Also, rest assured that you are able to continue using your custom classes and setups on the old API stack for the foreseeable future (beyond Ray 3.0).

What is the New API Stack?#

The new API stack is the result of re-writing from scratch RLlib’s core APIs and reducing its user-facing classes from more than a dozen critical ones down to only a handful of classes. During the design of these new interfaces from the ground up, the Ray Team strictly applied the following principles:

  • Suppose a simple mental-model underlying the new APIs

  • Classes must be usable outside of RLlib

  • Separate concerns as much as possible. Try to answer: “WHAT should be done WHEN and by WHOM?”

  • Offer fine-grained modularity, full interoperability, and frictionless pluggability of classes

Applying the above principles, the Ray Team reduced the important must-know classes for the average RLlib user from seven on the old stack, to only four on the new stack. The core new API stack classes are:

The AlgorithmConfig and Algorithm APIs remain as-is. These are already established APIs on the old stack.

Who should use the new API stack?#

Eventually, all users of RLlib should switch over to running experiments and developing their custom classes against the new API stack.

Right now, it’s only available for a few algorithms and setups (see table above), however, if you do use PPO (single- or multi-agent) or SAC (single-agent), you should try it.

The following section, lists some compelling reasons to migrate to the new stack.

Note these indicators against using it at this early stage:

1) You’re using a custom ModelV2 class and aren’t interested right now in moving it into the new RLModule API. 1) You’re using a custom Policy class (e.g., with a custom loss function and aren’t interested right now in moving it into the new Learner API. 1) You’re using custom Connector classes and aren’t interested right now in moving them into the new ConnectorV2 API.

If any of the above applies to you, don’t migrate for now, and continue running with the old API stack. Migrate to the new stack whenever you’re ready to re-write some small part of your code.

Comparison to the Old API Stack#

This table compares features and design choices between the new and old API stack:

New API Stack

Old API Stack

Reduced code complexity (for beginners and advanced users)

5 user-facing classes (AlgorithmConfig, RLModule, Learner, ConnectorV2, Episode)

8 user-facing classes (AlgorithmConfig, ModelV2, Policy, build_policy, Connector, RolloutWorker, BaseEnv, ViewRequirement)

Classes are usable outside of RLlib

Yes

Partly

Separation-of-concerns design (e.g., during sampling, only action must be computed)

Yes

No

Distributed/scalable sample collection

Yes

Yes

Full 360° read/write access to (multi-)agent trajectories

Yes

No

Multi-GPU and multi-node/multi-GPU

Yes

Yes & No

Support for shared (multi-agent) model components (e.g., communication channels, shared value functions, etc.)

Yes

No

Env vectorization with gym.vector.Env

Yes

No (RLlib’s own solution)

How to Use the New API Stack?#

The new API stack is disabled by default for all algorithms. To activate it for PPO (single- and multi-agent) or SAC (single-agent only), change the following in your AlgorithmConfig object:


from ray.rllib.algorithms.ppo import PPOConfig


config = (
    PPOConfig()
    .environment("CartPole-v1")
    # Switch both the new API stack flags to True (both False by default).
    # This enables the use of
    # a) RLModule (replaces ModelV2) and Learner (replaces Policy)
    # b) and automatically picks the correct EnvRunner (single-agent vs multi-agent)
    # and enables ConnectorV2 support.
    .api_stack(
        enable_rl_module_and_learner=True,
        enable_env_runner_and_connector_v2=True,
    )
    .resources(
        num_cpus_for_main_process=1,
    )
    # We are using a simple 1-CPU setup here for learning. However, as the new stack
    # supports arbitrary scaling on the learner axis, feel free to set
    # `num_learners` to the number of available GPUs for multi-GPU training (and
    # `num_gpus_per_learner=1`).
    .learners(
        num_learners=0,  # <- in most cases, set this value to the number of GPUs
        num_gpus_per_learner=0,  # <- set this to 1, if you have at least 1 GPU
    )
    # When using RLlib's default models (RLModules) AND the new EnvRunners, you should
    # set this flag in your model config. Having to set this, will no longer be required
    # in the near future. It does yield a small performance advantage as value function
    # predictions for PPO are no longer required to happen on the sampler side (but are
    # now fully located on the learner side, which might have GPUs available).
    .training(model={"uses_new_env_runners": True})
)


from ray.rllib.algorithms.ppo import PPOConfig  # noqa
from ray.rllib.examples.envs.classes.multi_agent import MultiAgentCartPole  # noqa


# A typical multi-agent setup (otherwise using the exact same parameters as before)
# looks like this.
config = (
    PPOConfig()
    .environment(MultiAgentCartPole, env_config={"num_agents": 2})
    # Switch both the new API stack flags to True (both False by default).
    # This enables the use of
    # a) RLModule (replaces ModelV2) and Learner (replaces Policy)
    # b) and automatically picks the correct EnvRunner (single-agent vs multi-agent)
    # and enables ConnectorV2 support.
    .api_stack(
        enable_rl_module_and_learner=True,
        enable_env_runner_and_connector_v2=True,
    )
    .resources(
        num_cpus_for_main_process=1,
    )
    # We are using a simple 1-CPU setup here for learning. However, as the new stack
    # supports arbitrary scaling on the learner axis, feel free to set
    # `num_learners` to the number of available GPUs for multi-GPU training (and
    # `num_gpus_per_learner=1`).
    .learners(
        num_learners=0,  # <- in most cases, set this value to the number of GPUs
        num_gpus_per_learner=0,  # <- set this to 1, if you have at least 1 GPU
    )
    # When using RLlib's default models (RLModules) AND the new EnvRunners, you should
    # set this flag in your model config. Having to set this, will no longer be required
    # in the near future. It does yield a small performance advantage as value function
    # predictions for PPO are no longer required to happen on the sampler side (but are
    # now fully located on the learner side, which might have GPUs available).
    .training(model={"uses_new_env_runners": True})
    # Because you are in a multi-agent env, you have to set up the usual multi-agent
    # parameters:
    .multi_agent(
        policies={"p0", "p1"},
        # Map agent 0 to p0 and agent 1 to p1.
        policy_mapping_fn=lambda agent_id, episode, **kwargs: f"p{agent_id}",
    )
)


from ray.rllib.algorithms.sac import SACConfig  # noqa


config = (
    SACConfig()
    .environment("Pendulum-v1")
    # Switch both the new API stack flags to True (both False by default).
    # This enables the use of
    # a) RLModule (replaces ModelV2) and Learner (replaces Policy)
    # b) and automatically picks the correct EnvRunner (single-agent vs multi-agent)
    # and enables ConnectorV2 support.
    .api_stack(
        enable_rl_module_and_learner=True,
        enable_env_runner_and_connector_v2=True,
    )
    .resources(
        num_cpus_for_main_process=1,
    )
    # We are using a simple 1-CPU setup here for learning. However, as the new stack
    # supports arbitrary scaling on the learner axis, feel free to set
    # `num_learners` to the number of available GPUs for multi-GPU training (and
    # `num_gpus_per_learner=1`).
    .learners(
        num_learners=0,  # <- in most cases, set this value to the number of GPUs
        num_gpus_per_learner=0,  # <- set this to 1, if you have at least 1 GPU
    )
    # When using RLlib's default models (RLModules) AND the new EnvRunners, you should
    # set this flag in your model config. Having to set this, will no longer be required
    # in the near future. It does yield a small performance advantage as value function
    # predictions for PPO are no longer required to happen on the sampler side (but are
    # now fully located on the learner side, which might have GPUs available).
    .training(
        model={"uses_new_env_runners": True},
        replay_buffer_config={"type": "EpisodeReplayBuffer"},
    )
)