RLlib: Industry-Grade, Scalable Reinforcement Learning#

Note

Ray 2.10.0 introduces the alpha stage of RLlib’s “new API stack”. The team is currently transitioning algorithms, example scripts, and documentation to the new code base throughout the subsequent minor releases leading up to Ray 3.0.

See here for more details on how to activate and use the new API stack.

../_images/rllib-logo.png

RLlib is an open source library for reinforcement learning (RL), offering support for production-level, highly scalable, and fault-tolerant RL workloads, while maintaining simple and unified APIs for a large variety of industry applications.

Whether training policies in a multi-agent setup, from historic offline data, or using externally connected simulators, RLlib offers simple solutions for each of these autonomous decision making needs and enables you to start running your experiments within hours.

RLlib is used in production by industry leaders in many different verticals, such as gaming, robotics, finance, climate- and industrial control, manufacturing and logistics, automobile, and boat design.

RLlib in 60 seconds#

../_images/rllib-index-header.svg

It only takes a few steps to get your first RLlib workload up and running on your laptop. Install RLlib and PyTorch, as shown below:

pip install "ray[rllib]" torch

Note

For installation on computers running Apple Silicon (such as M1), follow instructions here.

Note

To be able to run our Atari or MuJoCo examples, you also need to run: pip install "gymnasium[atari,accept-rom-license,mujoco]".

This is all! You can now start coding against RLlib. Here is an example for running the PPO Algorithm on the Taxi domain. You first create a config for the algorithm, which defines the RL environment (taxi) and any other needed settings and parameters.

Next, build the algorithm and train it for a total of 5 iterations. One training iteration includes parallel (distributed) sample collection by the EnvRunner actors, followed by loss calculation on the collected data, and a model update step.

At the end of your script, the trained Algorithm is evaluated:

from ray.rllib.algorithms.ppo import PPOConfig
from ray.rllib.connectors.env_to_module import FlattenObservations

# 1. Configure the algorithm,
config = (
    PPOConfig()
    .environment("Taxi-v3")
    .env_runners(
        num_env_runners=2,
        # Observations are discrete (ints) -> We need to flatten (one-hot) them.
        env_to_module_connector=lambda env: FlattenObservations(),
    )
    .evaluation(evaluation_num_env_runners=1)
)
# 2. build the algorithm ..
algo = config.build()
# 3. .. train it ..
for _ in range(5):
    print(algo.train())
# 4. .. and evaluate it.
algo.evaluate()

You can use any Farama-Foundation Gymnasium registered environment with the env argument.

In config.env_runners() you can specify - amongst many other things - the number of parallel EnvRunner actors to collect samples from the environment.

You can also tweak the NN architecture used by by tweaking RLlib’s DefaultModelConfig, as well as, set up a separate config for the evaluation EnvRunner actors through the config.evaluation() method.

See here, if you want to learn more about the RLlib training APIs. Also, see here for a simple example on how to write an action inference loop after training.

If you want to get a quick preview of which algorithms and environments RLlib supports, click the dropdowns below:

RLlib Algorithms
RLlib Environments

Farama-Foundation Environments

gymnasium single_agent

pip install "gymnasium[atari,accept-rom-license,mujoco]"``
config.environment("CartPole-v1")  # Classic Control
config.environment("ale_py:ALE/Pong-v5")  # Atari
config.environment("Hopper-v5")  # MuJoCo

PettingZoo multi_agent

pip install "pettingzoo[all]"
from ray.tune.registry import register_env
from ray.rllib.env.wrappers.pettingzoo_env import PettingZooEnv
from pettingzoo.sisl import waterworld_v4
register_env("env", lambda _: PettingZooEnv(waterworld_v4.env()))
config.environment("env")

RLlib Multi-Agent

RLlib’s MultiAgentEnv API multi_agent

from ray.rllib.examples.envs.classes.multi_agent import MultiAgentCartPole
from ray import tune
tune.register_env("env", lambda cfg: MultiAgentCartPole(cfg))
config.environment("env", env_config={"num_agents": 2})
config.multi_agent(
    policies={"p0", "p1"},
    policy_mapping_fn=lambda aid, *a, **kw: f"p{aid}",
)

Why chose RLlib?#

Scalable and Fault-Tolerant

RLlib workloads scale along various axes:

  • The number of EnvRunner actors to use. This is configurable through config.env_runners(num_env_runners=...) and allows you to scale the speed of your (simulator) data collection step. This EnvRunner axis is fully fault tolerant, meaning you can train against custom environments that are unstable or frequently stall execution and even place all your EnvRunner actors on spot machines.

  • The number of Learner actors to use for multi-GPU training. This is configurable through config.learners(num_learners=...) and you normally set this to the number of GPUs available (make sure you then also set config.learners(num_gpus_per_learner=1)) or - if you do not have GPUs - you can use this setting for DDP-style learning on CPUs instead.

Multi-Agent Reinforcement Learning (MARL)

RLlib natively supports multi-agent reinforcement learning (MARL), thereby allowing you to run in any complex configuration.

  • Independent multi-agent learning (the default): Every agent collects data for updating its own policy network, interpreting other agents as part of the environment.

  • Collaborative training: Train a team of agents that either all share the same policy (shared parameters) or in which some agents have their own policy network(s). You can also share value functions between all members of the team or some of them, as you see fit, thus allowing for global vs local objectives to be optimized.

  • Adversarial training: Have agents play against other agents in competitive environments. Use self-play, or league based self-play to train your agents to learn how to play throughout various stages of ever increasing difficulty.

  • Any combination of the above! Yes, you can train teams of arbitrary sizes of agents playing against other teams where the agents in each team might have individual sub-objectives and there are groups of neutral agents not participating in any competition.

Offline RL and Behavior Cloning

Ray.Data has been integrated into RLlib, enabling large-scale data ingestion for offline RL and behavior cloning (BC) workloads.

See here for a basic tuned example for the behavior cloning algo and here for how to pre-train a policy with BC, then finetuning it with online PPO.

Support for External Env Clients

Support for externally connecting RL environments is achieved through customizing the EnvRunner logic from RLlib-owned, internal gymnasium envs to external, TCP-connected Envs that act independently and may even perform their own action inference, e.g. through ONNX.

See here for an example of RLlib acting as a server with connecting external env TCP-clients.

Learn More#

RLlib Environments

Get started with environments supported by RLlib, such as Farama foundation’s Gymnasium, Petting Zoo, and many custom formats for vectorized and multi-agent environments.

RLlib Key Concepts

Learn more about the core concepts of RLlib, such as environments, algorithms and policies.

RLlib Algorithms

See the many available RL algorithms of RLlib for model-free and model-based RL, on-policy and off-policy training, multi-agent RL, and more.

Customizing RLlib#

RLlib provides powerful, yet easy to use APIs for customizing all aspects of your experimental- and production training-workflows. For example, you may code your own environments in python using the Farama Foundation’s gymnasium or DeepMind’s OpenSpiel, provide custom PyTorch models, write your own optimizer setups and loss definitions, or define custom exploratory behavior.

../_images/rllib-new-api-stack-simple.svg

RLlib’s API stack: Built on top of Ray, RLlib offers off-the-shelf, distributed and fault-tolerant algorithms and loss functions, PyTorch default models, multi-GPU training, and multi-agent support. User customizations are realized by subclassing the existing abstractions and - by overriding certain methods in those subclasses - define custom behavior.#