Note
Ray 2.10.0 introduces the alpha stage of RLlib’s “new API stack”. The team is currently transitioning algorithms, example scripts, and documentation to the new code base throughout the subsequent minor releases leading up to Ray 3.0.
See here for more details on how to activate and use the new API stack.
Examples#
This page contains an index of all the python scripts in the examples folder of RLlib, demonstrating the different use cases and features of the library.
Note
RLlib is currently in a transition state from old- to new API stack.
Some of the example scripts haven’t been translated yet to the new stack and are tagged
with the following comment line on top: # @OldAPIStack
. The moving of all example
scripts over to the new stack is work in progress.
Note
If any (new API stack) example is broken, or if you’d like to add an example to this page, feel free to raise an issue on RLlib’s github repository.
Folder Structure#
The examples folder is structured into several sub-directories, the contents of all of which are described in detail below.
How to run an example script#
Most of the example scripts are self-executable, meaning you can cd
into the respective
directory and run the script as-is with python:
$ cd ray/rllib/examples/multi_agent
$ python multi_agent_pendulum.py --enable-new-api-stack --num-agents=2
Use the --help
command line argument to have each script print out its supported command line options.
Most of the scripts share a common subset of generally applicable command line arguments,
for example --num-env-runners
(to scale the number of EnvRunner actors), --no-tune
(to switch off running with Ray Tune),
--wandb-key
(to log to W&B), or --verbose
(to control log chattiness).
All example sub-folders#
Actions#
- Nested Action Spaces:
Sets up an environment with nested action spaces using custom (single- or multi-agent) configurations. This example demonstrates how RLlib manages complex action structures, such as multi-dimensional or hierarchical action spaces.
Checkpoints#
- Checkpoint by Custom Criteria:
Shows how to create checkpoints based on custom criteria, giving users control over when to save model snapshots during training.
- Continue Training From Checkpoint:
Illustrates resuming training from a saved checkpoint, useful for extending training sessions or recovering from interruptions.
- Restore 1 (out of N) Agents from Checkpoint:
Restores one specific agent from a multi-agent checkpoint, allowing selective loading for environments where only certain agents need to resume training.
Connectors#
Note
RLlib’s Connector API has been re-written from scratch for the new API stack.
Connector-pieces and -pipelines are now referred to as ConnectorV2
(as opposed to Connector
, which only continue to work on the old API stack).
- Flatten and One-Hot Observations:
Demonstrates how to one-hot discrete observation spaces and/or flatten complex observations (Dict or Tuple), allowing RLlib to process arbitrary observation data as flattened (1D) vectors. Useful for environments with complex, discrete, or hierarchical observations.
- Observation Frame-Stacking:
Implements frame stacking, where consecutive frames are stacked together to provide temporal context to the agent. This technique is common in environments with continuous state changes, like video frames in Atari games. Using connectors for frame stacking is more efficient as it avoids having to send large observation tensors through the network (ray).
- Mean/Std Filtering:
Adds mean and standard deviation normalization for observations (shift by the mean and divide by std-dev), improving learning stability by scaling observations to a normalized range. This can enhance performance in environments with highly variable state magnitudes.
- Prev-Actions, Prev-Rewards Connector:
Augments observations with previous actions and rewards, giving the agent a short-term memory of past events, which can improve decision-making in partially observable or sequentially dependent tasks.
Curiosity#
- Count-Based Curiosity:
Implements count-based intrinsic motivation to encourage exploration of less visited states. Using curiosity is beneficial in sparse-reward environments where agents may struggle to find rewarding paths. However, count-based methods are only feasible for environments with small observation spaces.
- Euclidian Distance-Based Curiosity:
Uses Euclidean distance between states and the initial state to measure novelty, encouraging exploration by rewarding the agent for reaching “far away” regions of the environment. Suitable for sparse-reward tasks, where diverse exploration is key to success.
- Intrinsic-Curiosity-Model (ICM) Based Curiosity:
Adds an Intrinsic Curiosity Model (ICM) that learns to predict the next state as well as the action in between two states to measure novelty. The higher the loss of the ICM, the higher the “novelty” and thus the intrinsic reward. Ideal for complex environments with large observation spaces where reward signals are sparse.
Curriculum Learning#
- Custom Env Rendering Method:
Demonstrates curriculum learning, where the environment difficulty increases as the agent improves. This approach enables gradual learning, allowing agents to master simpler tasks before progressing to more challenging ones, ideal for environments with hierarchical or staged difficulties. Also see the curriculum learning how-to from the documentation.
Environments#
- Custom Env Rendering Method:
Demonstrates how to add a custom
render()
method to a (custom) environment, allowing visualizations of agent interactions.
- Custom gymnasium Env:
Implements a custom gymnasium environment from scratch, showing how to define observation and action spaces, arbitrary reward functions, as well as, step- and reset logic.
- Env Rendering and Recording:
Illustrates environment rendering and recording setups within RLlib, capturing visual outputs for later review (ex. on WandB), which is essential for tracking agent behavior in training.
- Env with Protobuf Observations:
Uses Protobuf for observations, demonstrating an advanced way of handling serialized data in environments. This approach is useful for integrating complex external data sources as observations.
Evaluation#
- Custom Evaluation:
Configures custom evaluation metrics for agent performance, allowing users to define specific success criteria beyond standard RLlib evaluation metrics.
- Evaluation Parallel to Training:
Runs evaluation episodes in parallel with training, reducing training time by offloading evaluation to separate processes. This is beneficial in scenarios where frequent evaluation is required without interrupting learning.
Fault Tolerance#
- Crashing and Stalling Env:
Simulates an environment that randomly crashes and/or stalls, allowing users to test RLlib’s fault-tolerance mechanisms. This script is useful for evaluating how RLlib handles interruptions and recovers from unexpected failures during training.
GPU (for Training and Sampling)#
- Float16 Training and Inference:
Configures a setup for mixed-precision (float16) training and inference, optimizing performance by reducing memory usage and speeding up computation. This is especially useful for large-scale models on compatible GPUs.
- Fractional GPUs per Learner:
Demonstrates allocating fractional GPUs to individual learners, enabling finer resource allocation in multi-model setups. Useful for saving resources when training smaller models, many of which can fit on a single GPU.
- Mixed Precision Training and Float16 Inference:
Uses mixed precision (float32 and float16) for training, while switching to float16 precision for inference, balancing stability during training with performance improvements during evaluation.
Inference (of Models/Policies)#
- Policy Inference after Training:
Demonstrates performing inference with a trained policy, showing how to load a trained model and use it to make decisions in a simulated environment.
- Policy Inference after Training (with ConnectorV2):
Runs inference with a trained (LSTM-based) policy using connectors, which preprocess observations and actions, allowing for more modular and flexible inference setups.
Learners#
- Custom Loss Function (simple):
Implements a custom loss function for training, demonstrating how users can define tailored loss objectives for specific environments or behaviors.
- Custom Torch Learning Rate Schedulers:
Adds learning rate scheduling to PPO, showing how to adjust the learning rate dynamically using PyTorch schedulers for improved training stability.
- Separate Learning Rate and Optimizer for Value-Function:
Configures a separate learning rate and a separate optimizer for the value function (vs the policy network), enabling differentiated training dynamics between policy and value estimation in RL algorithms.
Metrics#
- Logging Custom Metrics in EnvRunners:
Demonstrates adding custom metrics to
EnvRunner
actors, providing a way to track specific performance- and environment indicators beyond the standard RLlib metrics.
Multi-Agent RL#
- Custom Heuristic Policy:
Demonstrates running a hybrid policy setup within the
MultiAgentCartPole
environment, where one agent follows a hand-coded random policy while another agent trains with PPO. This example highlights integrating static and dynamic policies, suitable for environments with a mix of fixed-strategy and adaptive agents.
- Different Spaces for Agents:
Configures agents with differing observation and action spaces within the same environment, showcasing RLlib’s support for heterogeneous agents with varying space requirements in a single multi-agent environment.
- Grouped Agents (Two-Step Game):
Implements a multi-agent, grouped setup within a two-step game environment (from the QMIX paper). N agents are grouped into M teams (N >= M) for which policies and rewards are shared. This example demonstrates RLlib’s ability to manage collective objectives and interactions among grouped agents.
- Multi-Agent CartPole:
Runs a multi-agent version of the CartPole environment with each agent independently learning to balance its pole. This example serves as a foundational test for multi-agent reinforcement learning scenarios in simple, independent tasks.
- Multi-Agent Pendulum:
Extends the classic Pendulum environment into a multi-agent setting, where multiple agents attempt to balance their respective pendulums. This example highlights RLlib’s support for environments with replicated dynamics but distinct agent policies.
- PettingZoo Independent Learning:
Integrates RLlib with PettingZoo to facilitate independent learning among multiple agents. Each agent independently optimizes its policy within a shared environment.
- PettingZoo Parameter Sharing:
Uses PettingZoo for an environment where all agents share a single policy.
- PettingZoo Shared Value Function:
Also using PettingZoo, this example explores shared value functions among agents. It demonstrates collaborative learning scenarios where agents collectively estimate a value function rather than individual policies.
- Rock-Paper-Scissors Heuristic vs Learned:
Simulates a rock-paper-scissors game with one heuristic-driven agent and one learning agent. It provides insights into performance when combining fixed and adaptive strategies in adversarial games.
- Rock-Paper-Scissors Learned vs Learned:
Sets up a rock-paper-scissors game where both agents are trained and therefore learn strategies against each other. Useful for evaluating performance in simple adversarial settings.
- Self-Play (League-Based) with OpenSpiel:
Uses OpenSpiel to demonstrate league-based self-play, where agents play against various (frozen or still-learning) versions of themselves to improve through competitive interaction.
- Self-Play with OpenSpiel:
Similar to the league-based self-play, but simpler. This script leverages OpenSpiel for two-player games, allowing agents to improve through direct self-play without building a complex, structured league.
Offline RL#
- Train with Behavioral Cloning (BC), Finetune with PPO:
Combines behavioral cloning pre-training with PPO fine-tuning, providing a two-phase training strategy where imitation learning (offline) is followed by online reinforcement learning.
Ray Serve and RLlib#
- Custom Experiment:
Integrates RLlib with Ray Serve, showcasing how to deploy trained
RLModule
instances as RESTful services. This setup is ideal for deploying models in production environments with API-based interactions.
Ray Tune and RLlib#
- Custom Experiment:
Configures a custom experiment with Ray Tune, demonstrating advanced options for custom training- and evaluation phases
- Custom Logger:
Shows how to implement a custom logger within Ray Tune, allowing users to define specific logging behaviors and outputs during training.
- Custom Progress Reporter:
Demonstrates a custom progress reporter in Ray Tune, which enables tracking and displaying specific training metrics or status updates in a customized format.
RLModules#
- Action Masking:
Implements an
RLModule
with action masking, where certain (disallowed) actions are masked based on parts of the observation dict, useful for environments with conditional action availability.
- Auto-Regressive Actions:
Configures an RL module that generates actions in an autoregressive manner, where the second component of an action depends on the previously sampled first component of the same action.
- Custom CNN-Based RLModule:
Demonstrates a custom CNN architecture realized as an
RLModule
, enabling convolutional feature extraction tailored to the environment’s visual observations.
- Custom LSTM-Based RLModule:
Uses a custom LSTM within an
RLModule
, allowing for temporal sequence processing, beneficial for partially observable environments with sequential dependencies.
- Migrate ModelV2 to RLModule (new API stack) by config:
Shows how to migrate a ModelV2-based setup (old API stack) to the new API stack’s
RLModule
, using an (old API stack)AlgorithmConfig
instance.
- Migrate ModelV2 to RLModule (new API stack) by Policy Checkpoint:
Migrates a ModelV2 (old API stack) to the new API stack’s
RLModule
by directly loading a policy checkpoint, enabling smooth transitions to the new API stack while preserving learned parameters.
- Pretrain Single-Agent Policy, then Train in Multi-Agent Env:
Demonstrates pretraining a single-agent model and transferring it to a multi-agent setting, useful for initializing multi-agent scenarios with pre-trained policies.
Tuned Examples#
The tuned examples folder contains python config files that can be executed analogously to all other example scripts described here to run tuned learning experiments for the different algorithms and environment types.
For example, see this tuned Atari example for PPO, which learns to solve the Pong environment in roughly 5 minutes. It can be run as follows on a single g5.24xlarge (or g6.24xlarge) machine with 4 GPUs and 96 CPUs:
$ cd ray/rllib/tuned_examples/ppo
$ python atari_ppo.py --env=ale_py:ALE/Pong-v5 --num-learners=4 --num-env-runners=95
Note that some of the files in this folder are used for RLlib’s daily or weekly release tests as well.
Community Examples#
Note
The community examples listed here all refer to the old API stack of RLlib.
- Arena AI:
A General Evaluation Platform and Building Toolkit for Single/Multi-Agent Intelligence with RLlib-generated baselines.
- The Emergence of Adversarial Communication in Multi-Agent Reinforcement Learning:
Using Graph Neural Networks and RLlib to train multiple cooperative and adversarial agents to solve the “cover the area”-problem, thereby learning how to best communicate (or - in the adversarial case - how to disturb communication) (code).
- Flatland:
A dense traffic simulating environment with RLlib-generated baselines.
- mobile-env:
An open, minimalist Gymnasium environment for autonomous coordination in wireless mobile networks. Includes an example notebook using Ray RLlib for multi-agent RL with mobile-env.
- Neural MMO:
A multiagent AI research environment inspired by Massively Multiplayer Online (MMO) role playing games – self-contained worlds featuring thousands of agents per persistent macrocosm, diverse skilling systems, local and global economies, complex emergent social structures, and ad-hoc high-stakes single and team based conflict.
- NeuroCuts:
Example of building packet classification trees using RLlib / multi-agent in a bandit-like setting.
- NeuroVectorizer:
Example of learning optimal LLVM vectorization compiler pragmas for loops in C and C++ codes using RLlib.
- Roboschool / SageMaker:
Example of training robotic control policies in SageMaker with RLlib.
- Sequential Social Dilemma Games:
Example of using the multi-agent API to model several social dilemma games.
- Simple custom environment for single RL with Ray and RLlib:
Create a custom environment and train a single agent RL using Ray 2.0 with Tune.
- StarCraft2:
Example of training in StarCraft2 maps with RLlib / multi-agent.
- Traffic Flow:
Example of optimizing mixed-autonomy traffic simulations with RLlib / multi-agent.
Blog Posts#
Note
The blog posts listed here all refer to the old API stack of RLlib.
- Attention Nets and More with RLlib’s Trajectory View API:
Blog describing RLlib’s new “trajectory view API” and how it enables implementations of GTrXL (attention net) architectures.
- Reinforcement Learning with RLlib in the Unity Game Engine:
How-To guide about connecting RLlib with the Unity3D game engine for running visual- and physics-based RL experiments.
- Lessons from Implementing 12 Deep RL Algorithms in TF and PyTorch:
Discussion on how the Ray Team ported 12 of RLlib’s algorithms from TensorFlow to PyTorch and the lessons learned.
- Scaling Multi-Agent Reinforcement Learning:
Blog post of a brief tutorial on multi-agent RL and its design in RLlib.
- Functional RL with Keras and TensorFlow Eager:
Exploration of a functional paradigm for implementing reinforcement learning (RL) algorithms.