Note
Ray 2.10.0 introduces the alpha stage of RLlib’s “new API stack”. The Ray Team plans to transition algorithms, example scripts, and documentation to the new code base thereby incrementally replacing the “old API stack” (e.g., ModelV2, Policy, RolloutWorker) throughout the subsequent minor releases leading up to Ray 3.0.
Note, however, that so far only PPO (single- and multi-agent) and SAC (single-agent only) support the “new API stack” and continue to run by default with the old APIs. You can continue to use the existing custom (old stack) classes.
See here for more details on how to use the new API stack.
Examples#
This page contains an index of all the python scripts in the examples folder of RLlib, demonstrating the different use cases and features of the library.
Note
RLlib is currently in a transition state from “old API stack” to “new API stack”.
Some of the examples here haven’t been translated yet to the new stack and are tagged
with the following comment line on top: # @OldAPIStack
. The moving of all example
scripts over to the “new API stack” is work in progress and expected to be completed
by the end of 2024.
Note
If any new-API-stack example is broken, or if you’d like to add an example to this page, feel free to raise an issue on RLlib’s github repository.
Folder Structure#
The examples folder is structured into several sub-directories, the contents of all of which are described in detail below.
How to run an example script#
Most of the example scripts are self-executable, meaning you can just cd
into the respective
directory and run the script as-is with python:
$ cd ray/rllib/examples/multi_agent
$ python multi_agent_pendulum.py --enable-new-api-stack --num-agents=2
Use the --help
command line argument to have each script print out its supported command line options.
Most of the scripts share a common subset of generally applicable command line arguments,
for example --num-env-runners
, --no-tune
, or --wandb-key
.
All sub-folders#
Algorithms#
- How to write a custom Algorith.training_step() method combining on- and off-policy training:
Example of how to override the
training_step()
method to train two different policies in parallel (also using multi-agent API).
Checkpoints#
- How to extract a checkpoint from n Tune trials using one or more custom criteria.:
Example of how to find a checkpoint after a
Tuner.fit()
with some custom defined criteria.
Connectors#
Note
RLlib’s Connector API has been re-written from scratch for the new API stack ().
Connector-pieces and -pipelines are now referred to as ConnectorV2
(as opposed to Connector
, which only continue to work on the old API stack ).
- How to frame-stack Atari image observations:
An example using Atari framestacking in a very efficient manner, not in the environment itself (as a
gym.Wrapper
), but by stacking the observations on-the-fly usingEnvToModule
andLearnerConnector
pipelines. This method of framestacking is more efficient as it avoids having to send large observation tensors through the network (ray).
- How to mean/std-filter observations:
An example of a
ConnectorV2
that filters all observations from the environment using a plain mean/std-filter (shift by the mean and divide by std-dev). This example demonstrates how a statefulConnectorV2
class has its states (here the means and standard deviations of the individual observation items) coming from the differentEnvRunner
instances a) merged into one common state and then b) broadcast again back to the remoteEnvRunner
workers.
- How to include previous-actions and/or previous rewards in RLModule inputs:
An example of a
ConnectorV2
that adds the n previous actions and/or the m previous rewards to the RLModule’s input dict (to perform its forward passes, both for inference and training).
- How to train with nested action spaces:
Learning in arbitrarily nested action spaces, using an env in which the action space equals the observation space (both are complex, nested Dicts) and the policy has to pick actions that closely match (or are identical) to the previously seen observations.
- How to train with nested observation spaces:
Learning in arbitrarily nested observation spaces (using a CartPole-v1 variant with a nested Dict observation space).
Curriculum Learning#
- How to set up curriculum learning with the custom callbacks API:
Example of how to make the environment go through different levels of difficulty (from easy to harder to solve) and thus help the learning algorithm to cope with an otherwise unsolvable task. Also see the curriculum learning how-to from the documentation.
Debugging#
- How to train an RLlib algorithm using a deterministic/reproducible setup:
Example showing how you can train an RLlib algorithm in a deterministic, reproducible fashion using seeding.
Environments#
- How to register a custom gymnasium environment:
Example showing how to write your own RL environment using
gymnasium
and register it to run train your algorithm against this env with RLlib.
- How to set up rendering (and recording) of the environment trajectories during training with WandB:
Example showing how you can render and record episode trajectories of your gymnasium envs and log the videos to WandB.
- How to run a Unity3D multi-agent environment locally:
Example of how to setup an RLlib Algorithm against a locally running Unity3D editor instance to learn any Unity3D game (including support for multi-agent). Use this example to try things out and watch the game and the learning progress live in the editor. Providing a compiled game, this example could also run in distributed fashion with
num_env_runners > 0
. For a more heavy-weight, distributed, cloud-based example, seeUnity3D client/server
below.
- How to run with a Unity3D client/server setup:
Example of how to setup n distributed Unity3D (compiled) games in the cloud that function as data collecting clients against a central RLlib Policy server learning how to play the game. The n distributed clients could themselves be servers for external/human players and allow for control being fully in the hands of the Unity entities instead of RLlib. Note: Uses Unity’s MLAgents SDK (>=1.0) and supports all provided MLAgents example games and multi-agent setups.
- How to run with a CartPole client/server setup:
Example of online serving of predictions for a simple CartPole policy.
Evaluation#
- How to run evaluation with a custom evaluation function:
Example of how to write a custom evaluation function that’s called instead of the default behavior, which is running with the evaluation worker set through n episodes.
- How to run evaluation in parallel to training:
Example showing how the evaluation workers and the “normal” rollout workers can run (to some extend) in parallel to speed up training.
GPU (for Training and Sampling)#
- How to use fractional GPUs for training an RLModule:
If your model is small and easily fits on a single GPU and you want to therefore train other models alongside it to save time and cost, this script shows you how to set up your RLlib config with a fractional number of GPUs on the learner (model training) side.
Hierarchical Training#
- How to setup hierarchical training:
Example of hierarchical training using the multi-agent API.
Inference (of Models/Policies)#
- How to do inference with an already trained policy:
Example of how to perform inference (compute actions) on an already trained policy.
- How to do inference with an already trained (LSTM) policy:
Example of how to perform inference (compute actions) on an already trained (LSTM) policy.
- How to do inference with an already trained (attention) policy:
Example of how to perform inference (compute actions) on an already trained (attention) policy.
Metrics#
- How to write your own custom metrics and callbacks in RLlib:
Example of how to output custom training metrics to TensorBoard.
Multi-Agent RL#
- How to set up independent multi-agent training:
Set up RLlib to run any algorithm in (independent) multi-agent mode against a multi-agent environment.
- How to set up shared-parameter multi-agent training:
Set up RLlib to run any algorithm in (shared-parameter) multi-agent mode against a multi-agent environment.
- How to compare a heuristic policy vs a trained one on rock-paper-scissors and Rock-paper-scissors learned vs learned:
Two examples of different heuristic and learned policies competing against each other in the rock-paper-scissors environment.
- How to use agent grouping in a multi-agent environment (two-step game):
Example on how to use agent grouping in a multi-agent environment (the two-step game from the QMIX paper).
- How to set up multi-agent training vs a PettingZoo environment:
Example on how to use RLlib to learn in PettingZoo multi-agent environments.
- How to hand-code a (heuristic) policy:
Example of running a custom hand-coded policy alongside trainable policies.
- How to train a single policy (weight sharing) controlling more than one agents:
Example of how to define weight-sharing layers between two different policies.
- Hwo to write and set up a model with centralized critic:
Example of customizing PPO to leverage a centralized value function.
- How to write and set up a model with centralized critic in the env:
A simpler method of implementing a centralized critic by augmenting agent observations with global information.
- How to combine multiple algorithms into onw using the multi-agent API:
Example of alternating training between DQN and PPO.
Offline RL#
- How to run an offline RL experiment with CQL:
Example showing how to run an offline RL training job using a historic-data JSON file.
- How to save experiences from an environment for offline RL:
Example of how to externally generate experience batches in RLlib-compatible format.
Ray Serve and RLlib#
- How to use a trained RLlib algorithm with Ray Serve
This script offers a simple workflow for 1) training a policy with RLlib first, 2) creating a new policy 3) restoring its weights from the trained one and serving the new policy with Ray Serve.
Ray Tune and RLlib#
- How to define a custom progress reporter and use it with Ray Tune and RLlib:
Example of how to write your own progress reporter (for a multi-agent experiment) and use it with Ray Tune and RLlib.
- How to define and plug in your custom logger into Ray Tune and RLlib:
How to setup a custom Logger object in RLlib and use it with Ray Tune.
- How to Custom tune experiment:
How to run a custom Ray Tune experiment with RLlib with custom training- and evaluation phases.
RLModules#
- How to configure an autoregressive action distribution:
Learning with an auto-regressive action distribution (for example, two action components, where distribution of the second component depends on the first’s actually sampled value).
- How to train with parametric actions:
Example of how to handle variable-length or parametric action spaces.
- How to using the “Repeated” space of RLlib for variable lengths observations:
How to use RLlib’s
Repeated
space to handle variable length observations.
- How to write a custom Keras model:
Example of using a custom Keras model.
- How to register a custom model with supervised loss:
Example of defining and registering a custom model with a supervised loss.
- How to train with batch normalization:
Example of adding batch norm layers to a custom model.
- How to write a custom model with its custom API:
Shows how to define a custom Model API in RLlib, such that it can be used inside certain algorithms.
- How to write a “trajectory ciew API” utilizing model:
An example on how a model can use the trajectory view API to specify its own input.
- How to wrap MobileNetV2 into your RLlib model:
Implementations of
tf.keras.applications.mobilenet_v2.MobileNetV2
andtorch.hub (mobilenet_v2)
-wrapping example models.
- How to setup a Differentiable Neural Computer:
Example of DeepMind’s Differentiable Neural Computer for partially observable environments.
Tuned Examples#
The tuned examples folder contains python config files (yaml for the old API stack) that can be executed analogously to all other example scripts described here in order to run tuned learning experiments for the different algorithms and different environment types.
For example, see this tuned Atari example for PPO, which learns to solve the Pong environment in roughly 5min. It can be run like this on a single g5.24xlarge (or g6.24xlarge) machine with 4 GPUs and 96 CPUs:
$ cd ray/rllib/tuned_examples/ppo
$ python atari_ppo.py --env ALE/Pong-v5 --num-gpus=4 --num-env-runners=95
Note that some of the files in this folder are used for RLlib’s daily or weekly release tests as well.
Community Examples#
- Arena AI:
A General Evaluation Platform and Building Toolkit for Single/Multi-Agent Intelligence with RLlib-generated baselines.
- The Emergence of Adversarial Communication in Multi-Agent Reinforcement Learning:
Using Graph Neural Networks and RLlib to train multiple cooperative and adversarial agents to solve the “cover the area”-problem, thereby learning how to best communicate (or - in the adversarial case - how to disturb communication) (code).
- Flatland:
A dense traffic simulating environment with RLlib-generated baselines.
- mobile-env:
An open, minimalist Gymnasium environment for autonomous coordination in wireless mobile networks. Includes an example notebook using Ray RLlib for multi-agent RL with mobile-env.
- Neural MMO:
A multiagent AI research environment inspired by Massively Multiplayer Online (MMO) role playing games – self-contained worlds featuring thousands of agents per persistent macrocosm, diverse skilling systems, local and global economies, complex emergent social structures, and ad-hoc high-stakes single and team based conflict.
- NeuroCuts:
Example of building packet classification trees using RLlib / multi-agent in a bandit-like setting.
- NeuroVectorizer:
Example of learning optimal LLVM vectorization compiler pragmas for loops in C and C++ codes using RLlib.
- Roboschool / SageMaker:
Example of training robotic control policies in SageMaker with RLlib.
- Sequential Social Dilemma Games:
Example of using the multi-agent API to model several social dilemma games.
- Simple custom environment for single RL with Ray and RLlib:
Create a custom environment and train a single agent RL using Ray 2.0 with Tune.
- StarCraft2:
Example of training in StarCraft2 maps with RLlib / multi-agent.
- Traffic Flow:
Example of optimizing mixed-autonomy traffic simulations with RLlib / multi-agent.
Blog Posts#
- Attention Nets and More with RLlib’s Trajectory View API:
Blog describing RLlib’s new “trajectory view API” and how it enables implementations of GTrXL (attention net) architectures.
- Reinforcement Learning with RLlib in the Unity Game Engine:
How-To guide about connecting RLlib with the Unity3D game engine for running visual- and physics-based RL experiments.
- Lessons from Implementing 12 Deep RL Algorithms in TF and PyTorch:
Discussion on how the Ray Team ported 12 of RLlib’s algorithms from TensorFlow to PyTorch and the lessons learned.
- Scaling Multi-Agent Reinforcement Learning:
Blog post of a brief tutorial on multi-agent RL and its design in RLlib.
- Functional RL with Keras and TensorFlow Eager:
Exploration of a functional paradigm for implementing reinforcement learning (RL) algorithms.