Examples#

This page contains an index of all the python scripts in the examples folder of RLlib, demonstrating the different use cases and features of the library.

Note

RLlib is currently in a transition state from old- to new API stack. The Ray team has translated most of the example scripts to the new stack and tag those still on the old stack with this comment line on top: # @OldAPIStack. The moving of all example scripts over to the new stack is work in progress.

Note

If you find any new API stack example broken, or if you’d like to add an example to this page, create an issue in the RLlib GitHub repository.

Folder structure#

The examples folder has several sub-directories described in detail below.

How to run an example script#

Most of the example scripts are self-executable, meaning you can cd into the respective directory and run the script as-is with python:

$ cd ray/rllib/examples/multi_agent
$ python multi_agent_pendulum.py --num-agents=2

Use the --help command line argument to have each script print out its supported command line options.

Most of the scripts share a common subset of generally applicable command line arguments, for example --num-env-runners, to scale the number of EnvRunner actors, --no-tune, to switch off running with Ray Tune, --wandb-key, to log to WandB, or --verbose, to control log chattiness.

All example sub-folders#

Actions#

Auto-regressive actions:
Configures an RL module that generates actions in an autoregressive manner, where the second component of an action depends on the previously sampled first component of the same action.
Custom action distribution class:
Demonstrates how to write a custom action distribution class, taking an additional temperature parameter on top of a Categorical distribution, and how to configure this class inside your RLModule implementation. Further explains how to define different such classes for the different forward methods of your RLModule in case you need more granularity.
Nested Action Spaces:
Sets up an environment with nested action spaces using custom single- or multi-agent configurations. This example demonstrates how RLlib manages complex action structures, such as multi-dimensional or hierarchical action spaces.

Algorithms#

Custom implementation of the Model-Agnostic Meta-Learning (MAML) algorithm:
Shows how to stably train a model in an “infinite-task” environment, where each task corresponds to a sinusoidal function with randomly sampled amplitude and phase. Because each new task introduces a shift in data distribution, traditional learning algorithms would fail to generalize.
Custom “vanilla policy gradient” (VPG) algorithm:
Shows how to write a very simple policy gradient Algorithm from scratch, including a matching AlgorithmConfig, a matching Learner which defines the loss function, and the Algorithm’s training_step() implementation.
Custom algorithm with a global, shared data actor for sending manipulated rewards from EnvRunners to Learners:
Shows how to write a custom shared data actor accessible from any of the Algorithm’s other actors, like EnvRunner and Learner actors. The new actor stores manipulated rewards from sampled episodes under unique, per-episode keys and then serves this information to the Learner for adding these rewards to the train batch.

Checkpoints#

Checkpoint by custom criteria:
Shows how to create checkpoints based on custom criteria, giving users control over when to save model snapshots during training.
Continue training from checkpoint:
Illustrates resuming training from a saved checkpoint, useful for extending training sessions or recovering from interruptions.
Restore 1 out of N agents from checkpoint:
Restores one specific agent from a multi-agent checkpoint, allowing selective loading for environments where only certain agents need to resume training.

Connectors#

Note

RLlib’s Connector API has been re-written from scratch for the new API stack. Connector-pieces and -pipelines are now referred to as ConnectorV2 to distinguish against the Connector class, which only continue to work on the old API stack.

Flatten and one-hot observations:
Demonstrates how to one-hot discrete observation spaces and/or flatten complex observations, Dict or Tuple, allowing RLlib to process arbitrary observation data as flattened 1D vectors. Useful for environments with complex, discrete, or hierarchical observations.
Observation frame-stacking:
Implements frame stacking, where N consecutive frames stack together to provide temporal context to the agent. This technique is common in environments with continuous state changes, like video frames in Atari games. Using connectors for frame stacking is more efficient as it avoids having to send large observation tensors through ray remote calls.
Mean/Std filtering:
Adds mean and standard deviation normalization for observations, shifting by the mean and dividing by std-dev. This type of filtering can improve learning stability in environments with highly variable state magnitudes by scaling observations to a normalized range.
Multi-agent observation preprocessor enhancing non-Markovian observations to Markovian ones:
A multi-agent preprocessor enhances the per-agent observations of a multi-agent env, which by themselves are non-Markovian, partial observations and converts them into Markovian observations by adding information from the respective other agent. A policy can only be trained optimally through this additional information.
Prev-actions, prev-rewards connector:
Augments observations with previous actions and rewards, giving the agent a short-term memory of past events, which can improve decision-making in partially observable or sequentially dependent tasks.
Single-agent observation preprocessor:
A connector alters the CartPole-v1 environment observations from the Markovian 4-tuple (x-pos, angular-pos, x-velocity, angular-velocity) to a non-Markovian, simpler 2-tuple (only x-pos and angular-pos). The resulting problem can only be solved through a memory/stateful model, for example an LSTM.

Curiosity#

Count-based curiosity:
Implements count-based intrinsic motivation to encourage exploration of less visited states. Using curiosity is beneficial in sparse-reward environments where agents may struggle to find rewarding paths. However, count-based methods are only feasible for environments with small observation spaces.
Euclidian distance-based curiosity:
Uses Euclidean distance between states and the initial state to measure novelty, encouraging exploration by rewarding the agent for reaching “far away” regions of the environment. Suitable for sparse-reward tasks, where diverse exploration is key to success.
Intrinsic-curiosity-model (ICM) Based Curiosity:
Adds an Intrinsic Curiosity Model (ICM) that learns to predict the next state as well as the action in between two states to measure novelty. The higher the loss of the ICM, the higher the “novelty” and thus the intrinsic reward. Ideal for complex environments with large observation spaces where reward signals are sparse.

Curriculum learning#

Custom env rendering method:
Demonstrates curriculum learning, where the environment difficulty increases as the agent improves. This approach enables gradual learning, allowing agents to master simpler tasks before progressing to more challenging ones, ideal for environments with hierarchical or staged difficulties. Also see the curriculum learning how-to from the documentation.

Debugging#

Deterministic sampling and training:
Demonstrates the possibility to seed an experiment through the algorithm config. RLlib passes the seed through to all components that have a copy of the RL environment and the RLModule and thus makes sure these components behave deterministically. When using a seed, train results should become repeatable. Note that some algorithms, such as APPO which rely on asynchronous sampling in combination with Ray network communication always behave stochastically, no matter whether you set a seed or not.

Environments#

Async gym vectorization, parallelizing sub-environments:
Shows how the gym_env_vectorize_mode config setting can significantly speed up your :py:class`~ray.rllib.env.env_runner.EnvRunner` actors, if your RL environment is slow and you are using num_envs_per_env_runner > 1. The reason for the performance gain is that each sub-environment runs in its own process.
Custom env rendering method:
Demonstrates how to add a custom render() method to a (custom) environment, allowing visualizations of agent interactions.
Custom gymnasium env:
Implements a custom gymnasium environment from scratch, showing how to define observation and action spaces, arbitrary reward functions, as well as, step- and reset logic.
Env connecting to RLlib through a tcp client:
An external environment, running outside of RLlib and acting as a client, connects to RLlib as a server. The external env performs its own action inference using an ONNX model, sends collected data back to RLlib for training, and receives model updates from time to time from RLlib.
Env rendering and recording:
Illustrates environment rendering and recording setups within RLlib, capturing visual outputs for later review (ex. on WandB), which is essential for tracking agent behavior in training.
Env with protobuf observations:
Uses Protobuf for observations, demonstrating an advanced way of handling serialized data in environments. This approach is useful for integrating complex external data sources as observations.

Evaluation#

Custom evaluation:
Configures custom evaluation metrics for agent performance, allowing users to define specific success criteria beyond standard RLlib evaluation metrics.
Evaluation parallel to training:
Runs evaluation episodes in parallel with training, reducing training time by offloading evaluation to separate processes. This method is beneficial when you require frequent evaluation without interrupting learning.

Fault tolerance#

Crashing and stalling env:
Simulates an environment that randomly crashes or stalls, allowing users to test RLlib’s fault-tolerance mechanisms. This script is useful for evaluating how RLlib handles interruptions and recovers from unexpected failures during training.

GPUs for training and sampling#

Float16 training and inference:
Configures a setup for float16 training and inference, optimizing performance by reducing memory usage and speeding up computation. This is especially useful for large-scale models on compatible GPUs.
Fractional GPUs per Learner:
Demonstrates allocating fractional GPUs to individual learners, enabling finer resource allocation in multi-model setups. Useful for saving resources when training smaller models, many of which can fit on a single GPU.
Mixed precision training and float16 inference:
Uses mixed precision, float32 and float16, for training, while switching to float16 precision for inference, balancing stability during training with performance improvements during evaluation.
Using GPUs on EnvRunners:
Demos how EnvRunner instances, single- or multi-agent, can request GPUs through the config.env_runners(num_gpus_per_env_runner=..) setting.

Hierarchical training#

Hierarchical RL training:
Showcases a hierarchical RL setup inspired by automatic subgoal discovery and subpolicy specialization. A high-level policy selects subgoals and assigns one of three specialized low-level policies to achieve them within a time limit, encouraging specialization and efficient task-solving. The agent has to navigate a complex grid-world environment. The example highlights the advantages of hierarchical learning over flat approaches by demonstrating significantly improved learning performance in challenging, goal-oriented tasks.

Inference of models or policies#

Policy inference after training:
Demonstrates performing inference using a checkpointed RLModule or an ONNX runtime. First trains the RLModule, creates a checkpoint, then re-loads the module from this checkpoint or ONNX file, and computes actions in a simulated environment.
Policy inference after training, with ConnectorV2:
Runs inference with a trained, LSTM-based RLModule or an ONNX runtime. Two connector pipelines, env-to-module and module-to-env, preprocess observations and LSTM-states and postprocess model outputs into actions, allowing for very modular and flexible inference setups.

Learners#

Custom loss function, simple:
Implements a custom loss function for training, demonstrating how users can define tailored loss objectives for specific environments or behaviors.
Custom torch learning rate schedulers:
Adds learning rate scheduling to PPO, showing how to adjust the learning rate dynamically using PyTorch schedulers for improved training stability.
Separate learning rate and optimizer for value function:
Configures a separate learning rate and a separate optimizer for the value function vs the policy network, enabling differentiated training dynamics between policy and value estimation in RL algorithms.

Metrics#

Logging custom metrics in Algorithm.training_step:
Shows how to log custom metrics inside a custom Algorithm through overriding the :py:meth:`` method and making calls to the log_value() method of the MetricsLogger instance.
Logging custom metrics in EnvRunners:
Demonstrates adding custom metrics to EnvRunner actors, providing a way to track specific performance- and environment indicators beyond the standard RLlib metrics.

Multi-agent RL#

Custom heuristic policy:
Demonstrates running a hybrid policy setup within the MultiAgentCartPole environment, where one agent follows a hand-coded random policy while another agent trains with PPO. This example highlights integrating static and dynamic policies, suitable for environments with a mix of fixed-strategy and adaptive agents.
Different observation- and action spaces for different agents:
Configures agents with differing observation and action spaces within the same environment, showcasing RLlib’s support for heterogeneous agents with varying space requirements in a single multi-agent environment. Another example, which also makes use of connectors, and that covers the same topic, agents having different spaces, can be found here.
Grouped agents, two-step game:
Implements a multi-agent, grouped setup within a two-step game environment from the QMIX paper. N agents form M teams in total, where N >= M, and agents in each team share rewards and one policy. This example demonstrates RLlib’s ability to manage collective objectives and interactions among grouped agents.
Multi-agent CartPole:
Runs a multi-agent version of the CartPole environment with each agent independently learning to balance its pole. This example serves as a foundational test for multi-agent reinforcement learning scenarios in simple, independent tasks.
Multi-agent Pendulum:
Extends the classic Pendulum environment into a multi-agent setting, where multiple agents attempt to balance their respective pendulums. This example highlights RLlib’s support for environments with replicated dynamics but distinct agent policies.
PettingZoo independent learning:
Integrates RLlib with PettingZoo to facilitate independent learning among multiple agents. Each agent independently optimizes its policy within a shared environment.
PettingZoo parameter sharing:
Uses PettingZoo for an environment where all agents share a single policy.
PettingZoo shared value function:
Also using PettingZoo, this example explores shared value functions among agents. It demonstrates collaborative learning scenarios where agents collectively estimate a value function rather than individual policies.
Rock-paper-scissors heuristic vs learned:
Simulates a rock-paper-scissors game with one heuristic-driven agent and one learning agent. It provides insights into performance when combining fixed and adaptive strategies in adversarial games.
Rock-paper-scissors learned vs learned:
Sets up a rock-paper-scissors game where you train both agents to learn strategies on how to play against each other. Useful for evaluating performance in simple adversarial settings.
Self-play, league-based, with OpenSpiel:
Uses OpenSpiel to demonstrate league-based self-play, where agents play against various versions of themselves, frozen or in-training, to improve through competitive interaction.
Self-play with OpenSpiel:
Similar to the league-based self-play, but simpler. This script leverages OpenSpiel for two-player games, allowing agents to improve through direct self-play without building a complex, structured league.

Offline RL#

Train with behavioral cloning (BC), Finetune with PPO:
Combines behavioral cloning pre-training with PPO fine-tuning, providing a two-phase training strategy. Offline imitation learning as a first step followed by online reinforcement learning.

Ray Serve and RLlib#

Using Ray Serve with RLlib:
Integrates RLlib with Ray Serve, showcasing how to deploy trained RLModule instances as RESTful services. This setup is ideal for deploying models in production environments with API-based interactions.

Ray Tune and RLlib#

Custom experiment:
Configures a custom experiment with Ray Tune, demonstrating advanced options for custom training- and evaluation phases
Custom logger:
Shows how to implement a custom logger within Ray Tune, allowing users to define specific logging behaviors and outputs during training.
Custom progress reporter:
Demonstrates a custom progress reporter in Ray Tune, which enables tracking and displaying specific training metrics or status updates in a customized format.

RLModules#

Action masking:
Implements an RLModule with action masking, where certain disallowed actions are masked based on parts of the observation dict, useful for environments with conditional action availability.
Auto-regressive actions:
See here for more details.
Custom CNN-based RLModule:
Demonstrates a custom CNN architecture realized as an RLModule, enabling convolutional feature extraction tailored to the environment’s visual observations.
Custom LSTM-based RLModule:
Uses a custom LSTM within an RLModule, allowing for temporal sequence processing, beneficial for partially observable environments with sequential dependencies.
Migrate ModelV2 to RLModule by config:
Shows how to migrate a ModelV2-based setup (old API stack) to the new API stack’s RLModule, using an (old API stack) AlgorithmConfig instance.
Migrate ModelV2 to RLModule by Policy Checkpoint:
Migrates a ModelV2 (old API stack) to the new API stack’s RLModule by directly loading a policy checkpoint, enabling smooth transitions to the new API stack while preserving learned parameters.
Pretrain single-agent policy, then train in multi-agent Env:
Demonstrates pretraining a single-agent model and transferring it to a multi-agent setting, useful for initializing multi-agent scenarios with pre-trained policies.

Tuned examples#

The tuned examples folder contains python config files that you can execute analogously to all other example scripts described here to run tuned learning experiments for the different algorithms and environment types.

For example, see this tuned Atari example for PPO, which learns to solve the Pong environment in roughly 5 minutes. You can run it as follows on a single g5.24xlarge or g6.24xlarge machine with 4 GPUs and 96 CPUs:

$ cd ray/rllib/tuned_examples/ppo
$ python atari_ppo.py --env=ale_py:ALE/Pong-v5 --num-learners=4 --num-env-runners=95

Note that RLlib’s daily or weekly release tests use some of the files in this folder as well.

Community examples#

Note

The community examples listed here all refer to the old API stack of RLlib.

Arena AI:
A General Evaluation Platform and Building Toolkit for Single/Multi-Agent Intelligence with RLlib-generated baselines.
CARLA:
Example of training autonomous vehicles with RLlib and CARLA simulator.
The Emergence of Adversarial Communication in Multi-Agent Reinforcement Learning:
Using Graph Neural Networks and RLlib to train multiple cooperative and adversarial agents to solve the “cover the area”-problem, thereby learning how to best communicate or - in the adversarial case - how to disturb communication (code).
Flatland:
A dense traffic simulating environment with RLlib-generated baselines.
GFootball:
Example of setting up a multi-agent version of GFootball with RLlib.
mobile-env:
An open, minimalist Gymnasium environment for autonomous coordination in wireless mobile networks. Includes an example notebook using Ray RLlib for multi-agent RL with mobile-env.
Neural MMO:
A multiagent AI research environment inspired by Massively Multiplayer Online (MMO) role playing games – self-contained worlds featuring thousands of agents per persistent macrocosm, diverse skilling systems, local and global economies, complex emergent social structures, and ad-hoc high-stakes single and team based conflict.
NeuroCuts:
Example of building packet classification trees using RLlib / multi-agent in a bandit-like setting.
NeuroVectorizer:
Example of learning optimal LLVM vectorization compiler pragmas for loops in C and C++ codes using RLlib.
Roboschool / SageMaker:
Example of training robotic control policies in SageMaker with RLlib.
Sequential Social Dilemma Games:
Example of using the multi-agent API to model several social dilemma games.
Simple custom environment for single RL with Ray and RLlib:
Create a custom environment and train a single agent RL using Ray 2.0 with Tune.
StarCraft2:
Example of training in StarCraft2 maps with RLlib / multi-agent.
Traffic Flow:
Example of optimizing mixed-autonomy traffic simulations with RLlib / multi-agent.

Blog posts#

Note

The blog posts listed here all refer to the old API stack of RLlib.

Attention Nets and More with RLlib’s Trajectory View API:
Blog describing RLlib’s new “trajectory view API” and how it enables implementations of GTrXL attention net architectures.
Reinforcement Learning with RLlib in the Unity Game Engine:
How-To guide about connecting RLlib with the Unity3D game engine for running visual- and physics-based RL experiments.
Lessons from Implementing 12 Deep RL Algorithms in TF and PyTorch:
Discussion on how the Ray Team ported 12 of RLlib’s algorithms from TensorFlow to PyTorch and the lessons learned.
Scaling Multi-Agent Reinforcement Learning:
Blog post of a brief tutorial on multi-agent RL and its design in RLlib.
Functional RL with Keras and TensorFlow Eager:
Exploration of a functional paradigm for implementing reinforcement learning (RL) algorithms.