Note

Ray 2.10.0 introduces the alpha stage of RLlib’s “new API stack”. The team is currently transitioning algorithms, example scripts, and documentation to the new code base throughout the subsequent minor releases leading up to Ray 3.0.

See here for more details on how to activate and use the new API stack.

Examples#

This page contains an index of all the python scripts in the examples folder of RLlib, demonstrating the different use cases and features of the library.

Note

RLlib is currently in a transition state from old- to new API stack. Some of the example scripts haven’t been translated yet to the new stack and are tagged with the following comment line on top: # @OldAPIStack. The moving of all example scripts over to the new stack is work in progress.

Note

If any (new API stack) example is broken, or if you’d like to add an example to this page, feel free to raise an issue on RLlib’s github repository.

Folder Structure#

The examples folder is structured into several sub-directories, the contents of all of which are described in detail below.

How to run an example script#

Most of the example scripts are self-executable, meaning you can cd into the respective directory and run the script as-is with python:

$ cd ray/rllib/examples/multi_agent
$ python multi_agent_pendulum.py --enable-new-api-stack --num-agents=2

Use the --help command line argument to have each script print out its supported command line options.

Most of the scripts share a common subset of generally applicable command line arguments, for example --num-env-runners (to scale the number of EnvRunner actors), --no-tune (to switch off running with Ray Tune), --wandb-key (to log to W&B), or --verbose (to control log chattiness).

All example sub-folders#

Actions#

  • Nested Action Spaces:

    Sets up an environment with nested action spaces using custom (single- or multi-agent) configurations. This example demonstrates how RLlib manages complex action structures, such as multi-dimensional or hierarchical action spaces.

Checkpoints#

Connectors#

Note

RLlib’s Connector API has been re-written from scratch for the new API stack. Connector-pieces and -pipelines are now referred to as ConnectorV2 (as opposed to Connector, which only continue to work on the old API stack).

  • Flatten and One-Hot Observations:

    Demonstrates how to one-hot discrete observation spaces and/or flatten complex observations (Dict or Tuple), allowing RLlib to process arbitrary observation data as flattened (1D) vectors. Useful for environments with complex, discrete, or hierarchical observations.

  • Observation Frame-Stacking:

    Implements frame stacking, where consecutive frames are stacked together to provide temporal context to the agent. This technique is common in environments with continuous state changes, like video frames in Atari games. Using connectors for frame stacking is more efficient as it avoids having to send large observation tensors through the network (ray).

  • Mean/Std Filtering:

    Adds mean and standard deviation normalization for observations (shift by the mean and divide by std-dev), improving learning stability by scaling observations to a normalized range. This can enhance performance in environments with highly variable state magnitudes.

  • Prev-Actions, Prev-Rewards Connector:

    Augments observations with previous actions and rewards, giving the agent a short-term memory of past events, which can improve decision-making in partially observable or sequentially dependent tasks.

Curiosity#

  • Count-Based Curiosity:

    Implements count-based intrinsic motivation to encourage exploration of less visited states. Using curiosity is beneficial in sparse-reward environments where agents may struggle to find rewarding paths. However, count-based methods are only feasible for environments with small observation spaces.

  • Euclidian Distance-Based Curiosity:

    Uses Euclidean distance between states and the initial state to measure novelty, encouraging exploration by rewarding the agent for reaching “far away” regions of the environment. Suitable for sparse-reward tasks, where diverse exploration is key to success.

  • Intrinsic-Curiosity-Model (ICM) Based Curiosity:

    Adds an Intrinsic Curiosity Model (ICM) that learns to predict the next state as well as the action in between two states to measure novelty. The higher the loss of the ICM, the higher the “novelty” and thus the intrinsic reward. Ideal for complex environments with large observation spaces where reward signals are sparse.

Curriculum Learning#

  • Custom Env Rendering Method:

    Demonstrates curriculum learning, where the environment difficulty increases as the agent improves. This approach enables gradual learning, allowing agents to master simpler tasks before progressing to more challenging ones, ideal for environments with hierarchical or staged difficulties. Also see the curriculum learning how-to from the documentation.

Environments#

  • Custom Env Rendering Method:

    Demonstrates how to add a custom render() method to a (custom) environment, allowing visualizations of agent interactions.

  • Custom gymnasium Env:

    Implements a custom gymnasium environment from scratch, showing how to define observation and action spaces, arbitrary reward functions, as well as, step- and reset logic.

  • Env connecting to RLlib through a TCP client:

    An external environment, running outside of RLlib and acting as a client, connects to RLlib as a server. The external env performs its own action inference using an ONNX model, sends collected data back to RLlib for training, and receives model updates from time to time from RLlib.

  • Env Rendering and Recording:

    Illustrates environment rendering and recording setups within RLlib, capturing visual outputs for later review (ex. on WandB), which is essential for tracking agent behavior in training.

  • Env with Protobuf Observations:

    Uses Protobuf for observations, demonstrating an advanced way of handling serialized data in environments. This approach is useful for integrating complex external data sources as observations.

Evaluation#

  • Custom Evaluation:

    Configures custom evaluation metrics for agent performance, allowing users to define specific success criteria beyond standard RLlib evaluation metrics.

  • Evaluation Parallel to Training:

    Runs evaluation episodes in parallel with training, reducing training time by offloading evaluation to separate processes. This is beneficial in scenarios where frequent evaluation is required without interrupting learning.

Fault Tolerance#

  • Crashing and Stalling Env:

    Simulates an environment that randomly crashes and/or stalls, allowing users to test RLlib’s fault-tolerance mechanisms. This script is useful for evaluating how RLlib handles interruptions and recovers from unexpected failures during training.

GPU (for Training and Sampling)#

  • Float16 Training and Inference:

    Configures a setup for mixed-precision (float16) training and inference, optimizing performance by reducing memory usage and speeding up computation. This is especially useful for large-scale models on compatible GPUs.

  • Fractional GPUs per Learner:

    Demonstrates allocating fractional GPUs to individual learners, enabling finer resource allocation in multi-model setups. Useful for saving resources when training smaller models, many of which can fit on a single GPU.

  • Mixed Precision Training and Float16 Inference:

    Uses mixed precision (float32 and float16) for training, while switching to float16 precision for inference, balancing stability during training with performance improvements during evaluation.

Hierarchical Training#

  • Hierarchical RL Training:

    Showcases a hierarchical RL setup inspired by automatic subgoal discovery and subpolicy specialization. A high-level policy selects subgoals and assigns one of three specialized low-level policies to achieve them within a time limit, encouraging specialization and efficient task-solving. The agent has to navigate a complex grid-world environment. The example highlights the advantages of hierarchical learning over flat approaches by demonstrating significantly improved learning performance in challenging, goal-oriented tasks.

Inference (of Models/Policies)#

Learners#

  • Custom Loss Function (simple):

    Implements a custom loss function for training, demonstrating how users can define tailored loss objectives for specific environments or behaviors.

  • Custom Torch Learning Rate Schedulers:

    Adds learning rate scheduling to PPO, showing how to adjust the learning rate dynamically using PyTorch schedulers for improved training stability.

  • Separate Learning Rate and Optimizer for Value-Function:

    Configures a separate learning rate and a separate optimizer for the value function (vs the policy network), enabling differentiated training dynamics between policy and value estimation in RL algorithms.

Metrics#

Multi-Agent RL#

  • Custom Heuristic Policy:

    Demonstrates running a hybrid policy setup within the MultiAgentCartPole environment, where one agent follows a hand-coded random policy while another agent trains with PPO. This example highlights integrating static and dynamic policies, suitable for environments with a mix of fixed-strategy and adaptive agents.

  • Different Spaces for Agents:

    Configures agents with differing observation and action spaces within the same environment, showcasing RLlib’s support for heterogeneous agents with varying space requirements in a single multi-agent environment.

  • Grouped Agents (Two-Step Game):

    Implements a multi-agent, grouped setup within a two-step game environment (from the QMIX paper). N agents are grouped into M teams (N >= M) for which policies and rewards are shared. This example demonstrates RLlib’s ability to manage collective objectives and interactions among grouped agents.

  • Multi-Agent CartPole:

    Runs a multi-agent version of the CartPole environment with each agent independently learning to balance its pole. This example serves as a foundational test for multi-agent reinforcement learning scenarios in simple, independent tasks.

  • Multi-Agent Pendulum:

    Extends the classic Pendulum environment into a multi-agent setting, where multiple agents attempt to balance their respective pendulums. This example highlights RLlib’s support for environments with replicated dynamics but distinct agent policies.

  • PettingZoo Independent Learning:

    Integrates RLlib with PettingZoo to facilitate independent learning among multiple agents. Each agent independently optimizes its policy within a shared environment.

  • PettingZoo Parameter Sharing:

    Uses PettingZoo for an environment where all agents share a single policy.

  • PettingZoo Shared Value Function:

    Also using PettingZoo, this example explores shared value functions among agents. It demonstrates collaborative learning scenarios where agents collectively estimate a value function rather than individual policies.

  • Rock-Paper-Scissors Heuristic vs Learned:

    Simulates a rock-paper-scissors game with one heuristic-driven agent and one learning agent. It provides insights into performance when combining fixed and adaptive strategies in adversarial games.

  • Rock-Paper-Scissors Learned vs Learned:

    Sets up a rock-paper-scissors game where both agents are trained and therefore learn strategies against each other. Useful for evaluating performance in simple adversarial settings.

  • Self-Play (League-Based) with OpenSpiel:

    Uses OpenSpiel to demonstrate league-based self-play, where agents play against various (frozen or still-learning) versions of themselves to improve through competitive interaction.

  • Self-Play with OpenSpiel:

    Similar to the league-based self-play, but simpler. This script leverages OpenSpiel for two-player games, allowing agents to improve through direct self-play without building a complex, structured league.

Offline RL#

Ray Serve and RLlib#

  • Custom Experiment:

    Integrates RLlib with Ray Serve, showcasing how to deploy trained RLModule instances as RESTful services. This setup is ideal for deploying models in production environments with API-based interactions.

Ray Tune and RLlib#

  • Custom Experiment:

    Configures a custom experiment with Ray Tune, demonstrating advanced options for custom training- and evaluation phases

  • Custom Logger:

    Shows how to implement a custom logger within Ray Tune, allowing users to define specific logging behaviors and outputs during training.

  • Custom Progress Reporter:

    Demonstrates a custom progress reporter in Ray Tune, which enables tracking and displaying specific training metrics or status updates in a customized format.

RLModules#

Tuned Examples#

The tuned examples folder contains python config files that can be executed analogously to all other example scripts described here to run tuned learning experiments for the different algorithms and environment types.

For example, see this tuned Atari example for PPO, which learns to solve the Pong environment in roughly 5 minutes. It can be run as follows on a single g5.24xlarge (or g6.24xlarge) machine with 4 GPUs and 96 CPUs:

$ cd ray/rllib/tuned_examples/ppo
$ python atari_ppo.py --env=ale_py:ALE/Pong-v5 --num-learners=4 --num-env-runners=95

Note that some of the files in this folder are used for RLlib’s daily or weekly release tests as well.

Community Examples#

Note

The community examples listed here all refer to the old API stack of RLlib.

  • Arena AI:

    A General Evaluation Platform and Building Toolkit for Single/Multi-Agent Intelligence with RLlib-generated baselines.

  • CARLA:

    Example of training autonomous vehicles with RLlib and CARLA simulator.

  • The Emergence of Adversarial Communication in Multi-Agent Reinforcement Learning:

    Using Graph Neural Networks and RLlib to train multiple cooperative and adversarial agents to solve the “cover the area”-problem, thereby learning how to best communicate (or - in the adversarial case - how to disturb communication) (code).

  • Flatland:

    A dense traffic simulating environment with RLlib-generated baselines.

  • GFootball:

    Example of setting up a multi-agent version of GFootball with RLlib.

  • mobile-env:

    An open, minimalist Gymnasium environment for autonomous coordination in wireless mobile networks. Includes an example notebook using Ray RLlib for multi-agent RL with mobile-env.

  • Neural MMO:

    A multiagent AI research environment inspired by Massively Multiplayer Online (MMO) role playing games – self-contained worlds featuring thousands of agents per persistent macrocosm, diverse skilling systems, local and global economies, complex emergent social structures, and ad-hoc high-stakes single and team based conflict.

  • NeuroCuts:

    Example of building packet classification trees using RLlib / multi-agent in a bandit-like setting.

  • NeuroVectorizer:

    Example of learning optimal LLVM vectorization compiler pragmas for loops in C and C++ codes using RLlib.

  • Roboschool / SageMaker:

    Example of training robotic control policies in SageMaker with RLlib.

  • Sequential Social Dilemma Games:

    Example of using the multi-agent API to model several social dilemma games.

  • Simple custom environment for single RL with Ray and RLlib:

    Create a custom environment and train a single agent RL using Ray 2.0 with Tune.

  • StarCraft2:

    Example of training in StarCraft2 maps with RLlib / multi-agent.

  • Traffic Flow:

    Example of optimizing mixed-autonomy traffic simulations with RLlib / multi-agent.

Blog Posts#

Note

The blog posts listed here all refer to the old API stack of RLlib.