Note

Ray 2.10.0 introduces the alpha stage of RLlib’s “new API stack”. The team is currently transitioning algorithms, example scripts, and documentation to the new code base throughout the subsequent minor releases leading up to Ray 3.0.

See here for more details on how to activate and use the new API stack.

Environments#

Any environment type provided by you to RLlib (e.g. a user-defined gym.Env class), is converted internally into the BaseEnv API, whose main methods are poll() and send_actions():

../../_images/env_classes_overview.svg

The BaseEnv API allows RLlib to support:

  1. Vectorization of sub-environments (i.e. individual gym.Env instances, stacked to form a vector of envs) in order to batch the action computing model forward passes.

  2. External simulators requiring async execution (e.g. envs that run on separate machines and independently request actions from a policy server).

  3. Stepping through the individual sub-environments in parallel via pre-converting them into separate @ray.remote actors.

  4. Multi-agent RL via dicts mapping agent IDs to observations/rewards/etc..

For example, if you provide a custom gym.Env class to RLlib, auto-conversion to BaseEnv goes as follows:

Here is a simple example:

# __rllib-custom-gym-env-begin__
import gymnasium as gym
import numpy as np

import ray
from ray.rllib.algorithms.ppo import PPOConfig


class SimpleCorridor(gym.Env):
    def __init__(self, config):
        self.end_pos = config["corridor_length"]
        self.cur_pos = 0.0
        self.action_space = gym.spaces.Discrete(2)  # right/left
        self.observation_space = gym.spaces.Box(0.0, self.end_pos, shape=(1,))

    def reset(self, *, seed=None, options=None):
        self.cur_pos = 0.0
        return np.array([self.cur_pos]), {}

    def step(self, action):
        if action == 0 and self.cur_pos > 0.0:  # move right (towards goal)
            self.cur_pos -= 1.0
        elif action == 1:  # move left (towards start)
            self.cur_pos += 1.0
        if self.cur_pos >= self.end_pos:
            return np.array([0.0]), 1.0, True, True, {}
        else:
            return np.array([self.cur_pos]), -0.1, False, False, {}


ray.init()

config = PPOConfig().environment(SimpleCorridor, env_config={"corridor_length": 5})
algo = config.build()

for _ in range(3):
    print(algo.train())

algo.stop()
# __rllib-custom-gym-env-end__

However, you may also conveniently sub-class any of the other supported RLlib-specific environment types. The automated paths from those env types (or callables returning instances of those types) to an RLlib BaseEnv is as follows:

Environment API Reference#