Note

Ray 2.10.0 introduces the alpha stage of RLlib’s “new API stack”. The Ray Team plans to transition algorithms, example scripts, and documentation to the new code base thereby incrementally replacing the “old API stack” (e.g., ModelV2, Policy, RolloutWorker) throughout the subsequent minor releases leading up to Ray 3.0.

Note, however, that so far only PPO (single- and multi-agent) and SAC (single-agent only) support the “new API stack” and continue to run by default with the old APIs. You can continue to use the existing custom (old stack) classes.

See here for more details on how to use the new API stack.

ExternalEnv API#

ExternalEnv (Single-Agent Case)#

rllib.env.external_env.ExternalEnv#

class ray.rllib.env.external_env.ExternalEnv(action_space: gymnasium.Space, observation_space: gymnasium.Space, max_concurrent: int = None)[source]#

An environment that interfaces with external agents.

Unlike simulator envs, control is inverted: The environment queries the policy to obtain actions and in return logs observations and rewards for training. This is in contrast to gym.Env, where the algorithm drives the simulation through env.step() calls.

You can use ExternalEnv as the backend for policy serving (by serving HTTP requests in the run loop), for ingesting offline logs data (by reading offline transitions in the run loop), or other custom use cases not easily expressed through gym.Env.

ExternalEnv supports both on-policy actions (through self.get_action()), and off-policy actions (through self.log_action()).

This env is thread-safe, but individual episodes must be executed serially.

from ray.tune import register_env
from ray.rllib.algorithms.dqn import DQN
YourExternalEnv = ...
register_env("my_env", lambda config: YourExternalEnv(config))
algo = DQN(env="my_env")
while True:
    print(algo.train())
__init__(action_space: gymnasium.Space, observation_space: gymnasium.Space, max_concurrent: int = None)[source]#

Initializes an ExternalEnv instance.

Parameters:
  • action_space – Action space of the env.

  • observation_space – Observation space of the env.

run()[source]#

Override this to implement the run loop.

Your loop should continuously:
  1. Call self.start_episode(episode_id)

  2. Call self.[get|log]_action(episode_id, obs, [action]?)

  3. Call self.log_returns(episode_id, reward)

  4. Call self.end_episode(episode_id, obs)

  5. Wait if nothing to do.

Multiple episodes may be started at the same time.

start_episode(episode_id: str | None = None, training_enabled: bool = True) str[source]#

Record the start of an episode.

Parameters:
  • episode_id – Unique string id for the episode or None for it to be auto-assigned and returned.

  • training_enabled – Whether to use experiences for this episode to improve the policy.

Returns:

Unique string id for the episode.

get_action(episode_id: str, observation: Any) Any[source]#

Record an observation and get the on-policy action.

Parameters:
  • episode_id – Episode id returned from start_episode().

  • observation – Current environment observation.

Returns:

Action from the env action space.

log_action(episode_id: str, observation: Any, action: Any) None[source]#

Record an observation and (off-policy) action taken.

Parameters:
  • episode_id – Episode id returned from start_episode().

  • observation – Current environment observation.

  • action – Action for the observation.

log_returns(episode_id: str, reward: float, info: dict | None = None) None[source]#

Records returns (rewards and infos) from the environment.

The reward will be attributed to the previous action taken by the episode. Rewards accumulate until the next action. If no reward is logged before the next action, a reward of 0.0 is assumed.

Parameters:
  • episode_id – Episode id returned from start_episode().

  • reward – Reward from the environment.

  • info – Optional info dict.

end_episode(episode_id: str, observation: Any) None[source]#

Records the end of an episode.

Parameters:
  • episode_id – Episode id returned from start_episode().

  • observation – Current environment observation.

to_base_env(make_env: Callable[[int], Any | gymnasium.Env] | None = None, num_envs: int = 1, remote_envs: bool = False, remote_env_batch_wait_ms: int = 0, restart_failed_sub_environments: bool = False) BaseEnv[source]#

Converts an RLlib MultiAgentEnv into a BaseEnv object.

The resulting BaseEnv is always vectorized (contains n sub-environments) to support batched forward passes, where n may also be 1. BaseEnv also supports async execution via the poll and send_actions methods and thus supports external simulators.

Parameters:
  • make_env – A callable taking an int as input (which indicates the number of individual sub-environments within the final vectorized BaseEnv) and returning one individual sub-environment.

  • num_envs – The number of sub-environments to create in the resulting (vectorized) BaseEnv. The already existing env will be one of the num_envs.

  • remote_envs – Whether each sub-env should be a @ray.remote actor. You can set this behavior in your config via the remote_worker_envs=True option.

  • remote_env_batch_wait_ms – The wait time (in ms) to poll remote sub-environments for, if applicable. Only used if remote_envs is True.

Returns:

The resulting BaseEnv object.

ExternalMultiAgentEnv (Multi-Agent Case)#

rllib.env.external_multi_agent_env.ExternalMultiAgentEnv#

If your external environment needs to support multi-agent RL, you should instead sub-class ExternalMultiAgentEnv:

class ray.rllib.env.external_multi_agent_env.ExternalMultiAgentEnv(action_space: gymnasium.Space, observation_space: gymnasium.Space)[source]#

This is the multi-agent version of ExternalEnv.

__init__(action_space: gymnasium.Space, observation_space: gymnasium.Space)[source]#

Initializes an ExternalMultiAgentEnv instance.

Parameters:
  • action_space – Action space of the env.

  • observation_space – Observation space of the env.

run()[source]#

Override this to implement the multi-agent run loop.

Your loop should continuously:
  1. Call self.start_episode(episode_id)

  2. Call self.get_action(episode_id, obs_dict)

    -or- self.log_action(episode_id, obs_dict, action_dict)

  3. Call self.log_returns(episode_id, reward_dict)

  4. Call self.end_episode(episode_id, obs_dict)

  5. Wait if nothing to do.

Multiple episodes may be started at the same time.

start_episode(episode_id: str | None = None, training_enabled: bool = True) str[source]#

Record the start of an episode.

Parameters:
  • episode_id – Unique string id for the episode or None for it to be auto-assigned and returned.

  • training_enabled – Whether to use experiences for this episode to improve the policy.

Returns:

Unique string id for the episode.

get_action(episode_id: str, observation_dict: Dict[Any, Any]) Dict[Any, Any][source]#

Record an observation and get the on-policy action.

Thereby, observation_dict is expected to contain the observation of all agents acting in this episode step.

Parameters:
  • episode_id – Episode id returned from start_episode().

  • observation_dict – Current environment observation.

Returns:

Action from the env action space.

Return type:

action

log_action(episode_id: str, observation_dict: Dict[Any, Any], action_dict: Dict[Any, Any]) None[source]#

Record an observation and (off-policy) action taken.

Parameters:
  • episode_id – Episode id returned from start_episode().

  • observation_dict – Current environment observation.

  • action_dict – Action for the observation.

log_returns(episode_id: str, reward_dict: Dict[Any, Any], info_dict: Dict[Any, Any] = None, multiagent_done_dict: Dict[Any, Any] = None) None[source]#

Record returns from the environment.

The reward will be attributed to the previous action taken by the episode. Rewards accumulate until the next action. If no reward is logged before the next action, a reward of 0.0 is assumed.

Parameters:
  • episode_id – Episode id returned from start_episode().

  • reward_dict – Reward from the environment agents.

  • info_dict – Optional info dict.

  • multiagent_done_dict – Optional done dict for agents.

end_episode(episode_id: str, observation_dict: Dict[Any, Any]) None[source]#

Record the end of an episode.

Parameters:
  • episode_id – Episode id returned from start_episode().

  • observation_dict – Current environment observation.