ExternalEnv API

ExternalEnv (Single-Agent Case)

rllib.env.external_env.ExternalEnv

class ray.rllib.env.external_env.ExternalEnv(action_space: <Mock name='mock.Space' id='139804197671056'>, observation_space: <Mock name='mock.Space' id='139804197671056'>, max_concurrent: int = 100)[source]

An environment that interfaces with external agents.

Unlike simulator envs, control is inverted: The environment queries the policy to obtain actions and in return logs observations and rewards for training. This is in contrast to gym.Env, where the algorithm drives the simulation through env.step() calls.

You can use ExternalEnv as the backend for policy serving (by serving HTTP requests in the run loop), for ingesting offline logs data (by reading offline transitions in the run loop), or other custom use cases not easily expressed through gym.Env.

ExternalEnv supports both on-policy actions (through self.get_action()), and off-policy actions (through self.log_action()).

This env is thread-safe, but individual episodes must be executed serially.

Examples

>>> register_env("my_env", lambda config: YourExternalEnv(config))
>>> trainer = DQNTrainer(env="my_env")
>>> while True:
>>>     print(trainer.train())
__init__(action_space: <Mock name='mock.Space' id='139804197671056'>, observation_space: <Mock name='mock.Space' id='139804197671056'>, max_concurrent: int = 100)[source]

Initializes an ExternalEnv instance.

Parameters
  • action_space – Action space of the env.

  • observation_space – Observation space of the env.

  • max_concurrent – Max number of active episodes to allow at once. Exceeding this limit raises an error.

run()[source]

Override this to implement the run loop.

Your loop should continuously:
  1. Call self.start_episode(episode_id)

  2. Call self.[get|log]_action(episode_id, obs, [action]?)

  3. Call self.log_returns(episode_id, reward)

  4. Call self.end_episode(episode_id, obs)

  5. Wait if nothing to do.

Multiple episodes may be started at the same time.

start_episode(episode_id: Optional[str] = None, training_enabled: bool = True) → str[source]

Record the start of an episode.

Parameters
  • episode_id – Unique string id for the episode or None for it to be auto-assigned and returned.

  • training_enabled – Whether to use experiences for this episode to improve the policy.

Returns

Unique string id for the episode.

get_action(episode_id: str, observation: Any) → Any[source]

Record an observation and get the on-policy action.

Parameters
  • episode_id – Episode id returned from start_episode().

  • observation – Current environment observation.

Returns

Action from the env action space.

log_action(episode_id: str, observation: Any, action: Any) → None[source]

Record an observation and (off-policy) action taken.

Parameters
  • episode_id – Episode id returned from start_episode().

  • observation – Current environment observation.

  • action – Action for the observation.

log_returns(episode_id: str, reward: float, info: Optional[dict] = None) → None[source]

Records returns (rewards and infos) from the environment.

The reward will be attributed to the previous action taken by the episode. Rewards accumulate until the next action. If no reward is logged before the next action, a reward of 0.0 is assumed.

Parameters
  • episode_id – Episode id returned from start_episode().

  • reward – Reward from the environment.

  • info – Optional info dict.

end_episode(episode_id: str, observation: Any) → None[source]

Records the end of an episode.

Parameters
  • episode_id – Episode id returned from start_episode().

  • observation – Current environment observation.

to_base_env(make_env: Callable[[int], Any] = None, num_envs: int = 1, remote_envs: bool = False, remote_env_batch_wait_ms: int = 0)ray.rllib.env.base_env.BaseEnv[source]

Converts an RLlib MultiAgentEnv into a BaseEnv object.

The resulting BaseEnv is always vectorized (contains n sub-environments) to support batched forward passes, where n may also be 1. BaseEnv also supports async execution via the poll and send_actions methods and thus supports external simulators.

Parameters
  • make_env – A callable taking an int as input (which indicates the number of individual sub-environments within the final vectorized BaseEnv) and returning one individual sub-environment.

  • num_envs – The number of sub-environments to create in the resulting (vectorized) BaseEnv. The already existing env will be one of the num_envs.

  • remote_envs – Whether each sub-env should be a @ray.remote actor. You can set this behavior in your config via the remote_worker_envs=True option.

  • remote_env_batch_wait_ms – The wait time (in ms) to poll remote sub-environments for, if applicable. Only used if remote_envs is True.

Returns

The resulting BaseEnv object.

ExternalMultiAgentEnv (Multi-Agent Case)

rllib.env.external_multi_agent_env.ExternalMultiAgentEnv

If your external environment needs to support multi-agent RL, you should instead sub-class ExternalMultiAgentEnv:

class ray.rllib.env.external_multi_agent_env.ExternalMultiAgentEnv(action_space: <Mock name='mock.Space' id='139804197671056'>, observation_space: <Mock name='mock.Space' id='139804197671056'>, max_concurrent: int = 100)[source]

This is the multi-agent version of ExternalEnv.

__init__(action_space: <Mock name='mock.Space' id='139804197671056'>, observation_space: <Mock name='mock.Space' id='139804197671056'>, max_concurrent: int = 100)[source]

Initializes an ExternalMultiAgentEnv instance.

Parameters
  • action_space – Action space of the env.

  • observation_space – Observation space of the env.

  • max_concurrent – Max number of active episodes to allow at once. Exceeding this limit raises an error.

run()[source]

Override this to implement the multi-agent run loop.

Your loop should continuously:
  1. Call self.start_episode(episode_id)

  2. Call self.get_action(episode_id, obs_dict)

    -or- self.log_action(episode_id, obs_dict, action_dict)

  3. Call self.log_returns(episode_id, reward_dict)

  4. Call self.end_episode(episode_id, obs_dict)

  5. Wait if nothing to do.

Multiple episodes may be started at the same time.

start_episode(episode_id: Optional[str] = None, training_enabled: bool = True) → str[source]

Record the start of an episode.

Parameters
  • episode_id – Unique string id for the episode or None for it to be auto-assigned and returned.

  • training_enabled – Whether to use experiences for this episode to improve the policy.

Returns

Unique string id for the episode.

get_action(episode_id: str, observation_dict: Dict[Any, Any]) → Dict[Any, Any][source]

Record an observation and get the on-policy action.

Thereby, observation_dict is expected to contain the observation of all agents acting in this episode step.

Parameters
  • episode_id – Episode id returned from start_episode().

  • observation_dict – Current environment observation.

Returns

Action from the env action space.

Return type

action

log_action(episode_id: str, observation_dict: Dict[Any, Any], action_dict: Dict[Any, Any]) → None[source]

Record an observation and (off-policy) action taken.

Parameters
  • episode_id – Episode id returned from start_episode().

  • observation_dict – Current environment observation.

  • action_dict – Action for the observation.

log_returns(episode_id: str, reward_dict: Dict[Any, Any], info_dict: Dict[Any, Any] = None, multiagent_done_dict: Dict[Any, Any] = None) → None[source]

Record returns from the environment.

The reward will be attributed to the previous action taken by the episode. Rewards accumulate until the next action. If no reward is logged before the next action, a reward of 0.0 is assumed.

Parameters
  • episode_id – Episode id returned from start_episode().

  • reward_dict – Reward from the environment agents.

  • info_dict – Optional info dict.

  • multiagent_done_dict – Optional done dict for agents.

end_episode(episode_id: str, observation_dict: Dict[Any, Any]) → None[source]

Record the end of an episode.

Parameters
  • episode_id – Episode id returned from start_episode().

  • observation_dict – Current environment observation.