RLlib Package Reference

ray.rllib.policy

class ray.rllib.policy.Policy(observation_space: <Mock name='mock.spaces.Space' id='139809893372648'>, action_space: <Mock name='mock.spaces.Space' id='139809893372648'>, config: dict)[source]

An agent policy and loss, i.e., a TFPolicy or other subclass.

This object defines how to act in the environment, and also losses used to improve the policy based on its experiences. Note that both policy and loss are defined together for convenience, though the policy itself is logically separate.

All policies can directly extend Policy, however TensorFlow users may find TFPolicy simpler to implement. TFPolicy also enables RLlib to apply TensorFlow-specific optimizations such as fusing multiple policy graphs and multi-GPU support.

observation_space

Observation space of the policy.

Type

gym.Space

action_space

Action space of the policy.

Type

gym.Space

exploration

The exploration object to use for computing actions, or None.

Type

Exploration

abstract compute_actions(obs_batch: Union[List[Any], Any], state_batches: Optional[List[Any]] = None, prev_action_batch: Union[List[Any], Any] = None, prev_reward_batch: Union[List[Any], Any] = None, info_batch: Optional[Dict[str, list]] = None, episodes: Optional[List[MultiAgentEpisode]] = None, explore: Optional[bool] = None, timestep: Optional[int] = None, **kwargs) → Tuple[Any, List[Any], Dict[str, Any]][source]

Computes actions for the current policy.

Parameters
  • obs_batch (Union[List[TensorType], TensorType]) – Batch of observations.

  • state_batches (Optional[List[TensorType]]) – List of RNN state input batches, if any.

  • prev_action_batch (Union[List[TensorType], TensorType]) – Batch of previous action values.

  • prev_reward_batch (Union[List[TensorType], TensorType]) – Batch of previous rewards.

  • info_batch (Optional[Dict[str, list]]) – Batch of info objects.

  • episodes (Optional[List[MultiAgentEpisode]]) – List of MultiAgentEpisode, one for each obs in obs_batch. This provides access to all of the internal episode state, which may be useful for model-based or multiagent algorithms.

  • explore (Optional[bool]) – Whether to pick an exploitation or exploration action. Set to None (default) for using the value of self.config[“explore”].

  • timestep (Optional[int]) – The current (sampling) time step.

Keyword Arguments

kwargs – forward compatibility placeholder

Returns

actions (TensorType): Batch of output actions, with shape like

[BATCH_SIZE, ACTION_SHAPE].

state_outs (List[TensorType]): List of RNN state output

batches, if any, with shape like [STATE_SIZE, BATCH_SIZE].

info (List[dict]): Dictionary of extra feature batches, if any,

with shape like {“f1”: [BATCH_SIZE, …], “f2”: [BATCH_SIZE, …]}.

Return type

Tuple

compute_single_action(obs: Any, state: Optional[List[Any]] = None, prev_action: Optional[Any] = None, prev_reward: Optional[Any] = None, info: dict = None, episode: Optional[MultiAgentEpisode] = None, clip_actions: bool = False, explore: Optional[bool] = None, timestep: Optional[int] = None, **kwargs) → Tuple[Any, List[Any], Dict[str, Any]][source]

Unbatched version of compute_actions.

Parameters
  • obs (TensorType) – Single observation.

  • state (Optional[List[TensorType]]) – List of RNN state inputs, if any.

  • prev_action (Optional[TensorType]) – Previous action value, if any.

  • prev_reward (Optional[TensorType]) – Previous reward, if any.

  • info (dict) – Info object, if any.

  • episode (Optional[MultiAgentEpisode]) – this provides access to all of the internal episode state, which may be useful for model-based or multi-agent algorithms.

  • clip_actions (bool) – Should actions be clipped?

  • explore (Optional[bool]) – Whether to pick an exploitation or exploration action (default: None -> use self.config[“explore”]).

  • timestep (Optional[int]) – The current (sampling) time step.

Keyword Arguments

kwargs – Forward compatibility.

Returns

  • actions (TensorType): Single action.

  • state_outs (List[TensorType]): List of RNN state outputs,

    if any.

  • info (dict): Dictionary of extra features, if any.

Return type

Tuple

compute_actions_from_trajectories(trajectories: List[Trajectory], other_trajectories: Optional[Dict[Any, Trajectory]] = None, explore: bool = None, timestep: Optional[int] = None, **kwargs) → Tuple[Any, List[Any], Dict[str, Any]][source]

Computes actions for the current policy based on .

Note: This is an experimental API method.

Only used so far by the Sampler iff _use_trajectory_view_api=True (also only supported for torch).

Parameters
  • trajectories (List[Trajectory]) – A List of Trajectory data used to create a view for the Model forward call.

  • other_trajectories (Optional[Dict[AgentID, Trajectory]]) – Optional dict mapping AgentIDs to Trajectory objects.

  • explore (bool) – Whether to pick an exploitation or exploration action (default: None -> use self.config[“explore”]).

  • timestep (Optional[int]) – The current (sampling) time step.

  • kwargs – forward compatibility placeholder

Returns

actions (TensorType): Batch of output actions, with shape

like [BATCH_SIZE, ACTION_SHAPE].

state_outs (List[TensorType]): List of RNN state output

batches, if any, with shape like [STATE_SIZE, BATCH_SIZE].

info (dict): Dictionary of extra feature batches, if any, with

shape like {“f1”: [BATCH_SIZE, …], “f2”: [BATCH_SIZE, …]}.

Return type

Tuple

compute_log_likelihoods(actions: Union[List[Any], Any], obs_batch: Union[List[Any], Any], state_batches: Optional[List[Any]] = None, prev_action_batch: Union[List[Any], Any, None] = None, prev_reward_batch: Union[List[Any], Any, None] = None) → Any[source]

Computes the log-prob/likelihood for a given action and observation.

Parameters
  • actions (Union[List[TensorType], TensorType]) – Batch of actions, for which to retrieve the log-probs/likelihoods (given all other inputs: obs, states, ..).

  • obs_batch (Union[List[TensorType], TensorType]) – Batch of observations.

  • state_batches (Optional[List[TensorType]]) – List of RNN state input batches, if any.

  • prev_action_batch (Optional[Union[List[TensorType], TensorType]]) – Batch of previous action values.

  • prev_reward_batch (Optional[Union[List[TensorType], TensorType]]) – Batch of previous rewards.

Returns

Batch of log probs/likelihoods, with shape:

[BATCH_SIZE].

Return type

TensorType

postprocess_trajectory(sample_batch: ray.rllib.policy.sample_batch.SampleBatch, other_agent_batches: Optional[Dict[Any, Tuple[Policy, ray.rllib.policy.sample_batch.SampleBatch]]] = None, episode: Optional[MultiAgentEpisode] = None) → ray.rllib.policy.sample_batch.SampleBatch[source]

Implements algorithm-specific trajectory postprocessing.

This will be called on each trajectory fragment computed during policy evaluation. Each fragment is guaranteed to be only from one episode.

Parameters
  • sample_batch (SampleBatch) – batch of experiences for the policy, which will contain at most one episode trajectory.

  • other_agent_batches (dict) – In a multi-agent env, this contains a mapping of agent ids to (policy, agent_batch) tuples containing the policy and experiences of the other agents.

  • episode (Optional[MultiAgentEpisode]) – An optional multi-agent episode object to provide access to all of the internal episode state, which may be useful for model-based or multi-agent algorithms.

Returns

Postprocessed sample batch.

Return type

SampleBatch

learn_on_batch(samples: ray.rllib.policy.sample_batch.SampleBatch) → Dict[str, Any][source]

Fused compute gradients and apply gradients call.

Either this or the combination of compute/apply grads must be implemented by subclasses.

Parameters

samples (SampleBatch) – The SampleBatch object to learn from.

Returns

Dictionary of extra metadata from

compute_gradients().

Return type

Dict[str, TensorType]

Examples

>>> sample_batch = ev.sample()
>>> ev.learn_on_batch(sample_batch)
compute_gradients(postprocessed_batch: ray.rllib.policy.sample_batch.SampleBatch) → Tuple[Union[List[Tuple[Any, Any]], List[Any]], Dict[str, Any]][source]

Computes gradients against a batch of experiences.

Either this or learn_on_batch() must be implemented by subclasses.

Parameters

postprocessed_batch (SampleBatch) – The SampleBatch object to use for calculating gradients.

Returns

  • List of gradient output values.

  • Extra policy-specific info values.

Return type

Tuple[ModelGradients, Dict[str, TensorType]]

apply_gradients(gradients: Union[List[Tuple[Any, Any]], List[Any]]) → None[source]

Applies previously computed gradients.

Either this or learn_on_batch() must be implemented by subclasses.

Parameters

gradients (ModelGradients) – The already calculated gradients to apply to this Policy.

get_weights() → dict[source]

Returns model weights.

Returns

Serializable copy or view of model weights.

Return type

ModelWeights

set_weights(weights: dict) → None[source]

Sets model weights.

Parameters

weights (ModelWeights) – Serializable copy or view of model weights.

get_exploration_info() → Dict[str, Any][source]

Returns the current exploration information of this policy.

This information depends on the policy’s Exploration object.

Returns

Serializable information on the

self.exploration object.

Return type

Dict[str, TensorType]

is_recurrent() → bool[source]

Whether this Policy holds a recurrent Model.

Returns

True if this Policy has-a RNN-based Model.

Return type

bool

num_state_tensors() → int[source]

The number of internal states needed by the RNN-Model of the Policy.

Returns

The number of RNN internal states kept by this Policy’s Model.

Return type

int

get_initial_state() → List[Any][source]

Returns initial RNN state for the current policy.

Returns

Initial RNN state for the current policy.

Return type

List[TensorType]

get_state() → Union[Dict[str, Any], List[Any]][source]

Saves all local state.

Returns

Serialized local

state.

Return type

Union[Dict[str, TensorType], List[TensorType]]

set_state(state: object) → None[source]

Restores all local state.

Parameters

state (obj) – Serialized local state.

on_global_var_update(global_vars: Dict[str, Any]) → None[source]

Called on an update to global vars.

Parameters

global_vars (Dict[str, TensorType]) – Global variables by str key, broadcast from the driver.

export_model(export_dir: str) → None[source]

Export Policy to local directory for serving.

Parameters

export_dir (str) – Local writable directory.

export_checkpoint(export_dir: str) → None[source]

Export Policy checkpoint to local directory.

Parameters

export_dir (str) – Local writable directory.

import_model_from_h5(import_file: str) → None[source]

Imports Policy from local file.

Parameters

import_file (str) – Local readable file.

class ray.rllib.policy.TorchPolicy(observation_space: <Mock name='mock.spaces.Space' id='139809893372648'>, action_space: <Mock name='mock.spaces.Space' id='139809893372648'>, config: dict, *, model: ray.rllib.models.modelv2.ModelV2, loss: Callable[[ray.rllib.policy.policy.Policy, ray.rllib.models.modelv2.ModelV2, type, ray.rllib.policy.sample_batch.SampleBatch], Any], action_distribution_class: ray.rllib.models.torch.torch_action_dist.TorchDistributionWrapper, action_sampler_fn: Callable[[Any, List[Any]], Tuple[Any, Any]] = None, action_distribution_fn: Optional[Callable[[ray.rllib.policy.policy.Policy, ray.rllib.models.modelv2.ModelV2, Any, Any, Any], Tuple[Any, type, List[Any]]]] = None, max_seq_len: int = 20, get_batch_divisibility_req: Optional[int] = None)[source]

Template for a PyTorch policy and loss to use with RLlib.

This is similar to TFPolicy, but for PyTorch.

observation_space

observation space of the policy.

Type

gym.Space

action_space

action space of the policy.

Type

gym.Space

config

config of the policy.

Type

dict

model

Torch model instance.

Type

TorchModel

dist_class

Torch action distribution class.

Type

type

compute_actions(obs_batch: Union[List[Any], Any], state_batches: Optional[List[Any]] = None, prev_action_batch: Union[List[Any], Any] = None, prev_reward_batch: Union[List[Any], Any] = None, info_batch: Optional[Dict[str, list]] = None, episodes: Optional[List[MultiAgentEpisode]] = None, explore: Optional[bool] = None, timestep: Optional[int] = None, **kwargs) → Tuple[Any, List[Any], Dict[str, Any]][source]

Computes actions for the current policy.

Parameters
  • obs_batch (Union[List[TensorType], TensorType]) – Batch of observations.

  • state_batches (Optional[List[TensorType]]) – List of RNN state input batches, if any.

  • prev_action_batch (Union[List[TensorType], TensorType]) – Batch of previous action values.

  • prev_reward_batch (Union[List[TensorType], TensorType]) – Batch of previous rewards.

  • info_batch (Optional[Dict[str, list]]) – Batch of info objects.

  • episodes (Optional[List[MultiAgentEpisode]]) – List of MultiAgentEpisode, one for each obs in obs_batch. This provides access to all of the internal episode state, which may be useful for model-based or multiagent algorithms.

  • explore (Optional[bool]) – Whether to pick an exploitation or exploration action. Set to None (default) for using the value of self.config[“explore”].

  • timestep (Optional[int]) – The current (sampling) time step.

Keyword Arguments

kwargs – forward compatibility placeholder

Returns

actions (TensorType): Batch of output actions, with shape like

[BATCH_SIZE, ACTION_SHAPE].

state_outs (List[TensorType]): List of RNN state output

batches, if any, with shape like [STATE_SIZE, BATCH_SIZE].

info (List[dict]): Dictionary of extra feature batches, if any,

with shape like {“f1”: [BATCH_SIZE, …], “f2”: [BATCH_SIZE, …]}.

Return type

Tuple

compute_actions_from_trajectories(trajectories: List[Trajectory], other_trajectories: Optional[Dict[Any, Trajectory]] = None, explore: bool = None, timestep: Optional[int] = None, **kwargs) → Tuple[Any, List[Any], Dict[str, Any]][source]

Computes actions for the current policy based on .

Note: This is an experimental API method.

Only used so far by the Sampler iff _use_trajectory_view_api=True (also only supported for torch).

Parameters
  • trajectories (List[Trajectory]) – A List of Trajectory data used to create a view for the Model forward call.

  • other_trajectories (Optional[Dict[AgentID, Trajectory]]) – Optional dict mapping AgentIDs to Trajectory objects.

  • explore (bool) – Whether to pick an exploitation or exploration action (default: None -> use self.config[“explore”]).

  • timestep (Optional[int]) – The current (sampling) time step.

  • kwargs – forward compatibility placeholder

Returns

actions (TensorType): Batch of output actions, with shape

like [BATCH_SIZE, ACTION_SHAPE].

state_outs (List[TensorType]): List of RNN state output

batches, if any, with shape like [STATE_SIZE, BATCH_SIZE].

info (dict): Dictionary of extra feature batches, if any, with

shape like {“f1”: [BATCH_SIZE, …], “f2”: [BATCH_SIZE, …]}.

Return type

Tuple

compute_log_likelihoods(actions: Union[List[Any], Any], obs_batch: Union[List[Any], Any], state_batches: Optional[List[Any]] = None, prev_action_batch: Union[List[Any], Any, None] = None, prev_reward_batch: Union[List[Any], Any, None] = None) → Any[source]

Computes the log-prob/likelihood for a given action and observation.

Parameters
  • actions (Union[List[TensorType], TensorType]) – Batch of actions, for which to retrieve the log-probs/likelihoods (given all other inputs: obs, states, ..).

  • obs_batch (Union[List[TensorType], TensorType]) – Batch of observations.

  • state_batches (Optional[List[TensorType]]) – List of RNN state input batches, if any.

  • prev_action_batch (Optional[Union[List[TensorType], TensorType]]) – Batch of previous action values.

  • prev_reward_batch (Optional[Union[List[TensorType], TensorType]]) – Batch of previous rewards.

Returns

Batch of log probs/likelihoods, with shape:

[BATCH_SIZE].

Return type

TensorType

learn_on_batch(postprocessed_batch: ray.rllib.policy.sample_batch.SampleBatch) → Dict[str, Any][source]

Fused compute gradients and apply gradients call.

Either this or the combination of compute/apply grads must be implemented by subclasses.

Parameters

samples (SampleBatch) – The SampleBatch object to learn from.

Returns

Dictionary of extra metadata from

compute_gradients().

Return type

Dict[str, TensorType]

Examples

>>> sample_batch = ev.sample()
>>> ev.learn_on_batch(sample_batch)
compute_gradients(postprocessed_batch: ray.rllib.policy.sample_batch.SampleBatch) → Union[List[Tuple[Any, Any]], List[Any]][source]

Computes gradients against a batch of experiences.

Either this or learn_on_batch() must be implemented by subclasses.

Parameters

postprocessed_batch (SampleBatch) – The SampleBatch object to use for calculating gradients.

Returns

  • List of gradient output values.

  • Extra policy-specific info values.

Return type

Tuple[ModelGradients, Dict[str, TensorType]]

apply_gradients(gradients: Union[List[Tuple[Any, Any]], List[Any]]) → None[source]

Applies previously computed gradients.

Either this or learn_on_batch() must be implemented by subclasses.

Parameters

gradients (ModelGradients) – The already calculated gradients to apply to this Policy.

get_weights() → dict[source]

Returns model weights.

Returns

Serializable copy or view of model weights.

Return type

ModelWeights

set_weights(weights: dict) → None[source]

Sets model weights.

Parameters

weights (ModelWeights) – Serializable copy or view of model weights.

is_recurrent() → bool[source]

Whether this Policy holds a recurrent Model.

Returns

True if this Policy has-a RNN-based Model.

Return type

bool

num_state_tensors() → int[source]

The number of internal states needed by the RNN-Model of the Policy.

Returns

The number of RNN internal states kept by this Policy’s Model.

Return type

int

get_initial_state() → List[Any][source]

Returns initial RNN state for the current policy.

Returns

Initial RNN state for the current policy.

Return type

List[TensorType]

get_state() → Union[Dict[str, Any], List[Any]][source]

Saves all local state.

Returns

Serialized local

state.

Return type

Union[Dict[str, TensorType], List[TensorType]]

set_state(state: object) → None[source]

Restores all local state.

Parameters

state (obj) – Serialized local state.

extra_grad_process(optimizer: <Mock name='mock.optim.Optimizer' id='139809776364792'>, loss: Any)[source]

Called after each optimizer.zero_grad() + loss.backward() call.

Called for each self._optimizers/loss-value pair. Allows for gradient processing before optimizer.step() is called. E.g. for gradient clipping.

Parameters
  • optimizer (torch.optim.Optimizer) – A torch optimizer object.

  • loss (TensorType) – The loss tensor associated with the optimizer.

Returns

An dict with information on the gradient

processing step.

Return type

Dict[str, TensorType]

extra_compute_grad_fetches() → Dict[str, any][source]

Extra values to fetch and return from compute_gradients().

Returns

Extra fetch dict to be added to the fetch dict

of the compute_gradients call.

Return type

Dict[str, any]

extra_action_out(input_dict: Dict[str, Any], state_batches: List[Any], model: ray.rllib.models.torch.torch_modelv2.TorchModelV2, action_dist: ray.rllib.models.torch.torch_action_dist.TorchDistributionWrapper) → Dict[str, Any][source]

Returns dict of extra info to include in experience batch.

Parameters
  • input_dict (Dict[str, TensorType]) – Dict of model input tensors.

  • state_batches (List[TensorType]) – List of state tensors.

  • model (TorchModelV2) – Reference to the model object.

  • action_dist (TorchDistributionWrapper) – Torch action dist object to get log-probs (e.g. for already sampled actions).

Returns

Extra outputs to return in a

compute_actions() call (3rd return value).

Return type

Dict[str, TensorType]

extra_grad_info(train_batch: ray.rllib.policy.sample_batch.SampleBatch) → Dict[str, Any][source]

Return dict of extra grad info.

Parameters

train_batch (SampleBatch) – The training batch for which to produce extra grad info for.

Returns

The info dict carrying grad info per str

key.

Return type

Dict[str, TensorType]

optimizer() → Union[List[<Mock name=’mock.optim.Optimizer’ id=‘139809776364792’>], <Mock name=’mock.optim.Optimizer’ id=‘139809776364792’>][source]

Custom the local PyTorch optimizer(s) to use.

Returns

The local PyTorch optimizer(s) to use for this Policy.

Return type

Union[List[torch.optim.Optimizer], torch.optim.Optimizer]

export_model(export_dir: str) → None[source]

TODO(sven): implement for torch.

export_checkpoint(export_dir: str) → None[source]

TODO(sven): implement for torch.

import_model_from_h5(import_file: str) → None[source]

Imports weights into torch model.

class ray.rllib.policy.TFPolicy(observation_space: <Mock name='mock.spaces.Space' id='139809893372648'>, action_space: <Mock name='mock.spaces.Space' id='139809893372648'>, config: dict, sess: <Mock name='mock.compat.v1.Session' id='139809778409032'>, obs_input: Any, sampled_action: Any, loss: Any, loss_inputs: List[Tuple[str, Any]], model: ray.rllib.models.modelv2.ModelV2 = None, sampled_action_logp: Optional[Any] = None, action_input: Optional[Any] = None, log_likelihood: Optional[Any] = None, dist_inputs: Optional[Any] = None, dist_class: Optional[type] = None, state_inputs: Optional[List[Any]] = None, state_outputs: Optional[List[Any]] = None, prev_action_input: Optional[Any] = None, prev_reward_input: Optional[Any] = None, seq_lens: Optional[Any] = None, max_seq_len: int = 20, batch_divisibility_req: int = 1, update_ops: List[Any] = None, explore: Optional[Any] = None, timestep: Optional[Any] = None)[source]

An agent policy and loss implemented in TensorFlow.

Do not sub-class this class directly (neither should you sub-class DynamicTFPolicy), but rather use rllib.policy.tf_policy_template.build_tf_policy to generate your custom tf (graph-mode or eager) Policy classes.

Extending this class enables RLlib to perform TensorFlow specific optimizations on the policy, e.g., parallelization across gpus or fusing multiple graphs together in the multi-agent setting.

Input tensors are typically shaped like [BATCH_SIZE, …].

observation_space

observation space of the policy.

Type

gym.Space

action_space

action space of the policy.

Type

gym.Space

model

RLlib model used for the policy.

Type

rllib.models.Model

Examples

>>> policy = TFPolicySubclass(
    sess, obs_input, sampled_action, loss, loss_inputs)
>>> print(policy.compute_actions([1, 0, 2]))
(array([0, 1, 1]), [], {})
>>> print(policy.postprocess_trajectory(SampleBatch({...})))
SampleBatch({"action": ..., "advantages": ..., ...})
variables()[source]

Return the list of all savable variables for this policy.

get_placeholder(name) → <Mock name=’mock.compat.v1.placeholder’ id=‘139809775716728’>[source]

Returns the given action or loss input placeholder by name.

If the loss has not been initialized and a loss input placeholder is requested, an error is raised.

Parameters

name (str) – The name of the placeholder to return. One of SampleBatch.CUR_OBS|PREV_ACTION/REWARD or a valid key from self._loss_input_dict.

Returns

The placeholder under the given str key.

Return type

tf1.placeholder

get_session() → <Mock name=’mock.compat.v1.Session’ id=‘139809778409032’>[source]

Returns a reference to the TF session for this policy.

loss_initialized() → bool[source]

Returns whether the loss function has been initialized.

compute_actions(obs_batch: Union[List[Any], Any], state_batches: Optional[List[Any]] = None, prev_action_batch: Union[List[Any], Any] = None, prev_reward_batch: Union[List[Any], Any] = None, info_batch: Optional[Dict[str, list]] = None, episodes: Optional[List[MultiAgentEpisode]] = None, explore: Optional[bool] = None, timestep: Optional[int] = None, **kwargs)[source]

Computes actions for the current policy.

Parameters
  • obs_batch (Union[List[TensorType], TensorType]) – Batch of observations.

  • state_batches (Optional[List[TensorType]]) – List of RNN state input batches, if any.

  • prev_action_batch (Union[List[TensorType], TensorType]) – Batch of previous action values.

  • prev_reward_batch (Union[List[TensorType], TensorType]) – Batch of previous rewards.

  • info_batch (Optional[Dict[str, list]]) – Batch of info objects.

  • episodes (Optional[List[MultiAgentEpisode]]) – List of MultiAgentEpisode, one for each obs in obs_batch. This provides access to all of the internal episode state, which may be useful for model-based or multiagent algorithms.

  • explore (Optional[bool]) – Whether to pick an exploitation or exploration action. Set to None (default) for using the value of self.config[“explore”].

  • timestep (Optional[int]) – The current (sampling) time step.

Keyword Arguments

kwargs – forward compatibility placeholder

Returns

actions (TensorType): Batch of output actions, with shape like

[BATCH_SIZE, ACTION_SHAPE].

state_outs (List[TensorType]): List of RNN state output

batches, if any, with shape like [STATE_SIZE, BATCH_SIZE].

info (List[dict]): Dictionary of extra feature batches, if any,

with shape like {“f1”: [BATCH_SIZE, …], “f2”: [BATCH_SIZE, …]}.

Return type

Tuple

compute_log_likelihoods(actions: Union[List[Any], Any], obs_batch: Union[List[Any], Any], state_batches: Optional[List[Any]] = None, prev_action_batch: Union[List[Any], Any, None] = None, prev_reward_batch: Union[List[Any], Any, None] = None) → Any[source]

Computes the log-prob/likelihood for a given action and observation.

Parameters
  • actions (Union[List[TensorType], TensorType]) – Batch of actions, for which to retrieve the log-probs/likelihoods (given all other inputs: obs, states, ..).

  • obs_batch (Union[List[TensorType], TensorType]) – Batch of observations.

  • state_batches (Optional[List[TensorType]]) – List of RNN state input batches, if any.

  • prev_action_batch (Optional[Union[List[TensorType], TensorType]]) – Batch of previous action values.

  • prev_reward_batch (Optional[Union[List[TensorType], TensorType]]) – Batch of previous rewards.

Returns

Batch of log probs/likelihoods, with shape:

[BATCH_SIZE].

Return type

TensorType

learn_on_batch(postprocessed_batch: ray.rllib.policy.sample_batch.SampleBatch) → Dict[str, Any][source]

Fused compute gradients and apply gradients call.

Either this or the combination of compute/apply grads must be implemented by subclasses.

Parameters

samples (SampleBatch) – The SampleBatch object to learn from.

Returns

Dictionary of extra metadata from

compute_gradients().

Return type

Dict[str, TensorType]

Examples

>>> sample_batch = ev.sample()
>>> ev.learn_on_batch(sample_batch)
compute_gradients(postprocessed_batch: ray.rllib.policy.sample_batch.SampleBatch) → Tuple[Union[List[Tuple[Any, Any]], List[Any]], Dict[str, Any]][source]

Computes gradients against a batch of experiences.

Either this or learn_on_batch() must be implemented by subclasses.

Parameters

postprocessed_batch (SampleBatch) – The SampleBatch object to use for calculating gradients.

Returns

  • List of gradient output values.

  • Extra policy-specific info values.

Return type

Tuple[ModelGradients, Dict[str, TensorType]]

apply_gradients(gradients: Union[List[Tuple[Any, Any]], List[Any]]) → None[source]

Applies previously computed gradients.

Either this or learn_on_batch() must be implemented by subclasses.

Parameters

gradients (ModelGradients) – The already calculated gradients to apply to this Policy.

get_exploration_info() → Dict[str, Any][source]

Returns the current exploration information of this policy.

This information depends on the policy’s Exploration object.

Returns

Serializable information on the

self.exploration object.

Return type

Dict[str, TensorType]

get_weights() → Union[Dict[str, Any], List[Any]][source]

Returns model weights.

Returns

Serializable copy or view of model weights.

Return type

ModelWeights

set_weights(weights) → None[source]

Sets model weights.

Parameters

weights (ModelWeights) – Serializable copy or view of model weights.

get_state() → Union[Dict[str, Any], List[Any]][source]

Saves all local state.

Returns

Serialized local

state.

Return type

Union[Dict[str, TensorType], List[TensorType]]

set_state(state) → None[source]

Restores all local state.

Parameters

state (obj) – Serialized local state.

export_model(export_dir: str) → None[source]

Export tensorflow graph to export_dir for serving.

export_checkpoint(export_dir: str, filename_prefix: str = 'model') → None[source]

Export tensorflow checkpoint to export_dir.

import_model_from_h5(import_file: str) → None[source]

Imports weights into tf model.

copy(existing_inputs: List[Tuple[str, tf1.placeholder]]) → ray.rllib.policy.tf_policy.TFPolicy[source]

Creates a copy of self using existing input placeholders.

Optional: Only required to work with the multi-GPU optimizer.

Parameters

existing_inputs (List[Tuple[str, tf1.placeholder]]) – Dict mapping names (str) to tf1.placeholders to re-use (share) with the returned copy of self.

Returns

A copy of self.

Return type

TFPolicy

is_recurrent() → bool[source]

Whether this Policy holds a recurrent Model.

Returns

True if this Policy has-a RNN-based Model.

Return type

bool

num_state_tensors() → int[source]

The number of internal states needed by the RNN-Model of the Policy.

Returns

The number of RNN internal states kept by this Policy’s Model.

Return type

int

extra_compute_action_feed_dict() → Dict[Any, Any][source]

Extra dict to pass to the compute actions session run.

Returns

A feed dict to be added to the

feed_dict passed to the compute_actions session.run() call.

Return type

Dict[TensorType, TensorType]

extra_compute_action_fetches() → Dict[str, Any][source]

Extra values to fetch and return from compute_actions().

By default we return action probability/log-likelihood info and action distribution inputs (if present).

Returns

An extra fetch-dict to be passed to and

returned from the compute_actions() call.

Return type

Dict[str, TensorType]

extra_compute_grad_feed_dict() → Dict[Any, Any][source]

Extra dict to pass to the compute gradients session run.

Returns

Extra feed_dict to be passed to the

compute_gradients Session.run() call.

Return type

Dict[TensorType, TensorType]

extra_compute_grad_fetches() → Dict[str, any][source]

Extra values to fetch and return from compute_gradients().

Returns

Extra fetch dict to be added to the fetch dict

of the compute_gradients Session.run() call.

Return type

Dict[str, any]

optimizer() → <Mock name=’mock.keras.optimizers.Optimizer’ id=‘139809775200408’>[source]

TF optimizer to use for policy optimization.

Returns

The local optimizer to use for this

Policy’s Model.

Return type

tf.keras.optimizers.Optimizer

gradients(optimizer: <Mock name='mock.keras.optimizers.Optimizer' id='139809775200408'>, loss: Any) → List[Tuple[Any, Any]][source]

Override this for a custom gradient computation behavior.

Returns

List of tuples with grad

values and the grad-value’s corresponding tf.variable in it.

Return type

List[Tuple[TensorType, TensorType]]

build_apply_op(optimizer: <Mock name='mock.keras.optimizers.Optimizer' id='139809775200408'>, grads_and_vars: List[Tuple[Any, Any]]) → <Mock name=’mock.Operation’ id=‘139809775597832’>[source]

Override this for a custom gradient apply computation behavior.

Parameters
  • optimizer (tf.keras.optimizers.Optimizer) – The local tf optimizer to use for applying the grads and vars.

  • grads_and_vars (List[Tuple[TensorType, TensorType]]) – List of tuples with grad values and the grad-value’s corresponding tf.variable in it.

ray.rllib.policy.build_torch_policy(name, *, loss_fn, get_default_config=None, stats_fn=None, postprocess_fn=None, extra_action_out_fn=None, extra_grad_process_fn=None, extra_learn_fetches_fn=None, optimizer_fn=None, validate_spaces=None, before_init=None, after_init=None, action_sampler_fn=None, action_distribution_fn=None, make_model=None, make_model_and_action_dist=None, apply_gradients_fn=None, mixins=None, get_batch_divisibility_req=None)[source]

Helper function for creating a torch policy class at runtime.

Parameters
  • name (str) – name of the policy (e.g., “PPOTorchPolicy”)

  • loss_fn (callable) – Callable that returns a loss tensor as arguments given (policy, model, dist_class, train_batch).

  • get_default_config (Optional[callable]) – Optional callable that returns the default config to merge with any overrides.

  • stats_fn (Optional[callable]) – Optional callable that returns a dict of values given the policy and batch input tensors.

  • postprocess_fn (Optional[callable]) – Optional experience postprocessing function that takes the same args as Policy.postprocess_trajectory().

  • extra_action_out_fn (Optional[callable]) – Optional callable that returns a dict of extra values to include in experiences.

  • extra_grad_process_fn (Optional[callable]) – Optional callable that is called after gradients are computed and returns processing info.

  • extra_learn_fetches_fn (func) – optional function that returns a dict of extra values to fetch from the policy after loss evaluation.

  • optimizer_fn (Optional[callable]) – Optional callable that returns a torch optimizer given the policy and config.

  • validate_spaces (Optional[callable]) – Optional callable that takes the Policy, observation_space, action_space, and config to check for correctness.

  • before_init (Optional[callable]) – Optional callable to run at the beginning of Policy.__init__ that takes the same arguments as the Policy constructor.

  • after_init (Optional[callable]) – Optional callable to run at the end of policy init that takes the same arguments as the policy constructor.

  • action_sampler_fn (Optional[callable]) – Optional callable returning a sampled action and its log-likelihood given some (obs and state) inputs.

  • action_distribution_fn (Optional[callable]) – A callable that takes the Policy, Model, the observation batch, an explore-flag, a timestep, and an is_training flag and returns a tuple of a) distribution inputs (parameters), b) a dist-class to generate an action distribution object from, and c) internal-state outputs (empty list if not applicable).

  • make_model (Optional[callable]) – Optional func that takes the same arguments as Policy.__init__ and returns a model instance. The distribution class will be determined automatically. Note: Only one of make_model or make_model_and_action_dist should be provided.

  • make_model_and_action_dist (Optional[callable]) – Optional func that takes the same arguments as Policy.__init__ and returns a tuple of model instance and torch action distribution class. Note: Only one of make_model or make_model_and_action_dist should be provided.

  • apply_gradients_fn (Optional[callable]) – Optional callable that takes a grads list and applies these to the Model’s parameters.

  • mixins (list) – list of any class mixins for the returned policy class. These mixins will be applied in order and will have higher precedence than the TorchPolicy class.

  • get_batch_divisibility_req (Optional[callable]) – Optional callable that returns the divisibility requirement for sample batches.

Returns

TorchPolicy child class constructed from the specified args.

Return type

type

ray.rllib.policy.build_tf_policy(name, *, loss_fn, get_default_config=None, postprocess_fn=None, stats_fn=None, optimizer_fn=None, gradients_fn=None, apply_gradients_fn=None, grad_stats_fn=None, extra_action_fetches_fn=None, extra_learn_fetches_fn=None, validate_spaces=None, before_init=None, before_loss_init=None, after_init=None, make_model=None, action_sampler_fn=None, action_distribution_fn=None, mixins=None, get_batch_divisibility_req=None, obs_include_prev_action_reward=True)[source]

Helper function for creating a dynamic tf policy at runtime.

Functions will be run in this order to initialize the policy:
  1. Placeholder setup: postprocess_fn

  2. Loss init: loss_fn, stats_fn

  3. Optimizer init: optimizer_fn, gradients_fn, apply_gradients_fn,

    grad_stats_fn

This means that you can e.g., depend on any policy attributes created in the running of loss_fn in later functions such as stats_fn.

In eager mode, the following functions will be run repeatedly on each eager execution: loss_fn, stats_fn, gradients_fn, apply_gradients_fn, and grad_stats_fn.

This means that these functions should not define any variables internally, otherwise they will fail in eager mode execution. Variable should only be created in make_model (if defined).

Parameters
  • name (str) – name of the policy (e.g., “PPOTFPolicy”)

  • loss_fn (func) – function that returns a loss tensor as arguments (policy, model, dist_class, train_batch)

  • get_default_config (func) – optional function that returns the default config to merge with any overrides

  • postprocess_fn (func) – optional experience postprocessing function that takes the same args as Policy.postprocess_trajectory()

  • stats_fn (func) – optional function that returns a dict of TF fetches given the policy and batch input tensors

  • optimizer_fn (func) – optional function that returns a tf.Optimizer given the policy and config

  • gradients_fn (func) – optional function that returns a list of gradients given (policy, optimizer, loss). If not specified, this defaults to optimizer.compute_gradients(loss)

  • apply_gradients_fn (func) – optional function that returns an apply gradients op given (policy, optimizer, grads_and_vars)

  • grad_stats_fn (func) – optional function that returns a dict of TF fetches given the policy, batch input, and gradient tensors

  • extra_action_fetches_fn (func) – optional function that returns a dict of TF fetches given the policy object

  • extra_learn_fetches_fn (func) – optional function that returns a dict of extra values to fetch and return when learning on a batch

  • validate_spaces (Optional[callable]) – Optional callable that takes the Policy, observation_space, action_space, and config to check for correctness.

  • before_init (func) – optional function to run at the beginning of policy init that takes the same arguments as the policy constructor

  • before_loss_init (func) – optional function to run prior to loss init that takes the same arguments as the policy constructor

  • after_init (func) – optional function to run at the end of policy init that takes the same arguments as the policy constructor

  • make_model (func) – optional function that returns a ModelV2 object given (policy, obs_space, action_space, config). All policy variables should be created in this function. If not specified, a default model will be created.

  • action_sampler_fn (Optional[callable]) – A callable returning a sampled action and its log-likelihood given some (obs and state) inputs.

  • action_distribution_fn (Optional[callable]) – A callable returning distribution inputs (parameters), a dist-class to generate an action distribution object from, and internal-state outputs (or an empty list if not applicable).

  • mixins (list) – list of any class mixins for the returned policy class. These mixins will be applied in order and will have higher precedence than the DynamicTFPolicy class

  • get_batch_divisibility_req (func) – optional function that returns the divisibility requirement for sample batches

  • obs_include_prev_action_reward (bool) – whether to include the previous action and reward in the model input

Returns

a DynamicTFPolicy instance that uses the specified args

ray.rllib.env

class ray.rllib.env.BaseEnv[source]

The lowest-level env interface used by RLlib for sampling.

BaseEnv models multiple agents executing asynchronously in multiple environments. A call to poll() returns observations from ready agents keyed by their environment and agent ids, and actions for those agents can be sent back via send_actions().

All other env types can be adapted to BaseEnv. RLlib handles these conversions internally in RolloutWorker, for example:

gym.Env => rllib.VectorEnv => rllib.BaseEnv rllib.MultiAgentEnv => rllib.BaseEnv rllib.ExternalEnv => rllib.BaseEnv

action_space

Action space. This must be defined for single-agent envs. Multi-agent envs can set this to None.

Type

gym.Space

observation_space

Observation space. This must be defined for single-agent envs. Multi-agent envs can set this to None.

Type

gym.Space

Examples

>>> env = MyBaseEnv()
>>> obs, rewards, dones, infos, off_policy_actions = env.poll()
>>> print(obs)
{
    "env_0": {
        "car_0": [2.4, 1.6],
        "car_1": [3.4, -3.2],
    },
    "env_1": {
        "car_0": [8.0, 4.1],
    },
    "env_2": {
        "car_0": [2.3, 3.3],
        "car_1": [1.4, -0.2],
        "car_3": [1.2, 0.1],
    },
}
>>> env.send_actions(
    actions={
        "env_0": {
            "car_0": 0,
            "car_1": 1,
        }, ...
    })
>>> obs, rewards, dones, infos, off_policy_actions = env.poll()
>>> print(obs)
{
    "env_0": {
        "car_0": [4.1, 1.7],
        "car_1": [3.2, -4.2],
    }, ...
}
>>> print(dones)
{
    "env_0": {
        "__all__": False,
        "car_0": False,
        "car_1": True,
    }, ...
}
static to_base_env(env: Any, make_env: Callable[[int], Any] = None, num_envs: int = 1, remote_envs: bool = False, remote_env_batch_wait_ms: bool = 0) → ray.rllib.env.base_env.BaseEnv[source]

Wraps any env type as needed to expose the async interface.

poll() → Tuple[Dict[int, Dict[Any, Any]], Dict[int, Dict[Any, Any]], Dict[int, Dict[Any, Any]], Dict[int, Dict[Any, Any]], Dict[int, Dict[Any, Any]]][source]

Returns observations from ready agents.

The returns are two-level dicts mapping from env_id to a dict of agent_id to values. The number of agents and envs can vary over time.

Returns

  • obs (dict) (New observations for each ready agent.)

  • rewards (dict) (Reward values for each ready agent. If the) – episode is just started, the value will be None.

  • dones (dict) (Done values for each ready agent. The special key) – “__all__” is used to indicate env termination.

  • infos (dict) (Info values for each ready agent.)

  • off_policy_actions (dict) (Agents may take off-policy actions. When) – that happens, there will be an entry in this dict that contains the taken action. There is no need to send_actions() for agents that have already chosen off-policy actions.

send_actions(action_dict: Dict[int, Dict[Any, Any]]) → None[source]

Called to send actions back to running agents in this env.

Actions should be sent for each ready agent that returned observations in the previous poll() call.

Parameters

action_dict (dict) – Actions values keyed by env_id and agent_id.

try_reset(env_id: Optional[int] = None) → Optional[Dict[Any, Any]][source]

Attempt to reset the sub-env with the given id or all sub-envs.

If the environment does not support synchronous reset, None can be returned here.

Parameters

env_id (Optional[int]) – The sub-env ID if applicable. If None, reset the entire Env (i.e. all sub-envs).

Returns

Resetted observation or None if not supported.

Return type

obs (dict|None)

get_unwrapped() → List[Any][source]

Return a reference to the underlying gym envs, if any.

Returns

Underlying gym envs or [].

Return type

envs (list)

stop() → None[source]

Releases all resources used.

class ray.rllib.env.Unity3DEnv(file_name: str = None, worker_id: int = 0, base_port: int = 5004, seed: int = 0, no_graphics: bool = False, timeout_wait: int = 60, episode_horizon: int = 1000)[source]

A MultiAgentEnv representing a single Unity3D game instance.

For an example on how to use this Env with a running Unity3D editor or with a compiled game, see: rllib/examples/unity3d_env_local.py For an example on how to use it inside a Unity game client, which connects to an RLlib Policy server, see: rllib/examples/serving/unity3d_[client|server].py

Supports all Unity3D (MLAgents) examples, multi- or single-agent and gets converted automatically into an ExternalMultiAgentEnv, when used inside an RLlib PolicyClient for cloud/distributed training of Unity games.

step(action_dict: Dict[Any, Any]) → Tuple[Dict[Any, Any], Dict[Any, Any], Dict[Any, Any], Dict[Any, Any]][source]

Performs one multi-agent step through the game.

Parameters

action_dict (dict) – Multi-agent action dict with: keys=agent identifier consisting of [MLagents behavior name, e.g. “Goalie?team=1”] + “_” + [Agent index, a unique MLAgent-assigned index per single agent]

Returns

  • obs: Multi-agent observation dict.

    Only those observations for which to get new actions are returned.

  • rewards: Rewards dict matching obs.

  • dones: Done dict with only an __all__ multi-agent entry in

    it. __all__=True, if episode is done for all agents.

  • infos: An (empty) info dict.

Return type

tuple

reset() → Dict[Any, Any][source]

Resets the entire Unity3D scene (a single multi-agent episode).

class ray.rllib.env.PettingZooEnv(env)[source]

An interface to the PettingZoo MARL environment library.

See: https://github.com/PettingZoo-Team/PettingZoo

Inherits from MultiAgentEnv and exposes a given AEC (actor-environment-cycle) game from the PettingZoo project via the MultiAgentEnv public API.

It reduces the class of AEC games to Partially Observable Markov (POM) games by imposing the following important restrictions onto an AEC environment:

  1. Each agent steps in order specified in agents list (unless they are done, in which case, they should be skipped).

  2. Agents act simultaneously (-> No hard-turn games like chess).

  3. All agents have the same action_spaces and observation_spaces. Note: If, within your aec game, agents do not have homogeneous action / observation spaces, apply SuperSuit wrappers to apply padding functionality: https://github.com/PettingZoo-Team/ SuperSuit#built-in-multi-agent-only-functions

  4. Environments are positive sum games (-> Agents are expected to cooperate to maximize reward). This isn’t a hard restriction, it just that standard algorithms aren’t expected to work well in highly competitive games.

Examples

>>> from pettingzoo.gamma import prison_v0
>>> env = POMGameEnv(env_creator=prison_v0})
>>> obs = env.reset()
>>> print(obs)
    {
        "0": [110, 119],
        "1": [105, 102],
        "2": [99, 95],
    }
>>> obs, rewards, dones, infos = env.step(
    action_dict={
        "0": 1, "1": 0, "2": 2,
    })
>>> print(rewards)
    {
        "0": 0,
        "1": 1,
        "2": 0,
    }
>>> print(dones)
    {
        "0": False,    # agent 0 is still running
        "1": True,     # agent 1 is done
        "__all__": False,  # the env is not done
    }
>>> print(infos)
    {
        "0": {},  # info for agent 0
        "1": {},  # info for agent 1
    }
reset()[source]

Resets the env and returns observations from ready agents.

Returns

New observations for each ready agent.

Return type

obs (dict)

step(action_dict)[source]

Executes input actions from RL agents and returns observations from environment agents.

The returns are dicts mapping from agent_id strings to values. The number of agents in the env can vary over time.

Returns

  • obs (dict) (New observations for each ready agent.)

  • rewards (dict) (Reward values for each ready agent. If the) – episode is just started, the value will be None.

  • dones (dict) (Done values for each ready agent. The special key) – “__all__” (required) is used to indicate env termination.

  • infos (dict) (Optional info values for each agent id.)

with_agent_groups(groups, obs_space=None, act_space=None)[source]

Convenience method for grouping together agents in this env.

An agent group is a list of agent ids that are mapped to a single logical agent. All agents of the group must act at the same time in the environment. The grouped agent exposes Tuple action and observation spaces that are the concatenated action and obs spaces of the individual agents.

The rewards of all the agents in a group are summed. The individual agent rewards are available under the “individual_rewards” key of the group info return.

Agent grouping is required to leverage algorithms such as Q-Mix.

This API is experimental.

Parameters
  • groups (dict) – Mapping from group id to a list of the agent ids of group members. If an agent id is not present in any group value, it will be left ungrouped.

  • obs_space (Space) – Optional observation space for the grouped env. Must be a tuple space.

  • act_space (Space) – Optional action space for the grouped env. Must be a tuple space.

Examples

>>> env = YourMultiAgentEnv(...)
>>> grouped_env = env.with_agent_groups(env, {
...   "group1": ["agent1", "agent2", "agent3"],
...   "group2": ["agent4", "agent5"],
... })
class ray.rllib.env.MultiAgentEnv[source]

An environment that hosts multiple independent agents.

Agents are identified by (string) agent ids. Note that these “agents” here are not to be confused with RLlib agents.

Examples

>>> env = MyMultiAgentEnv()
>>> obs = env.reset()
>>> print(obs)
{
    "car_0": [2.4, 1.6],
    "car_1": [3.4, -3.2],
    "traffic_light_1": [0, 3, 5, 1],
}
>>> obs, rewards, dones, infos = env.step(
...    action_dict={
...        "car_0": 1, "car_1": 0, "traffic_light_1": 2,
...    })
>>> print(rewards)
{
    "car_0": 3,
    "car_1": -1,
    "traffic_light_1": 0,
}
>>> print(dones)
{
    "car_0": False,    # car_0 is still running
    "car_1": True,     # car_1 is done
    "__all__": False,  # the env is not done
}
>>> print(infos)
{
    "car_0": {},  # info for car_0
    "car_1": {},  # info for car_1
}
reset() → Dict[Any, Any][source]

Resets the env and returns observations from ready agents.

Returns

New observations for each ready agent.

Return type

obs (dict)

step(action_dict: Dict[Any, Any]) → Tuple[Dict[Any, Any], Dict[Any, Any], Dict[Any, Any], Dict[Any, Any]][source]

Returns observations from ready agents.

The returns are dicts mapping from agent_id strings to values. The number of agents in the env can vary over time.

Returns

  • obs (dict) (New observations for each ready agent.)

  • rewards (dict) (Reward values for each ready agent. If the) – episode is just started, the value will be None.

  • dones (dict) (Done values for each ready agent. The special key) – “__all__” (required) is used to indicate env termination.

  • infos (dict) (Optional info values for each agent id.)

with_agent_groups(groups: Dict[str, List[Any]], obs_space: <Mock name='mock.Space' id='139809893919824'> = None, act_space: <Mock name='mock.Space' id='139809893919824'> = None) → ray.rllib.env.multi_agent_env.MultiAgentEnv[source]

Convenience method for grouping together agents in this env.

An agent group is a list of agent ids that are mapped to a single logical agent. All agents of the group must act at the same time in the environment. The grouped agent exposes Tuple action and observation spaces that are the concatenated action and obs spaces of the individual agents.

The rewards of all the agents in a group are summed. The individual agent rewards are available under the “individual_rewards” key of the group info return.

Agent grouping is required to leverage algorithms such as Q-Mix.

This API is experimental.

Parameters
  • groups (dict) – Mapping from group id to a list of the agent ids of group members. If an agent id is not present in any group value, it will be left ungrouped.

  • obs_space (Space) – Optional observation space for the grouped env. Must be a tuple space.

  • act_space (Space) – Optional action space for the grouped env. Must be a tuple space.

Examples

>>> env = YourMultiAgentEnv(...)
>>> grouped_env = env.with_agent_groups(env, {
...   "group1": ["agent1", "agent2", "agent3"],
...   "group2": ["agent4", "agent5"],
... })
class ray.rllib.env.ExternalEnv(action_space: <Mock name='mock.Space' id='139809893919824'>, observation_space: <Mock name='mock.Space' id='139809893919824'>, max_concurrent: int = 100)[source]

An environment that interfaces with external agents.

Unlike simulator envs, control is inverted. The environment queries the policy to obtain actions and logs observations and rewards for training. This is in contrast to gym.Env, where the algorithm drives the simulation through env.step() calls.

You can use ExternalEnv as the backend for policy serving (by serving HTTP requests in the run loop), for ingesting offline logs data (by reading offline transitions in the run loop), or other custom use cases not easily expressed through gym.Env.

ExternalEnv supports both on-policy actions (through self.get_action()), and off-policy actions (through self.log_action()).

This env is thread-safe, but individual episodes must be executed serially.

action_space

Action space.

Type

gym.Space

observation_space

Observation space.

Type

gym.Space

Examples

>>> register_env("my_env", lambda config: YourExternalEnv(config))
>>> trainer = DQNTrainer(env="my_env")
>>> while True:
>>>     print(trainer.train())
run()[source]

Override this to implement the run loop.

Your loop should continuously:
  1. Call self.start_episode(episode_id)

  2. Call self.get_action(episode_id, obs)

    -or- self.log_action(episode_id, obs, action)

  3. Call self.log_returns(episode_id, reward)

  4. Call self.end_episode(episode_id, obs)

  5. Wait if nothing to do.

Multiple episodes may be started at the same time.

start_episode(episode_id: Optional[str] = None, training_enabled: bool = True) → str[source]

Record the start of an episode.

Parameters
  • episode_id (Optional[str]) – Unique string id for the episode or None for it to be auto-assigned and returned.

  • training_enabled (bool) – Whether to use experiences for this episode to improve the policy.

Returns

Unique string id for the episode.

Return type

episode_id (str)

get_action(episode_id: str, observation: Any) → Any[source]

Record an observation and get the on-policy action.

Parameters
  • episode_id (str) – Episode id returned from start_episode().

  • observation (obj) – Current environment observation.

Returns

Action from the env action space.

Return type

action (obj)

log_action(episode_id: str, observation: Any, action: Any) → None[source]

Record an observation and (off-policy) action taken.

Parameters
  • episode_id (str) – Episode id returned from start_episode().

  • observation (obj) – Current environment observation.

  • action (obj) – Action for the observation.

log_returns(episode_id: str, reward: float, info: dict = None) → None[source]

Record returns from the environment.

The reward will be attributed to the previous action taken by the episode. Rewards accumulate until the next action. If no reward is logged before the next action, a reward of 0.0 is assumed.

Parameters
  • episode_id (str) – Episode id returned from start_episode().

  • reward (float) – Reward from the environment.

  • info (dict) – Optional info dict.

end_episode(episode_id: str, observation: Any) → None[source]

Record the end of an episode.

Parameters
  • episode_id (str) – Episode id returned from start_episode().

  • observation (obj) – Current environment observation.

class ray.rllib.env.ExternalMultiAgentEnv(action_space: <Mock name='mock.Space' id='139809893919824'>, observation_space: <Mock name='mock.Space' id='139809893919824'>, max_concurrent: int = 100)[source]

This is the multi-agent version of ExternalEnv.

run()[source]

Override this to implement the multi-agent run loop.

Your loop should continuously:
  1. Call self.start_episode(episode_id)

  2. Call self.get_action(episode_id, obs_dict)

    -or- self.log_action(episode_id, obs_dict, action_dict)

  3. Call self.log_returns(episode_id, reward_dict)

  4. Call self.end_episode(episode_id, obs_dict)

  5. Wait if nothing to do.

Multiple episodes may be started at the same time.

start_episode(episode_id: Optional[str] = None, training_enabled: bool = True) → str[source]

Record the start of an episode.

Parameters
  • episode_id (Optional[str]) – Unique string id for the episode or None for it to be auto-assigned and returned.

  • training_enabled (bool) – Whether to use experiences for this episode to improve the policy.

Returns

Unique string id for the episode.

Return type

episode_id (str)

get_action(episode_id: str, observation_dict: Dict[Any, Any]) → Dict[Any, Any][source]

Record an observation and get the on-policy action. observation_dict is expected to contain the observation of all agents acting in this episode step.

Parameters
  • episode_id (str) – Episode id returned from start_episode().

  • observation_dict (dict) – Current environment observation.

Returns

Action from the env action space.

Return type

action (dict)

log_action(episode_id: str, observation_dict: Dict[Any, Any], action_dict: Dict[Any, Any]) → None[source]

Record an observation and (off-policy) action taken.

Parameters
  • episode_id (str) – Episode id returned from start_episode().

  • observation_dict (dict) – Current environment observation.

  • action_dict (dict) – Action for the observation.

log_returns(episode_id: str, reward_dict: Dict[Any, Any], info_dict: Dict[Any, Any] = None, multiagent_done_dict: Dict[Any, Any] = None) → None[source]

Record returns from the environment.

The reward will be attributed to the previous action taken by the episode. Rewards accumulate until the next action. If no reward is logged before the next action, a reward of 0.0 is assumed.

Parameters
  • episode_id (str) – Episode id returned from start_episode().

  • reward_dict (dict) – Reward from the environment agents.

  • info_dict (dict) – Optional info dict.

  • multiagent_done_dict (dict) – Optional done dict for agents.

end_episode(episode_id: str, observation_dict: Dict[Any, Any]) → None[source]

Record the end of an episode.

Parameters
  • episode_id (str) – Episode id returned from start_episode().

  • observation_dict (dict) – Current environment observation.

class ray.rllib.env.VectorEnv(observation_space: <Mock name='mock.Space' id='139809893919824'>, action_space: <Mock name='mock.Space' id='139809893919824'>, num_envs: int)[source]

An environment that supports batch evaluation using clones of sub-envs.

vector_reset() → List[Any][source]

Resets all sub-environments.

Returns

List of observations from each environment.

Return type

obs (List[any])

reset_at(index: int) → Any[source]

Resets a single environment.

Returns

Observations from the reset sub environment.

Return type

obs (obj)

vector_step(actions: List[Any]) → Tuple[List[Any], List[float], List[bool], List[dict]][source]

Performs a vectorized step on all sub environments using actions.

Parameters

actions (List[any]) – List of actions (one for each sub-env).

Returns

New observations for each sub-env. rewards (List[any]): Reward values for each sub-env. dones (List[any]): Done values for each sub-env. infos (List[any]): Info values for each sub-env.

Return type

obs (List[any])

get_unwrapped() → List[Any][source]

Returns the underlying sub environments.

Returns

List of all underlying sub environments.

Return type

List[Env]

class ray.rllib.env.EnvContext(env_config: dict, worker_index: int, vector_index: int = 0, remote: bool = False)[source]

Wraps env configurations to include extra rllib metadata.

These attributes can be used to parameterize environments per process. For example, one might use worker_index to control which data file an environment reads in on initialization.

RLlib auto-sets these attributes when constructing registered envs.

worker_index

When there are multiple workers created, this uniquely identifies the worker the env is created in.

Type

int

vector_index

When there are multiple envs per worker, this uniquely identifies the env index within the worker.

Type

int

remote

Whether environment should be remote or not.

Type

bool

class ray.rllib.env.PolicyClient(address: str, inference_mode: str = 'local', update_interval: float = 10.0)[source]

REST client to interact with a RLlib policy server.

start_episode(episode_id: Optional[str] = None, training_enabled: bool = True) → str[source]

Record the start of one or more episode(s).

Parameters
  • episode_id (Optional[str]) – Unique string id for the episode or None for it to be auto-assigned.

  • training_enabled (bool) – Whether to use experiences for this episode to improve the policy.

Returns

Unique string id for the episode.

Return type

episode_id (str)

get_action(episode_id: str, observation: Union[Any, Dict[Any, Any]]) → Union[Any, Dict[Any, Any]][source]

Record an observation and get the on-policy action.

Parameters
  • episode_id (str) – Episode id returned from start_episode().

  • observation (obj) – Current environment observation.

Returns

Action from the env action space.

Return type

action (obj)

log_action(episode_id: str, observation: Union[Any, Dict[Any, Any]], action: Union[Any, Dict[Any, Any]]) → None[source]

Record an observation and (off-policy) action taken.

Parameters
  • episode_id (str) – Episode id returned from start_episode().

  • observation (obj) – Current environment observation.

  • action (obj) – Action for the observation.

log_returns(episode_id: str, reward: int, info: Union[dict, Dict[Any, Any]] = None, multiagent_done_dict: Optional[Dict[Any, Any]] = None) → None[source]

Record returns from the environment.

The reward will be attributed to the previous action taken by the episode. Rewards accumulate until the next action. If no reward is logged before the next action, a reward of 0.0 is assumed.

Parameters
  • episode_id (str) – Episode id returned from start_episode().

  • reward (float) – Reward from the environment.

  • info (dict) – Extra info dict.

  • multiagent_done_dict (dict) – Multi-agent done information.

end_episode(episode_id: str, observation: Union[Any, Dict[Any, Any]]) → None[source]

Record the end of an episode.

Parameters
  • episode_id (str) – Episode id returned from start_episode().

  • observation (obj) – Current environment observation.

update_policy_weights() → None[source]

Query the server for new policy weights, if local inference is enabled.

class ray.rllib.env.PolicyServerInput(ioctx, address, port)[source]

REST policy server that acts as an offline data source.

This launches a multi-threaded server that listens on the specified host and port to serve policy requests and forward experiences to RLlib. For high performance experience collection, it implements InputReader.

For an example, run examples/cartpole_server.py along with examples/cartpole_client.py –inference-mode=local|remote.

Examples

>>> pg = PGTrainer(
...     env="CartPole-v0", config={
...         "input": lambda ioctx:
...             PolicyServerInput(ioctx, addr, port),
...         "num_workers": 0,  # Run just 1 server, in the trainer.
...     }
>>> while True:
>>>     pg.train()
>>> client = PolicyClient("localhost:9900", inference_mode="local")
>>> eps_id = client.start_episode()
>>> action = client.get_action(eps_id, obs)
>>> ...
>>> client.log_returns(eps_id, reward)
>>> ...
>>> client.log_returns(eps_id, reward)
next()[source]

Return the next batch of experiences read.

Returns

SampleBatch or MultiAgentBatch read.

ray.rllib.evaluation

class ray.rllib.evaluation.MultiAgentEpisode(policies: Dict[str, ray.rllib.policy.policy.Policy], policy_mapping_fn: Callable[[Any], str], batch_builder_factory: Callable[], MultiAgentSampleBatchBuilder], extra_batch_callback: Callable[[Union[SampleBatch, MultiAgentBatch]], None])[source]

Tracks the current state of a (possibly multi-agent) episode.

new_batch_builder

Create a new MultiAgentSampleBatchBuilder.

Type

func

add_extra_batch

Return a built MultiAgentBatch to the sampler.

Type

func

batch_builder

Batch builder for the current episode.

Type

obj

total_reward

Summed reward across all agents in this episode.

Type

float

length

Length of this episode.

Type

int

episode_id

Unique id identifying this trajectory.

Type

int

agent_rewards

Summed rewards broken down by agent.

Type

dict

custom_metrics

Dict where the you can add custom metrics.

Type

dict

user_data

Dict that you can use for temporary storage.

Type

dict

Use case 1: Model-based rollouts in multi-agent:

A custom compute_actions() function in a policy can inspect the current episode state and perform a number of rollouts based on the policies and state of other agents in the environment.

Use case 2: Returning extra rollouts data.

The model rollouts can be returned back to the sampler by calling:

>>> batch = episode.new_batch_builder()
>>> for each transition:
       batch.add_values(...)  # see sampler for usage
>>> episode.extra_batches.add(batch.build_and_reset())
soft_reset() → None[source]

Clears rewards and metrics, but retains RNN and other state.

This is used to carry state across multiple logical episodes in the same env (i.e., if soft_horizon is set).

policy_for(agent_id: Any = 'agent0') → ray.rllib.policy.policy.Policy[source]

Returns the policy for the specified agent.

If the agent is new, the policy mapping fn will be called to bind the agent to a policy for the duration of the episode.

last_observation_for(agent_id: Any = 'agent0') → Any[source]

Returns the last observation for the specified agent.

last_raw_obs_for(agent_id: Any = 'agent0') → Any[source]

Returns the last un-preprocessed obs for the specified agent.

last_info_for(agent_id: Any = 'agent0') → dict[source]

Returns the last info for the specified agent.

last_action_for(agent_id: Any = 'agent0') → Any[source]

Returns the last action for the specified agent, or zeros.

prev_action_for(agent_id: Any = 'agent0') → Any[source]

Returns the previous action for the specified agent.

prev_reward_for(agent_id: Any = 'agent0') → float[source]

Returns the previous reward for the specified agent.

rnn_state_for(agent_id: Any = 'agent0') → List[Any][source]

Returns the last RNN state for the specified agent.

last_pi_info_for(agent_id: Any = 'agent0') → dict[source]

Returns the last info object for the specified agent.

class ray.rllib.evaluation.RolloutWorker(env_creator: Callable[[ray.rllib.env.env_context.EnvContext], Any], policy: type, policy_mapping_fn: Callable[[Any], str] = None, policies_to_train: List[str] = None, tf_session_creator: Callable[[], Any] = None, rollout_fragment_length: int = 100, batch_mode: str = 'truncate_episodes', episode_horizon: int = None, preprocessor_pref: str = 'deepmind', sample_async: bool = False, compress_observations: bool = False, num_envs: int = 1, observation_fn: ObservationFunction = None, observation_filter: str = 'NoFilter', clip_rewards: bool = None, clip_actions: bool = True, env_config: dict = None, model_config: dict = None, policy_config: dict = None, worker_index: int = 0, num_workers: int = 0, monitor_path: str = None, log_dir: str = None, log_level: str = None, callbacks: DefaultCallbacks = None, input_creator: Callable[[ray.rllib.offline.io_context.IOContext], ray.rllib.offline.input_reader.InputReader] = <function RolloutWorker.<lambda>>, input_evaluation: List[str] = frozenset({}), output_creator: Callable[[ray.rllib.offline.io_context.IOContext], ray.rllib.offline.output_writer.OutputWriter] = <function RolloutWorker.<lambda>>, remote_worker_envs: bool = False, remote_env_batch_wait_ms: int = 0, soft_horizon: bool = False, no_done_at_end: bool = False, seed: int = None, extra_python_environs: dict = None, fake_sampler: bool = False)[source]

Common experience collection class.

This class wraps a policy instance and an environment class to collect experiences from the environment. You can create many replicas of this class as Ray actors to scale RL training.

This class supports vectorized and multi-agent policy evaluation (e.g., VectorEnv, MultiAgentEnv, etc.)

Examples

>>> # Create a rollout worker and using it to collect experiences.
>>> worker = RolloutWorker(
...   env_creator=lambda _: gym.make("CartPole-v0"),
...   policy=PGTFPolicy)
>>> print(worker.sample())
SampleBatch({
    "obs": [[...]], "actions": [[...]], "rewards": [[...]],
    "dones": [[...]], "new_obs": [[...]]})
>>> # Creating a multi-agent rollout worker
>>> worker = RolloutWorker(
...   env_creator=lambda _: MultiAgentTrafficGrid(num_cars=25),
...   policies={
...       # Use an ensemble of two policies for car agents
...       "car_policy1":
...         (PGTFPolicy, Box(...), Discrete(...), {"gamma": 0.99}),
...       "car_policy2":
...         (PGTFPolicy, Box(...), Discrete(...), {"gamma": 0.95}),
...       # Use a single shared policy for all traffic lights
...       "traffic_light_policy":
...         (PGTFPolicy, Box(...), Discrete(...), {}),
...   },
...   policy_mapping_fn=lambda agent_id:
...     random.choice(["car_policy1", "car_policy2"])
...     if agent_id.startswith("car_") else "traffic_light_policy")
>>> print(worker.sample())
MultiAgentBatch({
    "car_policy1": SampleBatch(...),
    "car_policy2": SampleBatch(...),
    "traffic_light_policy": SampleBatch(...)})
sample() → Union[SampleBatch, MultiAgentBatch][source]

Returns a batch of experience sampled from this worker.

This method must be implemented by subclasses.

Returns

A columnar batch of experiences (e.g., tensors).

Return type

SampleBatchType

Examples

>>> print(worker.sample())
SampleBatch({"obs": [1, 2, 3], "action": [0, 1, 0], ...})
sample_with_count() → Tuple[Union[SampleBatch, MultiAgentBatch], int][source]

Same as sample() but returns the count as a separate future.

get_weights(policies: List[str] = None) -> (<class 'dict'>, <class 'dict'>)[source]

Returns the model weights of this worker.

Returns

weights that can be set on another worker. info: dictionary of extra metadata.

Return type

object

Examples

>>> weights = worker.get_weights()
set_weights(weights: dict, global_vars: dict = None) → None[source]

Sets the model weights of this worker.

Examples

>>> weights = worker.get_weights()
>>> worker.set_weights(weights)
compute_gradients(samples: Union[SampleBatch, MultiAgentBatch]) → Tuple[Union[List[Tuple[Any, Any]], List[Any]], dict][source]

Returns a gradient computed w.r.t the specified samples.

Returns

A list of gradients that can be applied on a compatible worker. In the multi-agent case, returns a dict of gradients keyed by policy ids. An info dictionary of extra metadata is also returned.

Return type

(grads, info)

Examples

>>> batch = worker.sample()
>>> grads, info = worker.compute_gradients(samples)
apply_gradients(grads: Union[List[Tuple[Any, Any]], List[Any]]) → Dict[str, Any][source]

Applies the given gradients to this worker’s weights.

Examples

>>> samples = worker.sample()
>>> grads, info = worker.compute_gradients(samples)
>>> worker.apply_gradients(grads)
learn_on_batch(samples: Union[SampleBatch, MultiAgentBatch]) → dict[source]

Update policies based on the given batch.

This is the equivalent to apply_gradients(compute_gradients(samples)), but can be optimized to avoid pulling gradients into CPU memory.

Returns

dictionary of extra metadata from compute_gradients().

Return type

info

Examples

>>> batch = worker.sample()
>>> worker.learn_on_batch(samples)
sample_and_learn(expected_batch_size: int, num_sgd_iter: int, sgd_minibatch_size: str, standardize_fields: List[str]) → Tuple[dict, int][source]

Sample and batch and learn on it.

This is typically used in combination with distributed allreduce.

Parameters
  • expected_batch_size (int) – Expected number of samples to learn on.

  • num_sgd_iter (int) – Number of SGD iterations.

  • sgd_minibatch_size (int) – SGD minibatch size.

  • standardize_fields (list) – List of sample fields to normalize.

Returns

dictionary of extra metadata from learn_on_batch(). count: number of samples learned on.

Return type

info

get_metrics() → List[Union[ray.rllib.evaluation.rollout_metrics.RolloutMetrics, ray.rllib.offline.off_policy_estimator.OffPolicyEstimate]][source]

Returns a list of new RolloutMetric objects from evaluation.

foreach_env(func: Callable[[ray.rllib.env.base_env.BaseEnv], T]) → List[T][source]

Apply the given function to each underlying env instance.

get_policy(policy_id: Optional[str] = 'default_policy') → ray.rllib.policy.policy.Policy[source]

Return policy for the specified id, or None.

Parameters

policy_id (str) – id of policy to return.

for_policy(func: Callable[[ray.rllib.policy.policy.Policy], T], policy_id: Optional[str] = 'default_policy') → T[source]

Apply the given function to the specified policy.

foreach_policy(func: Callable[[ray.rllib.policy.policy.Policy, str], T]) → List[T][source]

Apply the given function to each (policy, policy_id) tuple.

foreach_trainable_policy(func: Callable[[ray.rllib.policy.policy.Policy, str], T]) → List[T][source]

Applies the given function to each (policy, policy_id) tuple, which can be found in self.policies_to_train.

Parameters

func (callable) – A function - taking a Policy and its ID - that is called on all Policies within self.policies_to_train.

Returns

The list of n return values of all

func([policy], [ID])-calls.

Return type

List[any]

sync_filters(new_filters: dict) → None[source]

Changes self’s filter to given and rebases any accumulated delta.

Parameters

new_filters (dict) – Filters with new state to update local copy.

get_filters(flush_after: bool = False) → dict[source]

Returns a snapshot of filters.

Parameters

flush_after (bool) – Clears the filter buffer state.

Returns

Dict for serializable filters

Return type

return_filters (dict)

creation_args() → dict[source]

Returns the args used to create this worker.

get_host() → str[source]

Returns the hostname of the process running this evaluator.

apply(func: Callable[[RolloutWorker], T], *args) → T[source]

Apply the given function to this rollout worker instance.

setup_torch_data_parallel(url: str, world_rank: int, world_size: int, backend: str) → None[source]

Join a torch process group for distributed SGD.

get_node_ip() → str[source]

Returns the IP address of the current node.

find_free_port() → int[source]

Finds a free port on the current node.

class ray.rllib.evaluation.SampleBatchBuilder[source]

Util to build a SampleBatch incrementally.

For efficiency, SampleBatches hold values in column form (as arrays). However, it is useful to add data one row (dict) at a time.

add_values(**values: Dict[str, Any]) → None[source]

Add the given dictionary (row) of values to this batch.

add_batch(batch: ray.rllib.policy.sample_batch.SampleBatch) → None[source]

Add the given batch of values to this batch.

build_and_reset() → ray.rllib.policy.sample_batch.SampleBatch[source]

Returns a sample batch including all previously added values.

class ray.rllib.evaluation.MultiAgentSampleBatchBuilder(policy_map: Dict[str, ray.rllib.policy.policy.Policy], clip_rewards: bool, callbacks: DefaultCallbacks)[source]

Util to build SampleBatches for each policy in a multi-agent env.

Input data is per-agent, while output data is per-policy. There is an M:N mapping between agents and policies. We retain one local batch builder per agent. When an agent is done, then its local batch is appended into the corresponding policy batch for the agent’s policy.

total() → int[source]

Returns the total number of steps taken in the env (all agents).

Returns

The number of steps taken in total in the environment over all

agents.

Return type

int

has_pending_agent_data() → bool[source]

Returns whether there is pending unprocessed data.

Returns

True if there is at least one per-agent builder (with data

in it).

Return type

bool

add_values(agent_id: Any, policy_id: Any, **values: Dict[str, Any]) → None[source]

Add the given dictionary (row) of values to this batch.

Parameters
  • agent_id (obj) – Unique id for the agent we are adding values for.

  • policy_id (obj) – Unique id for policy controlling the agent.

  • values (dict) – Row of values to add for this agent.

postprocess_batch_so_far(episode: Optional[ray.rllib.evaluation.episode.MultiAgentEpisode] = None) → None[source]

Apply policy postprocessors to any unprocessed rows.

This pushes the postprocessed per-agent batches onto the per-policy builders, clearing per-agent state.

Parameters

episode (Optional[MultiAgentEpisode]) – The Episode object that holds this MultiAgentBatchBuilder object.

build_and_reset(episode: Optional[ray.rllib.evaluation.episode.MultiAgentEpisode] = None) → ray.rllib.policy.sample_batch.MultiAgentBatch[source]

Returns the accumulated sample batches for each policy.

Any unprocessed rows will be first postprocessed with a policy postprocessor. The internal state of this builder will be reset.

Parameters

episode (Optional[MultiAgentEpisode]) – The Episode object that holds this MultiAgentBatchBuilder object or None.

Returns

Returns the accumulated sample batches for each

policy.

Return type

MultiAgentBatch

class ray.rllib.evaluation.SyncSampler(*, worker: RolloutWorker, env: ray.rllib.env.base_env.BaseEnv, policies: Dict[str, ray.rllib.policy.policy.Policy], policy_mapping_fn: Callable[[Any], str], preprocessors: Dict[str, ray.rllib.models.preprocessors.Preprocessor], obs_filters: Dict[str, ray.rllib.utils.filter.Filter], clip_rewards: bool, rollout_fragment_length: int, callbacks: DefaultCallbacks, horizon: int = None, pack_multiple_episodes_in_batch: bool = False, tf_sess=None, clip_actions: bool = True, soft_horizon: bool = False, no_done_at_end: bool = False, observation_fn: ObservationFunction = None, _use_trajectory_view_api: bool = False)[source]

Sync SamplerInput that collects experiences when get_data() is called.

class ray.rllib.evaluation.AsyncSampler(*, worker: RolloutWorker, env: ray.rllib.env.base_env.BaseEnv, policies: Dict[str, ray.rllib.policy.policy.Policy], policy_mapping_fn: Callable[[Any], str], preprocessors: Dict[str, ray.rllib.models.preprocessors.Preprocessor], obs_filters: Dict[str, ray.rllib.utils.filter.Filter], clip_rewards: bool, rollout_fragment_length: int, callbacks: DefaultCallbacks, horizon: int = None, pack_multiple_episodes_in_batch: bool = False, tf_sess=None, clip_actions: bool = True, blackhole_outputs: bool = False, soft_horizon: bool = False, no_done_at_end: bool = False, observation_fn: ObservationFunction = None, _use_trajectory_view_api: bool = False)[source]

Async SamplerInput that collects experiences in thread and queues them.

Once started, experiences are continuously collected and put into a Queue, from where they can be unqueued by the caller of get_data().

run()[source]

Method representing the thread’s activity.

You may override this method in a subclass. The standard run() method invokes the callable object passed to the object’s constructor as the target argument, if any, with sequential and keyword arguments taken from the args and kwargs arguments, respectively.

ray.rllib.evaluation.compute_advantages(rollout: ray.rllib.policy.sample_batch.SampleBatch, last_r: float, gamma: float = 0.9, lambda_: float = 1.0, use_gae: bool = True, use_critic: bool = True)[source]

Given a rollout, compute its value targets and the advantage.

Parameters
  • rollout (SampleBatch) – SampleBatch of a single trajectory

  • last_r (float) – Value estimation for last observation

  • gamma (float) – Discount factor.

  • lambda_ (float) – Parameter for GAE

  • use_gae (bool) – Using Generalized Advantage Estimation

  • use_critic (bool) – Whether to use critic (value estimates). Setting this to False will use 0 as baseline.

Returns

Object with experience from rollout and

processed rewards.

Return type

SampleBatch (SampleBatch)

ray.rllib.evaluation.collect_metrics(local_worker: Optional[RolloutWorker] = None, remote_workers: List[ActorHandle] = [], to_be_collected: List[ObjectRef] = [], timeout_seconds: int = 180) → dict[source]

Gathers episode metrics from RolloutWorker instances.

class ray.rllib.evaluation.SampleBatch(*args, **kwargs)[source]

Wrapper around a dictionary with string keys and array-like values.

For example, {“obs”: [1, 2, 3], “reward”: [0, -1, 1]} is a batch of three samples, each with an “obs” and “reward” attribute.

static concat_samples(samples: List[Dict[str, Any]]) → Union[ray.rllib.policy.sample_batch.SampleBatch, ray.rllib.policy.sample_batch.MultiAgentBatch][source]

Concatenates n data dicts or MultiAgentBatches.

Parameters

samples (List[Dict[TensorType]]]) – List of dicts of data (numpy).

Returns

A new (compressed)

SampleBatch or MultiAgentBatch.

Return type

Union[SampleBatch, MultiAgentBatch]

concat(other: ray.rllib.policy.sample_batch.SampleBatch) → ray.rllib.policy.sample_batch.SampleBatch[source]

Returns a new SampleBatch with each data column concatenated.

Parameters

other (SampleBatch) – The other SampleBatch object to concat to this one.

Returns

The new SampleBatch, resulting from concating other

to self.

Return type

SampleBatch

Examples

>>> b1 = SampleBatch({"a": [1, 2]})
>>> b2 = SampleBatch({"a": [3, 4, 5]})
>>> print(b1.concat(b2))
{"a": [1, 2, 3, 4, 5]}
copy() → ray.rllib.policy.sample_batch.SampleBatch[source]

Creates a (deep) copy of this SampleBatch and returns it.

Returns

A (deep) copy of this SampleBatch object.

Return type

SampleBatch

rows() → Dict[str, Any][source]

Returns an iterator over data rows, i.e. dicts with column values.

Yields

Dict[str, TensorType]

The column values of the row in this

iteration.

Examples

>>> batch = SampleBatch({"a": [1, 2, 3], "b": [4, 5, 6]})
>>> for row in batch.rows():
       print(row)
{"a": 1, "b": 4}
{"a": 2, "b": 5}
{"a": 3, "b": 6}
columns(keys: List[str]) → List[any][source]

Returns a list of the batch-data in the specified columns.

Parameters

keys (List[str]) – List of column names fo which to return the data.

Returns

The list of data items ordered by the order of column

names in keys.

Return type

List[any]

Examples

>>> batch = SampleBatch({"a": [1], "b": [2], "c": [3]})
>>> print(batch.columns(["a", "b"]))
[[1], [2]]
shuffle() → None[source]

Shuffles the rows of this batch in-place.

split_by_episode() → List[ray.rllib.policy.sample_batch.SampleBatch][source]

Splits this batch’s data by eps_id.

Returns

List of batches, one per distinct episode.

Return type

List[SampleBatch]

slice(start: int, end: int) → ray.rllib.policy.sample_batch.SampleBatch[source]

Returns a slice of the row data of this batch (w/o copying).

Parameters
  • start (int) – Starting index.

  • end (int) – Ending index.

Returns

A new SampleBatch, which has a slice of this batch’s

data.

Return type

SampleBatch

timeslices(k: int) → List[ray.rllib.policy.sample_batch.SampleBatch][source]

Returns SampleBatches, each one representing a k-slice of this one.

Will start from timestep 0 and produce slices of size=k.

Parameters

k (int) – The size (in timesteps) of each returned SampleBatch.

Returns

The list of (new) SampleBatches (each one of

size k).

Return type

List[SampleBatch]

keys() → Iterable[str][source]
Returns

The keys() iterable over self.data.

Return type

Iterable[str]

items() → Iterable[Any][source]
Returns

The values() iterable over self.data.

Return type

Iterable[TensorType]

get(key: str) → Optional[Any][source]

Returns one column (by key) from the data or None if key not found.

Parameters

key (str) – The key (column name) to return.

Returns

The data under the given key. None if key

not found in data.

Return type

Optional[TensorType]

size_bytes() → int[source]
Returns

The overall size in bytes of the data buffer (all columns).

Return type

int

compress(bulk: bool = False, columns: Set[str] = frozenset({'new_obs', 'obs'})) → None[source]

Compresses the data buffers (by column) in place.

Parameters
  • bulk (bool) – Whether to compress across the batch dimension (0) as well. If False will compress n separate list items, where n is the batch size.

  • columns (Set[str]) – The columns to compress. Default: Only compress the obs and new_obs columns.

decompress_if_needed(columns: Set[str] = frozenset({'new_obs', 'obs'})) → ray.rllib.policy.sample_batch.SampleBatch[source]

Decompresses data buffers (per column if not compressed) in place.

Parameters

columns (Set[str]) – The columns to decompress. Default: Only decompress the obs and new_obs columns.

Returns

This very SampleBatch.

Return type

SampleBatch

class ray.rllib.evaluation.MultiAgentBatch(policy_batches: Dict[Any, ray.rllib.policy.sample_batch.SampleBatch], env_steps: int)[source]

A batch of experiences from multiple agents in the environment.

policy_batches

Mapping from policy ids to SampleBatches of experiences.

Type

Dict[PolicyID, SampleBatch]

count

The number of env steps in this batch.

Type

int

env_steps() → int[source]

The number of env steps (there are >= 1 agent steps per env step).

Returns

The number of environment steps contained in this batch.

Return type

int

agent_steps() → int[source]

The number of agent steps (there are >= 1 agent steps per env step).

Returns

The number of agent steps total in this batch.

Return type

int

timeslices(k: int) → List[ray.rllib.policy.sample_batch.MultiAgentBatch][source]

Returns k-step batches holding data for each agent at those steps.

For examples, suppose we have agent1 observations [a1t1, a1t2, a1t3], for agent2, [a2t1, a2t3], and for agent3, [a3t3] only.

Calling timeslices(1) would return three MultiAgentBatches containing [a1t1, a2t1], [a1t2], and [a1t3, a2t3, a3t3].

Calling timeslices(2) would return two MultiAgentBatches containing [a1t1, a1t2, a2t1], and [a1t3, a2t3, a3t3].

This method is used to implement “lockstep” replay mode. Note that this method does not guarantee each batch contains only data from a single unroll. Batches might contain data from multiple different envs.

static wrap_as_needed(policy_batches: Dict[Any, ray.rllib.policy.sample_batch.SampleBatch], env_steps: int) → Union[ray.rllib.policy.sample_batch.SampleBatch, ray.rllib.policy.sample_batch.MultiAgentBatch][source]

Returns SampleBatch or MultiAgentBatch, depending on given policies.

Parameters
  • policy_batches (Dict[PolicyID, SampleBatch]) – Mapping from policy ids to SampleBatch.

  • env_steps (int) – Number of env steps in the batch.

Returns

The single default policy’s

SampleBatch or a MultiAgentBatch (more than one policy).

Return type

Union[SampleBatch, MultiAgentBatch]

static concat_samples(samples: List[MultiAgentBatch]) → ray.rllib.policy.sample_batch.MultiAgentBatch[source]

Concatenates a list of MultiAgentBatches into a new MultiAgentBatch.

Parameters

samples (List[MultiAgentBatch]) – List of MultiagentBatch objects to concatenate.

Returns

A new MultiAgentBatch consisting of the

concatenated inputs.

Return type

MultiAgentBatch

copy() → ray.rllib.policy.sample_batch.MultiAgentBatch[source]

Deep-copies self into a new MultiAgentBatch.

Returns

The copy of self with deep-copied data.

Return type

MultiAgentBatch

size_bytes() → int[source]
Returns

The overall size in bytes of all policy batches (all columns).

Return type

int

compress(bulk: bool = False, columns: Set[str] = frozenset({'new_obs', 'obs'})) → None[source]

Compresses each policy batch (per column) in place.

Parameters
  • bulk (bool) – Whether to compress across the batch dimension (0) as well. If False will compress n separate list items, where n is the batch size.

  • columns (Set[str]) – Set of column names to compress.

decompress_if_needed(columns: Set[str] = frozenset({'new_obs', 'obs'})) → ray.rllib.policy.sample_batch.MultiAgentBatch[source]

Decompresses each policy batch (per column), if already compressed.

Parameters

columns (Set[str]) – Set of column names to decompress.

Returns

This very MultiAgentBatch.

Return type

MultiAgentBatch

ray.rllib.execution

ray.rllib.models

class ray.rllib.models.ActionDistribution(inputs: List[Any], model: ray.rllib.models.modelv2.ModelV2)[source]

The policy action distribution of an agent.

inputs

input vector to compute samples from.

Type

Tensors

model

reference to model producing the inputs.

Type

ModelV2

sample() → Any[source]

Draw a sample from the action distribution.

deterministic_sample() → Any[source]

Get the deterministic “sampling” output from the distribution. This is usually the max likelihood output, i.e. mean for Normal, argmax for Categorical, etc..

sampled_action_logp() → Any[source]

Returns the log probability of the last sampled action.

logp(x: Any) → Any[source]

The log-likelihood of the action distribution.

kl(other: ray.rllib.models.action_dist.ActionDistribution) → Any[source]

The KL-divergence between two action distributions.

entropy() → Any[source]

The entropy of the action distribution.

multi_kl(other: ray.rllib.models.action_dist.ActionDistribution) → Any[source]

The KL-divergence between two action distributions.

This differs from kl() in that it can return an array for MultiDiscrete. TODO(ekl) consider removing this.

multi_entropy() → Any[source]

The entropy of the action distribution.

This differs from entropy() in that it can return an array for MultiDiscrete. TODO(ekl) consider removing this.

static required_model_output_shape(action_space: <Mock name='mock.Space' id='139809893919824'>, model_config: dict) → Union[int, numpy.ndarray][source]

Returns the required shape of an input parameter tensor for a particular action space and an optional dict of distribution-specific options.

Parameters
  • action_space (gym.Space) – The action space this distribution will be used for, whose shape attributes will be used to determine the required shape of the input parameter tensor.

  • model_config (dict) – Model’s config dict (as defined in catalog.py)

Returns

size of the

required input vector (minus leading batch dimension).

Return type

model_output_shape (int or np.ndarray of ints)

class ray.rllib.models.ModelCatalog[source]

Registry of models, preprocessors, and action distributions for envs.

Examples

>>> prep = ModelCatalog.get_preprocessor(env)
>>> observation = prep.transform(raw_observation)
>>> dist_class, dist_dim = ModelCatalog.get_action_dist(
...     env.action_space, {})
>>> model = ModelCatalog.get_model_v2(
...     obs_space, action_space, num_outputs, options)
>>> dist = dist_class(model.outputs, model)
>>> action = dist.sample()
static get_action_dist(action_space: <Mock name='mock.Space' id='139809893919824'>, config: dict, dist_type: str = None, framework: str = 'tf', **kwargs) -> (<class 'type'>, <class 'int'>)[source]

Returns a distribution class and size for the given action space.

Parameters
  • action_space (Space) – Action space of the target gym env.

  • config (Optional[dict]) – Optional model config.

  • dist_type (Optional[str]) – Identifier of the action distribution interpreted as a hint.

  • framework (str) – One of “tf”, “tfe”, or “torch”.

  • kwargs (dict) – Optional kwargs to pass on to the Distribution’s constructor.

Returns

  • dist_class (ActionDistribution): Python class of the

    distribution.

  • dist_dim (int): The size of the input vector to the

    distribution.

Return type

Tuple

static get_action_shape(action_space: <Mock name='mock.Space' id='139809893919824'>) -> (<class 'numpy.dtype'>, typing.List[int])[source]

Returns action tensor dtype and shape for the action space.

Parameters

action_space (Space) – Action space of the target gym env.

Returns

Dtype and shape of the actions tensor.

Return type

(dtype, shape)

static get_action_placeholder(action_space: <Mock name='mock.Space' id='139809893919824'>, name: str = 'action') → Any[source]

Returns an action placeholder consistent with the action space

Parameters
  • action_space (Space) – Action space of the target gym env.

  • name (str) – An optional string to name the placeholder by. Default: “action”.

Returns

A placeholder for the actions

Return type

action_placeholder (Tensor)

static get_model_v2(obs_space: <Mock name='mock.Space' id='139809893919824'>, action_space: <Mock name='mock.Space' id='139809893919824'>, num_outputs: int, model_config: dict, framework: str = 'tf', name: str = 'default_model', model_interface: type = None, default_model: type = None, **model_kwargs) → ray.rllib.models.modelv2.ModelV2[source]

Returns a suitable model compatible with given spaces and output.

Parameters
  • obs_space (Space) – Observation space of the target gym env. This may have an original_space attribute that specifies how to unflatten the tensor into a ragged tensor.

  • action_space (Space) – Action space of the target gym env.

  • num_outputs (int) – The size of the output vector of the model.

  • framework (str) – One of “tf”, “tfe”, or “torch”.

  • name (str) – Name (scope) for the model.

  • model_interface (cls) – Interface required for the model

  • default_model (cls) – Override the default class for the model. This only has an effect when not using a custom model

  • model_kwargs (dict) – args to pass to the ModelV2 constructor

Returns

Model to use for the policy.

Return type

model (ModelV2)

static get_preprocessor(env: <Mock name='mock.Env' id='139809907678904'>, options: dict = None) → ray.rllib.models.preprocessors.Preprocessor[source]

Returns a suitable preprocessor for the given env.

This is a wrapper for get_preprocessor_for_space().

static get_preprocessor_for_space(observation_space: <Mock name='mock.Space' id='139809893919824'>, options: dict = None) → ray.rllib.models.preprocessors.Preprocessor[source]

Returns a suitable preprocessor for the given observation space.

Parameters
  • observation_space (Space) – The input observation space.

  • options (dict) – Options to pass to the preprocessor.

Returns

Preprocessor for the observations.

Return type

preprocessor (Preprocessor)

static register_custom_preprocessor(preprocessor_name: str, preprocessor_class: type) → None[source]

Register a custom preprocessor class by name.

The preprocessor can be later used by specifying {“custom_preprocessor”: preprocesor_name} in the model config.

Parameters
  • preprocessor_name (str) – Name to register the preprocessor under.

  • preprocessor_class (type) – Python class of the preprocessor.

static register_custom_model(model_name: str, model_class: type) → None[source]

Register a custom model class by name.

The model can be later used by specifying {“custom_model”: model_name} in the model config.

Parameters
  • model_name (str) – Name to register the model under.

  • model_class (type) – Python class of the model.

static register_custom_action_dist(action_dist_name: str, action_dist_class: type) → None[source]

Register a custom action distribution class by name.

The model can be later used by specifying {“custom_action_dist”: action_dist_name} in the model config.

Parameters
  • model_name (str) – Name to register the action distribution under.

  • model_class (type) – Python class of the action distribution.

static get_model(input_dict, obs_space, action_space, num_outputs, options, state_in=None, seq_lens=None)[source]

Deprecated: Use get_model_v2() instead.

class ray.rllib.models.Model(input_dict, obs_space, action_space, num_outputs, options, state_in=None, seq_lens=None)[source]

This class is deprecated! Use ModelV2 instead.

value_function()[source]

Builds the value function output.

This method can be overridden to customize the implementation of the value function (e.g., not sharing hidden layers).

Returns

Tensor of size [BATCH_SIZE] for the value function.

custom_loss(policy_loss, loss_inputs)[source]

Override to customize the loss function used to optimize this model.

This can be used to incorporate self-supervised losses (by defining a loss over existing input and output tensors of this model), and supervised losses (by defining losses over a variable-sharing copy of this model’s layers).

You can find an runnable example in examples/custom_loss.py.

Parameters
  • policy_loss (Tensor) – scalar policy loss from the policy.

  • loss_inputs (dict) – map of input placeholders for rollout data.

Returns

Scalar tensor for the customized loss for this model.

custom_stats()[source]

Override to return custom metrics from your model.

The stats will be reported as part of the learner stats, i.e.,
info:
learner:
model:

key1: metric1 key2: metric2

Returns

Dict of string keys to scalar tensors.

loss()[source]

Deprecated: use self.custom_loss().

class ray.rllib.models.Preprocessor(obs_space: <Mock name='mock.Space' id='139809893919824'>, options: dict = None)[source]

Defines an abstract observation preprocessor function.

shape

Shape of the preprocessed output.

Type

List[int]

transform(observation: Any) → numpy.ndarray[source]

Returns the preprocessed observation.

write(observation: Any, array: numpy.ndarray, offset: int) → None[source]

Alternative to transform for more efficient flattening.

check_shape(observation: Any) → None[source]

Checks the shape of the given observation.

class ray.rllib.models.FullyConnectedNetwork(input_dict, obs_space, action_space, num_outputs, options, state_in=None, seq_lens=None)[source]

Generic fully connected network.

class ray.rllib.models.VisionNetwork(input_dict, obs_space, action_space, num_outputs, options, state_in=None, seq_lens=None)[source]

Generic vision network.

ray.rllib.utils

ray.rllib.utils.override(cls)[source]

Annotation for documenting method overrides.

Parameters

cls (type) – The superclass that provides the overriden method. If this cls does not actually have the method, an error is raised.

ray.rllib.utils.PublicAPI(obj)[source]

Annotation for documenting public APIs.

Public APIs are classes and methods exposed to end users of RLlib. You can expect these APIs to remain stable across RLlib releases.

Subclasses that inherit from a @PublicAPI base class can be assumed part of the RLlib public API as well (e.g., all trainer classes are in public API because Trainer is @PublicAPI).

In addition, you can assume all trainer configurations are part of their public API as well.

ray.rllib.utils.DeveloperAPI(obj)[source]

Annotation for documenting developer APIs.

Developer APIs are classes and methods explicitly exposed to developers for the purposes of building custom algorithms or advanced training strategies on top of RLlib internals. You can generally expect these APIs to be stable sans minor changes (but less stable than public APIs).

Subclasses that inherit from a @DeveloperAPI base class can be assumed part of the RLlib developer API as well.

ray.rllib.utils.try_import_tf(error=False)[source]

Tries importing tf and returns the module (or None).

Parameters

error (bool) – Whether to raise an error if tf cannot be imported.

Returns

  • tf1.x module (either from tf2.x.compat.v1 OR as tf1.x).

  • tf module (resulting from import tensorflow).

    Either tf1.x or 2.x.

  • The actually installed tf version as int: 1 or 2.

Return type

Tuple

Raises

ImportError – If error=True and tf is not installed.

ray.rllib.utils.try_import_tfp(error=False)[source]

Tries importing tfp and returns the module (or None).

Parameters

error (bool) – Whether to raise an error if tfp cannot be imported.

Returns

The tfp module.

Raises

ImportError – If error=True and tfp is not installed.

ray.rllib.utils.try_import_torch(error=False)[source]

Tries importing torch and returns the module (or None).

Parameters

error (bool) – Whether to raise an error if torch cannot be imported.

Returns

torch AND torch.nn modules.

Return type

tuple

Raises

ImportError – If error=True and PyTorch is not installed.

ray.rllib.utils.deprecation_warning(old, new=None, error=None)[source]

Logs (via the logger object) or throws a deprecation warning/error.

Parameters
  • old (str) – A description of the “thing” that is to be deprecated.

  • new (Optional[str]) – A description of the new “thing” that replaces it.

  • error (Optional[bool,Exception]) – Whether or which exception to throw. If True, throw ValueError.

ray.rllib.utils.renamed_agent(cls)[source]

Helper class for renaming Agent => Trainer with a warning.

ray.rllib.utils.renamed_class(cls, old_name)[source]

Helper class for renaming classes with a warning.

ray.rllib.utils.renamed_function(func, old_name)[source]

Helper function for renaming a function.

class ray.rllib.utils.FilterManager[source]

Manages filters and coordination across remote evaluators that expose get_filters and sync_filters.

static synchronize(local_filters, remotes, update_remote=True)[source]

Aggregates all filters from remote evaluators.

Local copy is updated and then broadcasted to all remote evaluators.

Parameters
  • local_filters (dict) – Filters to be synchronized.

  • remotes (list) – Remote evaluators with filters.

  • update_remote (bool) – Whether to push updates to remote filters.

class ray.rllib.utils.Filter[source]

Processes input, possibly statefully.

apply_changes(other, *args, **kwargs)[source]

Updates self with “new state” from other filter.

copy()[source]

Creates a new object with same state as self.

Returns

A copy of self.

sync(other)[source]

Copies all state from other filter to self.

clear_buffer()[source]

Creates copy of current state and clears accumulated state

ray.rllib.utils.sigmoid(x, derivative=False)[source]

Returns the sigmoid function applied to x. Alternatively, can return the derivative or the sigmoid function.

Parameters
  • x (np.ndarray) – The input to the sigmoid function.

  • derivative (bool) – Whether to return the derivative or not. Default: False.

Returns

The sigmoid function (or its derivative) applied to x.

Return type

np.ndarray

ray.rllib.utils.softmax(x, axis=- 1)[source]

Returns the softmax values for x as: S(xi) = e^xi / SUMj(e^xj), where j goes over all elements in x.

Parameters
  • x (np.ndarray) – The input to the softmax function.

  • axis (int) – The axis along which to softmax.

Returns

The softmax over x.

Return type

np.ndarray

ray.rllib.utils.relu(x, alpha=0.0)[source]

Implementation of the leaky ReLU function: y = x * alpha if x < 0 else x

Parameters
  • x (np.ndarray) – The input values.

  • alpha (float) – A scaling (“leak”) factor to use for negative x.

Returns

The leaky ReLU output for x.

Return type

np.ndarray

ray.rllib.utils.one_hot(x, depth=0, on_value=1, off_value=0)[source]

One-hot utility function for numpy. Thanks to qianyizhang: https://gist.github.com/qianyizhang/07ee1c15cad08afb03f5de69349efc30.

Parameters
  • x (np.ndarray) – The input to be one-hot encoded.

  • depth (int) – The max. number to be one-hot encoded (size of last rank).

  • on_value (float) – The value to use for on. Default: 1.0.

  • off_value (float) – The value to use for off. Default: 0.0.

Returns

The one-hot encoded equivalent of the input array.

Return type

np.ndarray

ray.rllib.utils.fc(x, weights, biases=None, framework=None)[source]

Calculates the outputs of a fully-connected (dense) layer given weights/biases and an input.

Parameters
  • x (np.ndarray) – The input to the dense layer.

  • weights (np.ndarray) – The weights matrix.

  • biases (Optional[np.ndarray]) – The biases vector. All 0s if None.

  • framework (Optional[str]) – An optional framework hint (to figure out, e.g. whether to transpose torch weight matrices).

Returns

The dense layer’s output.

ray.rllib.utils.lstm(x, weights, biases=None, initial_internal_states=None, time_major=False, forget_bias=1.0)[source]

Calculates the outputs of an LSTM layer given weights/biases, internal_states, and input.

Parameters
  • x (np.ndarray) – The inputs to the LSTM layer including time-rank (0th if time-major, else 1st) and the batch-rank (1st if time-major, else 0th).

  • weights (np.ndarray) – The weights matrix.

  • biases (Optional[np.ndarray]) – The biases vector. All 0s if None.

  • initial_internal_states (Optional[np.ndarray]) – The initial internal states to pass into the layer. All 0s if None.

  • time_major (bool) – Whether to use time-major or not. Default: False.

  • forget_bias (float) – Gets added to first sigmoid (forget gate) output. Default: 1.0.

Returns

  • The LSTM layer’s output.

  • Tuple: Last (c-state, h-state).

Return type

Tuple

class ray.rllib.utils.PolicyClient(address)[source]

DEPRECATED: Please use rllib.env.PolicyClient instead.

start_episode(episode_id=None, training_enabled=True)[source]

Record the start of an episode.

Parameters
  • episode_id (Optional[str]) – Unique string id for the episode or None for it to be auto-assigned.

  • training_enabled (bool) – Whether to use experiences for this episode to improve the policy.

Returns

Unique string id for the episode.

Return type

episode_id (str)

get_action(episode_id, observation)[source]

Record an observation and get the on-policy action.

Parameters
  • episode_id (str) – Episode id returned from start_episode().

  • observation (obj) – Current environment observation.

Returns

Action from the env action space.

Return type

action (obj)

log_action(episode_id, observation, action)[source]

Record an observation and (off-policy) action taken.

Parameters
  • episode_id (str) – Episode id returned from start_episode().

  • observation (obj) – Current environment observation.

  • action (obj) – Action for the observation.

log_returns(episode_id, reward, info=None)[source]

Record returns from the environment.

The reward will be attributed to the previous action taken by the episode. Rewards accumulate until the next action. If no reward is logged before the next action, a reward of 0.0 is assumed.

Parameters
  • episode_id (str) – Episode id returned from start_episode().

  • reward (float) – Reward from the environment.

end_episode(episode_id, observation)[source]

Record the end of an episode.

Parameters
  • episode_id (str) – Episode id returned from start_episode().

  • observation (obj) – Current environment observation.

class ray.rllib.utils.PolicyServer(external_env, address, port)[source]

DEPRECATED: Please use rllib.env.PolicyServerInput instead.

class ray.rllib.utils.LinearSchedule(**kwargs)[source]

Linear interpolation between initial_p and final_p. Simply uses Polynomial with power=1.0.

final_p + (initial_p - final_p) * (1 - t/t_max)

class ray.rllib.utils.PiecewiseSchedule(endpoints, framework, interpolation=<function _linear_interpolation>, outside_value=None)[source]
class ray.rllib.utils.PolynomialSchedule(schedule_timesteps, final_p, framework, initial_p=1.0, power=2.0)[source]
class ray.rllib.utils.ExponentialSchedule(schedule_timesteps, framework, initial_p=1.0, decay_rate=0.1)[source]
class ray.rllib.utils.ConstantSchedule(value, framework)[source]

A Schedule where the value remains constant over time.

ray.rllib.utils.check(x, y, decimals=5, atol=None, rtol=None, false=False)[source]

Checks two structures (dict, tuple, list, np.array, float, int, etc..) for (almost) numeric identity. All numbers in the two structures have to match up to decimal digits after the floating point. Uses assertions.

Parameters
  • x (any) – The value to be compared (to the expectation: y). This may be a Tensor.

  • y (any) – The expected value to be compared to x. This must not be a tf-Tensor, but may be a tfe/torch-Tensor.

  • decimals (int) – The number of digits after the floating point up to which all numeric values have to match.

  • atol (float) – Absolute tolerance of the difference between x and y (overrides decimals if given).

  • rtol (float) – Relative tolerance of the difference between x and y (overrides decimals if given).

  • false (bool) – Whether to check that x and y are NOT the same.

ray.rllib.utils.check_compute_single_action(trainer, include_state=False, include_prev_action_reward=False)[source]

Tests different combinations of arguments for trainer.compute_action.

Parameters
  • trainer (Trainer) – The Trainer object to test.

  • include_state (bool) – Whether to include the initial state of the Policy’s Model in the compute_action call.

  • include_prev_action_reward (bool) – Whether to include the prev-action and -reward in the compute_action call.

Throws:

ValueError: If anything unexpected happens.

ray.rllib.utils.framework_iterator(config=None, frameworks='tf2', 'tf', 'tfe', 'torch', session=False)[source]

An generator that allows for looping through n frameworks for testing.

Provides the correct config entries (“framework”) as well as the correct eager/non-eager contexts for tfe/tf.

Parameters
  • config (Optional[dict]) – An optional config dict to alter in place depending on the iteration.

  • frameworks (Tuple[str]) – A list/tuple of the frameworks to be tested. Allowed are: “tf2”, “tf”, “tfe”, “torch”, and None.

  • session (bool) – If True and only in the tf-case: Enter a tf.Session() and yield that as second return value (otherwise yield (fw, None)).

Yields

str

If enter_session is False:

The current framework (“tf2”, “tf”, “tfe”, “torch”) used.

Tuple(str, Union[None,tf.Session]: If enter_session is True:

A tuple of the current fw and the tf.Session if fw=”tf”.

ray.rllib.utils.merge_dicts(d1, d2)[source]
Parameters
  • d1 (dict) – Dict 1.

  • d2 (dict) – Dict 2.

Returns

A new dict that is d1 and d2 deep merged.

Return type

dict

ray.rllib.utils.deep_update(original, new_dict, new_keys_allowed=False, allow_new_subkey_list=None, override_all_if_type_changes=None)[source]

Updates original dict with values from new_dict recursively.

If new key is introduced in new_dict, then if new_keys_allowed is not True, an error will be thrown. Further, for sub-dicts, if the key is in the allow_new_subkey_list, then new subkeys can be introduced.

Parameters
  • original (dict) – Dictionary with default values.

  • new_dict (dict) – Dictionary with values to be updated

  • new_keys_allowed (bool) – Whether new keys are allowed.

  • allow_new_subkey_list (Optional[List[str]]) – List of keys that correspond to dict values where new subkeys can be introduced. This is only at the top level.

  • override_all_if_type_changes (Optional[List[str]]) – List of top level keys with value=dict, for which we always simply override the entire value (dict), iff the “type” key in that value dict changes.

ray.rllib.utils.add_mixins(base, mixins)[source]

Returns a new class with mixins applied in priority order.

ray.rllib.utils.force_list(elements=None, to_tuple=False)[source]

Makes sure elements is returned as a list, whether elements is a single item, already a list, or a tuple.

Parameters
  • elements (Optional[any]) – The inputs as single item, list, or tuple to be converted into a list/tuple. If None, returns empty list/tuple.

  • to_tuple (bool) – Whether to use tuple (instead of list).

Returns

All given elements in a list/tuple depending on

to_tuple’s value. If elements is None, returns an empty list/tuple.

Return type

Union[list,tuple]

ray.rllib.utils.force_tuple(elements=None, *, to_tuple=True)

Makes sure elements is returned as a list, whether elements is a single item, already a list, or a tuple.

Parameters
  • elements (Optional[any]) – The inputs as single item, list, or tuple to be converted into a list/tuple. If None, returns empty list/tuple.

  • to_tuple (bool) – Whether to use tuple (instead of list).

Returns

All given elements in a list/tuple depending on

to_tuple’s value. If elements is None, returns an empty list/tuple.

Return type

Union[list,tuple]