RLlib Package Reference¶
ray.rllib.policy¶
-
class
ray.rllib.policy.
Policy
(observation_space: <Mock name='mock.spaces.Space' id='139801235170768'>, action_space: <Mock name='mock.spaces.Space' id='139801235170768'>, config: dict)[source]¶ An agent policy and loss, i.e., a TFPolicy or other subclass.
This object defines how to act in the environment, and also losses used to improve the policy based on its experiences. Note that both policy and loss are defined together for convenience, though the policy itself is logically separate.
All policies can directly extend Policy, however TensorFlow users may find TFPolicy simpler to implement. TFPolicy also enables RLlib to apply TensorFlow-specific optimizations such as fusing multiple policy graphs and multi-GPU support.
-
observation_space
¶ Observation space of the policy. For complex spaces (e.g., Dict), this will be flattened version of the space, and you can access the original space via
observation_space.original_space
.- Type
gym.Space
-
action_space
¶ Action space of the policy.
- Type
gym.Space
-
exploration
¶ The exploration object to use for computing actions, or None.
- Type
Exploration
-
abstract
compute_actions
(obs_batch: Union[List[Any], Any], state_batches: Optional[List[Any]] = None, prev_action_batch: Union[List[Any], Any] = None, prev_reward_batch: Union[List[Any], Any] = None, info_batch: Optional[Dict[str, list]] = None, episodes: Optional[List[MultiAgentEpisode]] = None, explore: Optional[bool] = None, timestep: Optional[int] = None, **kwargs) → Tuple[Any, List[Any], Dict[str, Any]][source]¶ Computes actions for the current policy.
- Parameters
obs_batch (Union[List[TensorType], TensorType]) – Batch of observations.
state_batches (Optional[List[TensorType]]) – List of RNN state input batches, if any.
prev_action_batch (Union[List[TensorType], TensorType]) – Batch of previous action values.
prev_reward_batch (Union[List[TensorType], TensorType]) – Batch of previous rewards.
info_batch (Optional[Dict[str, list]]) – Batch of info objects.
episodes (Optional[List[MultiAgentEpisode]]) – List of MultiAgentEpisode, one for each obs in obs_batch. This provides access to all of the internal episode state, which may be useful for model-based or multiagent algorithms.
explore (Optional[bool]) – Whether to pick an exploitation or exploration action. Set to None (default) for using the value of self.config[“explore”].
timestep (Optional[int]) – The current (sampling) time step.
- Keyword Arguments
kwargs – forward compatibility placeholder
- Returns
- actions (TensorType): Batch of output actions, with shape like
[BATCH_SIZE, ACTION_SHAPE].
- state_outs (List[TensorType]): List of RNN state output
batches, if any, with shape like [STATE_SIZE, BATCH_SIZE].
- info (List[dict]): Dictionary of extra feature batches, if any,
with shape like {“f1”: [BATCH_SIZE, …], “f2”: [BATCH_SIZE, …]}.
- Return type
Tuple
-
compute_single_action
(obs: Any, state: Optional[List[Any]] = None, prev_action: Optional[Any] = None, prev_reward: Optional[Any] = None, info: dict = None, episode: Optional[MultiAgentEpisode] = None, clip_actions: bool = False, explore: Optional[bool] = None, timestep: Optional[int] = None, **kwargs) → Tuple[Any, List[Any], Dict[str, Any]][source]¶ Unbatched version of compute_actions.
- Parameters
obs (TensorType) – Single observation.
state (Optional[List[TensorType]]) – List of RNN state inputs, if any.
prev_action (Optional[TensorType]) – Previous action value, if any.
prev_reward (Optional[TensorType]) – Previous reward, if any.
info (dict) – Info object, if any.
episode (Optional[MultiAgentEpisode]) – this provides access to all of the internal episode state, which may be useful for model-based or multi-agent algorithms.
clip_actions (bool) – Should actions be clipped?
explore (Optional[bool]) – Whether to pick an exploitation or exploration action (default: None -> use self.config[“explore”]).
timestep (Optional[int]) – The current (sampling) time step.
- Keyword Arguments
kwargs – Forward compatibility.
- Returns
actions (TensorType): Single action.
- state_outs (List[TensorType]): List of RNN state outputs,
if any.
info (dict): Dictionary of extra features, if any.
- Return type
Tuple
-
compute_actions_from_input_dict
(input_dict: Dict[str, Any], explore: bool = None, timestep: Optional[int] = None, episodes: Optional[List[MultiAgentEpisode]] = None, **kwargs) → Tuple[Any, List[Any], Dict[str, Any]][source]¶ Computes actions from collected samples (across multiple-agents).
Note: This is an experimental API method.
Only used so far by the Sampler iff _use_trajectory_view_api=True (also only supported for torch). Uses the currently “forward-pass-registered” samples from the collector to construct the input_dict for the Model.
- Parameters
input_dict (Dict[str, TensorType]) – An input dict mapping str keys to Tensors. input_dict already abides to the Policy’s as well as the Model’s view requirements and can be passed to the Model as-is.
explore (bool) – Whether to pick an exploitation or exploration action (default: None -> use self.config[“explore”]).
timestep (Optional[int]) – The current (sampling) time step.
kwargs – forward compatibility placeholder
- Returns
- actions (TensorType): Batch of output actions, with shape
like [BATCH_SIZE, ACTION_SHAPE].
- state_outs (List[TensorType]): List of RNN state output
batches, if any, with shape like [STATE_SIZE, BATCH_SIZE].
- info (dict): Dictionary of extra feature batches, if any, with
shape like {“f1”: [BATCH_SIZE, …], “f2”: [BATCH_SIZE, …]}.
- Return type
Tuple
-
compute_log_likelihoods
(actions: Union[List[Any], Any], obs_batch: Union[List[Any], Any], state_batches: Optional[List[Any]] = None, prev_action_batch: Union[List[Any], Any, None] = None, prev_reward_batch: Union[List[Any], Any, None] = None) → Any[source]¶ Computes the log-prob/likelihood for a given action and observation.
- Parameters
actions (Union[List[TensorType], TensorType]) – Batch of actions, for which to retrieve the log-probs/likelihoods (given all other inputs: obs, states, ..).
obs_batch (Union[List[TensorType], TensorType]) – Batch of observations.
state_batches (Optional[List[TensorType]]) – List of RNN state input batches, if any.
prev_action_batch (Optional[Union[List[TensorType], TensorType]]) – Batch of previous action values.
prev_reward_batch (Optional[Union[List[TensorType], TensorType]]) – Batch of previous rewards.
- Returns
- Batch of log probs/likelihoods, with shape:
[BATCH_SIZE].
- Return type
TensorType
-
postprocess_trajectory
(sample_batch: ray.rllib.policy.sample_batch.SampleBatch, other_agent_batches: Optional[Dict[Any, Tuple[Policy, ray.rllib.policy.sample_batch.SampleBatch]]] = None, episode: Optional[MultiAgentEpisode] = None) → ray.rllib.policy.sample_batch.SampleBatch[source]¶ Implements algorithm-specific trajectory postprocessing.
This will be called on each trajectory fragment computed during policy evaluation. Each fragment is guaranteed to be only from one episode.
- Parameters
sample_batch (SampleBatch) – batch of experiences for the policy, which will contain at most one episode trajectory.
other_agent_batches (dict) – In a multi-agent env, this contains a mapping of agent ids to (policy, agent_batch) tuples containing the policy and experiences of the other agents.
episode (Optional[MultiAgentEpisode]) – An optional multi-agent episode object to provide access to all of the internal episode state, which may be useful for model-based or multi-agent algorithms.
- Returns
Postprocessed sample batch.
- Return type
-
learn_on_batch
(samples: ray.rllib.policy.sample_batch.SampleBatch) → Dict[str, Any][source]¶ Fused compute gradients and apply gradients call.
Either this or the combination of compute/apply grads must be implemented by subclasses.
- Parameters
samples (SampleBatch) – The SampleBatch object to learn from.
- Returns
- Dictionary of extra metadata from
compute_gradients().
- Return type
Dict[str, TensorType]
Examples
>>> sample_batch = ev.sample() >>> ev.learn_on_batch(sample_batch)
-
compute_gradients
(postprocessed_batch: ray.rllib.policy.sample_batch.SampleBatch) → Tuple[Union[List[Tuple[Any, Any]], List[Any]], Dict[str, Any]][source]¶ Computes gradients against a batch of experiences.
Either this or learn_on_batch() must be implemented by subclasses.
- Parameters
postprocessed_batch (SampleBatch) – The SampleBatch object to use for calculating gradients.
- Returns
List of gradient output values.
Extra policy-specific info values.
- Return type
Tuple[ModelGradients, Dict[str, TensorType]]
-
apply_gradients
(gradients: Union[List[Tuple[Any, Any]], List[Any]]) → None[source]¶ Applies previously computed gradients.
Either this or learn_on_batch() must be implemented by subclasses.
- Parameters
gradients (ModelGradients) – The already calculated gradients to apply to this Policy.
-
get_weights
() → dict[source]¶ Returns model weights.
- Returns
Serializable copy or view of model weights.
- Return type
ModelWeights
-
set_weights
(weights: dict) → None[source]¶ Sets model weights.
- Parameters
weights (ModelWeights) – Serializable copy or view of model weights.
-
get_exploration_info
() → Dict[str, Any][source]¶ Returns the current exploration information of this policy.
This information depends on the policy’s Exploration object.
- Returns
- Serializable information on the
self.exploration object.
- Return type
Dict[str, TensorType]
-
is_recurrent
() → bool[source]¶ Whether this Policy holds a recurrent Model.
- Returns
True if this Policy has-a RNN-based Model.
- Return type
bool
-
num_state_tensors
() → int[source]¶ The number of internal states needed by the RNN-Model of the Policy.
- Returns
The number of RNN internal states kept by this Policy’s Model.
- Return type
int
-
get_initial_state
() → List[Any][source]¶ Returns initial RNN state for the current policy.
- Returns
Initial RNN state for the current policy.
- Return type
List[TensorType]
-
get_state
() → Union[Dict[str, Any], List[Any]][source]¶ Saves all local state.
- Returns
- Serialized local
state.
- Return type
Union[Dict[str, TensorType], List[TensorType]]
-
set_state
(state: object) → None[source]¶ Restores all local state.
- Parameters
state (obj) – Serialized local state.
-
on_global_var_update
(global_vars: Dict[str, Any]) → None[source]¶ Called on an update to global vars.
- Parameters
global_vars (Dict[str, TensorType]) – Global variables by str key, broadcast from the driver.
-
export_model
(export_dir: str) → None[source]¶ Export Policy to local directory for serving.
- Parameters
export_dir (str) – Local writable directory.
-
-
class
ray.rllib.policy.
TorchPolicy
(observation_space: <Mock name='mock.spaces.Space' id='139801235170768'>, action_space: <Mock name='mock.spaces.Space' id='139801235170768'>, config: dict, *, model: ray.rllib.models.modelv2.ModelV2, loss: Callable[[ray.rllib.policy.policy.Policy, ray.rllib.models.modelv2.ModelV2, Type[ray.rllib.models.torch.torch_action_dist.TorchDistributionWrapper], ray.rllib.policy.sample_batch.SampleBatch], Union[Any, List[Any]]], action_distribution_class: Type[ray.rllib.models.torch.torch_action_dist.TorchDistributionWrapper], action_sampler_fn: Optional[Callable[[Any, List[Any]], Tuple[Any, Any]]] = None, action_distribution_fn: Optional[Callable[[ray.rllib.policy.policy.Policy, ray.rllib.models.modelv2.ModelV2, Any, Any, Any], Tuple[Any, Type[ray.rllib.models.torch.torch_action_dist.TorchDistributionWrapper], List[Any]]]] = None, max_seq_len: int = 20, get_batch_divisibility_req: Optional[Callable[[ray.rllib.policy.policy.Policy], int]] = None)[source]¶ Template for a PyTorch policy and loss to use with RLlib.
-
observation_space
¶ observation space of the policy.
- Type
gym.Space
-
action_space
¶ action space of the policy.
- Type
gym.Space
-
config
¶ config of the policy.
- Type
dict
-
model
¶ Torch model instance.
- Type
TorchModel
-
dist_class
¶ Torch action distribution class.
- Type
type
-
compute_actions
(obs_batch: Union[List[Any], Any], state_batches: Optional[List[Any]] = None, prev_action_batch: Union[List[Any], Any] = None, prev_reward_batch: Union[List[Any], Any] = None, info_batch: Optional[Dict[str, list]] = None, episodes: Optional[List[MultiAgentEpisode]] = None, explore: Optional[bool] = None, timestep: Optional[int] = None, **kwargs) → Tuple[Any, List[Any], Dict[str, Any]][source]¶ Computes actions for the current policy.
- Parameters
obs_batch (Union[List[TensorType], TensorType]) – Batch of observations.
state_batches (Optional[List[TensorType]]) – List of RNN state input batches, if any.
prev_action_batch (Union[List[TensorType], TensorType]) – Batch of previous action values.
prev_reward_batch (Union[List[TensorType], TensorType]) – Batch of previous rewards.
info_batch (Optional[Dict[str, list]]) – Batch of info objects.
episodes (Optional[List[MultiAgentEpisode]]) – List of MultiAgentEpisode, one for each obs in obs_batch. This provides access to all of the internal episode state, which may be useful for model-based or multiagent algorithms.
explore (Optional[bool]) – Whether to pick an exploitation or exploration action. Set to None (default) for using the value of self.config[“explore”].
timestep (Optional[int]) – The current (sampling) time step.
- Keyword Arguments
kwargs – forward compatibility placeholder
- Returns
- actions (TensorType): Batch of output actions, with shape like
[BATCH_SIZE, ACTION_SHAPE].
- state_outs (List[TensorType]): List of RNN state output
batches, if any, with shape like [STATE_SIZE, BATCH_SIZE].
- info (List[dict]): Dictionary of extra feature batches, if any,
with shape like {“f1”: [BATCH_SIZE, …], “f2”: [BATCH_SIZE, …]}.
- Return type
Tuple
-
compute_actions_from_input_dict
(input_dict: Dict[str, Any], explore: bool = None, timestep: Optional[int] = None, **kwargs) → Tuple[Any, List[Any], Dict[str, Any]][source]¶ Computes actions from collected samples (across multiple-agents).
Note: This is an experimental API method.
Only used so far by the Sampler iff _use_trajectory_view_api=True (also only supported for torch). Uses the currently “forward-pass-registered” samples from the collector to construct the input_dict for the Model.
- Parameters
input_dict (Dict[str, TensorType]) – An input dict mapping str keys to Tensors. input_dict already abides to the Policy’s as well as the Model’s view requirements and can be passed to the Model as-is.
explore (bool) – Whether to pick an exploitation or exploration action (default: None -> use self.config[“explore”]).
timestep (Optional[int]) – The current (sampling) time step.
kwargs – forward compatibility placeholder
- Returns
- actions (TensorType): Batch of output actions, with shape
like [BATCH_SIZE, ACTION_SHAPE].
- state_outs (List[TensorType]): List of RNN state output
batches, if any, with shape like [STATE_SIZE, BATCH_SIZE].
- info (dict): Dictionary of extra feature batches, if any, with
shape like {“f1”: [BATCH_SIZE, …], “f2”: [BATCH_SIZE, …]}.
- Return type
Tuple
-
apply_gradients
(gradients: Union[List[Tuple[Any, Any]], List[Any]]) → None[source]¶ Applies previously computed gradients.
Either this or learn_on_batch() must be implemented by subclasses.
- Parameters
gradients (ModelGradients) – The already calculated gradients to apply to this Policy.
-
get_weights
() → dict[source]¶ Returns model weights.
- Returns
Serializable copy or view of model weights.
- Return type
ModelWeights
-
set_weights
(weights: dict) → None[source]¶ Sets model weights.
- Parameters
weights (ModelWeights) – Serializable copy or view of model weights.
-
is_recurrent
() → bool[source]¶ Whether this Policy holds a recurrent Model.
- Returns
True if this Policy has-a RNN-based Model.
- Return type
bool
-
num_state_tensors
() → int[source]¶ The number of internal states needed by the RNN-Model of the Policy.
- Returns
The number of RNN internal states kept by this Policy’s Model.
- Return type
int
-
get_initial_state
() → List[Any][source]¶ Returns initial RNN state for the current policy.
- Returns
Initial RNN state for the current policy.
- Return type
List[TensorType]
-
get_state
() → Union[Dict[str, Any], List[Any]][source]¶ Saves all local state.
- Returns
- Serialized local
state.
- Return type
Union[Dict[str, TensorType], List[TensorType]]
-
set_state
(state: object) → None[source]¶ Restores all local state.
- Parameters
state (obj) – Serialized local state.
-
extra_grad_process
(optimizer: <Mock name='mock.optim.Optimizer' id='139801239392976'>, loss: Any)[source]¶ Called after each optimizer.zero_grad() + loss.backward() call.
Called for each self._optimizers/loss-value pair. Allows for gradient processing before optimizer.step() is called. E.g. for gradient clipping.
- Parameters
optimizer (torch.optim.Optimizer) – A torch optimizer object.
loss (TensorType) – The loss tensor associated with the optimizer.
- Returns
- An dict with information on the gradient
processing step.
- Return type
Dict[str, TensorType]
-
extra_compute_grad_fetches
() → Dict[str, any][source]¶ Extra values to fetch and return from compute_gradients().
- Returns
- Extra fetch dict to be added to the fetch dict
of the compute_gradients call.
- Return type
Dict[str, any]
-
extra_action_out
(input_dict: Dict[str, Any], state_batches: List[Any], model: ray.rllib.models.torch.torch_modelv2.TorchModelV2, action_dist: ray.rllib.models.torch.torch_action_dist.TorchDistributionWrapper) → Dict[str, Any][source]¶ Returns dict of extra info to include in experience batch.
- Parameters
input_dict (Dict[str, TensorType]) – Dict of model input tensors.
state_batches (List[TensorType]) – List of state tensors.
model (TorchModelV2) – Reference to the model object.
action_dist (TorchDistributionWrapper) – Torch action dist object to get log-probs (e.g. for already sampled actions).
- Returns
- Extra outputs to return in a
compute_actions() call (3rd return value).
- Return type
Dict[str, TensorType]
-
extra_grad_info
(train_batch: ray.rllib.policy.sample_batch.SampleBatch) → Dict[str, Any][source]¶ Return dict of extra grad info.
- Parameters
train_batch (SampleBatch) – The training batch for which to produce extra grad info for.
- Returns
- The info dict carrying grad info per str
key.
- Return type
Dict[str, TensorType]
-
optimizer
() → Union[List[<Mock name=’mock.optim.Optimizer’ id=’139801239392976’>], <Mock name=’mock.optim.Optimizer’ id=’139801239392976’>][source]¶ Custom the local PyTorch optimizer(s) to use.
- Returns
The local PyTorch optimizer(s) to use for this Policy.
- Return type
Union[List[torch.optim.Optimizer], torch.optim.Optimizer]
-
-
class
ray.rllib.policy.
TFPolicy
(observation_space: <Mock name='mock.spaces.Space' id='139801235170768'>, action_space: <Mock name='mock.spaces.Space' id='139801235170768'>, config: dict, sess: <Mock name='mock.compat.v1.Session' id='139801243233936'>, obs_input: Any, sampled_action: Any, loss: Any, loss_inputs: List[Tuple[str, Any]], model: ray.rllib.models.modelv2.ModelV2 = None, sampled_action_logp: Optional[Any] = None, action_input: Optional[Any] = None, log_likelihood: Optional[Any] = None, dist_inputs: Optional[Any] = None, dist_class: Optional[type] = None, state_inputs: Optional[List[Any]] = None, state_outputs: Optional[List[Any]] = None, prev_action_input: Optional[Any] = None, prev_reward_input: Optional[Any] = None, seq_lens: Optional[Any] = None, max_seq_len: int = 20, batch_divisibility_req: int = 1, update_ops: List[Any] = None, explore: Optional[Any] = None, timestep: Optional[Any] = None)[source]¶ An agent policy and loss implemented in TensorFlow.
Do not sub-class this class directly (neither should you sub-class DynamicTFPolicy), but rather use rllib.policy.tf_policy_template.build_tf_policy to generate your custom tf (graph-mode or eager) Policy classes.
Extending this class enables RLlib to perform TensorFlow specific optimizations on the policy, e.g., parallelization across gpus or fusing multiple graphs together in the multi-agent setting.
Input tensors are typically shaped like [BATCH_SIZE, …].
-
observation_space
¶ observation space of the policy.
- Type
gym.Space
-
action_space
¶ action space of the policy.
- Type
gym.Space
-
model
¶ RLlib model used for the policy.
- Type
rllib.models.Model
Examples
>>> policy = TFPolicySubclass( sess, obs_input, sampled_action, loss, loss_inputs)
>>> print(policy.compute_actions([1, 0, 2])) (array([0, 1, 1]), [], {})
>>> print(policy.postprocess_trajectory(SampleBatch({...}))) SampleBatch({"action": ..., "advantages": ..., ...})
-
get_placeholder
(name) → <Mock name=’mock.compat.v1.placeholder’ id=’139801237588688’>[source]¶ Returns the given action or loss input placeholder by name.
If the loss has not been initialized and a loss input placeholder is requested, an error is raised.
- Parameters
name (str) – The name of the placeholder to return. One of SampleBatch.CUR_OBS|PREV_ACTION/REWARD or a valid key from self._loss_input_dict.
- Returns
The placeholder under the given str key.
- Return type
tf1.placeholder
-
get_session
() → <Mock name=’mock.compat.v1.Session’ id=’139801243233936’>[source]¶ Returns a reference to the TF session for this policy.
-
compute_actions
(obs_batch: Union[List[Any], Any], state_batches: Optional[List[Any]] = None, prev_action_batch: Union[List[Any], Any] = None, prev_reward_batch: Union[List[Any], Any] = None, info_batch: Optional[Dict[str, list]] = None, episodes: Optional[List[MultiAgentEpisode]] = None, explore: Optional[bool] = None, timestep: Optional[int] = None, **kwargs)[source]¶ Computes actions for the current policy.
- Parameters
obs_batch (Union[List[TensorType], TensorType]) – Batch of observations.
state_batches (Optional[List[TensorType]]) – List of RNN state input batches, if any.
prev_action_batch (Union[List[TensorType], TensorType]) – Batch of previous action values.
prev_reward_batch (Union[List[TensorType], TensorType]) – Batch of previous rewards.
info_batch (Optional[Dict[str, list]]) – Batch of info objects.
episodes (Optional[List[MultiAgentEpisode]]) – List of MultiAgentEpisode, one for each obs in obs_batch. This provides access to all of the internal episode state, which may be useful for model-based or multiagent algorithms.
explore (Optional[bool]) – Whether to pick an exploitation or exploration action. Set to None (default) for using the value of self.config[“explore”].
timestep (Optional[int]) – The current (sampling) time step.
- Keyword Arguments
kwargs – forward compatibility placeholder
- Returns
- actions (TensorType): Batch of output actions, with shape like
[BATCH_SIZE, ACTION_SHAPE].
- state_outs (List[TensorType]): List of RNN state output
batches, if any, with shape like [STATE_SIZE, BATCH_SIZE].
- info (List[dict]): Dictionary of extra feature batches, if any,
with shape like {“f1”: [BATCH_SIZE, …], “f2”: [BATCH_SIZE, …]}.
- Return type
Tuple
-
compute_actions_from_input_dict
(input_dict: Dict[str, Any], explore: bool = None, timestep: Optional[int] = None, episodes: Optional[List[MultiAgentEpisode]] = None, **kwargs) → Tuple[Any, List[Any], Dict[str, Any]][source]¶ Computes actions from collected samples (across multiple-agents).
Note: This is an experimental API method.
Only used so far by the Sampler iff _use_trajectory_view_api=True (also only supported for torch). Uses the currently “forward-pass-registered” samples from the collector to construct the input_dict for the Model.
- Parameters
input_dict (Dict[str, TensorType]) – An input dict mapping str keys to Tensors. input_dict already abides to the Policy’s as well as the Model’s view requirements and can be passed to the Model as-is.
explore (bool) – Whether to pick an exploitation or exploration action (default: None -> use self.config[“explore”]).
timestep (Optional[int]) – The current (sampling) time step.
kwargs – forward compatibility placeholder
- Returns
- actions (TensorType): Batch of output actions, with shape
like [BATCH_SIZE, ACTION_SHAPE].
- state_outs (List[TensorType]): List of RNN state output
batches, if any, with shape like [STATE_SIZE, BATCH_SIZE].
- info (dict): Dictionary of extra feature batches, if any, with
shape like {“f1”: [BATCH_SIZE, …], “f2”: [BATCH_SIZE, …]}.
- Return type
Tuple
-
compute_log_likelihoods
(actions: Union[List[Any], Any], obs_batch: Union[List[Any], Any], state_batches: Optional[List[Any]] = None, prev_action_batch: Union[List[Any], Any, None] = None, prev_reward_batch: Union[List[Any], Any, None] = None) → Any[source]¶ Computes the log-prob/likelihood for a given action and observation.
- Parameters
actions (Union[List[TensorType], TensorType]) – Batch of actions, for which to retrieve the log-probs/likelihoods (given all other inputs: obs, states, ..).
obs_batch (Union[List[TensorType], TensorType]) – Batch of observations.
state_batches (Optional[List[TensorType]]) – List of RNN state input batches, if any.
prev_action_batch (Optional[Union[List[TensorType], TensorType]]) – Batch of previous action values.
prev_reward_batch (Optional[Union[List[TensorType], TensorType]]) – Batch of previous rewards.
- Returns
- Batch of log probs/likelihoods, with shape:
[BATCH_SIZE].
- Return type
TensorType
-
learn_on_batch
(postprocessed_batch: ray.rllib.policy.sample_batch.SampleBatch) → Dict[str, Any][source]¶ Fused compute gradients and apply gradients call.
Either this or the combination of compute/apply grads must be implemented by subclasses.
- Parameters
samples (SampleBatch) – The SampleBatch object to learn from.
- Returns
- Dictionary of extra metadata from
compute_gradients().
- Return type
Dict[str, TensorType]
Examples
>>> sample_batch = ev.sample() >>> ev.learn_on_batch(sample_batch)
-
compute_gradients
(postprocessed_batch: ray.rllib.policy.sample_batch.SampleBatch) → Tuple[Union[List[Tuple[Any, Any]], List[Any]], Dict[str, Any]][source]¶ Computes gradients against a batch of experiences.
Either this or learn_on_batch() must be implemented by subclasses.
- Parameters
postprocessed_batch (SampleBatch) – The SampleBatch object to use for calculating gradients.
- Returns
List of gradient output values.
Extra policy-specific info values.
- Return type
Tuple[ModelGradients, Dict[str, TensorType]]
-
apply_gradients
(gradients: Union[List[Tuple[Any, Any]], List[Any]]) → None[source]¶ Applies previously computed gradients.
Either this or learn_on_batch() must be implemented by subclasses.
- Parameters
gradients (ModelGradients) – The already calculated gradients to apply to this Policy.
-
get_exploration_info
() → Dict[str, Any][source]¶ Returns the current exploration information of this policy.
This information depends on the policy’s Exploration object.
- Returns
- Serializable information on the
self.exploration object.
- Return type
Dict[str, TensorType]
-
get_weights
() → Union[Dict[str, Any], List[Any]][source]¶ Returns model weights.
- Returns
Serializable copy or view of model weights.
- Return type
ModelWeights
-
set_weights
(weights) → None[source]¶ Sets model weights.
- Parameters
weights (ModelWeights) – Serializable copy or view of model weights.
-
get_state
() → Union[Dict[str, Any], List[Any]][source]¶ Saves all local state.
- Returns
- Serialized local
state.
- Return type
Union[Dict[str, TensorType], List[TensorType]]
-
set_state
(state) → None[source]¶ Restores all local state.
- Parameters
state (obj) – Serialized local state.
-
export_checkpoint
(export_dir: str, filename_prefix: str = 'model') → None[source]¶ Export tensorflow checkpoint to export_dir.
-
copy
(existing_inputs: List[Tuple[str, tf1.placeholder]]) → ray.rllib.policy.tf_policy.TFPolicy[source]¶ Creates a copy of self using existing input placeholders.
Optional: Only required to work with the multi-GPU optimizer.
- Parameters
existing_inputs (List[Tuple[str, tf1.placeholder]]) – Dict mapping names (str) to tf1.placeholders to re-use (share) with the returned copy of self.
- Returns
A copy of self.
- Return type
-
is_recurrent
() → bool[source]¶ Whether this Policy holds a recurrent Model.
- Returns
True if this Policy has-a RNN-based Model.
- Return type
bool
-
num_state_tensors
() → int[source]¶ The number of internal states needed by the RNN-Model of the Policy.
- Returns
The number of RNN internal states kept by this Policy’s Model.
- Return type
int
-
extra_compute_action_feed_dict
() → Dict[Any, Any][source]¶ Extra dict to pass to the compute actions session run.
- Returns
- A feed dict to be added to the
feed_dict passed to the compute_actions session.run() call.
- Return type
Dict[TensorType, TensorType]
-
extra_compute_action_fetches
() → Dict[str, Any][source]¶ Extra values to fetch and return from compute_actions().
By default we return action probability/log-likelihood info and action distribution inputs (if present).
- Returns
- An extra fetch-dict to be passed to and
returned from the compute_actions() call.
- Return type
Dict[str, TensorType]
-
extra_compute_grad_feed_dict
() → Dict[Any, Any][source]¶ Extra dict to pass to the compute gradients session run.
- Returns
- Extra feed_dict to be passed to the
compute_gradients Session.run() call.
- Return type
Dict[TensorType, TensorType]
-
extra_compute_grad_fetches
() → Dict[str, any][source]¶ Extra values to fetch and return from compute_gradients().
- Returns
- Extra fetch dict to be added to the fetch dict
of the compute_gradients Session.run() call.
- Return type
Dict[str, any]
-
optimizer
() → <Mock name=’mock.keras.optimizers.Optimizer’ id=’139801237488976’>[source]¶ TF optimizer to use for policy optimization.
- Returns
- The local optimizer to use for this
Policy’s Model.
- Return type
tf.keras.optimizers.Optimizer
-
gradients
(optimizer: <Mock name='mock.keras.optimizers.Optimizer' id='139801237488976'>, loss: Any) → List[Tuple[Any, Any]][source]¶ Override this for a custom gradient computation behavior.
- Returns
- List of tuples with grad
values and the grad-value’s corresponding tf.variable in it.
- Return type
List[Tuple[TensorType, TensorType]]
-
build_apply_op
(optimizer: <Mock name='mock.keras.optimizers.Optimizer' id='139801237488976'>, grads_and_vars: List[Tuple[Any, Any]]) → <Mock name=’mock.Operation’ id=’139801237245776’>[source]¶ Override this for a custom gradient apply computation behavior.
- Parameters
optimizer (tf.keras.optimizers.Optimizer) – The local tf optimizer to use for applying the grads and vars.
grads_and_vars (List[Tuple[TensorType, TensorType]]) – List of tuples with grad values and the grad-value’s corresponding tf.variable in it.
-
-
ray.rllib.policy.
build_policy_class
(name: str, framework: str, *, loss_fn: Optional[Callable[[ray.rllib.policy.policy.Policy, ray.rllib.models.modelv2.ModelV2, Type[ray.rllib.models.torch.torch_action_dist.TorchDistributionWrapper], ray.rllib.policy.sample_batch.SampleBatch], Union[Any, List[Any]]]], get_default_config: Optional[Callable[[], dict]] = None, stats_fn: Optional[Callable[[ray.rllib.policy.policy.Policy, ray.rllib.policy.sample_batch.SampleBatch], Dict[str, Any]]] = None, postprocess_fn: Optional[Callable[[ray.rllib.policy.policy.Policy, ray.rllib.policy.sample_batch.SampleBatch, Optional[Dict[Any, ray.rllib.policy.sample_batch.SampleBatch]], Optional[MultiAgentEpisode]], ray.rllib.policy.sample_batch.SampleBatch]] = None, extra_action_out_fn: Optional[Callable[[ray.rllib.policy.policy.Policy, Dict[str, Any], List[Any], ray.rllib.models.modelv2.ModelV2, ray.rllib.models.torch.torch_action_dist.TorchDistributionWrapper], Dict[str, Any]]] = None, extra_grad_process_fn: Optional[Callable[[ray.rllib.policy.policy.Policy, torch.optim.Optimizer, Any], Dict[str, Any]]] = None, extra_learn_fetches_fn: Optional[Callable[[ray.rllib.policy.policy.Policy], Dict[str, Any]]] = None, optimizer_fn: Optional[Callable[[ray.rllib.policy.policy.Policy, dict], torch.optim.Optimizer]] = None, validate_spaces: Optional[Callable[[ray.rllib.policy.policy.Policy, <Mock name='mock.Space' id='139801234788560'>, <Mock name='mock.Space' id='139801234788560'>, dict], None]] = None, before_init: Optional[Callable[[ray.rllib.policy.policy.Policy, <Mock name='mock.Space' id='139801234788560'>, <Mock name='mock.Space' id='139801234788560'>, dict], None]] = None, before_loss_init: Optional[Callable[[ray.rllib.policy.policy.Policy, <Mock name='mock.spaces.Space' id='139801235170768'>, <Mock name='mock.spaces.Space' id='139801235170768'>, dict], None]] = None, after_init: Optional[Callable[[ray.rllib.policy.policy.Policy, <Mock name='mock.Space' id='139801234788560'>, <Mock name='mock.Space' id='139801234788560'>, dict], None]] = None, _after_loss_init: Optional[Callable[[ray.rllib.policy.policy.Policy, <Mock name='mock.spaces.Space' id='139801235170768'>, <Mock name='mock.spaces.Space' id='139801235170768'>, dict], None]] = None, action_sampler_fn: Optional[Callable[[Any, List[Any]], Tuple[Any, Any]]] = None, action_distribution_fn: Optional[Callable[[ray.rllib.policy.policy.Policy, ray.rllib.models.modelv2.ModelV2, Any, Any, Any], Tuple[Any, type, List[Any]]]] = None, make_model: Optional[Callable[[ray.rllib.policy.policy.Policy, <Mock name='mock.spaces.Space' id='139801235170768'>, <Mock name='mock.spaces.Space' id='139801235170768'>, dict], ray.rllib.models.modelv2.ModelV2]] = None, make_model_and_action_dist: Optional[Callable[[ray.rllib.policy.policy.Policy, <Mock name='mock.spaces.Space' id='139801235170768'>, <Mock name='mock.spaces.Space' id='139801235170768'>, dict], Tuple[ray.rllib.models.modelv2.ModelV2, Type[ray.rllib.models.torch.torch_action_dist.TorchDistributionWrapper]]]] = None, apply_gradients_fn: Optional[Callable[[ray.rllib.policy.policy.Policy, torch.optim.Optimizer], None]] = None, mixins: Optional[List[type]] = None, get_batch_divisibility_req: Optional[Callable[[ray.rllib.policy.policy.Policy], int]] = None) → Type[ray.rllib.policy.torch_policy.TorchPolicy][source]¶ Helper function for creating a new Policy class at runtime.
Supports frameworks JAX and PyTorch.
- Parameters
name (str) – name of the policy (e.g., “PPOTorchPolicy”)
framework (str) – Either “jax” or “torch”.
(Optional[Callable[[Policy, ModelV2, (loss_fn) – Type[TorchDistributionWrapper], SampleBatch], Union[TensorType, List[TensorType]]]]): Callable that returns a loss tensor.
get_default_config (Optional[Callable[[None], TrainerConfigDict]]) – Optional callable that returns the default config to merge with any overrides. If None, uses only(!) the user-provided PartialTrainerConfigDict as dict for this Policy.
(Optional[Callable[[Policy, SampleBatch, (postprocess_fn) – Optional[Dict[Any, SampleBatch]], Optional[“MultiAgentEpisode”]], SampleBatch]]): Optional callable for post-processing experience batches (called after the super’s postprocess_trajectory method).
(Optional[Callable[[Policy, SampleBatch], (stats_fn) – Dict[str, TensorType]]]): Optional callable that returns a dict of values given the policy and training batch. If None, will use TorchPolicy.extra_grad_info() instead. The stats dict is used for logging (e.g. in TensorBoard).
(Optional[Callable[[Policy, Dict[str, TensorType], (extra_action_out_fn) – List[TensorType], ModelV2, TorchDistributionWrapper]], Dict[str, TensorType]]]): Optional callable that returns a dict of extra values to include in experiences. If None, no extra computations will be performed.
(Optional[Callable[[Policy, (apply_gradients_fn) – “torch.optim.Optimizer”, TensorType], Dict[str, TensorType]]]): Optional callable that is called after gradients are computed and returns a processing info dict. If None, will call the TorchPolicy.extra_grad_process() method instead.
TODO (#) – (sven) dissolve naming mismatch between “learn” and “compute..”
(Optional[Callable[[Policy], (extra_learn_fetches_fn) – Dict[str, TensorType]]]): Optional callable that returns a dict of extra tensors from the policy after loss evaluation. If None, will call the TorchPolicy.extra_compute_grad_fetches() method instead.
(Optional[Callable[[Policy, TrainerConfigDict], (optimizer_fn) – “torch.optim.Optimizer”]]): Optional callable that returns a torch optimizer given the policy and config. If None, will call the TorchPolicy.optimizer() method instead (which returns a torch Adam optimizer).
(Optional[Callable[[Policy, gym.Space, gym.Space, (after_init) – TrainerConfigDict], None]]): Optional callable that takes the Policy, observation_space, action_space, and config to check for correctness. If None, no spaces checking will be done.
(Optional[Callable[[Policy, gym.Space, gym.Space, – TrainerConfigDict], None]]): Optional callable to run at the beginning of Policy.__init__ that takes the same arguments as the Policy constructor. If None, this step will be skipped.
(Optional[Callable[[Policy, gym.spaces.Space, (make_model) – gym.spaces.Space, TrainerConfigDict], None]]): Optional callable to run prior to loss init. If None, this step will be skipped.
(Optional[Callable[[Policy, gym.Space, gym.Space, – TrainerConfigDict], None]]): DEPRECATED: Use before_loss_init instead.
(Optional[Callable[[Policy, gym.spaces.Space, – gym.spaces.Space, TrainerConfigDict], None]]): Optional callable to run after the loss init. If None, this step will be skipped. This will be deprecated at some point and renamed into after_init to match build_tf_policy() behavior.
(Optional[Callable[[TensorType, List[TensorType]], (action_sampler_fn) – Tuple[TensorType, TensorType]]]): Optional callable returning a sampled action and its log-likelihood given some (obs and state) inputs. If None, will either use action_distribution_fn or compute actions by calling self.model, then sampling from the so parameterized action distribution.
(Optional[Callable[[Policy, ModelV2, TensorType, (action_distribution_fn) – TensorType, TensorType], Tuple[TensorType, Type[TorchDistributionWrapper], List[TensorType]]]]): A callable that takes the Policy, Model, the observation batch, an explore-flag, a timestep, and an is_training flag and returns a tuple of a) distribution inputs (parameters), b) a dist-class to generate an action distribution object from, and c) internal-state outputs (empty list if not applicable). If None, will either use action_sampler_fn or compute actions by calling self.model, then sampling from the parameterized action distribution.
(Optional[Callable[[Policy, gym.spaces.Space, – gym.spaces.Space, TrainerConfigDict], ModelV2]]): Optional callable that takes the same arguments as Policy.__init__ and returns a model instance. The distribution class will be determined automatically. Note: Only one of make_model or make_model_and_action_dist should be provided. If both are None, a default Model will be created.
(Optional[Callable[[Policy, – gym.spaces.Space, gym.spaces.Space, TrainerConfigDict], Tuple[ModelV2, Type[TorchDistributionWrapper]]]]): Optional callable that takes the same arguments as Policy.__init__ and returns a tuple of model instance and torch action distribution class. Note: Only one of make_model or make_model_and_action_dist should be provided. If both are None, a default Model will be created.
(Optional[Callable[[Policy, – “torch.optim.Optimizer”], None]]): Optional callable that takes a grads list and applies these to the Model’s parameters. If None, will call the TorchPolicy.apply_gradients() method instead.
mixins (Optional[List[type]]) – Optional list of any class mixins for the returned policy class. These mixins will be applied in order and will have higher precedence than the TorchPolicy class.
get_batch_divisibility_req (Optional[Callable[[Policy], int]]) – Optional callable that returns the divisibility requirement for sample batches. If None, will assume a value of 1.
- Returns
- TorchPolicy child class constructed from the
specified args.
- Return type
Type[TorchPolicy]
-
ray.rllib.policy.
build_tf_policy
(name: str, *, loss_fn: Callable[[ray.rllib.policy.policy.Policy, ray.rllib.models.modelv2.ModelV2, Type[ray.rllib.models.tf.tf_action_dist.TFActionDistribution], ray.rllib.policy.sample_batch.SampleBatch], Union[Any, List[Any]]], get_default_config: Optional[Callable[[None], dict]] = None, postprocess_fn: Optional[Callable[[ray.rllib.policy.policy.Policy, ray.rllib.policy.sample_batch.SampleBatch, Optional[Dict[Any, ray.rllib.policy.sample_batch.SampleBatch]], Optional[MultiAgentEpisode]], ray.rllib.policy.sample_batch.SampleBatch]] = None, stats_fn: Optional[Callable[[ray.rllib.policy.policy.Policy, ray.rllib.policy.sample_batch.SampleBatch], Dict[str, Any]]] = None, optimizer_fn: Optional[Callable[[ray.rllib.policy.policy.Policy, dict], tf.keras.optimizers.Optimizer]] = None, gradients_fn: Optional[Callable[[ray.rllib.policy.policy.Policy, tf.keras.optimizers.Optimizer, Any], Union[List[Tuple[Any, Any]], List[Any]]]] = None, apply_gradients_fn: Optional[Callable[[ray.rllib.policy.policy.Policy, tf.keras.optimizers.Optimizer, Union[List[Tuple[Any, Any]], List[Any]]], tf.Operation]] = None, grad_stats_fn: Optional[Callable[[ray.rllib.policy.policy.Policy, ray.rllib.policy.sample_batch.SampleBatch, Union[List[Tuple[Any, Any]], List[Any]]], Dict[str, Any]]] = None, extra_action_fetches_fn: Optional[Callable[[ray.rllib.policy.policy.Policy], Dict[str, Any]]] = None, extra_learn_fetches_fn: Optional[Callable[[ray.rllib.policy.policy.Policy], Dict[str, Any]]] = None, validate_spaces: Optional[Callable[[ray.rllib.policy.policy.Policy, <Mock name='mock.Space' id='139801234788560'>, <Mock name='mock.Space' id='139801234788560'>, dict], None]] = None, before_init: Optional[Callable[[ray.rllib.policy.policy.Policy, <Mock name='mock.Space' id='139801234788560'>, <Mock name='mock.Space' id='139801234788560'>, dict], None]] = None, before_loss_init: Optional[Callable[[ray.rllib.policy.policy.Policy, <Mock name='mock.spaces.Space' id='139801235170768'>, <Mock name='mock.spaces.Space' id='139801235170768'>, dict], None]] = None, after_init: Optional[Callable[[ray.rllib.policy.policy.Policy, <Mock name='mock.Space' id='139801234788560'>, <Mock name='mock.Space' id='139801234788560'>, dict], None]] = None, make_model: Optional[Callable[[ray.rllib.policy.policy.Policy, <Mock name='mock.spaces.Space' id='139801235170768'>, <Mock name='mock.spaces.Space' id='139801235170768'>, dict], ray.rllib.models.modelv2.ModelV2]] = None, action_sampler_fn: Optional[Callable[[Any, List[Any]], Tuple[Any, Any]]] = None, action_distribution_fn: Optional[Callable[[ray.rllib.policy.policy.Policy, ray.rllib.models.modelv2.ModelV2, Any, Any, Any], Tuple[Any, type, List[Any]]]] = None, mixins: Optional[List[type]] = None, get_batch_divisibility_req: Optional[Callable[[ray.rllib.policy.policy.Policy], int]] = None, obs_include_prev_action_reward: bool = True) → Type[ray.rllib.policy.dynamic_tf_policy.DynamicTFPolicy][source]¶ Helper function for creating a dynamic tf policy at runtime.
- Functions will be run in this order to initialize the policy:
Placeholder setup: postprocess_fn
Loss init: loss_fn, stats_fn
- Optimizer init: optimizer_fn, gradients_fn, apply_gradients_fn,
grad_stats_fn
This means that you can e.g., depend on any policy attributes created in the running of loss_fn in later functions such as stats_fn.
In eager mode, the following functions will be run repeatedly on each eager execution: loss_fn, stats_fn, gradients_fn, apply_gradients_fn, and grad_stats_fn.
This means that these functions should not define any variables internally, otherwise they will fail in eager mode execution. Variable should only be created in make_model (if defined).
- Parameters
name (str) – Name of the policy (e.g., “PPOTFPolicy”).
(Callable[[ (loss_fn) – Policy, ModelV2, Type[TFActionDistribution], SampleBatch], Union[TensorType, List[TensorType]]]): Callable for calculating a loss tensor.
get_default_config (Optional[Callable[[None], TrainerConfigDict]]) – Optional callable that returns the default config to merge with any overrides. If None, uses only(!) the user-provided PartialTrainerConfigDict as dict for this Policy.
(Optional[Callable[[Policy, SampleBatch, (postprocess_fn) – Optional[Dict[AgentID, SampleBatch]], MultiAgentEpisode], None]]): Optional callable for post-processing experience batches (called after the parent class’ postprocess_trajectory method).
(Optional[Callable[[Policy, SampleBatch], (stats_fn) – Dict[str, TensorType]]]): Optional callable that returns a dict of TF tensors to fetch given the policy and batch input tensors. If None, will not compute any stats.
(Optional[Callable[[Policy, TrainerConfigDict], (optimizer_fn) – “tf.keras.optimizers.Optimizer”]]): Optional callable that returns a tf.Optimizer given the policy and config. If None, will call the base class’ optimizer() method instead (which returns a tf1.train.AdamOptimizer).
(Optional[Callable[[Policy, (apply_gradients_fn) – “tf.keras.optimizers.Optimizer”, TensorType], ModelGradients]]): Optional callable that returns a list of gradients. If None, this defaults to optimizer.compute_gradients([loss]).
(Optional[Callable[[Policy, – “tf.keras.optimizers.Optimizer”, ModelGradients], “tf.Operation”]]): Optional callable that returns an apply gradients op given policy, tf-optimizer, and grads_and_vars. If None, will call the base class’ build_apply_op() method instead.
(Optional[Callable[[Policy, SampleBatch, ModelGradients], (grad_stats_fn) – Dict[str, TensorType]]]): Optional callable that returns a dict of TF fetches given the policy, batch input, and gradient tensors. If None, will not collect any gradient stats.
(Optional[Callable[[Policy], (extra_learn_fetches_fn) – Dict[str, TensorType]]]): Optional callable that returns a dict of TF fetches given the policy object. If None, will not perform any extra fetches.
(Optional[Callable[[Policy], – Dict[str, TensorType]]]): Optional callable that returns a dict of extra values to fetch and return when learning on a batch. If None, will call the base class’ extra_compute_grad_fetches() method instead.
(Optional[Callable[[Policy, gym.Space, gym.Space, (after_init) – TrainerConfigDict], None]]): Optional callable that takes the Policy, observation_space, action_space, and config to check the spaces for correctness. If None, no spaces checking will be done.
(Optional[Callable[[Policy, gym.Space, gym.Space, – TrainerConfigDict], None]]): Optional callable to run at the beginning of policy init that takes the same arguments as the policy constructor. If None, this step will be skipped.
(Optional[Callable[[Policy, gym.spaces.Space, (make_model) – gym.spaces.Space, TrainerConfigDict], None]]): Optional callable to run prior to loss init. If None, this step will be skipped.
(Optional[Callable[[Policy, gym.Space, gym.Space, – TrainerConfigDict], None]]): Optional callable to run at the end of policy init. If None, this step will be skipped.
(Optional[Callable[[Policy, gym.spaces.Space, – gym.spaces.Space, TrainerConfigDict], ModelV2]]): Optional callable that returns a ModelV2 object. All policy variables should be created in this function. If None, a default ModelV2 object will be created.
(Optional[Callable[[TensorType, List[TensorType]], (action_sampler_fn) – Tuple[TensorType, TensorType]]]): A callable returning a sampled action and its log-likelihood given observation and state inputs. If None, will either use action_distribution_fn or compute actions by calling self.model, then sampling from the so parameterized action distribution.
(Optional[Callable[[Policy, ModelV2, TensorType, (action_distribution_fn) – TensorType, TensorType], Tuple[TensorType, type, List[TensorType]]]]): Optional callable returning distribution inputs (parameters), a dist-class to generate an action distribution object from, and internal-state outputs (or an empty list if not applicable). If None, will either use action_sampler_fn or compute actions by calling self.model, then sampling from the so parameterized action distribution.
mixins (Optional[List[type]]) – Optional list of any class mixins for the returned policy class. These mixins will be applied in order and will have higher precedence than the DynamicTFPolicy class.
get_batch_divisibility_req (Optional[Callable[[Policy], int]]) – Optional callable that returns the divisibility requirement for sample batches. If None, will assume a value of 1.
obs_include_prev_action_reward (bool) – Whether to include the previous action and reward in the model input.
- Returns
- A child class of DynamicTFPolicy based on the
specified args.
- Return type
Type[DynamicTFPolicy]
ray.rllib.env¶
-
class
ray.rllib.env.
BaseEnv
[source]¶ The lowest-level env interface used by RLlib for sampling.
BaseEnv models multiple agents executing asynchronously in multiple environments. A call to poll() returns observations from ready agents keyed by their environment and agent ids, and actions for those agents can be sent back via send_actions().
All other env types can be adapted to BaseEnv. RLlib handles these conversions internally in RolloutWorker, for example:
gym.Env => rllib.VectorEnv => rllib.BaseEnv rllib.MultiAgentEnv => rllib.BaseEnv rllib.ExternalEnv => rllib.BaseEnv
-
action_space
¶ Action space. This must be defined for single-agent envs. Multi-agent envs can set this to None.
- Type
gym.Space
-
observation_space
¶ Observation space. This must be defined for single-agent envs. Multi-agent envs can set this to None.
- Type
gym.Space
Examples
>>> env = MyBaseEnv() >>> obs, rewards, dones, infos, off_policy_actions = env.poll() >>> print(obs) { "env_0": { "car_0": [2.4, 1.6], "car_1": [3.4, -3.2], }, "env_1": { "car_0": [8.0, 4.1], }, "env_2": { "car_0": [2.3, 3.3], "car_1": [1.4, -0.2], "car_3": [1.2, 0.1], }, } >>> env.send_actions( actions={ "env_0": { "car_0": 0, "car_1": 1, }, ... }) >>> obs, rewards, dones, infos, off_policy_actions = env.poll() >>> print(obs) { "env_0": { "car_0": [4.1, 1.7], "car_1": [3.2, -4.2], }, ... } >>> print(dones) { "env_0": { "__all__": False, "car_0": False, "car_1": True, }, ... }
-
static
to_base_env
(env: Any, make_env: Callable[[int], Any] = None, num_envs: int = 1, remote_envs: bool = False, remote_env_batch_wait_ms: int = 0) → ray.rllib.env.base_env.BaseEnv[source]¶ Wraps any env type as needed to expose the async interface.
-
poll
() → Tuple[Dict[Union[int, str], Dict[Any, Any]], Dict[Union[int, str], Dict[Any, Any]], Dict[Union[int, str], Dict[Any, Any]], Dict[Union[int, str], Dict[Any, Any]], Dict[Union[int, str], Dict[Any, Any]]][source]¶ Returns observations from ready agents.
The returns are two-level dicts mapping from env_id to a dict of agent_id to values. The number of agents and envs can vary over time.
- Returns
obs (dict) (New observations for each ready agent.)
rewards (dict) (Reward values for each ready agent. If the) – episode is just started, the value will be None.
dones (dict) (Done values for each ready agent. The special key) – “__all__” is used to indicate env termination.
infos (dict) (Info values for each ready agent.)
off_policy_actions (dict) (Agents may take off-policy actions. When) – that happens, there will be an entry in this dict that contains the taken action. There is no need to send_actions() for agents that have already chosen off-policy actions.
-
send_actions
(action_dict: Dict[Union[int, str], Dict[Any, Any]]) → None[source]¶ Called to send actions back to running agents in this env.
Actions should be sent for each ready agent that returned observations in the previous poll() call.
- Parameters
action_dict (dict) – Actions values keyed by env_id and agent_id.
-
try_reset
(env_id: Union[int, str, None] = None) → Optional[Dict[Any, Any]][source]¶ Attempt to reset the sub-env with the given id or all sub-envs.
If the environment does not support synchronous reset, None can be returned here.
- Parameters
env_id (Optional[int]) – The sub-env ID if applicable. If None, reset the entire Env (i.e. all sub-envs).
- Returns
- Resetted (multi-agent) observation dict
or None if reset is not supported.
- Return type
Optional[MultiAgentDict]
-
-
class
ray.rllib.env.
EnvContext
(env_config: dict, worker_index: int, vector_index: int = 0, remote: bool = False, num_workers: Optional[int] = None)[source]¶ Wraps env configurations to include extra rllib metadata.
These attributes can be used to parameterize environments per process. For example, one might use worker_index to control which data file an environment reads in on initialization.
RLlib auto-sets these attributes when constructing registered envs.
-
worker_index
¶ When there are multiple workers created, this uniquely identifies the worker the env is created in.
- Type
int
-
vector_index
¶ When there are multiple envs per worker, this uniquely identifies the env index within the worker.
- Type
int
-
remote
¶ Whether environment should be remote or not.
- Type
bool
-
-
class
ray.rllib.env.
ExternalEnv
(action_space: <Mock name='mock.Space' id='139801234788560'>, observation_space: <Mock name='mock.Space' id='139801234788560'>, max_concurrent: int = 100)[source]¶ An environment that interfaces with external agents.
Unlike simulator envs, control is inverted. The environment queries the policy to obtain actions and logs observations and rewards for training. This is in contrast to gym.Env, where the algorithm drives the simulation through env.step() calls.
You can use ExternalEnv as the backend for policy serving (by serving HTTP requests in the run loop), for ingesting offline logs data (by reading offline transitions in the run loop), or other custom use cases not easily expressed through gym.Env.
ExternalEnv supports both on-policy actions (through self.get_action()), and off-policy actions (through self.log_action()).
This env is thread-safe, but individual episodes must be executed serially.
-
action_space
¶ Action space.
- Type
gym.Space
-
observation_space
¶ Observation space.
- Type
gym.Space
Examples
>>> register_env("my_env", lambda config: YourExternalEnv(config)) >>> trainer = DQNTrainer(env="my_env") >>> while True: >>> print(trainer.train())
-
run
()[source]¶ Override this to implement the run loop.
- Your loop should continuously:
Call self.start_episode(episode_id)
- Call self.get_action(episode_id, obs)
-or- self.log_action(episode_id, obs, action)
Call self.log_returns(episode_id, reward)
Call self.end_episode(episode_id, obs)
Wait if nothing to do.
Multiple episodes may be started at the same time.
-
start_episode
(episode_id: Optional[str] = None, training_enabled: bool = True) → str[source]¶ Record the start of an episode.
- Parameters
episode_id (Optional[str]) – Unique string id for the episode or None for it to be auto-assigned and returned.
training_enabled (bool) – Whether to use experiences for this episode to improve the policy.
- Returns
Unique string id for the episode.
- Return type
episode_id (str)
-
get_action
(episode_id: str, observation: Any) → Any[source]¶ Record an observation and get the on-policy action.
- Parameters
episode_id (str) – Episode id returned from start_episode().
observation (obj) – Current environment observation.
- Returns
Action from the env action space.
- Return type
action (obj)
-
log_action
(episode_id: str, observation: Any, action: Any) → None[source]¶ Record an observation and (off-policy) action taken.
- Parameters
episode_id (str) – Episode id returned from start_episode().
observation (obj) – Current environment observation.
action (obj) – Action for the observation.
-
log_returns
(episode_id: str, reward: float, info: dict = None) → None[source]¶ Record returns from the environment.
The reward will be attributed to the previous action taken by the episode. Rewards accumulate until the next action. If no reward is logged before the next action, a reward of 0.0 is assumed.
- Parameters
episode_id (str) – Episode id returned from start_episode().
reward (float) – Reward from the environment.
info (dict) – Optional info dict.
-
-
class
ray.rllib.env.
ExternalMultiAgentEnv
(action_space: <Mock name='mock.Space' id='139801234788560'>, observation_space: <Mock name='mock.Space' id='139801234788560'>, max_concurrent: int = 100)[source]¶ This is the multi-agent version of ExternalEnv.
-
run
()[source]¶ Override this to implement the multi-agent run loop.
- Your loop should continuously:
Call self.start_episode(episode_id)
- Call self.get_action(episode_id, obs_dict)
-or- self.log_action(episode_id, obs_dict, action_dict)
Call self.log_returns(episode_id, reward_dict)
Call self.end_episode(episode_id, obs_dict)
Wait if nothing to do.
Multiple episodes may be started at the same time.
-
start_episode
(episode_id: Optional[str] = None, training_enabled: bool = True) → str[source]¶ Record the start of an episode.
- Parameters
episode_id (Optional[str]) – Unique string id for the episode or None for it to be auto-assigned and returned.
training_enabled (bool) – Whether to use experiences for this episode to improve the policy.
- Returns
Unique string id for the episode.
- Return type
episode_id (str)
-
get_action
(episode_id: str, observation_dict: Dict[Any, Any]) → Dict[Any, Any][source]¶ Record an observation and get the on-policy action. observation_dict is expected to contain the observation of all agents acting in this episode step.
- Parameters
episode_id (str) – Episode id returned from start_episode().
observation_dict (dict) – Current environment observation.
- Returns
Action from the env action space.
- Return type
action (dict)
-
log_action
(episode_id: str, observation_dict: Dict[Any, Any], action_dict: Dict[Any, Any]) → None[source]¶ Record an observation and (off-policy) action taken.
- Parameters
episode_id (str) – Episode id returned from start_episode().
observation_dict (dict) – Current environment observation.
action_dict (dict) – Action for the observation.
-
log_returns
(episode_id: str, reward_dict: Dict[Any, Any], info_dict: Dict[Any, Any] = None, multiagent_done_dict: Dict[Any, Any] = None) → None[source]¶ Record returns from the environment.
The reward will be attributed to the previous action taken by the episode. Rewards accumulate until the next action. If no reward is logged before the next action, a reward of 0.0 is assumed.
- Parameters
episode_id (str) – Episode id returned from start_episode().
reward_dict (dict) – Reward from the environment agents.
info_dict (dict) – Optional info dict.
multiagent_done_dict (dict) – Optional done dict for agents.
-
-
class
ray.rllib.env.
MultiAgentEnv
[source]¶ An environment that hosts multiple independent agents.
Agents are identified by (string) agent ids. Note that these “agents” here are not to be confused with RLlib agents.
Examples
>>> env = MyMultiAgentEnv() >>> obs = env.reset() >>> print(obs) { "car_0": [2.4, 1.6], "car_1": [3.4, -3.2], "traffic_light_1": [0, 3, 5, 1], } >>> obs, rewards, dones, infos = env.step( ... action_dict={ ... "car_0": 1, "car_1": 0, "traffic_light_1": 2, ... }) >>> print(rewards) { "car_0": 3, "car_1": -1, "traffic_light_1": 0, } >>> print(dones) { "car_0": False, # car_0 is still running "car_1": True, # car_1 is done "__all__": False, # the env is not done } >>> print(infos) { "car_0": {}, # info for car_0 "car_1": {}, # info for car_1 }
-
reset
() → Dict[Any, Any][source]¶ Resets the env and returns observations from ready agents.
- Returns
New observations for each ready agent.
- Return type
obs (dict)
-
step
(action_dict: Dict[Any, Any]) → Tuple[Dict[Any, Any], Dict[Any, Any], Dict[Any, Any], Dict[Any, Any]][source]¶ Returns observations from ready agents.
The returns are dicts mapping from agent_id strings to values. The number of agents in the env can vary over time.
- Returns
obs (dict) (New observations for each ready agent.)
rewards (dict) (Reward values for each ready agent. If the) – episode is just started, the value will be None.
dones (dict) (Done values for each ready agent. The special key) – “__all__” (required) is used to indicate env termination.
infos (dict) (Optional info values for each agent id.)
-
with_agent_groups
(groups: Dict[str, List[Any]], obs_space: <Mock name='mock.Space' id='139801234788560'> = None, act_space: <Mock name='mock.Space' id='139801234788560'> = None) → ray.rllib.env.multi_agent_env.MultiAgentEnv[source]¶ Convenience method for grouping together agents in this env.
An agent group is a list of agent ids that are mapped to a single logical agent. All agents of the group must act at the same time in the environment. The grouped agent exposes Tuple action and observation spaces that are the concatenated action and obs spaces of the individual agents.
The rewards of all the agents in a group are summed. The individual agent rewards are available under the “individual_rewards” key of the group info return.
Agent grouping is required to leverage algorithms such as Q-Mix.
This API is experimental.
- Parameters
groups (dict) – Mapping from group id to a list of the agent ids of group members. If an agent id is not present in any group value, it will be left ungrouped.
obs_space (Space) – Optional observation space for the grouped env. Must be a tuple space.
act_space (Space) – Optional action space for the grouped env. Must be a tuple space.
Examples
>>> env = YourMultiAgentEnv(...) >>> grouped_env = env.with_agent_groups(env, { ... "group1": ["agent1", "agent2", "agent3"], ... "group2": ["agent4", "agent5"], ... })
-
-
class
ray.rllib.env.
PolicyClient
(address: str, inference_mode: str = 'local', update_interval: float = 10.0)[source]¶ REST client to interact with a RLlib policy server.
-
start_episode
(episode_id: Optional[str] = None, training_enabled: bool = True) → str[source]¶ Record the start of one or more episode(s).
- Parameters
episode_id (Optional[str]) – Unique string id for the episode or None for it to be auto-assigned.
training_enabled (bool) – Whether to use experiences for this episode to improve the policy.
- Returns
Unique string id for the episode.
- Return type
episode_id (str)
-
get_action
(episode_id: str, observation: Union[Any, Dict[Any, Any]]) → Union[Any, Dict[Any, Any]][source]¶ Record an observation and get the on-policy action.
- Parameters
episode_id (str) – Episode id returned from start_episode().
observation (obj) – Current environment observation.
- Returns
Action from the env action space.
- Return type
action (obj)
-
log_action
(episode_id: str, observation: Union[Any, Dict[Any, Any]], action: Union[Any, Dict[Any, Any]]) → None[source]¶ Record an observation and (off-policy) action taken.
- Parameters
episode_id (str) – Episode id returned from start_episode().
observation (obj) – Current environment observation.
action (obj) – Action for the observation.
-
log_returns
(episode_id: str, reward: int, info: Union[dict, Dict[Any, Any]] = None, multiagent_done_dict: Optional[Dict[Any, Any]] = None) → None[source]¶ Record returns from the environment.
The reward will be attributed to the previous action taken by the episode. Rewards accumulate until the next action. If no reward is logged before the next action, a reward of 0.0 is assumed.
- Parameters
episode_id (str) – Episode id returned from start_episode().
reward (float) – Reward from the environment.
info (dict) – Extra info dict.
multiagent_done_dict (dict) – Multi-agent done information.
-
-
class
ray.rllib.env.
PolicyServerInput
(ioctx, address, port)[source]¶ REST policy server that acts as an offline data source.
This launches a multi-threaded server that listens on the specified host and port to serve policy requests and forward experiences to RLlib. For high performance experience collection, it implements InputReader.
For an example, run examples/cartpole_server.py along with examples/cartpole_client.py –inference-mode=local|remote.
Examples
>>> pg = PGTrainer( ... env="CartPole-v0", config={ ... "input": lambda ioctx: ... PolicyServerInput(ioctx, addr, port), ... "num_workers": 0, # Run just 1 server, in the trainer. ... } >>> while True: >>> pg.train()
>>> client = PolicyClient("localhost:9900", inference_mode="local") >>> eps_id = client.start_episode() >>> action = client.get_action(eps_id, obs) >>> ... >>> client.log_returns(eps_id, reward) >>> ... >>> client.log_returns(eps_id, reward)
-
next
()[source]¶ Returns the next batch of experiences read.
- Returns
The experience read.
- Return type
Union[SampleBatch, MultiAgentBatch]
-
-
class
ray.rllib.env.
RemoteVectorEnv
(make_env: Callable[[int], Any], num_envs: int, multiagent: bool, remote_env_batch_wait_ms: int)[source]¶ Vector env that executes envs in remote workers.
This provides dynamic batching of inference as observations are returned from the remote simulator actors. Both single and multi-agent child envs are supported, and envs can be stepped synchronously or async.
You shouldn’t need to instantiate this class directly. It’s automatically inserted when you use the remote_worker_envs option for Trainers.
-
poll
() → Tuple[Dict[Union[int, str], Dict[Any, Any]], Dict[Union[int, str], Dict[Any, Any]], Dict[Union[int, str], Dict[Any, Any]], Dict[Union[int, str], Dict[Any, Any]], Dict[Union[int, str], Dict[Any, Any]]][source]¶ Returns observations from ready agents.
The returns are two-level dicts mapping from env_id to a dict of agent_id to values. The number of agents and envs can vary over time.
- Returns
obs (dict) (New observations for each ready agent.)
rewards (dict) (Reward values for each ready agent. If the) – episode is just started, the value will be None.
dones (dict) (Done values for each ready agent. The special key) – “__all__” is used to indicate env termination.
infos (dict) (Info values for each ready agent.)
off_policy_actions (dict) (Agents may take off-policy actions. When) – that happens, there will be an entry in this dict that contains the taken action. There is no need to send_actions() for agents that have already chosen off-policy actions.
-
send_actions
(action_dict: Dict[Union[int, str], Dict[Any, Any]]) → None[source]¶ Called to send actions back to running agents in this env.
Actions should be sent for each ready agent that returned observations in the previous poll() call.
- Parameters
action_dict (dict) – Actions values keyed by env_id and agent_id.
-
try_reset
(env_id: Union[int, str, None] = None) → Optional[Dict[Any, Any]][source]¶ Attempt to reset the sub-env with the given id or all sub-envs.
If the environment does not support synchronous reset, None can be returned here.
- Parameters
env_id (Optional[int]) – The sub-env ID if applicable. If None, reset the entire Env (i.e. all sub-envs).
- Returns
- Resetted (multi-agent) observation dict
or None if reset is not supported.
- Return type
Optional[MultiAgentDict]
-
-
class
ray.rllib.env.
VectorEnv
(observation_space: <Mock name='mock.Space' id='139801234788560'>, action_space: <Mock name='mock.Space' id='139801234788560'>, num_envs: int)[source]¶ An environment that supports batch evaluation using clones of sub-envs.
-
vector_reset
() → List[Any][source]¶ Resets all sub-environments.
- Returns
List of observations from each environment.
- Return type
obs (List[any])
-
reset_at
(index: int) → Any[source]¶ Resets a single environment.
- Returns
Observations from the reset sub environment.
- Return type
obs (obj)
-
vector_step
(actions: List[Any]) → Tuple[List[Any], List[float], List[bool], List[dict]][source]¶ Performs a vectorized step on all sub environments using actions.
- Parameters
actions (List[any]) – List of actions (one for each sub-env).
- Returns
New observations for each sub-env. rewards (List[any]): Reward values for each sub-env. dones (List[any]): Done values for each sub-env. infos (List[any]): Info values for each sub-env.
- Return type
obs (List[any])
-
-
class
ray.rllib.env.
GroupAgentsWrapper
(env, groups, obs_space=None, act_space=None)[source]¶ Wraps a MultiAgentEnv environment with agents grouped as specified.
See multi_agent_env.py for the specification of groups.
This API is experimental.
-
reset
()[source]¶ Resets the env and returns observations from ready agents.
- Returns
New observations for each ready agent.
- Return type
obs (dict)
-
step
(action_dict)[source]¶ Returns observations from ready agents.
The returns are dicts mapping from agent_id strings to values. The number of agents in the env can vary over time.
- Returns
obs (dict) (New observations for each ready agent.)
rewards (dict) (Reward values for each ready agent. If the) – episode is just started, the value will be None.
dones (dict) (Done values for each ready agent. The special key) – “__all__” (required) is used to indicate env termination.
infos (dict) (Optional info values for each agent id.)
-
-
class
ray.rllib.env.
KaggleFootballMultiAgentEnv
(configuration: Optional[Dict[str, Any]] = None)[source]¶ An interface to the kaggle’s football environment.
See: https://github.com/Kaggle/kaggle-environments
-
reset
() → Dict[Any, Any][source]¶ Resets the env and returns observations from ready agents.
- Returns
New observations for each ready agent.
- Return type
obs (dict)
-
step
(action_dict: Dict[Any, int]) → Tuple[Dict[Any, Any], Dict[Any, Any], Dict[Any, Any], Dict[Any, Any]][source]¶ Returns observations from ready agents.
The returns are dicts mapping from agent_id strings to values. The number of agents in the env can vary over time.
- Returns
obs (dict) (New observations for each ready agent.)
rewards (dict) (Reward values for each ready agent. If the) – episode is just started, the value will be None.
dones (dict) (Done values for each ready agent. The special key) – “__all__” (required) is used to indicate env termination.
infos (dict) (Optional info values for each agent id.)
-
build_agent_spaces
() → Tuple[<Mock name=’mock.Space’ id=’139801123920144’>, <Mock name=’mock.Space’ id=’139801123920144’>][source]¶ Construct the action and observation spaces
Description of actions and observations: https://github.com/google-research/football/blob/master/gfootball/doc/observation.md
-
-
class
ray.rllib.env.
PettingZooEnv
(env)[source]¶ An interface to the PettingZoo MARL environment library.
See: https://github.com/PettingZoo-Team/PettingZoo
Inherits from MultiAgentEnv and exposes a given AEC (actor-environment-cycle) game from the PettingZoo project via the MultiAgentEnv public API.
Note that the wrapper has some important limitations:
All agents have the same action_spaces and observation_spaces. Note: If, within your aec game, agents do not have homogeneous action / observation spaces, apply SuperSuit wrappers to apply padding functionality: https://github.com/PettingZoo-Team/ SuperSuit#built-in-multi-agent-only-functions
Environments are positive sum games (-> Agents are expected to cooperate to maximize reward). This isn’t a hard restriction, it just that standard algorithms aren’t expected to work well in highly competitive games.
Examples
>>> from pettingzoo.butterfly import prison_v2 >>> env = PettingZooEnv(prison_v2.env()) >>> obs = env.reset() >>> print(obs) # only returns the observation for the agent which should be stepping { 'prisoner_0': array([[[0, 0, 0], [0, 0, 0], [0, 0, 0], ..., [0, 0, 0], [0, 0, 0], [0, 0, 0]]], dtype=uint8) } >>> obs, rewards, dones, infos = env.step({ ... "prisoner_0": 1 ... }) # only returns the observation, reward, info, etc, for # the agent who's turn is next. >>> print(obs) { 'prisoner_1': array([[[0, 0, 0], [0, 0, 0], [0, 0, 0], ..., [0, 0, 0], [0, 0, 0], [0, 0, 0]]], dtype=uint8) } >>> print(rewards) { 'prisoner_1': 0 } >>> print(dones) { 'prisoner_1': False, '__all__': False } >>> print(infos) { 'prisoner_1': {'map_tuple': (1, 0)} }
-
reset
()[source]¶ Resets the env and returns observations from ready agents.
- Returns
New observations for each ready agent.
- Return type
obs (dict)
-
step
(action)[source]¶ Returns observations from ready agents.
The returns are dicts mapping from agent_id strings to values. The number of agents in the env can vary over time.
- Returns
obs (dict) (New observations for each ready agent.)
rewards (dict) (Reward values for each ready agent. If the) – episode is just started, the value will be None.
dones (dict) (Done values for each ready agent. The special key) – “__all__” (required) is used to indicate env termination.
infos (dict) (Optional info values for each agent id.)
-
class
ray.rllib.env.
Unity3DEnv
(file_name: str = None, port: Optional[int] = None, seed: int = 0, no_graphics: bool = False, timeout_wait: int = 300, episode_horizon: int = 1000)[source]¶ A MultiAgentEnv representing a single Unity3D game instance.
For an example on how to use this Env with a running Unity3D editor or with a compiled game, see: rllib/examples/unity3d_env_local.py For an example on how to use it inside a Unity game client, which connects to an RLlib Policy server, see: rllib/examples/serving/unity3d_[client|server].py
Supports all Unity3D (MLAgents) examples, multi- or single-agent and gets converted automatically into an ExternalMultiAgentEnv, when used inside an RLlib PolicyClient for cloud/distributed training of Unity games.
-
step
(action_dict: Dict[Any, Any]) → Tuple[Dict[Any, Any], Dict[Any, Any], Dict[Any, Any], Dict[Any, Any]][source]¶ Performs one multi-agent step through the game.
- Parameters
action_dict (dict) – Multi-agent action dict with: keys=agent identifier consisting of [MLagents behavior name, e.g. “Goalie?team=1”] + “_” + [Agent index, a unique MLAgent-assigned index per single agent]
- Returns
- obs: Multi-agent observation dict.
Only those observations for which to get new actions are returned.
rewards: Rewards dict matching obs.
- dones: Done dict with only an __all__ multi-agent entry in
it. __all__=True, if episode is done for all agents.
infos: An (empty) info dict.
- Return type
tuple
-
ray.rllib.evaluation¶
-
class
ray.rllib.evaluation.
MultiAgentEpisode
(policies: Dict[str, ray.rllib.policy.policy.Policy], policy_mapping_fn: Callable[[Any], str], batch_builder_factory: Callable[], MultiAgentSampleBatchBuilder], extra_batch_callback: Callable[[Union[SampleBatch, MultiAgentBatch]], None], env_id: Union[int, str])[source]¶ Tracks the current state of a (possibly multi-agent) episode.
-
new_batch_builder
¶ Create a new MultiAgentSampleBatchBuilder.
- Type
func
-
add_extra_batch
¶ Return a built MultiAgentBatch to the sampler.
- Type
func
-
batch_builder
¶ Batch builder for the current episode.
- Type
obj
-
total_reward
¶ Summed reward across all agents in this episode.
- Type
float
-
length
¶ Length of this episode.
- Type
int
-
episode_id
¶ Unique id identifying this trajectory.
- Type
int
-
agent_rewards
¶ Summed rewards broken down by agent.
- Type
dict
-
custom_metrics
¶ Dict where the you can add custom metrics.
- Type
dict
-
user_data
¶ Dict that you can use for temporary storage. E.g. in between two custom callbacks referring to the same episode.
- Type
dict
-
hist_data
¶ Dict mapping str keys to List[float] for storage of per-timestep float data throughout the episode.
- Type
dict
- Use case 1: Model-based rollouts in multi-agent:
A custom compute_actions() function in a policy can inspect the current episode state and perform a number of rollouts based on the policies and state of other agents in the environment.
- Use case 2: Returning extra rollouts data.
The model rollouts can be returned back to the sampler by calling:
>>> batch = episode.new_batch_builder() >>> for each transition: batch.add_values(...) # see sampler for usage >>> episode.extra_batches.add(batch.build_and_reset())
-
soft_reset
() → None[source]¶ Clears rewards and metrics, but retains RNN and other state.
This is used to carry state across multiple logical episodes in the same env (i.e., if soft_horizon is set).
-
policy_for
(agent_id: Any = 'agent0') → str[source]¶ Returns and stores the policy ID for the specified agent.
If the agent is new, the policy mapping fn will be called to bind the agent to a policy for the duration of the episode.
- Parameters
agent_id (AgentID) – The agent ID to lookup the policy ID for.
- Returns
The policy ID for the specified agent.
- Return type
PolicyID
-
last_observation_for
(agent_id: Any = 'agent0') → Any[source]¶ Returns the last observation for the specified agent.
-
last_raw_obs_for
(agent_id: Any = 'agent0') → Any[source]¶ Returns the last un-preprocessed obs for the specified agent.
-
last_info_for
(agent_id: Any = 'agent0') → dict[source]¶ Returns the last info for the specified agent.
-
last_action_for
(agent_id: Any = 'agent0') → Any[source]¶ Returns the last action for the specified agent, or zeros.
-
prev_action_for
(agent_id: Any = 'agent0') → Any[source]¶ Returns the previous action for the specified agent.
-
prev_reward_for
(agent_id: Any = 'agent0') → float[source]¶ Returns the previous reward for the specified agent.
-
-
class
ray.rllib.evaluation.
RolloutWorker
(*, env_creator: Callable[[ray.rllib.env.env_context.EnvContext], Any], validate_env: Optional[Callable[[Any, ray.rllib.env.env_context.EnvContext], None]] = None, policy_spec: Union[type, Dict[str, Tuple[Optional[type], <Mock name='mock.Space' id='139801234788560'>, <Mock name='mock.Space' id='139801234788560'>, dict]]] = None, policy_mapping_fn: Optional[Callable[[Any], str]] = None, policies_to_train: Optional[List[str]] = None, tf_session_creator: Optional[Callable[[], tf1.Session]] = None, rollout_fragment_length: int = 100, count_steps_by: str = 'env_steps', batch_mode: str = 'truncate_episodes', episode_horizon: int = None, preprocessor_pref: str = 'deepmind', sample_async: bool = False, compress_observations: bool = False, num_envs: int = 1, observation_fn: ObservationFunction = None, observation_filter: str = 'NoFilter', clip_rewards: bool = None, clip_actions: bool = True, env_config: dict = None, model_config: dict = None, policy_config: dict = None, worker_index: int = 0, num_workers: int = 0, monitor_path: str = None, log_dir: str = None, log_level: str = None, callbacks: Type[DefaultCallbacks] = None, input_creator: Callable[[ray.rllib.offline.io_context.IOContext], ray.rllib.offline.input_reader.InputReader] = <function RolloutWorker.<lambda>>, input_evaluation: List[str] = frozenset({}), output_creator: Callable[[ray.rllib.offline.io_context.IOContext], ray.rllib.offline.output_writer.OutputWriter] = <function RolloutWorker.<lambda>>, remote_worker_envs: bool = False, remote_env_batch_wait_ms: int = 0, soft_horizon: bool = False, no_done_at_end: bool = False, seed: int = None, extra_python_environs: dict = None, fake_sampler: bool = False, spaces: Optional[Dict[str, Tuple[<Mock name='mock.spaces.Space' id='139801235170768'>, <Mock name='mock.spaces.Space' id='139801235170768'>]]] = None, _use_trajectory_view_api: bool = True, policy: Union[type, Dict[str, Tuple[Optional[type], <Mock name='mock.Space' id='139801234788560'>, <Mock name='mock.Space' id='139801234788560'>, dict]]] = None)[source]¶ Common experience collection class.
This class wraps a policy instance and an environment class to collect experiences from the environment. You can create many replicas of this class as Ray actors to scale RL training.
This class supports vectorized and multi-agent policy evaluation (e.g., VectorEnv, MultiAgentEnv, etc.)
Examples
>>> # Create a rollout worker and using it to collect experiences. >>> worker = RolloutWorker( ... env_creator=lambda _: gym.make("CartPole-v0"), ... policy_spec=PGTFPolicy) >>> print(worker.sample()) SampleBatch({ "obs": [[...]], "actions": [[...]], "rewards": [[...]], "dones": [[...]], "new_obs": [[...]]})
>>> # Creating a multi-agent rollout worker >>> worker = RolloutWorker( ... env_creator=lambda _: MultiAgentTrafficGrid(num_cars=25), ... policy_spec={ ... # Use an ensemble of two policies for car agents ... "car_policy1": ... (PGTFPolicy, Box(...), Discrete(...), {"gamma": 0.99}), ... "car_policy2": ... (PGTFPolicy, Box(...), Discrete(...), {"gamma": 0.95}), ... # Use a single shared policy for all traffic lights ... "traffic_light_policy": ... (PGTFPolicy, Box(...), Discrete(...), {}), ... }, ... policy_mapping_fn=lambda agent_id: ... random.choice(["car_policy1", "car_policy2"]) ... if agent_id.startswith("car_") else "traffic_light_policy") >>> print(worker.sample()) MultiAgentBatch({ "car_policy1": SampleBatch(...), "car_policy2": SampleBatch(...), "traffic_light_policy": SampleBatch(...)})
-
sample
() → Union[SampleBatch, MultiAgentBatch][source]¶ Returns a batch of experience sampled from this worker.
This method must be implemented by subclasses.
- Returns
A columnar batch of experiences (e.g., tensors).
- Return type
SampleBatchType
Examples
>>> print(worker.sample()) SampleBatch({"obs": [1, 2, 3], "action": [0, 1, 0], ...})
-
sample_with_count
() → Tuple[Union[SampleBatch, MultiAgentBatch], int][source]¶ Same as sample() but returns the count as a separate future.
-
get_weights
(policies: List[str] = None) -> (<class 'dict'>, <class 'dict'>)[source]¶ Returns the model weights of this worker.
- Returns
weights that can be set on another worker. info: dictionary of extra metadata.
- Return type
object
Examples
>>> weights = worker.get_weights()
-
set_weights
(weights: dict, global_vars: dict = None) → None[source]¶ Sets the model weights of this worker.
Examples
>>> weights = worker.get_weights() >>> worker.set_weights(weights)
-
compute_gradients
(samples: Union[SampleBatch, MultiAgentBatch]) → Tuple[Union[List[Tuple[Any, Any]], List[Any]], dict][source]¶ Returns a gradient computed w.r.t the specified samples.
- Returns
A list of gradients that can be applied on a compatible worker. In the multi-agent case, returns a dict of gradients keyed by policy ids. An info dictionary of extra metadata is also returned.
- Return type
(grads, info)
Examples
>>> batch = worker.sample() >>> grads, info = worker.compute_gradients(samples)
-
apply_gradients
(grads: Union[List[Tuple[Any, Any]], List[Any]]) → Dict[str, Any][source]¶ Applies the given gradients to this worker’s weights.
Examples
>>> samples = worker.sample() >>> grads, info = worker.compute_gradients(samples) >>> worker.apply_gradients(grads)
-
learn_on_batch
(samples: Union[SampleBatch, MultiAgentBatch]) → dict[source]¶ Update policies based on the given batch.
This is the equivalent to apply_gradients(compute_gradients(samples)), but can be optimized to avoid pulling gradients into CPU memory.
- Returns
dictionary of extra metadata from compute_gradients().
- Return type
info
Examples
>>> batch = worker.sample() >>> worker.learn_on_batch(samples)
-
sample_and_learn
(expected_batch_size: int, num_sgd_iter: int, sgd_minibatch_size: str, standardize_fields: List[str]) → Tuple[dict, int][source]¶ Sample and batch and learn on it.
This is typically used in combination with distributed allreduce.
- Parameters
expected_batch_size (int) – Expected number of samples to learn on.
num_sgd_iter (int) – Number of SGD iterations.
sgd_minibatch_size (int) – SGD minibatch size.
standardize_fields (list) – List of sample fields to normalize.
- Returns
dictionary of extra metadata from learn_on_batch(). count: number of samples learned on.
- Return type
info
-
get_metrics
() → List[Union[ray.rllib.evaluation.rollout_metrics.RolloutMetrics, ray.rllib.offline.off_policy_estimator.OffPolicyEstimate]][source]¶ Returns a list of new RolloutMetric objects from evaluation.
-
foreach_env
(func: Callable[[ray.rllib.env.base_env.BaseEnv], T]) → List[T][source]¶ Apply the given function to each underlying env instance.
-
get_policy
(policy_id: Optional[str] = 'default_policy') → ray.rllib.policy.policy.Policy[source]¶ Return policy for the specified id, or None.
- Parameters
policy_id (str) – id of policy to return.
-
for_policy
(func: Callable[[ray.rllib.policy.policy.Policy], T], policy_id: Optional[str] = 'default_policy', **kwargs) → T[source]¶ Apply the given function to the specified policy.
-
foreach_policy
(func: Callable[[ray.rllib.policy.policy.Policy, str], T], **kwargs) → List[T][source]¶ Apply the given function to each (policy, policy_id) tuple.
-
foreach_trainable_policy
(func: Callable[[ray.rllib.policy.policy.Policy, str], T], **kwargs) → List[T][source]¶ Applies the given function to each (policy, policy_id) tuple, which can be found in self.policies_to_train.
- Parameters
func (callable) – A function - taking a Policy and its ID - that is called on all Policies within self.policies_to_train.
- Returns
- The list of n return values of all
func([policy], [ID])-calls.
- Return type
List[any]
-
sync_filters
(new_filters: dict) → None[source]¶ Changes self’s filter to given and rebases any accumulated delta.
- Parameters
new_filters (dict) – Filters with new state to update local copy.
-
get_filters
(flush_after: bool = False) → dict[source]¶ Returns a snapshot of filters.
- Parameters
flush_after (bool) – Clears the filter buffer state.
- Returns
Dict for serializable filters
- Return type
return_filters (dict)
-
apply
(func: Callable[[RolloutWorker], T], *args) → T[source]¶ Apply the given function to this rollout worker instance.
-
-
class
ray.rllib.evaluation.
SampleBatchBuilder
[source]¶ Util to build a SampleBatch incrementally.
For efficiency, SampleBatches hold values in column form (as arrays). However, it is useful to add data one row (dict) at a time.
-
class
ray.rllib.evaluation.
MultiAgentSampleBatchBuilder
(policy_map: Dict[str, ray.rllib.policy.policy.Policy], clip_rewards: bool, callbacks: DefaultCallbacks)[source]¶ Util to build SampleBatches for each policy in a multi-agent env.
Input data is per-agent, while output data is per-policy. There is an M:N mapping between agents and policies. We retain one local batch builder per agent. When an agent is done, then its local batch is appended into the corresponding policy batch for the agent’s policy.
-
total
() → int[source]¶ Returns the total number of steps taken in the env (all agents).
- Returns
- The number of steps taken in total in the environment over all
agents.
- Return type
int
-
has_pending_agent_data
() → bool[source]¶ Returns whether there is pending unprocessed data.
- Returns
- True if there is at least one per-agent builder (with data
in it).
- Return type
bool
-
add_values
(agent_id: Any, policy_id: Any, **values: Any) → None[source]¶ Add the given dictionary (row) of values to this batch.
- Parameters
agent_id (obj) – Unique id for the agent we are adding values for.
policy_id (obj) – Unique id for policy controlling the agent.
values (dict) – Row of values to add for this agent.
-
postprocess_batch_so_far
(episode: Optional[ray.rllib.evaluation.episode.MultiAgentEpisode] = None) → None[source]¶ Apply policy postprocessors to any unprocessed rows.
This pushes the postprocessed per-agent batches onto the per-policy builders, clearing per-agent state.
- Parameters
episode (Optional[MultiAgentEpisode]) – The Episode object that holds this MultiAgentBatchBuilder object.
-
build_and_reset
(episode: Optional[ray.rllib.evaluation.episode.MultiAgentEpisode] = None) → ray.rllib.policy.sample_batch.MultiAgentBatch[source]¶ Returns the accumulated sample batches for each policy.
Any unprocessed rows will be first postprocessed with a policy postprocessor. The internal state of this builder will be reset.
- Parameters
episode (Optional[MultiAgentEpisode]) – The Episode object that holds this MultiAgentBatchBuilder object or None.
- Returns
- Returns the accumulated sample batches for each
policy.
- Return type
-
-
class
ray.rllib.evaluation.
SyncSampler
(*, worker: RolloutWorker, env: ray.rllib.env.base_env.BaseEnv, policies: Dict[str, ray.rllib.policy.policy.Policy], policy_mapping_fn: Callable[[Any], str], preprocessors: Dict[str, ray.rllib.models.preprocessors.Preprocessor], obs_filters: Dict[str, ray.rllib.utils.filter.Filter], clip_rewards: bool, rollout_fragment_length: int, count_steps_by: str = 'env_steps', callbacks: DefaultCallbacks, horizon: int = None, multiple_episodes_in_batch: bool = False, tf_sess=None, clip_actions: bool = True, soft_horizon: bool = False, no_done_at_end: bool = False, observation_fn: ObservationFunction = None, _use_trajectory_view_api: bool = False, sample_collector_class: Optional[Type[ray.rllib.evaluation.collectors.sample_collector.SampleCollector]] = None)[source]¶ Sync SamplerInput that collects experiences when get_data() is called.
-
class
ray.rllib.evaluation.
AsyncSampler
(*, worker: RolloutWorker, env: ray.rllib.env.base_env.BaseEnv, policies: Dict[str, ray.rllib.policy.policy.Policy], policy_mapping_fn: Callable[[Any], str], preprocessors: Dict[str, ray.rllib.models.preprocessors.Preprocessor], obs_filters: Dict[str, ray.rllib.utils.filter.Filter], clip_rewards: bool, rollout_fragment_length: int, count_steps_by: str = 'env_steps', callbacks: DefaultCallbacks, horizon: int = None, multiple_episodes_in_batch: bool = False, tf_sess=None, clip_actions: bool = True, blackhole_outputs: bool = False, soft_horizon: bool = False, no_done_at_end: bool = False, observation_fn: ObservationFunction = None, _use_trajectory_view_api: bool = False, sample_collector_class: Optional[Type[ray.rllib.evaluation.collectors.sample_collector.SampleCollector]] = None)[source]¶ Async SamplerInput that collects experiences in thread and queues them.
Once started, experiences are continuously collected and put into a Queue, from where they can be unqueued by the caller of get_data().
-
run
()[source]¶ Method representing the thread’s activity.
You may override this method in a subclass. The standard run() method invokes the callable object passed to the object’s constructor as the target argument, if any, with sequential and keyword arguments taken from the args and kwargs arguments, respectively.
-
-
ray.rllib.evaluation.
compute_advantages
(rollout: ray.rllib.policy.sample_batch.SampleBatch, last_r: float, gamma: float = 0.9, lambda_: float = 1.0, use_gae: bool = True, use_critic: bool = True)[source]¶ Given a rollout, compute its value targets and the advantages.
- Parameters
rollout (SampleBatch) – SampleBatch of a single trajectory.
last_r (float) – Value estimation for last observation.
gamma (float) – Discount factor.
lambda_ (float) – Parameter for GAE.
use_gae (bool) – Using Generalized Advantage Estimation.
use_critic (bool) – Whether to use critic (value estimates). Setting this to False will use 0 as baseline.
- Returns
- Object with experience from rollout and
processed rewards.
- Return type
-
ray.rllib.evaluation.
collect_metrics
(local_worker: Optional[RolloutWorker] = None, remote_workers: List[ActorHandle] = [], to_be_collected: List[ObjectRef] = [], timeout_seconds: int = 180) → dict[source]¶ Gathers episode metrics from RolloutWorker instances.
-
class
ray.rllib.evaluation.
SampleBatch
(*args, **kwargs)[source]¶ Wrapper around a dictionary with string keys and array-like values.
For example, {“obs”: [1, 2, 3], “reward”: [0, -1, 1]} is a batch of three samples, each with an “obs” and “reward” attribute.
-
static
concat_samples
(samples: List[SampleBatch]) → Union[ray.rllib.policy.sample_batch.SampleBatch, ray.rllib.policy.sample_batch.MultiAgentBatch][source]¶ Concatenates n data dicts or MultiAgentBatches.
- Parameters
samples (List[Dict[TensorType]]]) – List of dicts of data (numpy).
- Returns
- A new (compressed)
SampleBatch or MultiAgentBatch.
- Return type
Union[SampleBatch, MultiAgentBatch]
-
concat
(other: ray.rllib.policy.sample_batch.SampleBatch) → ray.rllib.policy.sample_batch.SampleBatch[source]¶ Returns a new SampleBatch with each data column concatenated.
- Parameters
other (SampleBatch) – The other SampleBatch object to concat to this one.
- Returns
- The new SampleBatch, resulting from concating other
to self.
- Return type
Examples
>>> b1 = SampleBatch({"a": [1, 2]}) >>> b2 = SampleBatch({"a": [3, 4, 5]}) >>> print(b1.concat(b2)) {"a": [1, 2, 3, 4, 5]}
-
copy
() → ray.rllib.policy.sample_batch.SampleBatch[source]¶ Creates a (deep) copy of this SampleBatch and returns it.
- Returns
A (deep) copy of this SampleBatch object.
- Return type
-
rows
() → Dict[str, Any][source]¶ Returns an iterator over data rows, i.e. dicts with column values.
- Yields
Dict[str, TensorType] –
- The column values of the row in this
iteration.
Examples
>>> batch = SampleBatch({"a": [1, 2, 3], "b": [4, 5, 6]}) >>> for row in batch.rows(): print(row) {"a": 1, "b": 4} {"a": 2, "b": 5} {"a": 3, "b": 6}
-
columns
(keys: List[str]) → List[any][source]¶ Returns a list of the batch-data in the specified columns.
- Parameters
keys (List[str]) – List of column names fo which to return the data.
- Returns
- The list of data items ordered by the order of column
names in keys.
- Return type
List[any]
Examples
>>> batch = SampleBatch({"a": [1], "b": [2], "c": [3]}) >>> print(batch.columns(["a", "b"])) [[1], [2]]
-
split_by_episode
() → List[ray.rllib.policy.sample_batch.SampleBatch][source]¶ Splits this batch’s data by eps_id.
- Returns
List of batches, one per distinct episode.
- Return type
List[SampleBatch]
-
slice
(start: int, end: int) → ray.rllib.policy.sample_batch.SampleBatch[source]¶ Returns a slice of the row data of this batch (w/o copying).
- Parameters
start (int) – Starting index.
end (int) – Ending index.
- Returns
- A new SampleBatch, which has a slice of this batch’s
data.
- Return type
-
timeslices
(k: int) → List[ray.rllib.policy.sample_batch.SampleBatch][source]¶ Returns SampleBatches, each one representing a k-slice of this one.
Will start from timestep 0 and produce slices of size=k.
- Parameters
k (int) – The size (in timesteps) of each returned SampleBatch.
- Returns
- The list of (new) SampleBatches (each one of
size k).
- Return type
List[SampleBatch]
-
keys
() → Iterable[str][source]¶ - Returns
The keys() iterable over self.data.
- Return type
Iterable[str]
-
items
() → Iterable[Any][source]¶ - Returns
The values() iterable over self.data.
- Return type
Iterable[TensorType]
-
get
(key: str) → Optional[Any][source]¶ Returns one column (by key) from the data or None if key not found.
- Parameters
key (str) – The key (column name) to return.
- Returns
- The data under the given key. None if key
not found in data.
- Return type
Optional[TensorType]
-
size_bytes
() → int[source]¶ - Returns
The overall size in bytes of the data buffer (all columns).
- Return type
int
-
compress
(bulk: bool = False, columns: Set[str] = frozenset({'new_obs', 'obs'})) → None[source]¶ Compresses the data buffers (by column) in place.
- Parameters
bulk (bool) – Whether to compress across the batch dimension (0) as well. If False will compress n separate list items, where n is the batch size.
columns (Set[str]) – The columns to compress. Default: Only compress the obs and new_obs columns.
-
decompress_if_needed
(columns: Set[str] = frozenset({'new_obs', 'obs'})) → ray.rllib.policy.sample_batch.SampleBatch[source]¶ Decompresses data buffers (per column if not compressed) in place.
- Parameters
columns (Set[str]) – The columns to decompress. Default: Only decompress the obs and new_obs columns.
- Returns
This very SampleBatch.
- Return type
-
static
-
class
ray.rllib.evaluation.
MultiAgentBatch
(policy_batches: Dict[str, ray.rllib.policy.sample_batch.SampleBatch], env_steps: int)[source]¶ A batch of experiences from multiple agents in the environment.
-
policy_batches
¶ Mapping from policy ids to SampleBatches of experiences.
- Type
Dict[PolicyID, SampleBatch]
-
count
¶ The number of env steps in this batch.
- Type
int
-
env_steps
() → int[source]¶ The number of env steps (there are >= 1 agent steps per env step).
- Returns
The number of environment steps contained in this batch.
- Return type
int
-
agent_steps
() → int[source]¶ The number of agent steps (there are >= 1 agent steps per env step).
- Returns
The number of agent steps total in this batch.
- Return type
int
-
timeslices
(k: int) → List[ray.rllib.policy.sample_batch.MultiAgentBatch][source]¶ Returns k-step batches holding data for each agent at those steps.
For examples, suppose we have agent1 observations [a1t1, a1t2, a1t3], for agent2, [a2t1, a2t3], and for agent3, [a3t3] only.
Calling timeslices(1) would return three MultiAgentBatches containing [a1t1, a2t1], [a1t2], and [a1t3, a2t3, a3t3].
Calling timeslices(2) would return two MultiAgentBatches containing [a1t1, a1t2, a2t1], and [a1t3, a2t3, a3t3].
This method is used to implement “lockstep” replay mode. Note that this method does not guarantee each batch contains only data from a single unroll. Batches might contain data from multiple different envs.
-
static
wrap_as_needed
(policy_batches: Dict[str, ray.rllib.policy.sample_batch.SampleBatch], env_steps: int) → Union[ray.rllib.policy.sample_batch.SampleBatch, ray.rllib.policy.sample_batch.MultiAgentBatch][source]¶ Returns SampleBatch or MultiAgentBatch, depending on given policies.
- Parameters
policy_batches (Dict[PolicyID, SampleBatch]) – Mapping from policy ids to SampleBatch.
env_steps (int) – Number of env steps in the batch.
- Returns
- The single default policy’s
SampleBatch or a MultiAgentBatch (more than one policy).
- Return type
Union[SampleBatch, MultiAgentBatch]
-
static
concat_samples
(samples: List[MultiAgentBatch]) → ray.rllib.policy.sample_batch.MultiAgentBatch[source]¶ Concatenates a list of MultiAgentBatches into a new MultiAgentBatch.
- Parameters
samples (List[MultiAgentBatch]) – List of MultiagentBatch objects to concatenate.
- Returns
- A new MultiAgentBatch consisting of the
concatenated inputs.
- Return type
-
copy
() → ray.rllib.policy.sample_batch.MultiAgentBatch[source]¶ Deep-copies self into a new MultiAgentBatch.
- Returns
The copy of self with deep-copied data.
- Return type
-
size_bytes
() → int[source]¶ - Returns
The overall size in bytes of all policy batches (all columns).
- Return type
int
-
compress
(bulk: bool = False, columns: Set[str] = frozenset({'new_obs', 'obs'})) → None[source]¶ Compresses each policy batch (per column) in place.
- Parameters
bulk (bool) – Whether to compress across the batch dimension (0) as well. If False will compress n separate list items, where n is the batch size.
columns (Set[str]) – Set of column names to compress.
-
decompress_if_needed
(columns: Set[str] = frozenset({'new_obs', 'obs'})) → ray.rllib.policy.sample_batch.MultiAgentBatch[source]¶ Decompresses each policy batch (per column), if already compressed.
- Parameters
columns (Set[str]) – Set of column names to decompress.
- Returns
This very MultiAgentBatch.
- Return type
-
ray.rllib.execution¶
-
ray.rllib.execution.
Concurrently
(ops: List[ray.util.iter.LocalIterator], *, mode: str = 'round_robin', output_indexes: Optional[List[int]] = None, round_robin_weights: Optional[List[int]] = None) → ray.util.iter.LocalIterator[Union[ray.rllib.policy.sample_batch.SampleBatch, ray.rllib.policy.sample_batch.MultiAgentBatch]][source]¶ Operator that runs the given parent iterators concurrently.
- Parameters
mode (str) – One of ‘round_robin’, ‘async’. In ‘round_robin’ mode, we alternate between pulling items from each parent iterator in order deterministically. In ‘async’ mode, we pull from each parent iterator as fast as they are produced. This is non-deterministic.
output_indexes (list) – If specified, only output results from the given ops. For example, if
output_indexes=[0]
, only results from the first op in ops will be returned.round_robin_weights (list) – List of weights to use for round robin mode. For example,
[2, 1]
will cause the iterator to pull twice as many items from the first iterator as the second.[2, 1, *]
will cause as many items to be pulled as possible from the third iterator without blocking. This is only allowed in round robin mode.
Examples
>>> sim_op = ParallelRollouts(...).for_each(...) >>> replay_op = LocalReplay(...).for_each(...) >>> combined_op = Concurrently([sim_op, replay_op], mode="async")
-
class
ray.rllib.execution.
Enqueue
(output_queue: queue.Queue)[source]¶ Enqueue data items into a queue.Queue instance.
Returns the input item as output.
The enqueue is non-blocking, so Enqueue operations can executed with Dequeue via the Concurrently() operator.
Examples
>>> queue = queue.Queue(100) >>> write_op = ParallelRollouts(...).for_each(Enqueue(queue)) >>> read_op = Dequeue(queue) >>> combined_op = Concurrently([write_op, read_op], mode="async") >>> next(combined_op) SampleBatch(...)
-
ray.rllib.execution.
Dequeue
(input_queue: queue.Queue, check=<function <lambda>>) → ray.util.iter.LocalIterator[Union[ray.rllib.policy.sample_batch.SampleBatch, ray.rllib.policy.sample_batch.MultiAgentBatch]][source]¶ Dequeue data items from a queue.Queue instance.
The dequeue is non-blocking, so Dequeue operations can executed with Enqueue via the Concurrently() operator.
- Parameters
input_queue (Queue) – queue to pull items from.
check (fn) – liveness check. When this function returns false, Dequeue() will raise an error to halt execution.
Examples
>>> queue = queue.Queue(100) >>> write_op = ParallelRollouts(...).for_each(Enqueue(queue)) >>> read_op = Dequeue(queue) >>> combined_op = Concurrently([write_op, read_op], mode="async") >>> next(combined_op) SampleBatch(...)
-
ray.rllib.execution.
StandardMetricsReporting
(train_op: ray.util.iter.LocalIterator[Any], workers: ray.rllib.evaluation.worker_set.WorkerSet, config: dict, selected_workers: List[ActorHandle] = None) → ray.util.iter.LocalIterator[dict][source]¶ Operator to periodically collect and report metrics.
- Parameters
train_op (LocalIterator) – Operator for executing training steps. We ignore the output values.
workers (WorkerSet) – Rollout workers to collect metrics from.
config (dict) – Trainer configuration, used to determine the frequency of stats reporting.
selected_workers (list) – Override the list of remote workers to collect metrics from.
- Returns
A local iterator over training results.
- Return type
LocalIterator[dict]
Examples
>>> train_op = ParallelRollouts(...).for_each(TrainOneStep(...)) >>> metrics_op = StandardMetricsReporting(train_op, workers, config) >>> next(metrics_op) {"episode_reward_max": ..., "episode_reward_mean": ..., ...}
-
class
ray.rllib.execution.
CollectMetrics
(workers: ray.rllib.evaluation.worker_set.WorkerSet, min_history: int = 100, timeout_seconds: int = 180, selected_workers: List[ActorHandle] = None)[source]¶ Callable that collects metrics from workers.
The metrics are smoothed over a given history window.
This should be used with the .for_each() operator. For a higher level API, consider using StandardMetricsReporting instead.
Examples
>>> output_op = train_op.for_each(CollectMetrics(workers)) >>> print(next(output_op)) {"episode_reward_max": ..., "episode_reward_mean": ..., ...}
-
class
ray.rllib.execution.
OncePerTimeInterval
(delay: int)[source]¶ Callable that returns True once per given interval.
This should be used with the .filter() operator to throttle / rate-limit metrics reporting. For a higher-level API, consider using StandardMetricsReporting instead.
Examples
>>> throttled_op = train_op.filter(OncePerTimeInterval(5)) >>> start = time.time() >>> next(throttled_op) >>> print(time.time() - start) 5.00001 # will be greater than 5 seconds
-
class
ray.rllib.execution.
OncePerTimestepsElapsed
(delay_steps: int)[source]¶ Callable that returns True once per given number of timesteps.
This should be used with the .filter() operator to throttle / rate-limit metrics reporting. For a higher-level API, consider using StandardMetricsReporting instead.
Examples
>>> throttled_op = train_op.filter(OncePerTimestepsElapsed(1000)) >>> next(throttled_op) # will only return after 1000 steps have elapsed
-
class
ray.rllib.execution.
StoreToReplayBuffer
(*, local_buffer: ray.rllib.execution.replay_buffer.LocalReplayBuffer = None, actors: List[ActorHandle] = None)[source]¶ Callable that stores data into replay buffer actors.
If constructed with a local replay actor, data will be stored into that buffer. If constructed with a list of replay actor handles, data will be stored randomly among those actors.
This should be used with the .for_each() operator on a rollouts iterator. The batch that was stored is returned.
Examples
>>> actors = [ReplayActor.remote() for _ in range(4)] >>> rollouts = ParallelRollouts(...) >>> store_op = rollouts.for_each(StoreToReplayActors(actors=actors)) >>> next(store_op) SampleBatch(...)
-
ray.rllib.execution.
Replay
(*, local_buffer: ray.rllib.execution.replay_buffer.LocalReplayBuffer = None, actors: List[ActorHandle] = None, num_async: int = 4) → ray.util.iter.LocalIterator[Union[SampleBatch, MultiAgentBatch]][source]¶ Replay experiences from the given buffer or actors.
This should be combined with the StoreToReplayActors operation using the Concurrently() operator.
- Parameters
local_buffer (LocalReplayBuffer) – Local buffer to use. Only one of this and replay_actors can be specified.
actors (list) – List of replay actors. Only one of this and local_buffer can be specified.
num_async (int) – In async mode, the max number of async requests in flight per actor.
Examples
>>> actors = [ReplayActor.remote() for _ in range(4)] >>> replay_op = Replay(actors=actors) >>> next(replay_op) SampleBatch(...)
-
class
ray.rllib.execution.
SimpleReplayBuffer
(num_slots: int, replay_proportion: Optional[float] = None)[source]¶ Simple replay buffer that operates over batches.
-
class
ray.rllib.execution.
MixInReplay
(num_slots: int, replay_proportion: float)[source]¶ This operator adds replay to a stream of experiences.
It takes input batches, and returns a list of batches that include replayed data as well. The number of replayed batches is determined by the configured replay proportion. The max age of a batch is determined by the number of replay slots.
-
ray.rllib.execution.
ParallelRollouts
(workers: ray.rllib.evaluation.worker_set.WorkerSet, *, mode='bulk_sync', num_async=1) → ray.util.iter.LocalIterator[ray.rllib.policy.sample_batch.SampleBatch][source]¶ Operator to collect experiences in parallel from rollout workers.
If there are no remote workers, experiences will be collected serially from the local worker instance instead.
- Parameters
workers (WorkerSet) – set of rollout workers to use.
mode (str) – One of ‘async’, ‘bulk_sync’, ‘raw’. In ‘async’ mode, batches are returned as soon as they are computed by rollout workers with no order guarantees. In ‘bulk_sync’ mode, we collect one batch from each worker and concatenate them together into a large batch to return. In ‘raw’ mode, the ParallelIterator object is returned directly and the caller is responsible for implementing gather and updating the timesteps counter.
num_async (int) – In async mode, the max number of async requests in flight per actor.
- Returns
A local iterator over experiences collected in parallel.
Examples
>>> rollouts = ParallelRollouts(workers, mode="async") >>> batch = next(rollouts) >>> print(batch.count) 50 # config.rollout_fragment_length
>>> rollouts = ParallelRollouts(workers, mode="bulk_sync") >>> batch = next(rollouts) >>> print(batch.count) 200 # config.rollout_fragment_length * config.num_workers
Updates the STEPS_SAMPLED_COUNTER counter in the local iterator context.
-
ray.rllib.execution.
AsyncGradients
(workers: ray.rllib.evaluation.worker_set.WorkerSet) → ray.util.iter.LocalIterator[Tuple[Union[List[Tuple[Any, Any]], List[Any]], int]][source]¶ Operator to compute gradients in parallel from rollout workers.
- Parameters
workers (WorkerSet) – set of rollout workers to use.
- Returns
A local iterator over policy gradients computed on rollout workers.
Examples
>>> grads_op = AsyncGradients(workers) >>> print(next(grads_op)) {"var_0": ..., ...}, 50 # grads, batch count
Updates the STEPS_SAMPLED_COUNTER counter and LEARNER_INFO field in the local iterator context.
-
class
ray.rllib.execution.
ConcatBatches
(min_batch_size: int, count_steps_by: str = 'env_steps')[source]¶ Callable used to merge batches into larger batches for training.
This should be used with the .combine() operator.
Examples
>>> rollouts = ParallelRollouts(...) >>> rollouts = rollouts.combine(ConcatBatches( ... min_batch_size=10000, count_steps_by="env_steps")) >>> print(next(rollouts).count) 10000
-
class
ray.rllib.execution.
SelectExperiences
(policy_ids: List[str])[source]¶ Callable used to select experiences from a MultiAgentBatch.
This should be used with the .for_each() operator.
Examples
>>> rollouts = ParallelRollouts(...) >>> rollouts = rollouts.for_each(SelectExperiences(["pol1", "pol2"])) >>> print(next(rollouts).policy_batches.keys()) {"pol1", "pol2"}
-
class
ray.rllib.execution.
StandardizeFields
(fields: List[str])[source]¶ Callable used to standardize fields of batches.
This should be used with the .for_each() operator. Note that the input may be mutated by this operator for efficiency.
Examples
>>> rollouts = ParallelRollouts(...) >>> rollouts = rollouts.for_each(StandardizeFields(["advantages"])) >>> print(np.std(next(rollouts)["advantages"])) 1.0
-
class
ray.rllib.execution.
TrainOneStep
(workers: ray.rllib.evaluation.worker_set.WorkerSet, policies: List[str] = frozenset({}), num_sgd_iter: int = 1, sgd_minibatch_size: int = 0)[source]¶ Callable that improves the policy and updates workers.
This should be used with the .for_each() operator. A tuple of the input and learner stats will be returned.
Examples
>>> rollouts = ParallelRollouts(...) >>> train_op = rollouts.for_each(TrainOneStep(workers)) >>> print(next(train_op)) # This trains the policy on one batch. SampleBatch(...), {"learner_stats": ...}
Updates the STEPS_TRAINED_COUNTER counter and LEARNER_INFO field in the local iterator context.
-
class
ray.rllib.execution.
TrainTFMultiGPU
(workers: ray.rllib.evaluation.worker_set.WorkerSet, sgd_minibatch_size: int, num_sgd_iter: int, num_gpus: int, rollout_fragment_length: int, num_envs_per_worker: int, train_batch_size: int, shuffle_sequences: bool, policies: List[str] = frozenset({}), _fake_gpus: bool = False, framework: str = 'tf')[source]¶ TF Multi-GPU version of TrainOneStep.
This should be used with the .for_each() operator. A tuple of the input and learner stats will be returned.
Examples
>>> rollouts = ParallelRollouts(...) >>> train_op = rollouts.for_each(TrainMultiGPU(workers, ...)) >>> print(next(train_op)) # This trains the policy on one batch. SampleBatch(...), {"learner_stats": ...}
Updates the STEPS_TRAINED_COUNTER counter and LEARNER_INFO field in the local iterator context.
-
class
ray.rllib.execution.
ComputeGradients
(workers: ray.rllib.evaluation.worker_set.WorkerSet)[source]¶ Callable that computes gradients with respect to the policy loss.
This should be used with the .for_each() operator.
Examples
>>> grads_op = rollouts.for_each(ComputeGradients(workers)) >>> print(next(grads_op)) {"var_0": ..., ...}, 50 # grads, batch count
Updates the LEARNER_INFO info field in the local iterator context.
-
class
ray.rllib.execution.
ApplyGradients
(workers, policies: List[str] = frozenset({}), update_all=True)[source]¶ Callable that applies gradients and updates workers.
This should be used with the .for_each() operator.
Examples
>>> apply_op = grads_op.for_each(ApplyGradients(workers)) >>> print(next(apply_op)) None
Updates the STEPS_TRAINED_COUNTER counter in the local iterator context.
-
class
ray.rllib.execution.
AverageGradients
[source]¶ Callable that averages the gradients in a batch.
This should be used with the .for_each() operator after a set of gradients have been batched with .batch().
Examples
>>> batched_grads = grads_op.batch(32) >>> avg_grads = batched_grads.for_each(AverageGradients()) >>> print(next(avg_grads)) {"var_0": ..., ...}, 1600 # averaged grads, summed batch count
-
class
ray.rllib.execution.
UpdateTargetNetwork
(workers: ray.rllib.evaluation.worker_set.WorkerSet, target_update_freq: int, by_steps_trained: bool = False, policies: List[str] = frozenset({}))[source]¶ Periodically call policy.update_target() on all trainable policies.
This should be used with the .for_each() operator after training step has been taken.
Examples
>>> train_op = ParallelRollouts(...).for_each(TrainOneStep(...)) >>> update_op = train_op.for_each( ... UpdateTargetIfNeeded(workers, target_update_freq=500)) >>> print(next(update_op)) None
Updates the LAST_TARGET_UPDATE_TS and NUM_TARGET_UPDATES counters in the local iterator context. The value of the last update counter is used to track when we should update the target next.
ray.rllib.models¶
-
class
ray.rllib.models.
ActionDistribution
(inputs: List[Any], model: ray.rllib.models.modelv2.ModelV2)[source]¶ The policy action distribution of an agent.
-
inputs
¶ input vector to compute samples from.
- Type
Tensors
-
deterministic_sample
() → Any[source]¶ Get the deterministic “sampling” output from the distribution. This is usually the max likelihood output, i.e. mean for Normal, argmax for Categorical, etc..
-
kl
(other: ray.rllib.models.action_dist.ActionDistribution) → Any[source]¶ The KL-divergence between two action distributions.
-
multi_kl
(other: ray.rllib.models.action_dist.ActionDistribution) → Any[source]¶ The KL-divergence between two action distributions.
This differs from kl() in that it can return an array for MultiDiscrete. TODO(ekl) consider removing this.
-
multi_entropy
() → Any[source]¶ The entropy of the action distribution.
This differs from entropy() in that it can return an array for MultiDiscrete. TODO(ekl) consider removing this.
-
static
required_model_output_shape
(action_space: <Mock name='mock.Space' id='139801234788560'>, model_config: dict) → Union[int, numpy.ndarray][source]¶ Returns the required shape of an input parameter tensor for a particular action space and an optional dict of distribution-specific options.
- Parameters
action_space (gym.Space) – The action space this distribution will be used for, whose shape attributes will be used to determine the required shape of the input parameter tensor.
model_config (dict) – Model’s config dict (as defined in catalog.py)
- Returns
- size of the
required input vector (minus leading batch dimension).
- Return type
model_output_shape (int or np.ndarray of ints)
-
-
class
ray.rllib.models.
ModelCatalog
[source]¶ Registry of models, preprocessors, and action distributions for envs.
Examples
>>> prep = ModelCatalog.get_preprocessor(env) >>> observation = prep.transform(raw_observation)
>>> dist_class, dist_dim = ModelCatalog.get_action_dist( ... env.action_space, {}) >>> model = ModelCatalog.get_model_v2( ... obs_space, action_space, num_outputs, options) >>> dist = dist_class(model.outputs, model) >>> action = dist.sample()
-
static
get_action_dist
(action_space: <Mock name='mock.Space' id='139801234788560'>, config: dict, dist_type: Union[str, Type[ray.rllib.models.action_dist.ActionDistribution], None] = None, framework: str = 'tf', **kwargs) -> (<class 'type'>, <class 'int'>)[source]¶ Returns a distribution class and size for the given action space.
- Parameters
action_space (Space) – Action space of the target gym env.
config (Optional[dict]) – Optional model config.
dist_type (Optional[Union[str, Type[ActionDistribution]]]) – Identifier of the action distribution (str) interpreted as a hint or the actual ActionDistribution class to use.
framework (str) – One of “tf2”, “tf”, “tfe”, “torch”, or “jax”.
kwargs (dict) – Optional kwargs to pass on to the Distribution’s constructor.
- Returns
- dist_class (ActionDistribution): Python class of the
distribution.
- dist_dim (int): The size of the input vector to the
distribution.
- Return type
Tuple
-
static
get_action_shape
(action_space: <Mock name='mock.Space' id='139801234788560'>, framework: str = 'tf') -> (<class 'numpy.dtype'>, typing.List[int])[source]¶ Returns action tensor dtype and shape for the action space.
- Parameters
action_space (Space) – Action space of the target gym env.
framework (str) – The framework identifier. One of “tf” or “torch”.
- Returns
Dtype and shape of the actions tensor.
- Return type
(dtype, shape)
-
static
get_action_placeholder
(action_space: <Mock name='mock.Space' id='139801234788560'>, name: str = 'action') → Any[source]¶ Returns an action placeholder consistent with the action space
- Parameters
action_space (Space) – Action space of the target gym env.
name (str) – An optional string to name the placeholder by. Default: “action”.
- Returns
A placeholder for the actions
- Return type
action_placeholder (Tensor)
-
static
get_model_v2
(obs_space: <Mock name='mock.Space' id='139801234788560'>, action_space: <Mock name='mock.Space' id='139801234788560'>, num_outputs: int, model_config: dict, framework: str = 'tf', name: str = 'default_model', model_interface: type = None, default_model: type = None, **model_kwargs) → ray.rllib.models.modelv2.ModelV2[source]¶ Returns a suitable model compatible with given spaces and output.
- Parameters
obs_space (Space) – Observation space of the target gym env. This may have an original_space attribute that specifies how to unflatten the tensor into a ragged tensor.
action_space (Space) – Action space of the target gym env.
num_outputs (int) – The size of the output vector of the model.
model_config (ModelConfigDict) – The “model” sub-config dict within the Trainer’s config dict.
framework (str) – One of “tf2”, “tf”, “tfe”, “torch”, or “jax”.
name (str) – Name (scope) for the model.
model_interface (cls) – Interface required for the model
default_model (cls) – Override the default class for the model. This only has an effect when not using a custom model
model_kwargs (dict) – args to pass to the ModelV2 constructor
- Returns
Model to use for the policy.
- Return type
model (ModelV2)
-
static
get_preprocessor
(env: <Mock name='mock.Env' id='139801239523600'>, options: Optional[dict] = None) → ray.rllib.models.preprocessors.Preprocessor[source]¶ Returns a suitable preprocessor for the given env.
This is a wrapper for get_preprocessor_for_space().
-
static
get_preprocessor_for_space
(observation_space: <Mock name='mock.Space' id='139801234788560'>, options: dict = None) → ray.rllib.models.preprocessors.Preprocessor[source]¶ Returns a suitable preprocessor for the given observation space.
- Parameters
observation_space (Space) – The input observation space.
options (dict) – Options to pass to the preprocessor.
- Returns
Preprocessor for the observations.
- Return type
preprocessor (Preprocessor)
-
static
register_custom_preprocessor
(preprocessor_name: str, preprocessor_class: type) → None[source]¶ Register a custom preprocessor class by name.
The preprocessor can be later used by specifying {“custom_preprocessor”: preprocesor_name} in the model config.
- Parameters
preprocessor_name (str) – Name to register the preprocessor under.
preprocessor_class (type) – Python class of the preprocessor.
-
static
register_custom_model
(model_name: str, model_class: type) → None[source]¶ Register a custom model class by name.
The model can be later used by specifying {“custom_model”: model_name} in the model config.
- Parameters
model_name (str) – Name to register the model under.
model_class (type) – Python class of the model.
-
static
register_custom_action_dist
(action_dist_name: str, action_dist_class: type) → None[source]¶ Register a custom action distribution class by name.
The model can be later used by specifying {“custom_action_dist”: action_dist_name} in the model config.
- Parameters
model_name (str) – Name to register the action distribution under.
model_class (type) – Python class of the action distribution.
-
static
-
class
ray.rllib.models.
ModelV2
(obs_space: <Mock name='mock.spaces.Space' id='139801235170768'>, action_space: <Mock name='mock.spaces.Space' id='139801235170768'>, num_outputs: int, model_config: dict, name: str, framework: str)[source]¶ Defines an abstract neural network model for use with RLlib.
Custom models should extend either TFModelV2 or TorchModelV2 instead of this class directly.
- Data flow:
- obs -> forward() -> model_out
value_function() -> V(s)
-
get_initial_state
() → List[numpy.ndarray][source]¶ Get the initial recurrent state values for the model.
- Returns
- List of np.array objects containing the initial
hidden state of an RNN, if applicable.
- Return type
List[np.ndarray]
Examples
>>> def get_initial_state(self): >>> return [ >>> np.zeros(self.cell_size, np.float32), >>> np.zeros(self.cell_size, np.float32), >>> ]
-
forward
(input_dict: Dict[str, Any], state: List[Any], seq_lens: Any)[source]¶ Call the model with the given input tensors and state.
Any complex observations (dicts, tuples, etc.) will be unpacked by __call__ before being passed to forward(). To access the flattened observation tensor, refer to input_dict[“obs_flat”].
This method can be called any number of times. In eager execution, each call to forward() will eagerly evaluate the model. In symbolic execution, each call to forward creates a computation graph that operates over the variables of this model (i.e., shares weights).
Custom models should override this instead of __call__.
- Parameters
input_dict (dict) – dictionary of input tensors, including “obs”, “obs_flat”, “prev_action”, “prev_reward”, “is_training”, “eps_id”, “agent_id”, “infos”, and “t”.
state (list) – list of state tensors with sizes matching those returned by get_initial_state + the batch dimension
seq_lens (Tensor) – 1d tensor holding input sequence lengths
- Returns
- The model output tensor of size
[BATCH, num_outputs], and the new RNN state.
- Return type
(outputs, state)
Examples
>>> def forward(self, input_dict, state, seq_lens): >>> model_out, self._value_out = self.base_model( ... input_dict["obs"]) >>> return model_out, state
-
value_function
() → Any[source]¶ Returns the value function output for the most recent forward pass.
Note that a forward call has to be performed first, before this methods can return anything and thus that calling this method does not cause an extra forward pass through the network.
- Returns
value estimate tensor of shape [BATCH].
-
custom_loss
(policy_loss: Any, loss_inputs: Dict[str, Any]) → Any[source]¶ Override to customize the loss function used to optimize this model.
This can be used to incorporate self-supervised losses (by defining a loss over existing input and output tensors of this model), and supervised losses (by defining losses over a variable-sharing copy of this model’s layers).
You can find an runnable example in examples/custom_loss.py.
- Parameters
policy_loss (Union[List[Tensor],Tensor]) – List of or single policy loss(es) from the policy.
loss_inputs (dict) – map of input placeholders for rollout data.
- Returns
- List of or scalar tensor for the
customized loss(es) for this model.
- Return type
Union[List[Tensor],Tensor]
-
metrics
() → Dict[str, Any][source]¶ Override to return custom metrics from your model.
- The stats will be reported as part of the learner stats, i.e.,
- info:
- learner:
- model:
key1: metric1 key2: metric2
- Returns
Dict of string keys to scalar tensors.
-
from_batch
(train_batch: ray.rllib.policy.sample_batch.SampleBatch, is_training: bool = True)[source]¶ Convenience function that calls this model with a tensor batch.
All this does is unpack the tensor batch to call this model with the right input dict, state, and seq len arguments.
-
import_from_h5
(h5_file: str) → None[source]¶ Imports weights from an h5 file.
- Parameters
h5_file (str) – The h5 file name to import weights from.
Example
>>> trainer = MyTrainer() >>> trainer.import_policy_model_from_h5("/tmp/weights.h5") >>> for _ in range(10): >>> trainer.train()
-
context
() → contextlib.AbstractContextManager[source]¶ Returns a contextmanager for the current forward pass.
-
variables
(as_dict: bool = False) → Union[List[Any], Dict[str, Any]][source]¶ Returns the list (or a dict) of variables for this model.
- Parameters
as_dict (bool) – Whether variables should be returned as dict-values (using descriptive str keys).
- Returns
- The list (or dict if as_dict is
True) of all variables of this ModelV2.
- Return type
Union[List[any],Dict[str,any]]
-
trainable_variables
(as_dict: bool = False) → Union[List[Any], Dict[str, Any]][source]¶ Returns the list of trainable variables for this model.
- Parameters
as_dict (bool) – Whether variables should be returned as dict-values (using descriptive keys).
- Returns
- The list (or dict if as_dict is
True) of all trainable (tf)/requires_grad (torch) variables of this ModelV2.
- Return type
Union[List[any],Dict[str,any]]
-
is_time_major
() → bool[source]¶ If True, data for calling this ModelV2 must be in time-major format.
- Returns
- bool: Whether this ModelV2 requires a time-major (TxBx…) data
format.
-
get_input_dict
(sample_batch, index: Union[int, str] = 'last') → Dict[str, Any][source]¶ Creates single ts input-dict at given index from a SampleBatch.
- Parameters
sample_batch (SampleBatch) – A single-trajectory SampleBatch object to generate the compute_actions input dict from.
index (Union[int, str]) – An integer index value indicating the position in the trajectory for which to generate the compute_actions input dict. Set to “last” to generate the dict at the very end of the trajectory (e.g. for value estimation). Note that “last” is different from -1, as “last” will use the final NEXT_OBS as observation input.
- Returns
The (single-timestep) input dict for ModelV2 calls.
- Return type
ModelInputDict
-
class
ray.rllib.models.
Preprocessor
(obs_space: <Mock name='mock.Space' id='139801234788560'>, options: dict = None)[source]¶ Defines an abstract observation preprocessor function.
-
shape
¶ Shape of the preprocessed output.
- Type
List[int]
-
ray.rllib.utils¶
-
ray.rllib.utils.
override
(cls)[source]¶ Annotation for documenting method overrides.
- Parameters
cls (type) – The superclass that provides the overridden method. If this cls does not actually have the method, an error is raised.
-
ray.rllib.utils.
PublicAPI
(obj)[source]¶ Annotation for documenting public APIs.
Public APIs are classes and methods exposed to end users of RLlib. You can expect these APIs to remain stable across RLlib releases.
Subclasses that inherit from a
@PublicAPI
base class can be assumed part of the RLlib public API as well (e.g., all trainer classes are in public API because Trainer is@PublicAPI
).In addition, you can assume all trainer configurations are part of their public API as well.
-
ray.rllib.utils.
DeveloperAPI
(obj)[source]¶ Annotation for documenting developer APIs.
Developer APIs are classes and methods explicitly exposed to developers for the purposes of building custom algorithms or advanced training strategies on top of RLlib internals. You can generally expect these APIs to be stable sans minor changes (but less stable than public APIs).
Subclasses that inherit from a
@DeveloperAPI
base class can be assumed part of the RLlib developer API as well.
-
ray.rllib.utils.
try_import_tf
(error=False)[source]¶ Tries importing tf and returns the module (or None).
- Parameters
error (bool) – Whether to raise an error if tf cannot be imported.
- Returns
tf1.x module (either from tf2.x.compat.v1 OR as tf1.x).
- tf module (resulting from import tensorflow).
Either tf1.x or 2.x.
The actually installed tf version as int: 1 or 2.
- Return type
Tuple
- Raises
ImportError – If error=True and tf is not installed.
-
ray.rllib.utils.
try_import_tfp
(error=False)[source]¶ Tries importing tfp and returns the module (or None).
- Parameters
error (bool) – Whether to raise an error if tfp cannot be imported.
- Returns
The tfp module.
- Raises
ImportError – If error=True and tfp is not installed.
-
ray.rllib.utils.
try_import_torch
(error=False)[source]¶ Tries importing torch and returns the module (or None).
- Parameters
error (bool) – Whether to raise an error if torch cannot be imported.
- Returns
torch AND torch.nn modules.
- Return type
tuple
- Raises
ImportError – If error=True and PyTorch is not installed.
-
ray.rllib.utils.
deprecation_warning
(old: str, new: Optional[str] = None, error: Union[bool, Exception, None] = None) → None[source]¶ Warns (via the logger object) or throws a deprecation warning/error.
- Parameters
old (str) – A description of the “thing” that is to be deprecated.
new (Optional[str]) – A description of the new “thing” that replaces it.
error (Optional[Union[bool, Exception]]) – Whether or which exception to throw. If True, throw ValueError. If False, just warn. If Exception, throw that Exception.
-
ray.rllib.utils.
renamed_agent
(cls)[source]¶ Helper class for renaming Agent => Trainer with a warning.
-
ray.rllib.utils.
renamed_class
(cls, old_name)[source]¶ Helper class for renaming classes with a warning.
-
class
ray.rllib.utils.
FilterManager
[source]¶ Manages filters and coordination across remote evaluators that expose get_filters and sync_filters.
-
static
synchronize
(local_filters, remotes, update_remote=True)[source]¶ Aggregates all filters from remote evaluators.
Local copy is updated and then broadcasted to all remote evaluators.
- Parameters
local_filters (dict) – Filters to be synchronized.
remotes (list) – Remote evaluators with filters.
update_remote (bool) – Whether to push updates to remote filters.
-
static
-
ray.rllib.utils.
sigmoid
(x, derivative=False)[source]¶ Returns the sigmoid function applied to x. Alternatively, can return the derivative or the sigmoid function.
- Parameters
x (np.ndarray) – The input to the sigmoid function.
derivative (bool) – Whether to return the derivative or not. Default: False.
- Returns
The sigmoid function (or its derivative) applied to x.
- Return type
np.ndarray
-
ray.rllib.utils.
softmax
(x, axis=- 1)[source]¶ Returns the softmax values for x as: S(xi) = e^xi / SUMj(e^xj), where j goes over all elements in x.
- Parameters
x (np.ndarray) – The input to the softmax function.
axis (int) – The axis along which to softmax.
- Returns
The softmax over x.
- Return type
np.ndarray
-
ray.rllib.utils.
relu
(x, alpha=0.0)[source]¶ Implementation of the leaky ReLU function: y = x * alpha if x < 0 else x
- Parameters
x (np.ndarray) – The input values.
alpha (float) – A scaling (“leak”) factor to use for negative x.
- Returns
The leaky ReLU output for x.
- Return type
np.ndarray
-
ray.rllib.utils.
one_hot
(x: Union[Any, int], depth: int = 0, on_value: int = 1.0, off_value: float = 0.0)[source]¶ One-hot utility function for numpy. Thanks to qianyizhang: https://gist.github.com/qianyizhang/07ee1c15cad08afb03f5de69349efc30.
- Parameters
x (TensorType) – The input to be one-hot encoded.
depth (int) – The max. number to be one-hot encoded (size of last rank).
on_value (float) – The value to use for on. Default: 1.0.
off_value (float) – The value to use for off. Default: 0.0.
- Returns
The one-hot encoded equivalent of the input array.
- Return type
np.ndarray
-
ray.rllib.utils.
fc
(x, weights, biases=None, framework=None)[source]¶ Calculates the outputs of a fully-connected (dense) layer given weights/biases and an input.
- Parameters
x (np.ndarray) – The input to the dense layer.
weights (np.ndarray) – The weights matrix.
biases (Optional[np.ndarray]) – The biases vector. All 0s if None.
framework (Optional[str]) – An optional framework hint (to figure out, e.g. whether to transpose torch weight matrices).
- Returns
The dense layer’s output.
-
ray.rllib.utils.
lstm
(x, weights, biases=None, initial_internal_states=None, time_major=False, forget_bias=1.0)[source]¶ Calculates the outputs of an LSTM layer given weights/biases, internal_states, and input.
- Parameters
x (np.ndarray) – The inputs to the LSTM layer including time-rank (0th if time-major, else 1st) and the batch-rank (1st if time-major, else 0th).
weights (np.ndarray) – The weights matrix.
biases (Optional[np.ndarray]) – The biases vector. All 0s if None.
initial_internal_states (Optional[np.ndarray]) – The initial internal states to pass into the layer. All 0s if None.
time_major (bool) – Whether to use time-major or not. Default: False.
forget_bias (float) – Gets added to first sigmoid (forget gate) output. Default: 1.0.
- Returns
The LSTM layer’s output.
Tuple: Last (c-state, h-state).
- Return type
Tuple
-
class
ray.rllib.utils.
LinearSchedule
(**kwargs)[source]¶ Linear interpolation between initial_p and final_p. Simply uses Polynomial with power=1.0.
final_p + (initial_p - final_p) * (1 - t/t_max)
-
class
ray.rllib.utils.
PiecewiseSchedule
(endpoints, framework, interpolation=<function _linear_interpolation>, outside_value=None)[source]¶
-
class
ray.rllib.utils.
PolynomialSchedule
(schedule_timesteps, final_p, framework, initial_p=1.0, power=2.0)[source]¶
-
class
ray.rllib.utils.
ExponentialSchedule
(schedule_timesteps, framework, initial_p=1.0, decay_rate=0.1)[source]¶
-
class
ray.rllib.utils.
ConstantSchedule
(value, framework)[source]¶ A Schedule where the value remains constant over time.
-
ray.rllib.utils.
check
(x, y, decimals=5, atol=None, rtol=None, false=False)[source]¶ Checks two structures (dict, tuple, list, np.array, float, int, etc..) for (almost) numeric identity. All numbers in the two structures have to match up to decimal digits after the floating point. Uses assertions.
- Parameters
x (any) – The value to be compared (to the expectation: y). This may be a Tensor.
y (any) – The expected value to be compared to x. This must not be a tf-Tensor, but may be a tfe/torch-Tensor.
decimals (int) – The number of digits after the floating point up to which all numeric values have to match.
atol (float) – Absolute tolerance of the difference between x and y (overrides decimals if given).
rtol (float) – Relative tolerance of the difference between x and y (overrides decimals if given).
false (bool) – Whether to check that x and y are NOT the same.
-
ray.rllib.utils.
check_compute_single_action
(trainer, include_state=False, include_prev_action_reward=False)[source]¶ Tests different combinations of arguments for trainer.compute_action.
- Parameters
trainer (Trainer) – The Trainer object to test.
include_state (bool) – Whether to include the initial state of the Policy’s Model in the compute_action call.
include_prev_action_reward (bool) – Whether to include the prev-action and -reward in the compute_action call.
- Raises
ValueError – If anything unexpected happens.
-
ray.rllib.utils.
framework_iterator
(config=None, frameworks='tf2', 'tf', 'tfe', 'torch', session=False)[source]¶ An generator that allows for looping through n frameworks for testing.
Provides the correct config entries (“framework”) as well as the correct eager/non-eager contexts for tfe/tf.
- Parameters
config (Optional[dict]) – An optional config dict to alter in place depending on the iteration.
frameworks (Tuple[str]) – A list/tuple of the frameworks to be tested. Allowed are: “tf2”, “tf”, “tfe”, “torch”, and None.
session (bool) – If True and only in the tf-case: Enter a tf.Session() and yield that as second return value (otherwise yield (fw, None)).
- Yields
str –
- If enter_session is False:
The current framework (“tf2”, “tf”, “tfe”, “torch”) used.
- Tuple(str, Union[None,tf.Session]: If enter_session is True:
A tuple of the current fw and the tf.Session if fw=”tf”.
-
ray.rllib.utils.
merge_dicts
(d1, d2)[source]¶ - Parameters
d1 (dict) – Dict 1.
d2 (dict) – Dict 2.
- Returns
A new dict that is d1 and d2 deep merged.
- Return type
dict
-
ray.rllib.utils.
deep_update
(original, new_dict, new_keys_allowed=False, allow_new_subkey_list=None, override_all_if_type_changes=None)[source]¶ Updates original dict with values from new_dict recursively.
If new key is introduced in new_dict, then if new_keys_allowed is not True, an error will be thrown. Further, for sub-dicts, if the key is in the allow_new_subkey_list, then new subkeys can be introduced.
- Parameters
original (dict) – Dictionary with default values.
new_dict (dict) – Dictionary with values to be updated
new_keys_allowed (bool) – Whether new keys are allowed.
allow_new_subkey_list (Optional[List[str]]) – List of keys that correspond to dict values where new subkeys can be introduced. This is only at the top level.
override_all_if_type_changes (Optional[List[str]]) – List of top level keys with value=dict, for which we always simply override the entire value (dict), iff the “type” key in that value dict changes.
-
ray.rllib.utils.
add_mixins
(base, mixins)[source]¶ Returns a new class with mixins applied in priority order.
-
ray.rllib.utils.
force_list
(elements=None, to_tuple=False)[source]¶ Makes sure elements is returned as a list, whether elements is a single item, already a list, or a tuple.
- Parameters
elements (Optional[any]) – The inputs as single item, list, or tuple to be converted into a list/tuple. If None, returns empty list/tuple.
to_tuple (bool) – Whether to use tuple (instead of list).
- Returns
- All given elements in a list/tuple depending on
to_tuple’s value. If elements is None, returns an empty list/tuple.
- Return type
Union[list,tuple]
-
ray.rllib.utils.
force_tuple
(elements=None, *, to_tuple=True)¶ Makes sure elements is returned as a list, whether elements is a single item, already a list, or a tuple.
- Parameters
elements (Optional[any]) – The inputs as single item, list, or tuple to be converted into a list/tuple. If None, returns empty list/tuple.
to_tuple (bool) – Whether to use tuple (instead of list).
- Returns
- All given elements in a list/tuple depending on
to_tuple’s value. If elements is None, returns an empty list/tuple.
- Return type
Union[list,tuple]