
abstract Policy.compute_actions(obs_batch: List[numpy.array | jnp.ndarray | tf.Tensor | torch.Tensor | dict | tuple] | numpy.array | jnp.ndarray | tf.Tensor | torch.Tensor | dict | tuple, state_batches: List[numpy.array | jnp.ndarray | tf.Tensor | torch.Tensor] | None = None, prev_action_batch: List[numpy.array | jnp.ndarray | tf.Tensor | torch.Tensor | dict | tuple] | numpy.array | jnp.ndarray | tf.Tensor | torch.Tensor | dict | tuple = None, prev_reward_batch: List[numpy.array | jnp.ndarray | tf.Tensor | torch.Tensor | dict | tuple] | numpy.array | jnp.ndarray | tf.Tensor | torch.Tensor | dict | tuple = None, info_batch: Dict[str, list] | None = None, episodes: List[Episode] | None = None, explore: bool | None = None, timestep: int | None = None, **kwargs) Tuple[numpy.array | jnp.ndarray | tf.Tensor | torch.Tensor, List[numpy.array | jnp.ndarray | tf.Tensor | torch.Tensor], Dict[str, numpy.array | jnp.ndarray | tf.Tensor | torch.Tensor]][source]#

Computes actions for the current policy.

  • obs_batch – Batch of observations.

  • state_batches – List of RNN state input batches, if any.

  • prev_action_batch – Batch of previous action values.

  • prev_reward_batch – Batch of previous rewards.

  • info_batch – Batch of info objects.

  • episodes – List of Episode objects, one for each obs in obs_batch. This provides access to all of the internal episode state, which may be useful for model-based or multi-agent algorithms.

  • explore – Whether to pick an exploitation or exploration action. Set to None (default) for using the value of self.config["explore"].

  • timestep – The current (sampling) time step.

Keyword Arguments:

kwargs – Forward compatibility placeholder


Batch of output actions, with shape like


state_outs (List[TensorType]): List of RNN state output

batches, if any, each with shape [BATCH_SIZE, STATE_SIZE].

info (List[dict]): Dictionary of extra feature batches, if any,

with shape like {“f1”: [BATCH_SIZE, …], “f2”: [BATCH_SIZE, …]}.

Return type:
