ray.rllib.policy.policy.Policy.compute_single_action#

Policy.compute_single_action(obs: numpy.array | jnp.ndarray | tf.Tensor | torch.Tensor | dict | tuple | None = None, state: List[numpy.array | jnp.ndarray | tf.Tensor | torch.Tensor] | None = None, *, prev_action: numpy.array | jnp.ndarray | tf.Tensor | torch.Tensor | dict | tuple | None = None, prev_reward: numpy.array | jnp.ndarray | tf.Tensor | torch.Tensor | dict | tuple | None = None, info: dict = None, input_dict: SampleBatch | None = None, episode: Episode | None = None, explore: bool | None = None, timestep: int | None = None, **kwargs) Tuple[numpy.array | jnp.ndarray | tf.Tensor | torch.Tensor | dict | tuple, List[numpy.array | jnp.ndarray | tf.Tensor | torch.Tensor], Dict[str, numpy.array | jnp.ndarray | tf.Tensor | torch.Tensor]][source]#

Computes and returns a single (B=1) action value.

Takes an input dict (usually a SampleBatch) as its main data input. This allows for using this method in case a more complex input pattern (view requirements) is needed, for example when the Model requires the last n observations, the last m actions/rewards, or a combination of any of these. Alternatively, in case no complex inputs are required, takes a single obs values (and possibly single state values, prev-action/reward values, etc..).

Parameters:
  • obs – Single observation.

  • state – List of RNN state inputs, if any.

  • prev_action – Previous action value, if any.

  • prev_reward – Previous reward, if any.

  • info – Info object, if any.

  • input_dict – A SampleBatch or input dict containing the single (unbatched) Tensors to compute actions. If given, it’ll be used instead of obs, state, prev_action|reward, and info.

  • episode – This provides access to all of the internal episode state, which may be useful for model-based or multi-agent algorithms.

  • explore – Whether to pick an exploitation or exploration action (default: None -> use self.config[“explore”]).

  • timestep – The current (sampling) time step.

Keyword Arguments:

kwargs – Forward compatibility placeholder.

Returns:

Tuple consisting of the action, the list of RNN state outputs (if any), and a dictionary of extra features (if any).