Policy.compute_single_action(obs: Optional[Union[numpy.array, jnp.ndarray, tf.Tensor, torch.Tensor, dict, tuple]] = None, state: Optional[List[Union[numpy.array, jnp.ndarray, tf.Tensor, torch.Tensor]]] = None, *, prev_action: Optional[Union[numpy.array, jnp.ndarray, tf.Tensor, torch.Tensor, dict, tuple]] = None, prev_reward: Optional[Union[numpy.array, jnp.ndarray, tf.Tensor, torch.Tensor, dict, tuple]] = None, info: dict = None, input_dict: Optional[ray.rllib.policy.sample_batch.SampleBatch] = None, episode: Optional[Episode] = None, explore: Optional[bool] = None, timestep: Optional[int] = None, **kwargs) Tuple[Union[numpy.array, jnp.ndarray, tf.Tensor, torch.Tensor, dict, tuple], List[Union[numpy.array, jnp.ndarray, tf.Tensor, torch.Tensor]], Dict[str, Union[numpy.array, jnp.ndarray, tf.Tensor, torch.Tensor]]][source]#

Computes and returns a single (B=1) action value.

Takes an input dict (usually a SampleBatch) as its main data input. This allows for using this method in case a more complex input pattern (view requirements) is needed, for example when the Model requires the last n observations, the last m actions/rewards, or a combination of any of these. Alternatively, in case no complex inputs are required, takes a single obs values (and possibly single state values, prev-action/reward values, etc..).

  • obs – Single observation.

  • state – List of RNN state inputs, if any.

  • prev_action – Previous action value, if any.

  • prev_reward – Previous reward, if any.

  • info – Info object, if any.

  • input_dict – A SampleBatch or input dict containing the single (unbatched) Tensors to compute actions. If given, it’ll be used instead of obs, state, prev_action|reward, and info.

  • episode – This provides access to all of the internal episode state, which may be useful for model-based or multi-agent algorithms.

  • explore – Whether to pick an exploitation or exploration action (default: None -> use self.config[“explore”]).

  • timestep – The current (sampling) time step.

Keyword Arguments

kwargs – Forward compatibility placeholder.


Tuple consisting of the action, the list of RNN state outputs (if any), and a dictionary of extra features (if any).