ray.rllib.policy.policy.Policy.compute_single_action#
- Policy.compute_single_action(obs: numpy.array | jnp.ndarray | tf.Tensor | torch.Tensor | dict | tuple | None = None, state: List[numpy.array | jnp.ndarray | tf.Tensor | torch.Tensor] | None = None, *, prev_action: numpy.array | jnp.ndarray | tf.Tensor | torch.Tensor | dict | tuple | None = None, prev_reward: numpy.array | jnp.ndarray | tf.Tensor | torch.Tensor | dict | tuple | None = None, info: dict = None, input_dict: SampleBatch | None = None, episode: Episode | None = None, explore: bool | None = None, timestep: int | None = None, **kwargs) Tuple[numpy.array | jnp.ndarray | tf.Tensor | torch.Tensor | dict | tuple, List[numpy.array | jnp.ndarray | tf.Tensor | torch.Tensor], Dict[str, numpy.array | jnp.ndarray | tf.Tensor | torch.Tensor]] [source]#
Computes and returns a single (B=1) action value.
Takes an input dict (usually a SampleBatch) as its main data input. This allows for using this method in case a more complex input pattern (view requirements) is needed, for example when the Model requires the last n observations, the last m actions/rewards, or a combination of any of these. Alternatively, in case no complex inputs are required, takes a single
obs
values (and possibly single state values, prev-action/reward values, etc..).- Parameters:
obs – Single observation.
state – List of RNN state inputs, if any.
prev_action – Previous action value, if any.
prev_reward – Previous reward, if any.
info – Info object, if any.
input_dict – A SampleBatch or input dict containing the single (unbatched) Tensors to compute actions. If given, it’ll be used instead of
obs
,state
,prev_action|reward
, andinfo
.episode – This provides access to all of the internal episode state, which may be useful for model-based or multi-agent algorithms.
explore – Whether to pick an exploitation or exploration action (default: None -> use self.config[“explore”]).
timestep – The current (sampling) time step.
- Keyword Arguments:
kwargs – Forward compatibility placeholder.
- Returns:
Tuple consisting of the action, the list of RNN state outputs (if any), and a dictionary of extra features (if any).