ray.rllib.policy.policy.Policy.compute_single_action
ray.rllib.policy.policy.Policy.compute_single_action#
- Policy.compute_single_action(obs: Optional[Union[numpy.array, tf.Tensor, torch.Tensor, dict, tuple]] = None, state: Optional[List[Union[numpy.array, tf.Tensor, torch.Tensor]]] = None, *, prev_action: Optional[Union[numpy.array, tf.Tensor, torch.Tensor, dict, tuple]] = None, prev_reward: Optional[Union[numpy.array, tf.Tensor, torch.Tensor, dict, tuple]] = None, info: dict = None, input_dict: Optional[ray.rllib.policy.sample_batch.SampleBatch] = None, episode: Optional[Episode] = None, explore: Optional[bool] = None, timestep: Optional[int] = None, **kwargs) Tuple[Union[numpy.array, tf.Tensor, torch.Tensor, dict, tuple], List[Union[numpy.array, tf.Tensor, torch.Tensor]], Dict[str, Union[numpy.array, tf.Tensor, torch.Tensor]]] [source]#
Computes and returns a single (B=1) action value.
Takes an input dict (usually a SampleBatch) as its main data input. This allows for using this method in case a more complex input pattern (view requirements) is needed, for example when the Model requires the last n observations, the last m actions/rewards, or a combination of any of these. Alternatively, in case no complex inputs are required, takes a single
obs
values (and possibly single state values, prev-action/reward values, etc..).- Parameters
obs β Single observation.
state β List of RNN state inputs, if any.
prev_action β Previous action value, if any.
prev_reward β Previous reward, if any.
info β Info object, if any.
input_dict β A SampleBatch or input dict containing the single (unbatched) Tensors to compute actions. If given, itβll be used instead of
obs
,state
,prev_action|reward
, andinfo
.episode β This provides access to all of the internal episode state, which may be useful for model-based or multi-agent algorithms.
explore β Whether to pick an exploitation or exploration action (default: None -> use self.config[βexploreβ]).
timestep β The current (sampling) time step.
- Keyword Arguments
kwargs β Forward compatibility placeholder.
- Returns
Tuple consisting of the action, the list of RNN state outputs (if any), and a dictionary of extra features (if any).