ray.rllib.policy.policy.Policy.compute_single_action#

Computes and returns a single (B=1) action value.

Takes an input dict (usually a SampleBatch) as its main data input. This allows for using this method in case a more complex input pattern (view requirements) is needed, for example when the Model requires the last n observations, the last m actions/rewards, or a combination of any of these. Alternatively, in case no complex inputs are required, takes a single obs values (and possibly single state values, prev-action/reward values, etc..).

Parameters:

obs – Single observation.
state – List of RNN state inputs, if any.
prev_action – Previous action value, if any.
prev_reward – Previous reward, if any.
info – Info object, if any.
input_dict – A SampleBatch or input dict containing the single (unbatched) Tensors to compute actions. If given, it’ll be used instead of obs, state, prev_action|reward, and info.
episode – This provides access to all of the internal episode state, which may be useful for model-based or multi-agent algorithms.
explore – Whether to pick an exploitation or exploration action (default: None -> use self.config[“explore”]).
timestep – The current (sampling) time step.

Keyword Arguments:

kwargs – Forward compatibility placeholder.

Returns:

Tuple consisting of the action, the list of RNN state outputs (if any), and a dictionary of extra features (if any).