ray.rllib.policy.policy.Policy.compute_log_likelihoods#

Policy.compute_log_likelihoods(actions: List[numpy.array | jnp.ndarray | tf.Tensor | torch.Tensor] | numpy.array | jnp.ndarray | tf.Tensor | torch.Tensor, obs_batch: List[numpy.array | jnp.ndarray | tf.Tensor | torch.Tensor] | numpy.array | jnp.ndarray | tf.Tensor | torch.Tensor, state_batches: List[numpy.array | jnp.ndarray | tf.Tensor | torch.Tensor] | None = None, prev_action_batch: List[numpy.array | jnp.ndarray | tf.Tensor | torch.Tensor] | numpy.array | jnp.ndarray | tf.Tensor | torch.Tensor | None = None, prev_reward_batch: List[numpy.array | jnp.ndarray | tf.Tensor | torch.Tensor] | numpy.array | jnp.ndarray | tf.Tensor | torch.Tensor | None = None, actions_normalized: bool = True, in_training: bool = True) numpy.array | jnp.ndarray | tf.Tensor | torch.Tensor[source]#

Computes the log-prob/likelihood for a given action and observation.

The log-likelihood is calculated using this Policy’s action distribution class (self.dist_class).

Parameters:
  • actions – Batch of actions, for which to retrieve the log-probs/likelihoods (given all other inputs: obs, states, ..).

  • obs_batch – Batch of observations.

  • state_batches – List of RNN state input batches, if any.

  • prev_action_batch – Batch of previous action values.

  • prev_reward_batch – Batch of previous rewards.

  • actions_normalized – Is the given actions already normalized (between -1.0 and 1.0) or not? If not and normalize_actions=True, we need to normalize the given actions first, before calculating log likelihoods.

  • in_training – Whether to use the forward_train() or forward_exploration() of the underlying RLModule.

Returns:

[BATCH_SIZE].

Return type:

Batch of log probs/likelihoods, with shape