ray.rllib.env.single_agent_episode.SingleAgentEpisode.get_rewards#

SingleAgentEpisode.get_rewards(indices: int | slice | List[int] | None = None, *, neg_index_as_lookback: bool = False, fill: float | None = None) Any[source]#

Returns individual rewards or batched ranges thereof from this episode.

Parameters:
  • indices – A single int is interpreted as an index, from which to return the individual reward stored at this index. A list of ints is interpreted as a list of indices from which to gather individual rewards in a batch of size len(indices). A slice object is interpreted as a range of rewards to be returned. Thereby, negative indices by default are interpreted as “before the end” unless the neg_index_as_lookback=True option is used, in which case negative indices are interpreted as “before ts=0”, meaning going back into the lookback buffer. If None, will return all rewards (from ts=0 to the end).

  • neg_index_as_lookback – Negative values in indices are interpreted as as “before ts=0”, meaning going back into the lookback buffer. For example, an episode with rewards [4, 5, 6, 7, 8, 9], where [4, 5, 6] is the lookback buffer range (ts=0 item is 7), will respond to get_rewards(-1, neg_index_as_lookback=True) with 6 and to get_rewards(slice(-2, 1), neg_index_as_lookback=True) with [5, 6,  7].

  • fill – An optional float value to use for filling up the returned results at the boundaries. This filling only happens if the requested index range’s start/stop boundaries exceed the episode’s boundaries (including the lookback buffer on the left side). This comes in very handy, if users don’t want to worry about reaching such boundaries and want to zero-pad. For example, an episode with rewards [10, 11, 12, 13, 14] and lookback buffer size of 2 (meaning rewards 10 and 11 are part of the lookback buffer) will respond to get_rewards(slice(-7, -2), fill=0.0) with [0.0, 0.0, 10, 11, 12].

Examples:

from ray.rllib.env.single_agent_episode import SingleAgentEpisode

episode = SingleAgentEpisode(
    rewards=[1.0, 2.0, 3.0],
    observations=[0, 1, 2, 3], actions=[1, 2, 3],  # <- not relevant here
    len_lookback_buffer=0,  # no lookback; all data is actually "in" episode
)
# Plain usage (`indices` arg only).
episode.get_rewards(-1)  # 3.0
episode.get_rewards(0)  # 1.0
episode.get_rewards([0, 2])  # [1.0, 3.0]
episode.get_rewards([-1, 0])  # [3.0, 1.0]
episode.get_rewards(slice(None, 2))  # [1.0, 2.0]
episode.get_rewards(slice(-2, None))  # [2.0, 3.0]
# Using `fill=...` (requesting slices beyond the boundaries).
episode.get_rewards(slice(-5, -2), fill=0.0)  # [0.0, 0.0, 1.0, 2.0]
episode.get_rewards(slice(1, 5), fill=0.0)  # [2.0, 3.0, 0.0, 0.0]
Returns:

The collected rewards. As a 0-axis batch, if there are several indices or a list of exactly one index provided OR indices is a slice object. As single item (B=0 -> no additional 0-axis) if indices is a single int.