ray.rllib.core.learner.learner.Learner.update_from_episodes#

Learner.update_from_episodes(episodes: List[SingleAgentEpisode | MultiAgentEpisode], *, timesteps: Dict[str, Any] | None = None, num_epochs: int = 1, minibatch_size: int | None = None, shuffle_batch_per_epoch: bool = False, num_total_minibatches: int = 0, num_iters=-1) Dict[source]#

Run num_epochs epochs over the train batch generated from episodes.

You can use this method to take more than one backward pass on the batch. The same minibatch_size and num_epochs will be used for all module ids in MultiRLModule.

Parameters:
  • episodes – An list of episode objects to update from.

  • timesteps – Timesteps dict, which must have the key NUM_ENV_STEPS_SAMPLED_LIFETIME. # TODO (sven): Make this a more formal structure with its own type.

  • num_epochs – The number of complete passes over the entire train batch. Each pass might be further split into n minibatches (if minibatch_size provided). The train batch is generated from the given episodes through the Learner connector pipeline.

  • minibatch_size – The size of minibatches to use to further split the train batch into sub-batches. The batch is then iterated over n times where n is len(batch) // minibatch_size. The train batch is generated from the given episodes through the Learner connector pipeline.

  • shuffle_batch_per_epoch – Whether to shuffle the train batch once per epoch. If the train batch has a time rank (axis=1), shuffling will only take place along the batch axis to not disturb any intact (episode) trajectories. Also, shuffling is always skipped if minibatch_size is None, meaning the entire train batch is processed each epoch, making it unnecessary to shuffle. The train batch is generated from the given episodes through the Learner connector pipeline.

  • num_total_minibatches – The total number of minibatches to loop through (over all num_epochs epochs). It’s only required to set this to != 0 in multi-agent + multi-GPU situations, in which the MultiAgentEpisodes themselves are roughly sharded equally, however, they might contain SingleAgentEpisodes with very lopsided length distributions. Thus, without this fixed, pre-computed value, one Learner might go through a different number of minibatche passes than others causing a deadlock.

Returns:

A ResultDict object produced by a call to self.metrics.reduce(). The returned dict may be arbitrarily nested and must have Stats objects at all its leafs, allowing components further downstream (i.e. a user of this Learner) to further reduce these results (for example over n parallel Learners).