ray.rllib.core.learner.learner.Learner.update_from_episodes#
- Learner.update_from_episodes(episodes: List[SingleAgentEpisode | MultiAgentEpisode], *, timesteps: Dict[str, Any] | None = None, num_epochs: int = 1, minibatch_size: int | None = None, shuffle_batch_per_epoch: bool = False, num_total_minibatches: int = 0, num_iters=-1) Dict [source]#
Run
num_epochs
epochs over the train batch generated fromepisodes
.You can use this method to take more than one backward pass on the batch. The same
minibatch_size
andnum_epochs
will be used for all module ids in MultiRLModule.- Parameters:
episodes – An list of episode objects to update from.
timesteps – Timesteps dict, which must have the key
NUM_ENV_STEPS_SAMPLED_LIFETIME
. # TODO (sven): Make this a more formal structure with its own type.num_epochs – The number of complete passes over the entire train batch. Each pass might be further split into n minibatches (if
minibatch_size
provided). The train batch is generated from the givenepisodes
through the Learner connector pipeline.minibatch_size – The size of minibatches to use to further split the train
batch
into sub-batches. Thebatch
is then iterated over n times where n islen(batch) // minibatch_size
. The train batch is generated from the givenepisodes
through the Learner connector pipeline.shuffle_batch_per_epoch – Whether to shuffle the train batch once per epoch. If the train batch has a time rank (axis=1), shuffling will only take place along the batch axis to not disturb any intact (episode) trajectories. Also, shuffling is always skipped if
minibatch_size
is None, meaning the entire train batch is processed each epoch, making it unnecessary to shuffle. The train batch is generated from the givenepisodes
through the Learner connector pipeline.num_total_minibatches – The total number of minibatches to loop through (over all
num_epochs
epochs). It’s only required to set this to != 0 in multi-agent + multi-GPU situations, in which the MultiAgentEpisodes themselves are roughly sharded equally, however, they might contain SingleAgentEpisodes with very lopsided length distributions. Thus, without this fixed, pre-computed value, one Learner might go through a different number of minibatche passes than others causing a deadlock.
- Returns:
A
ResultDict
object produced by a call toself.metrics.reduce()
. The returned dict may be arbitrarily nested and must haveStats
objects at all its leafs, allowing components further downstream (i.e. a user of this Learner) to further reduce these results (for example over n parallel Learners).