ray.rllib.core.learner.learner.Learner.update#

Learner.update(batch: MultiAgentBatch | None = None, batches: List[MultiAgentBatch] | None = None, batch_refs: List[ray._raylet.ObjectRef] | None = None, episodes: List[SingleAgentEpisode | MultiAgentEpisode] | None = None, episodes_refs: List[ray._raylet.ObjectRef] | None = None, data_iterators: List[DataIterator] | None = None, training_data: TrainingData | None = None, *, timesteps: Dict[str, Any] | None = None, num_total_minibatches: int = 0, num_epochs: int = 1, minibatch_size: int | None = None, shuffle_batch_per_epoch: bool = False, _no_metrics_reduce: bool = False, **kwargs) Dict[source]#

Run num_epochs epochs over the given train batch.

You can use this method to take more than one backward pass on the batch. The same minibatch_size and num_epochs will be used for all module ids in MultiRLModule.

Parameters:
  • batch – A batch of training data to update from.

  • timesteps – Timesteps dict, which must have the key NUM_ENV_STEPS_SAMPLED_LIFETIME. # TODO (sven): Make this a more formal structure with its own type.

  • num_epochs – The number of complete passes over the entire train batch. Each pass might be further split into n minibatches (if minibatch_size provided).

  • minibatch_size – The size of minibatches to use to further split the train batch into sub-batches. The batch is then iterated over n times where n is len(batch) // minibatch_size.

  • shuffle_batch_per_epoch – Whether to shuffle the train batch once per epoch. If the train batch has a time rank (axis=1), shuffling will only take place along the batch axis to not disturb any intact (episode) trajectories. Also, shuffling is always skipped if minibatch_size is None, meaning the entire train batch is processed each epoch, making it unnecessary to shuffle.

Returns:

A ResultDict object produced by a call to self.metrics.reduce(). The returned dict may be arbitrarily nested and must have Stats objects at all its leafs, allowing components further downstream (i.e. a user of this Learner) to further reduce these results (for example over n parallel Learners).