ray.rllib.algorithms.algorithm.Algorithm.training_step#

Algorithm.training_step() None[source]#

Default single iteration logic of an algorithm.

  • Collect on-policy samples (SampleBatches) in parallel using the Algorithm’s EnvRunners (@ray.remote).

  • Concatenate collected SampleBatches into one train batch.

  • Note that we may have more than one policy in the multi-agent case: Call the different policies’ learn_on_batch (simple optimizer) OR load_batch_into_buffer + learn_on_loaded_batch (multi-GPU optimizer) methods to calculate loss and update the model(s).

  • Return all collected metrics for the iteration.

Returns:

For the new API stack, returns None. Results are compiled and extracted automatically through a single self.metrics.reduce() call at the very end of an iteration (which might contain more than one call to training_step()). This way, we make sure that we account for all results generated by each individual training_step() call. For the old API stack, returns the results dict from executing the training step.