Algorithm.training_step() dict | NestedDict[source]#

Default single iteration logic of an algorithm.

  • Collect on-policy samples (SampleBatches) in parallel using the Algorithm’s EnvRunners (@ray.remote).

  • Concatenate collected SampleBatches into one train batch.

  • Note that we may have more than one policy in the multi-agent case: Call the different policies’ learn_on_batch (simple optimizer) OR load_batch_into_buffer + learn_on_loaded_batch (multi-GPU optimizer) methods to calculate loss and update the model(s).

  • Return all collected metrics for the iteration.


The results dict from executing the training iteration.