ray.rllib.algorithms.algorithm_config.AlgorithmConfig.training#
- AlgorithmConfig.training(*, gamma: float | None = <ray.rllib.utils.from_config._NotProvided object>, lr: float | ~typing.List[~typing.List[int | float]] | ~typing.List[~typing.Tuple[int, int | float]] | None = <ray.rllib.utils.from_config._NotProvided object>, grad_clip: float | None = <ray.rllib.utils.from_config._NotProvided object>, grad_clip_by: str | None = <ray.rllib.utils.from_config._NotProvided object>, train_batch_size: int | None = <ray.rllib.utils.from_config._NotProvided object>, train_batch_size_per_learner: int | None = <ray.rllib.utils.from_config._NotProvided object>, num_epochs: int | None = <ray.rllib.utils.from_config._NotProvided object>, minibatch_size: int | None = <ray.rllib.utils.from_config._NotProvided object>, shuffle_batch_per_epoch: bool | None = <ray.rllib.utils.from_config._NotProvided object>, model: dict | None = <ray.rllib.utils.from_config._NotProvided object>, optimizer: dict | None = <ray.rllib.utils.from_config._NotProvided object>, learner_class: ~typing.Type[Learner] | None = <ray.rllib.utils.from_config._NotProvided object>, learner_connector: ~typing.Callable[[RLModule], ConnectorV2 | ~typing.List[ConnectorV2]] | None = <ray.rllib.utils.from_config._NotProvided object>, add_default_connectors_to_learner_pipeline: bool | None = <ray.rllib.utils.from_config._NotProvided object>, learner_config_dict: ~typing.Dict[str, ~typing.Any] | None = <ray.rllib.utils.from_config._NotProvided object>, num_aggregator_actors_per_learner=-1, max_requests_in_flight_per_aggregator_actor=-1, num_sgd_iter=-1, max_requests_in_flight_per_sampler_worker=-1) AlgorithmConfig [source]#
Sets the training related configuration.
- Parameters:
gamma – Float specifying the discount factor of the Markov Decision process.
lr – The learning rate (float) or learning rate schedule in the format of [[timestep, lr-value], [timestep, lr-value], …] In case of a schedule, intermediary timesteps are assigned to linearly interpolated learning rate values. A schedule config’s first entry must start with timestep 0, i.e.: [[0, initial_value], […]]. Note: If you require a) more than one optimizer (per RLModule), b) optimizer types that are not Adam, c) a learning rate schedule that is not a linearly interpolated, piecewise schedule as described above, or d) specifying c’tor arguments of the optimizer that are not the learning rate (e.g. Adam’s epsilon), then you must override your Learner’s
configure_optimizer_for_module()
method and handle lr-scheduling yourself.grad_clip – If None, no gradient clipping is applied. Otherwise, depending on the setting of
grad_clip_by
, the (float) value ofgrad_clip
has the following effect: Ifgrad_clip_by=value
: Clips all computed gradients individually inside the interval [-grad_clip
, +`grad_clip`]. Ifgrad_clip_by=norm
, computes the L2-norm of each weight/bias gradient tensor individually and then clip all gradients such that these L2-norms do not exceedgrad_clip
. The L2-norm of a tensor is computed via:sqrt(SUM(w0^2, w1^2, ..., wn^2))
where w[i] are the elements of the tensor (no matter what the shape of this tensor is). Ifgrad_clip_by=global_norm
, computes the square of the L2-norm of each weight/bias gradient tensor individually, sum up all these squared L2-norms across all given gradient tensors (e.g. the entire module to be updated), square root that overall sum, and then clip all gradients such that this global L2-norm does not exceed the given value. The global L2-norm over a list of tensors (e.g. W and V) is computed via:sqrt[SUM(w0^2, w1^2, ..., wn^2) + SUM(v0^2, v1^2, ..., vm^2)]
, where w[i] and v[j] are the elements of the tensors W and V (no matter what the shapes of these tensors are).grad_clip_by – See
grad_clip
for the effect of this setting on gradient clipping. Allowed values arevalue
,norm
, andglobal_norm
.train_batch_size_per_learner – Train batch size per individual Learner worker. This setting only applies to the new API stack. The number of Learner workers can be set via
config.resources( num_learners=...)
. The total effective batch size is thennum_learners
xtrain_batch_size_per_learner
and you can access it with the propertyAlgorithmConfig.total_train_batch_size
.train_batch_size – Training batch size, if applicable. When on the new API stack, this setting should no longer be used. Instead, use
train_batch_size_per_learner
(in combination withnum_learners
).num_epochs – The number of complete passes over the entire train batch (per Learner). Each pass might be further split into n minibatches (if
minibatch_size
provided).minibatch_size – The size of minibatches to use to further split the train batch into.
shuffle_batch_per_epoch – Whether to shuffle the train batch once per epoch. If the train batch has a time rank (axis=1), shuffling only takes place along the batch axis to not disturb any intact (episode) trajectories.
model – Arguments passed into the policy model. See models/catalog.py for a full list of the available model options. TODO: Provide ModelConfig objects instead of dicts.
optimizer – Arguments to pass to the policy optimizer. This setting is not used when
enable_rl_module_and_learner=True
.learner_class – The
Learner
class to use for (distributed) updating of the RLModule. Only used whenenable_rl_module_and_learner=True
.learner_connector – A callable taking an env observation space and an env action space as inputs and returning a learner ConnectorV2 (might be a pipeline) object.
add_default_connectors_to_learner_pipeline – If True (default), RLlib’s Learners automatically add the default Learner ConnectorV2 pieces to the LearnerPipeline. These automatically perform: a) adding observations from episodes to the train batch, if this has not already been done by a user-provided connector piece b) if RLModule is stateful, add a time rank to the train batch, zero-pad the data, and add the correct state inputs, if this has not already been done by a user-provided connector piece. c) add all other information (actions, rewards, terminateds, etc..) to the train batch, if this has not already been done by a user-provided connector piece. Only if you know exactly what you are doing, you should set this setting to False. Note that this setting is only relevant if the new API stack is used (including the new EnvRunner classes).
learner_config_dict – A dict to insert any settings accessible from within the Learner instance. This should only be used in connection with custom Learner subclasses and in case the user doesn’t want to write an extra
AlgorithmConfig
subclass just to add a few settings to the base Algo’s own config class.
- Returns:
This updated AlgorithmConfig object.