ray.rllib.algorithms.algorithm_config.AlgorithmConfig.training#
- AlgorithmConfig.training(*, gamma: float | None = <ray.rllib.utils.from_config._NotProvided object>, lr: float | ~typing.List[~typing.List[int | float]] | ~typing.List[~typing.Tuple[int, int | float]] | None = <ray.rllib.utils.from_config._NotProvided object>, grad_clip: float | None = <ray.rllib.utils.from_config._NotProvided object>, grad_clip_by: str | None = <ray.rllib.utils.from_config._NotProvided object>, train_batch_size: int | None = <ray.rllib.utils.from_config._NotProvided object>, train_batch_size_per_learner: int | None = <ray.rllib.utils.from_config._NotProvided object>, num_epochs: int | None = <ray.rllib.utils.from_config._NotProvided object>, minibatch_size: int | None = <ray.rllib.utils.from_config._NotProvided object>, shuffle_batch_per_epoch: bool | None = <ray.rllib.utils.from_config._NotProvided object>, model: dict | None = <ray.rllib.utils.from_config._NotProvided object>, optimizer: dict | None = <ray.rllib.utils.from_config._NotProvided object>, num_aggregator_actors_per_learner=-1, max_requests_in_flight_per_aggregator_actor=-1, num_sgd_iter=-1, max_requests_in_flight_per_sampler_worker=-1, learner_class: ~typing.Type[Learner] | None = <ray.rllib.utils.from_config._NotProvided object>, learner_connector: ~typing.Callable[[RLModule], ConnectorV2 | ~typing.List[ConnectorV2]] | None = <ray.rllib.utils.from_config._NotProvided object>, add_default_connectors_to_learner_pipeline: bool | None = <ray.rllib.utils.from_config._NotProvided object>, learner_config_dict: ~typing.Dict[str, ~typing.Any] | None = <ray.rllib.utils.from_config._NotProvided object>) AlgorithmConfig [source]#
Sets the training related configuration.
- Parameters:
gamma – Float specifying the discount factor of the Markov Decision process.
lr – The learning rate (float) or learning rate schedule in the format of [[timestep, lr-value], [timestep, lr-value], …] In case of a schedule, intermediary timesteps are assigned to linearly interpolated learning rate values. A schedule config’s first entry must start with timestep 0, i.e.: [[0, initial_value], […]]. Note: If you require a) more than one optimizer (per RLModule), b) optimizer types that are not Adam, c) a learning rate schedule that is not a linearly interpolated, piecewise schedule as described above, or d) specifying c’tor arguments of the optimizer that are not the learning rate (e.g. Adam’s epsilon), then you must override your Learner’s
configure_optimizer_for_module()
method and handle lr-scheduling yourself.grad_clip – If None, no gradient clipping is applied. Otherwise, depending on the setting of
grad_clip_by
, the (float) value ofgrad_clip
has the following effect: Ifgrad_clip_by=value
: Clips all computed gradients individually inside the interval [-grad_clip
, +`grad_clip`]. Ifgrad_clip_by=norm
, computes the L2-norm of each weight/bias gradient tensor individually and then clip all gradients such that these L2-norms do not exceedgrad_clip
. The L2-norm of a tensor is computed via:sqrt(SUM(w0^2, w1^2, ..., wn^2))
where w[i] are the elements of the tensor (no matter what the shape of this tensor is). Ifgrad_clip_by=global_norm
, computes the square of the L2-norm of each weight/bias gradient tensor individually, sum up all these squared L2-norms across all given gradient tensors (e.g. the entire module to be updated), square root that overall sum, and then clip all gradients such that this global L2-norm does not exceed the given value. The global L2-norm over a list of tensors (e.g. W and V) is computed via:sqrt[SUM(w0^2, w1^2, ..., wn^2) + SUM(v0^2, v1^2, ..., vm^2)]
, where w[i] and v[j] are the elements of the tensors W and V (no matter what the shapes of these tensors are).grad_clip_by – See
grad_clip
for the effect of this setting on gradient clipping. Allowed values arevalue
,norm
, andglobal_norm
.train_batch_size_per_learner – Train batch size per individual Learner worker. This setting only applies to the new API stack. The number of Learner workers can be set via
config.resources( num_learners=...)
. The total effective batch size is thennum_learners
xtrain_batch_size_per_learner
and you can access it with the propertyAlgorithmConfig.total_train_batch_size
.train_batch_size – Training batch size, if applicable. When on the new API stack, this setting should no longer be used. Instead, use
train_batch_size_per_learner
(in combination withnum_learners
).num_epochs – The number of complete passes over the entire train batch (per Learner). Each pass might be further split into n minibatches (if
minibatch_size
provided).minibatch_size – The size of minibatches to use to further split the train batch into.
shuffle_batch_per_epoch – Whether to shuffle the train batch once per epoch. If the train batch has a time rank (axis=1), shuffling only takes place along the batch axis to not disturb any intact (episode) trajectories.
model – Arguments passed into the policy model. See models/catalog.py for a full list of the available model options. TODO: Provide ModelConfig objects instead of dicts.
optimizer – Arguments to pass to the policy optimizer. This setting is not used when
enable_rl_module_and_learner=True
.
- Returns:
This updated AlgorithmConfig object.