ray.rllib.algorithms.algorithm_config.AlgorithmConfig.training#

AlgorithmConfig.training(*, gamma: float | None = <ray.rllib.utils.from_config._NotProvided object>, lr: float | ~typing.List[~typing.List[int | float]] | ~typing.List[~typing.Tuple[int, int | float]] | None = <ray.rllib.utils.from_config._NotProvided object>, grad_clip: float | None = <ray.rllib.utils.from_config._NotProvided object>, grad_clip_by: str | None = <ray.rllib.utils.from_config._NotProvided object>, train_batch_size: int | None = <ray.rllib.utils.from_config._NotProvided object>, train_batch_size_per_learner: int | None = <ray.rllib.utils.from_config._NotProvided object>, num_epochs: int | None = <ray.rllib.utils.from_config._NotProvided object>, minibatch_size: int | None = <ray.rllib.utils.from_config._NotProvided object>, shuffle_batch_per_epoch: bool | None = <ray.rllib.utils.from_config._NotProvided object>, model: dict | None = <ray.rllib.utils.from_config._NotProvided object>, optimizer: dict | None = <ray.rllib.utils.from_config._NotProvided object>, learner_class: ~typing.Type[Learner] | None = <ray.rllib.utils.from_config._NotProvided object>, learner_connector: ~typing.Callable[[RLModule], ConnectorV2 | ~typing.List[ConnectorV2]] | None = <ray.rllib.utils.from_config._NotProvided object>, add_default_connectors_to_learner_pipeline: bool | None = <ray.rllib.utils.from_config._NotProvided object>, learner_config_dict: ~typing.Dict[str, ~typing.Any] | None = <ray.rllib.utils.from_config._NotProvided object>, num_aggregator_actors_per_learner=-1, max_requests_in_flight_per_aggregator_actor=-1, num_sgd_iter=-1, max_requests_in_flight_per_sampler_worker=-1) AlgorithmConfig[source]#

Sets the training related configuration.

Parameters:
  • gamma – Float specifying the discount factor of the Markov Decision process.

  • lr – The learning rate (float) or learning rate schedule in the format of [[timestep, lr-value], [timestep, lr-value], …] In case of a schedule, intermediary timesteps are assigned to linearly interpolated learning rate values. A schedule config’s first entry must start with timestep 0, i.e.: [[0, initial_value], […]]. Note: If you require a) more than one optimizer (per RLModule), b) optimizer types that are not Adam, c) a learning rate schedule that is not a linearly interpolated, piecewise schedule as described above, or d) specifying c’tor arguments of the optimizer that are not the learning rate (e.g. Adam’s epsilon), then you must override your Learner’s configure_optimizer_for_module() method and handle lr-scheduling yourself.

  • grad_clip – If None, no gradient clipping is applied. Otherwise, depending on the setting of grad_clip_by, the (float) value of grad_clip has the following effect: If grad_clip_by=value: Clips all computed gradients individually inside the interval [-grad_clip, +`grad_clip`]. If grad_clip_by=norm, computes the L2-norm of each weight/bias gradient tensor individually and then clip all gradients such that these L2-norms do not exceed grad_clip. The L2-norm of a tensor is computed via: sqrt(SUM(w0^2, w1^2, ..., wn^2)) where w[i] are the elements of the tensor (no matter what the shape of this tensor is). If grad_clip_by=global_norm, computes the square of the L2-norm of each weight/bias gradient tensor individually, sum up all these squared L2-norms across all given gradient tensors (e.g. the entire module to be updated), square root that overall sum, and then clip all gradients such that this global L2-norm does not exceed the given value. The global L2-norm over a list of tensors (e.g. W and V) is computed via: sqrt[SUM(w0^2, w1^2, ..., wn^2) + SUM(v0^2, v1^2, ..., vm^2)], where w[i] and v[j] are the elements of the tensors W and V (no matter what the shapes of these tensors are).

  • grad_clip_by – See grad_clip for the effect of this setting on gradient clipping. Allowed values are value, norm, and global_norm.

  • train_batch_size_per_learner – Train batch size per individual Learner worker. This setting only applies to the new API stack. The number of Learner workers can be set via config.resources( num_learners=...). The total effective batch size is then num_learners x train_batch_size_per_learner and you can access it with the property AlgorithmConfig.total_train_batch_size.

  • train_batch_size – Training batch size, if applicable. When on the new API stack, this setting should no longer be used. Instead, use train_batch_size_per_learner (in combination with num_learners).

  • num_epochs – The number of complete passes over the entire train batch (per Learner). Each pass might be further split into n minibatches (if minibatch_size provided).

  • minibatch_size – The size of minibatches to use to further split the train batch into.

  • shuffle_batch_per_epoch – Whether to shuffle the train batch once per epoch. If the train batch has a time rank (axis=1), shuffling only takes place along the batch axis to not disturb any intact (episode) trajectories.

  • model – Arguments passed into the policy model. See models/catalog.py for a full list of the available model options. TODO: Provide ModelConfig objects instead of dicts.

  • optimizer – Arguments to pass to the policy optimizer. This setting is not used when enable_rl_module_and_learner=True.

  • learner_class – The Learner class to use for (distributed) updating of the RLModule. Only used when enable_rl_module_and_learner=True.

  • learner_connector – A callable taking an env observation space and an env action space as inputs and returning a learner ConnectorV2 (might be a pipeline) object.

  • add_default_connectors_to_learner_pipeline – If True (default), RLlib’s Learners automatically add the default Learner ConnectorV2 pieces to the LearnerPipeline. These automatically perform: a) adding observations from episodes to the train batch, if this has not already been done by a user-provided connector piece b) if RLModule is stateful, add a time rank to the train batch, zero-pad the data, and add the correct state inputs, if this has not already been done by a user-provided connector piece. c) add all other information (actions, rewards, terminateds, etc..) to the train batch, if this has not already been done by a user-provided connector piece. Only if you know exactly what you are doing, you should set this setting to False. Note that this setting is only relevant if the new API stack is used (including the new EnvRunner classes).

  • learner_config_dict – A dict to insert any settings accessible from within the Learner instance. This should only be used in connection with custom Learner subclasses and in case the user doesn’t want to write an extra AlgorithmConfig subclass just to add a few settings to the base Algo’s own config class.

Returns:

This updated AlgorithmConfig object.