Algorithms

The Algorithm class is the highest-level API in RLlib. It allows you to train and evaluate policies, save an experiment’s progress and restore from a prior saved experiment when continuing an RL run. Algorithm is a sub-class of Trainable and thus fully supports distributed hyperparameter tuning for RL.

../../_images/trainer_class_overview.svg

A typical RLlib Algorithm object: The components sitting inside an Algorithm are normally N RolloutWorker and zero or more @ray.remote BaseEnv per worker.

Defining Algorithms with the AlgorithmConfig Class

The AlgorithmConfig class represents the primary way of configuring and building an Algorithm. You don’t use AlgorithmConfig directly in practice, but rather use its algorithm-specific implementations such as PPOConfig, which each come with their own set of arguments to their respective .training() method.

Here’s how you work with an AlgorithmConfig.

class ray.rllib.algorithms.algorithm_config.AlgorithmConfig(algo_class=None)[source]

A RLlib AlgorithmConfig builds an RLlib Algorithm from a given configuration.

Example

>>> from ray.rllib.algorithms.algorithm_config import AlgorithmConfig
>>> from ray.rllib.algorithms.callbacks import MemoryTrackingCallbacks
>>> # Construct a generic config object, specifying values within different
>>> # sub-categories, e.g. "training".
>>> config = AlgorithmConfig().training(gamma=0.9, lr=0.01)  
...     .environment(env="CartPole-v1")
...     .resources(num_gpus=0)
...     .rollouts(num_rollout_workers=4)
...     .callbacks(MemoryTrackingCallbacks)
>>> # A config object can be used to construct the respective Trainer.
>>> rllib_algo = config.build()  

Example

>>> from ray.rllib.algorithms.algorithm_config import AlgorithmConfig
>>> from ray import tune
>>> # In combination with a tune.grid_search:
>>> config = AlgorithmConfig()
>>> config.training(lr=tune.grid_search([0.01, 0.001])) 
>>> # Use `to_dict()` method to get the legacy plain python config dict
>>> # for usage with `tune.Tuner().fit()`.
>>> tune.Tuner(  
...     "[registered trainer class]", param_space=config.to_dict()
...     ).fit()
classmethod from_dict(config_dict: dict) ray.rllib.algorithms.algorithm_config.AlgorithmConfig[source]

Creates an AlgorithmConfig from a legacy python config dict.

Examples

>>> from ray.rllib.algorithms.ppo.ppo import DEFAULT_CONFIG, PPOConfig
>>> ppo_config = PPOConfig.from_dict(DEFAULT_CONFIG)
>>> ppo = ppo_config.build(env="Pendulum-v1")
Parameters

config_dict – The legacy formatted python config dict for some algorithm.

Returns

A new AlgorithmConfig object that matches the given python config dict.

to_dict() dict[source]

Converts all settings into a legacy config dict for backward compatibility.

Returns

A complete AlgorithmConfigDict, usable in backward-compatible Tune/RLlib use cases, e.g. w/ tune.Tuner().fit().

update_from_dict(config_dict: dict) ray.rllib.algorithms.algorithm_config.AlgorithmConfig[source]

Modifies this AlgorithmConfig via the provided python config dict.

Warns if config_dict contains deprecated keys. Silently sets even properties of self that do NOT exist. This way, this method may be used to configure custom Policies which do not have their own specific AlgorithmConfig classes, e.g. ray.rllib.examples.policy.random_policy::RandomPolicy.

Parameters

config_dict – The old-style python config dict (PartialAlgorithmConfigDict) to use for overriding some properties defined in there.

Returns

This updated AlgorithmConfig object.

copy(copy_frozen: Optional[bool] = None) ray.rllib.algorithms.algorithm_config.AlgorithmConfig[source]

Creates a deep copy of this config and (un)freezes if necessary.

Parameters

copy_frozen – Whether the created deep copy will be frozen or not. If None, keep the same frozen status that self currently has.

Returns

A deep copy of self that is (un)frozen.

freeze() None[source]

Freezes this config object, such that no attributes can be set anymore.

Algorithms should use this method to make sure that their config objects remain read-only after this.

validate() None[source]

Validates all values in this config.

Note: This should NOT include immediate checks on single value correctness, e.g. “batch_mode” = [complete_episodes|truncate_episodes]. Those simgular, independent checks should instead go directly into their respective methods.

build(env: Optional[Union[str, Any]] = None, logger_creator: Optional[Callable[[], ray.tune.logger.logger.Logger]] = None, use_copy: bool = True) Algorithm[source]

Builds an Algorithm from this AlgorithmConfig (or a copy thereof).

Parameters
  • env – Name of the environment to use (e.g. a gym-registered str), a full class path (e.g. “ray.rllib.examples.env.random_env.RandomEnv”), or an Env class directly. Note that this arg can also be specified via the “env” key in config.

  • logger_creator – Callable that creates a ray.tune.Logger object. If unspecified, a default logger is created.

  • use_copy – Whether to deepcopy self and pass the copy to the Algorithm (instead of self) as config. This is useful in case you would like to recycle the same AlgorithmConfig over and over, e.g. in a test case, in which we loop over different DL-frameworks.

Returns

A ray.rllib.algorithms.algorithm.Algorithm object.

python_environment(*, extra_python_environs_for_driver: Optional[dict] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, extra_python_environs_for_worker: Optional[dict] = <ray.rllib.algorithms.algorithm_config._NotProvided object>) ray.rllib.algorithms.algorithm_config.AlgorithmConfig[source]

Sets the config’s python environment settings.

Parameters
  • extra_python_environs_for_driver – Any extra python env vars to set in the algorithm’s process, e.g., {“OMP_NUM_THREADS”: “16”}.

  • extra_python_environs_for_worker – The extra python environments need to set for worker processes.

Returns

This updated AlgorithmConfig object.

resources(*, num_gpus: Optional[Union[float, int]] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, _fake_gpus: Optional[bool] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, num_cpus_per_worker: Optional[Union[float, int]] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, num_gpus_per_worker: Optional[Union[float, int]] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, num_cpus_for_local_worker: Optional[int] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, custom_resources_per_worker: Optional[dict] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, placement_strategy: Optional[str] = <ray.rllib.algorithms.algorithm_config._NotProvided object>) ray.rllib.algorithms.algorithm_config.AlgorithmConfig[source]

Specifies resources allocated for an Algorithm and its ray actors/workers.

Parameters
  • num_gpus – Number of GPUs to allocate to the algorithm process. Note that not all algorithms can take advantage of GPUs. Support for multi-GPU is currently only available for tf-[PPO/IMPALA/DQN/PG]. This can be fractional (e.g., 0.3 GPUs).

  • _fake_gpus – Set to True for debugging (multi-)?GPU funcitonality on a CPU machine. GPU towers will be simulated by graphs located on CPUs in this case. Use num_gpus to test for different numbers of fake GPUs.

  • num_cpus_per_worker – Number of CPUs to allocate per worker.

  • num_gpus_per_worker – Number of GPUs to allocate per worker. This can be fractional. This is usually needed only if your env itself requires a GPU (i.e., it is a GPU-intensive video game), or model inference is unusually expensive.

  • custom_resources_per_worker – Any custom Ray resources to allocate per worker.

  • num_cpus_for_local_worker – Number of CPUs to allocate for the algorithm. Note: this only takes effect when running in Tune. Otherwise, the algorithm runs in the main program (driver).

  • custom_resources_per_worker – Any custom Ray resources to allocate per worker.

  • placement_strategy – The strategy for the placement group factory returned by Algorithm.default_resource_request(). A PlacementGroup defines, which devices (resources) should always be co-located on the same node. For example, an Algorithm with 2 rollout workers, running with num_gpus=1 will request a placement group with the bundles: [{“gpu”: 1, “cpu”: 1}, {“cpu”: 1}, {“cpu”: 1}], where the first bundle is for the driver and the other 2 bundles are for the two workers. These bundles can now be “placed” on the same or different nodes depending on the value of placement_strategy: “PACK”: Packs bundles into as few nodes as possible. “SPREAD”: Places bundles across distinct nodes as even as possible. “STRICT_PACK”: Packs bundles into one node. The group is not allowed to span multiple nodes. “STRICT_SPREAD”: Packs bundles across distinct nodes.

Returns

This updated AlgorithmConfig object.

framework(framework: Optional[str] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, *, eager_tracing: Optional[bool] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, eager_max_retraces: Optional[int] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, tf_session_args: Optional[Dict[str, Any]] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, local_tf_session_args: Optional[Dict[str, Any]] = <ray.rllib.algorithms.algorithm_config._NotProvided object>) ray.rllib.algorithms.algorithm_config.AlgorithmConfig[source]

Sets the config’s DL framework settings.

Parameters
  • framework – tf: TensorFlow (static-graph); tf2: TensorFlow 2.x (eager or traced, if eager_tracing=True); torch: PyTorch

  • eager_tracing – Enable tracing in eager mode. This greatly improves performance (speedup ~2x), but makes it slightly harder to debug since Python code won’t be evaluated after the initial eager pass. Only possible if framework=tf2.

  • eager_max_retraces – Maximum number of tf.function re-traces before a runtime error is raised. This is to prevent unnoticed retraces of methods inside the _eager_traced Policy, which could slow down execution by a factor of 4, without the user noticing what the root cause for this slowdown could be. Only necessary for framework=tf2. Set to None to ignore the re-trace count and never throw an error.

  • tf_session_args – Configures TF for single-process operation by default.

  • local_tf_session_args – Override the following tf session args on the local worker

Returns

This updated AlgorithmConfig object.

environment(env: Optional[Union[str, Any]] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, *, env_config: Optional[dict] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, observation_space: Optional[<MagicMock name='mock.spaces.Space' id='140716254345616'>] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, action_space: Optional[<MagicMock name='mock.spaces.Space' id='140716254345616'>] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, env_task_fn: Optional[Callable[[dict, Any, ray.rllib.env.env_context.EnvContext], Any]] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, render_env: Optional[bool] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, clip_rewards: Optional[Union[bool, float]] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, normalize_actions: Optional[bool] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, clip_actions: Optional[bool] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, disable_env_checking: Optional[bool] = <ray.rllib.algorithms.algorithm_config._NotProvided object>) ray.rllib.algorithms.algorithm_config.AlgorithmConfig[source]

Sets the config’s RL-environment settings.

Parameters
  • env – The environment specifier. This can either be a tune-registered env, via tune.register_env([name], lambda env_ctx: [env object]), or a string specifier of an RLlib supported type. In the latter case, RLlib will try to interpret the specifier as either an openAI gym env, a PyBullet env, a ViZDoomGym env, or a fully qualified classpath to an Env class, e.g. “ray.rllib.examples.env.random_env.RandomEnv”.

  • env_config – Arguments dict passed to the env creator as an EnvContext object (which is a dict plus the properties: num_rollout_workers, worker_index, vector_index, and remote).

  • observation_space – The observation space for the Policies of this Algorithm.

  • action_space – The action space for the Policies of this Algorithm.

  • env_task_fn – A callable taking the last train results, the base env and the env context as args and returning a new task to set the env to. The env must be a TaskSettableEnv sub-class for this to work. See examples/curriculum_learning.py for an example.

  • render_env – If True, try to render the environment on the local worker or on worker 1 (if num_rollout_workers > 0). For vectorized envs, this usually means that only the first sub-environment will be rendered. In order for this to work, your env will have to implement the render() method which either: a) handles window generation and rendering itself (returning True) or b) returns a numpy uint8 image of shape [height x width x 3 (RGB)].

  • clip_rewards – Whether to clip rewards during Policy’s postprocessing. None (default): Clip for Atari only (r=sign(r)). True: r=sign(r): Fixed rewards -1.0, 1.0, or 0.0. False: Never clip. [float value]: Clip at -value and + value. Tuple[value1, value2]: Clip at value1 and value2.

  • normalize_actions – If True, RLlib will learn entirely inside a normalized action space (0.0 centered with small stddev; only affecting Box components). We will unsquash actions (and clip, just in case) to the bounds of the env’s action space before sending actions back to the env.

  • clip_actions – If True, RLlib will clip actions according to the env’s bounds before sending them back to the env. TODO: (sven) This option should be deprecated and always be False.

  • disable_env_checking – If True, disable the environment pre-checking module.

Returns

This updated AlgorithmConfig object.

rollouts(*, num_rollout_workers: Optional[int] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, num_envs_per_worker: Optional[int] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, create_env_on_local_worker: Optional[bool] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, sample_collector: Optional[Type[ray.rllib.evaluation.collectors.sample_collector.SampleCollector]] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, sample_async: Optional[bool] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, enable_connectors: Optional[bool] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, rollout_fragment_length: Optional[Union[int, str]] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, batch_mode: Optional[str] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, remote_worker_envs: Optional[bool] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, remote_env_batch_wait_ms: Optional[float] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, validate_workers_after_construction: Optional[bool] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, ignore_worker_failures: Optional[bool] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, recreate_failed_workers: Optional[bool] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, restart_failed_sub_environments: Optional[bool] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, num_consecutive_worker_failures_tolerance: Optional[int] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, no_done_at_end: Optional[bool] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, preprocessor_pref: Optional[str] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, observation_filter: Optional[str] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, synchronize_filter: Optional[bool] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, compress_observations: Optional[bool] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, enable_tf1_exec_eagerly: Optional[bool] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, sampler_perf_stats_ema_coef: Optional[float] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, horizon=-1, soft_horizon=-1) ray.rllib.algorithms.algorithm_config.AlgorithmConfig[source]

Sets the rollout worker configuration.

Parameters
  • num_rollout_workers – Number of rollout worker actors to create for parallel sampling. Setting this to 0 will force rollouts to be done in the local worker (driver process or the Algorithm’s actor when using Tune).

  • num_envs_per_worker – Number of environments to evaluate vector-wise per worker. This enables model inference batching, which can improve performance for inference bottlenecked workloads.

  • sample_collector – The SampleCollector class to be used to collect and retrieve environment-, model-, and sampler data. Override the SampleCollector base class to implement your own collection/buffering/retrieval logic.

  • create_env_on_local_worker – When num_rollout_workers > 0, the driver (local_worker; worker-idx=0) does not need an environment. This is because it doesn’t have to sample (done by remote_workers; worker_indices > 0) nor evaluate (done by evaluation workers; see below).

  • sample_async – Use a background thread for sampling (slightly off-policy, usually not advisable to turn on unless your env specifically requires it).

  • enable_connectors – Use connector based environment runner, so that all preprocessing of obs and postprocessing of actions are done in agent and action connectors.

  • rollout_fragment_length – Divide episodes into fragments of this many steps each during rollouts. Trajectories of this size are collected from rollout workers and combined into a larger batch of train_batch_size for learning. For example, given rollout_fragment_length=100 and train_batch_size=1000: 1. RLlib collects 10 fragments of 100 steps each from rollout workers. 2. These fragments are concatenated and we perform an epoch of SGD. When using multiple envs per worker, the fragment size is multiplied by num_envs_per_worker. This is since we are collecting steps from multiple envs in parallel. For example, if num_envs_per_worker=5, then rollout workers will return experiences in chunks of 5*100 = 500 steps. The dataflow here can vary per algorithm. For example, PPO further divides the train batch into minibatches for multi-epoch SGD. Set to “auto” to have RLlib compute an exact rollout_fragment_length to match the given batch size.

  • batch_mode – How to build per-Sampler (RolloutWorker) batches, which are then usually concat’d to form the train batch. Note that “steps” below can mean different things (either env- or agent-steps) and depends on the count_steps_by setting, adjustable via AlgorithmConfig.multi_agent(count_steps_by=..): 1) “truncate_episodes”: Each call to sample() will return a batch of at most rollout_fragment_length * num_envs_per_worker in size. The batch will be exactly rollout_fragment_length * num_envs in size if postprocessing does not change batch sizes. Episodes may be truncated in order to meet this size requirement. This mode guarantees evenly sized batches, but increases variance as the future return must now be estimated at truncation boundaries. 2) “complete_episodes”: Each call to sample() will return a batch of at least rollout_fragment_length * num_envs_per_worker in size. Episodes will not be truncated, but multiple episodes may be packed within one batch to meet the (minimum) batch size. Note that when num_envs_per_worker > 1, episode steps will be buffered until the episode completes, and hence batches may contain significant amounts of off-policy data.

  • remote_worker_envs – If using num_envs_per_worker > 1, whether to create those new envs in remote processes instead of in the same worker. This adds overheads, but can make sense if your envs can take much time to step / reset (e.g., for StarCraft). Use this cautiously; overheads are significant.

  • remote_env_batch_wait_ms – Timeout that remote workers are waiting when polling environments. 0 (continue when at least one env is ready) is a reasonable default, but optimal value could be obtained by measuring your environment step / reset and model inference perf.

  • validate_workers_after_construction – Whether to validate that each created remote worker is healthy after its construction process.

  • ignore_worker_failures – Whether to attempt to continue training if a worker crashes. The number of currently healthy workers is reported as the “num_healthy_workers” metric.

  • recreate_failed_workers – Whether - upon a worker failure - RLlib will try to recreate the lost worker as an identical copy of the failed one. The new worker will only differ from the failed one in its self.recreated_worker=True property value. It will have the same worker_index as the original one. If True, the ignore_worker_failures setting will be ignored.

  • restart_failed_sub_environments – If True and any sub-environment (within a vectorized env) throws any error during env stepping, the Sampler will try to restart the faulty sub-environment. This is done without disturbing the other (still intact) sub-environment and without the RolloutWorker crashing.

  • num_consecutive_worker_failures_tolerance – The number of consecutive times a rollout worker (or evaluation worker) failure is tolerated before finally crashing the Algorithm. Only useful if either ignore_worker_failures or recreate_failed_workers is True. Note that for restart_failed_sub_environments and sub-environment failures, the worker itself is NOT affected and won’t throw any errors as the flawed sub-environment is silently restarted under the hood.

  • no_done_at_end – If True, don’t set a ‘done=True’ at the end of the episode.

  • preprocessor_pref – Whether to use “rllib” or “deepmind” preprocessors by default. Set to None for using no preprocessor. In this case, the model will have to handle possibly complex observations from the environment.

  • observation_filter – Element-wise observation filter, either “NoFilter” or “MeanStdFilter”.

  • synchronize_filter – Whether to synchronize the statistics of remote filters.

  • compress_observations – Whether to LZ4 compress individual observations in the SampleBatches collected during rollouts.

  • enable_tf1_exec_eagerly – Explicitly tells the rollout worker to enable TF eager execution. This is useful for example when framework is “torch”, but a TF2 policy needs to be restored for evaluation or league-based purposes.

  • sampler_perf_stats_ema_coef – If specified, perf stats are in EMAs. This is the coeff of how much new data points contribute to the averages. Default is None, which uses simple global average instead. The EMA update rule is: updated = (1 - ema_coef) * old + ema_coef * new

Returns

This updated AlgorithmConfig object.

training(gamma: Optional[float] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, lr: Optional[float] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, train_batch_size: Optional[int] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, model: Optional[dict] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, optimizer: Optional[dict] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, max_requests_in_flight_per_sampler_worker: Optional[int] = <ray.rllib.algorithms.algorithm_config._NotProvided object>) ray.rllib.algorithms.algorithm_config.AlgorithmConfig[source]

Sets the training related configuration.

Parameters
  • gamma – Float specifying the discount factor of the Markov Decision process.

  • lr – The default learning rate.

  • train_batch_size – Training batch size, if applicable.

  • model – Arguments passed into the policy model. See models/catalog.py for a full list of the available model options. TODO: Provide ModelConfig objects instead of dicts.

  • optimizer – Arguments to pass to the policy optimizer.

  • max_requests_in_flight_per_sampler_worker – Max number of inflight requests to each sampling worker. See the FaultTolerantActorManager class for more details. Tuning these values is important when running experimens with large sample batches, where there is the risk that the object store may fill up, causing spilling of objects to disk. This can cause any asynchronous requests to become very slow, making your experiment run slow as well. You can inspect the object store during your experiment via a call to ray memory on your headnode, and by using the ray dashboard. If you’re seeing that the object store is filling up, turn down the number of remote requests in flight, or enable compression in your experiment of timesteps.

Returns

This updated AlgorithmConfig object.

callbacks(callbacks_class) ray.rllib.algorithms.algorithm_config.AlgorithmConfig[source]

Sets the callbacks configuration.

Parameters

callbacks_class – Callbacks class, whose methods will be run during various phases of training and environment sample collection. See the DefaultCallbacks class and examples/custom_metrics_and_callbacks.py for more usage information.

Returns

This updated AlgorithmConfig object.

exploration(*, explore: Optional[bool] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, exploration_config: Optional[dict] = <ray.rllib.algorithms.algorithm_config._NotProvided object>) ray.rllib.algorithms.algorithm_config.AlgorithmConfig[source]

Sets the config’s exploration settings.

Parameters
  • explore – Default exploration behavior, iff explore=None is passed into compute_action(s). Set to False for no exploration behavior (e.g., for evaluation).

  • exploration_config – A dict specifying the Exploration object’s config.

Returns

This updated AlgorithmConfig object.

evaluation(*, evaluation_interval: Optional[int] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, evaluation_duration: Optional[Union[int, str]] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, evaluation_duration_unit: Optional[str] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, evaluation_sample_timeout_s: Optional[float] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, evaluation_parallel_to_training: Optional[bool] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, evaluation_config: Optional[Union[ray.rllib.algorithms.algorithm_config.AlgorithmConfig, dict]] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, off_policy_estimation_methods: Optional[Dict] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, ope_split_batch_by_episode: Optional[bool] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, evaluation_num_workers: Optional[int] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, custom_evaluation_function: Optional[Callable] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, always_attach_evaluation_results: Optional[bool] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, enable_async_evaluation: Optional[bool] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, evaluation_num_episodes=-1) ray.rllib.algorithms.algorithm_config.AlgorithmConfig[source]

Sets the config’s evaluation settings.

Parameters
  • evaluation_interval – Evaluate with every evaluation_interval training iterations. The evaluation stats will be reported under the “evaluation” metric key. Note that for Ape-X metrics are already only reported for the lowest epsilon workers (least random workers). Set to None (or 0) for no evaluation.

  • evaluation_duration – Duration for which to run evaluation each evaluation_interval. The unit for the duration can be set via evaluation_duration_unit to either “episodes” (default) or “timesteps”. If using multiple evaluation workers (evaluation_num_workers > 1), the load to run will be split amongst these. If the value is “auto”: - For evaluation_parallel_to_training=True: Will run as many episodes/timesteps that fit into the (parallel) training step. - For evaluation_parallel_to_training=False: Error.

  • evaluation_duration_unit – The unit, with which to count the evaluation duration. Either “episodes” (default) or “timesteps”.

  • evaluation_sample_timeout_s – The timeout (in seconds) for the ray.get call to the remote evaluation worker(s) sample() method. After this time, the user will receive a warning and instructions on how to fix the issue. This could be either to make sure the episode ends, increasing the timeout, or switching to evaluation_duration_unit=timesteps.

  • evaluation_parallel_to_training – Whether to run evaluation in parallel to a Algorithm.train() call using threading. Default=False. E.g. evaluation_interval=2 -> For every other training iteration, the Algorithm.train() and Algorithm.evaluate() calls run in parallel. Note: This is experimental. Possible pitfalls could be race conditions for weight synching at the beginning of the evaluation loop.

  • evaluation_config – Typical usage is to pass extra args to evaluation env creator and to disable exploration by computing deterministic actions. IMPORTANT NOTE: Policy gradient algorithms are able to find the optimal policy, even if this is a stochastic one. Setting “explore=False” here will result in the evaluation workers not using this optimal policy!

  • off_policy_estimation_methods – Specify how to evaluate the current policy, along with any optional config parameters. This only has an effect when reading offline experiences (“input” is not “sampler”). Available keys: {ope_method_name: {“type”: ope_type, …}} where ope_method_name is a user-defined string to save the OPE results under, and ope_type can be any subclass of OffPolicyEstimator, e.g. ray.rllib.offline.estimators.is::ImportanceSampling or your own custom subclass, or the full class path to the subclass. You can also add additional config arguments to be passed to the OffPolicyEstimator in the dict, e.g. {“qreg_dr”: {“type”: DoublyRobust, “q_model_type”: “qreg”, “k”: 5}}

  • ope_split_batch_by_episode – Whether to use SampleBatch.split_by_episode() to split the input batch to episodes before estimating the ope metrics. In case of bandits you should make this False to see improvements in ope evaluation speed. In case of bandits, it is ok to not split by episode, since each record is one timestep already. The default is True.

  • evaluation_num_workers – Number of parallel workers to use for evaluation. Note that this is set to zero by default, which means evaluation will be run in the algorithm process (only if evaluation_interval is not None). If you increase this, it will increase the Ray resource usage of the algorithm since evaluation workers are created separately from rollout workers (used to sample data for training).

  • custom_evaluation_function – Customize the evaluation method. This must be a function of signature (algo: Algorithm, eval_workers: WorkerSet) -> metrics: dict. See the Algorithm.evaluate() method to see the default implementation. The Algorithm guarantees all eval workers have the latest policy state before this function is called.

  • always_attach_evaluation_results – Make sure the latest available evaluation results are always attached to a step result dict. This may be useful if Tune or some other meta controller needs access to evaluation metrics all the time.

  • enable_async_evaluation – If True, use an AsyncRequestsManager for the evaluation workers and use this manager to send sample() requests to the evaluation workers. This way, the Algorithm becomes more robust against long running episodes and/or failing (and restarting) workers.

Returns

This updated AlgorithmConfig object.

offline_data(*, input_=<ray.rllib.algorithms.algorithm_config._NotProvided object>, input_config=<ray.rllib.algorithms.algorithm_config._NotProvided object>, actions_in_input_normalized=<ray.rllib.algorithms.algorithm_config._NotProvided object>, input_evaluation=<ray.rllib.algorithms.algorithm_config._NotProvided object>, postprocess_inputs=<ray.rllib.algorithms.algorithm_config._NotProvided object>, shuffle_buffer_size=<ray.rllib.algorithms.algorithm_config._NotProvided object>, output=<ray.rllib.algorithms.algorithm_config._NotProvided object>, output_config=<ray.rllib.algorithms.algorithm_config._NotProvided object>, output_compress_columns=<ray.rllib.algorithms.algorithm_config._NotProvided object>, output_max_file_size=<ray.rllib.algorithms.algorithm_config._NotProvided object>, offline_sampling=<ray.rllib.algorithms.algorithm_config._NotProvided object>) ray.rllib.algorithms.algorithm_config.AlgorithmConfig[source]

Sets the config’s offline data settings.

Parameters
  • input – Specify how to generate experiences: - “sampler”: Generate experiences via online (env) simulation (default). - A local directory or file glob expression (e.g., “/tmp/.json”). - A list of individual file paths/URIs (e.g., [“/tmp/1.json”, “s3://bucket/2.json”]). - A dict with string keys and sampling probabilities as values (e.g., {“sampler”: 0.4, “/tmp/.json”: 0.4, “s3://bucket/expert.json”: 0.2}). - A callable that takes an IOContext object as only arg and returns a ray.rllib.offline.InputReader. - A string key that indexes a callable with tune.registry.register_input

  • input_config – Arguments that describe the settings for reading the input. If input is sample, this will be environment configuation, e.g. env_name and env_config, etc. See EnvContext for more info. If the input is dataset, this will be e.g. format, path.

  • actions_in_input_normalized – True, if the actions in a given offline “input” are already normalized (between -1.0 and 1.0). This is usually the case when the offline file has been generated by another RLlib algorithm (e.g. PPO or SAC), while “normalize_actions” was set to True.

  • postprocess_inputs – Whether to run postprocess_trajectory() on the trajectory fragments from offline inputs. Note that postprocessing will be done using the current policy, not the behavior policy, which is typically undesirable for on-policy algorithms.

  • shuffle_buffer_size – If positive, input batches will be shuffled via a sliding window buffer of this number of batches. Use this if the input data is not in random enough order. Input is delayed until the shuffle buffer is filled.

  • output – Specify where experiences should be saved: - None: don’t save any experiences - “logdir” to save to the agent log dir - a path/URI to save to a custom output directory (e.g., “s3://bckt/”) - a function that returns a rllib.offline.OutputWriter

  • output_config – Arguments accessible from the IOContext for configuring custom output.

  • output_compress_columns – What sample batch columns to LZ4 compress in the output data.

  • output_max_file_size – Max output file size before rolling over to a new file.

  • offline_sampling – Whether sampling for the Algorithm happens via reading from offline data. If True, RolloutWorkers will NOT limit the number of collected batches within the same sample() call based on the number of sub-environments within the worker (no sub-environments present).

Returns

This updated AlgorithmConfig object.

multi_agent(*, policies=<ray.rllib.algorithms.algorithm_config._NotProvided object>, policy_map_capacity=<ray.rllib.algorithms.algorithm_config._NotProvided object>, policy_map_cache=<ray.rllib.algorithms.algorithm_config._NotProvided object>, policy_mapping_fn=<ray.rllib.algorithms.algorithm_config._NotProvided object>, policies_to_train=<ray.rllib.algorithms.algorithm_config._NotProvided object>, observation_fn=<ray.rllib.algorithms.algorithm_config._NotProvided object>, count_steps_by=<ray.rllib.algorithms.algorithm_config._NotProvided object>, replay_mode=-1) ray.rllib.algorithms.algorithm_config.AlgorithmConfig[source]

Sets the config’s multi-agent settings.

Validates the new multi-agent settings and translates everything into a unified multi-agent setup format. For example a policies list or set of IDs is properly converted into a dict mapping these IDs to PolicySpecs.

Parameters
  • policies – Map of type MultiAgentPolicyConfigDict from policy ids to either 4-tuples of (policy_cls, obs_space, act_space, config) or PolicySpecs. These tuples or PolicySpecs define the class of the policy, the observation- and action spaces of the policies, and any extra config.

  • policy_map_capacity – Keep this many policies in the “policy_map” (before writing least-recently used ones to disk/S3).

  • policy_map_cache – Where to store overflowing (least-recently used) policies? Could be a directory (str) or an S3 location. None for using the default output dir.

  • policy_mapping_fn – Function mapping agent ids to policy ids. The signature is: (agent_id, episode, worker, **kwargs) -> PolicyID.

  • policies_to_train – Determines those policies that should be updated. Options are: - None, for training all policies. - An iterable of PolicyIDs that should be trained. - A callable, taking a PolicyID and a SampleBatch or MultiAgentBatch and returning a bool (indicating whether the given policy is trainable or not, given the particular batch). This allows you to have a policy trained only on certain data (e.g. when playing against a certain opponent).

  • observation_fn – Optional function that can be used to enhance the local agent observations to include more state. See rllib/evaluation/observation_function.py for more info.

  • count_steps_by – Which metric to use as the “batch size” when building a MultiAgentBatch. The two supported values are: “env_steps”: Count each time the env is “stepped” (no matter how many multi-agent actions are passed/how many multi-agent observations have been returned in the previous step). “agent_steps”: Count each individual agent step as one step.

Returns

This updated AlgorithmConfig object.

is_multi_agent() bool[source]

Returns whether this config specifies a multi-agent setup.

Returns

True, if a) >1 policies defined OR b) 1 policy defined, but its ID is NOT DEFAULT_POLICY_ID.

reporting(*, keep_per_episode_custom_metrics: Optional[bool] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, metrics_episode_collection_timeout_s: Optional[float] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, metrics_num_episodes_for_smoothing: Optional[int] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, min_time_s_per_iteration: Optional[int] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, min_train_timesteps_per_iteration: Optional[int] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, min_sample_timesteps_per_iteration: Optional[int] = <ray.rllib.algorithms.algorithm_config._NotProvided object>) ray.rllib.algorithms.algorithm_config.AlgorithmConfig[source]

Sets the config’s reporting settings.

Parameters
  • keep_per_episode_custom_metrics – Store raw custom metrics without calculating max, min, mean

  • metrics_episode_collection_timeout_s – Wait for metric batches for at most this many seconds. Those that have not returned in time will be collected in the next train iteration.

  • metrics_num_episodes_for_smoothing – Smooth rollout metrics over this many episodes, if possible. In case rollouts (sample collection) just started, there may be fewer than this many episodes in the buffer and we’ll compute metrics over this smaller number of available episodes. In case there are more than this many episodes collected in a single training iteration, use all of these episodes for metrics computation, meaning don’t ever cut any “excess” episodes.

  • min_time_s_per_iteration – Minimum time to accumulate within a single train() call. This value does not affect learning, only the number of times Algorithm.training_step() is called by Algorithm.train(). If - after one such step attempt, the time taken has not reached min_time_s_per_iteration, will perform n more training_step() calls until the minimum time has been consumed. Set to 0 or None for no minimum time.

  • min_train_timesteps_per_iteration – Minimum training timesteps to accumulate within a single train() call. This value does not affect learning, only the number of times Algorithm.training_step() is called by Algorithm.train(). If - after one such step attempt, the training timestep count has not been reached, will perform n more training_step() calls until the minimum timesteps have been executed. Set to 0 or None for no minimum timesteps.

  • min_sample_timesteps_per_iteration – Minimum env sampling timesteps to accumulate within a single train() call. This value does not affect learning, only the number of times Algorithm.training_step() is called by Algorithm.train(). If - after one such step attempt, the env sampling timestep count has not been reached, will perform n more training_step() calls until the minimum timesteps have been executed. Set to 0 or None for no minimum timesteps.

Returns

This updated AlgorithmConfig object.

checkpointing(export_native_model_files: Optional[bool] = <ray.rllib.algorithms.algorithm_config._NotProvided object>) ray.rllib.algorithms.algorithm_config.AlgorithmConfig[source]

Sets the config’s checkpointing settings.

Parameters

export_native_model_files – Whether an individual Policy- or the Algorithm’s checkpoints also contain (tf or torch) native model files. These could be used to restore just the NN models from these files w/o requiring RLlib. These files are generated by calling the tf- or torch- built-in saving utility methods on the actual models.

Returns

This updated AlgorithmConfig object.

debugging(*, logger_creator: Optional[Callable[[], ray.tune.logger.logger.Logger]] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, logger_config: Optional[dict] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, log_level: Optional[str] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, log_sys_usage: Optional[bool] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, fake_sampler: Optional[bool] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, seed: Optional[int] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, worker_cls: Optional[Type[ray.rllib.evaluation.rollout_worker.RolloutWorker]] = <ray.rllib.algorithms.algorithm_config._NotProvided object>) ray.rllib.algorithms.algorithm_config.AlgorithmConfig[source]

Sets the config’s debugging settings.

Parameters
  • logger_creator – Callable that creates a ray.tune.Logger object. If unspecified, a default logger is created.

  • logger_config – Define logger-specific configuration to be used inside Logger Default value None allows overwriting with nested dicts.

  • log_level – Set the ray.rllib.* log level for the agent process and its workers. Should be one of DEBUG, INFO, WARN, or ERROR. The DEBUG level will also periodically print out summaries of relevant internal dataflow (this is also printed out once at startup at the INFO level). When using the rllib train command, you can also use the -v and -vv flags as shorthand for INFO and DEBUG.

  • log_sys_usage – Log system resource metrics to results. This requires psutil to be installed for sys stats, and gputil for GPU metrics.

  • fake_sampler – Use fake (infinite speed) sampler. For testing only.

  • seed – This argument, in conjunction with worker_index, sets the random seed of each worker, so that identically configured trials will have identical results. This makes experiments reproducible.

  • worker_cls – Use a custom RolloutWorker type for unit testing purpose.

Returns

This updated AlgorithmConfig object.

experimental(*, _tf_policy_handles_more_than_one_loss=<ray.rllib.algorithms.algorithm_config._NotProvided object>, _disable_preprocessor_api=<ray.rllib.algorithms.algorithm_config._NotProvided object>, _disable_action_flattening=<ray.rllib.algorithms.algorithm_config._NotProvided object>, _disable_execution_plan_api=<ray.rllib.algorithms.algorithm_config._NotProvided object>) ray.rllib.algorithms.algorithm_config.AlgorithmConfig[source]

Sets the config’s experimental settings.

Parameters
  • _tf_policy_handles_more_than_one_loss – Experimental flag. If True, TFPolicy will handle more than one loss/optimizer. Set this to True, if you would like to return more than one loss term from your loss_fn and an equal number of optimizers from your optimizer_fn. In the future, the default for this will be True.

  • _disable_preprocessor_api – Experimental flag. If True, no (observation) preprocessor will be created and observations will arrive in model as they are returned by the env. In the future, the default for this will be True.

  • _disable_action_flattening – Experimental flag. If True, RLlib will no longer flatten the policy-computed actions into a single tensor (for storage in SampleCollectors/output files/etc..), but leave (possibly nested) actions as-is. Disabling flattening affects: - SampleCollectors: Have to store possibly nested action structs. - Models that have the previous action(s) as part of their input. - Algorithms reading from offline files (incl. action information).

  • _disable_execution_plan_api – Experimental flag. If True, the execution plan API will not be used. Instead, a Algorithm’s training_iteration method will be called as-is each training iteration.

Returns

This updated AlgorithmConfig object.

get_rollout_fragment_length(worker_index: int = 0) int[source]

Automatically infers a proper rollout_fragment_length setting if “auto”.

Uses the simple formula: rollout_fragment_length = train_batch_size / (num_envs_per_worker * num_rollout_workers)

If result is not a fraction AND worker_index is provided, will make those workers add another timestep, such that the overall batch size (across the workers) will add up to exactly the train_batch_size.

Returns

The user-provided rollout_fragment_length or a computed one (if user value is “auto”).

get_evaluation_config_object() Optional[ray.rllib.algorithms.algorithm_config.AlgorithmConfig][source]

Creates a full AlgorithmConfig object from self.evaluation_config.

Returns

A fully valid AlgorithmConfig object that can be used for the evaluation WorkerSet. If self is already an evaluation config object, return None.

get_multi_agent_setup(*, policies: Optional[Dict[str, PolicySpec]] = None, env: Optional[Any] = None, spaces: Optional[Dict[str, Tuple[<MagicMock name='mock.Space' id='140712722805712'>, <MagicMock name='mock.Space' id='140712722805712'>]]] = None, default_policy_class: Optional[Type[ray.rllib.policy.policy.Policy]] = None) Tuple[Dict[str, PolicySpec], Callable[[str, Union[SampleBatch, MultiAgentBatch]], bool]][source]

Compiles complete multi-agent config (dict) from the information in self.

Infers the observation- and action spaces, the policy classes, and the policy’s configs. The returned MultiAgentPolicyConfigDict is fully unified and strictly maps PolicyIDs to complete PolicySpec objects (with all their fields not-None).

Examples

>>> import numpy as np
>>> from ray.rllib.algorithms.ppo import PPOConfig
>>> config = (
...   PPOConfig()
...   .environment("CartPole-v1")
...   .framework("torch")
...   .multi_agent(policies={"pol1", "pol2"}, policies_to_train=["pol1"])
... )
>>> policy_dict, is_policy_to_train = \  
...     config.get_multi_agent_setup()
>>> is_policy_to_train("pol1") 
True
>>> is_policy_to_train("pol2") 
False
>>> print(policy_dict) 
{
  "pol1": PolicySpec(
    PPOTorchPolicyV2,  # infered from Algo's default policy class
    Box(-2.0, 2.0, (4,), np.float),  # infered from env
    Discrete(2),  # infered from env
    {},  # not provided -> empty dict
  ),
  "pol2": PolicySpec(
    PPOTorchPolicyV2,  # infered from Algo's default policy class
    Box(-2.0, 2.0, (4,), np.float),  # infered from env
    Discrete(2),  # infered from env
    {},  # not provided -> empty dict
  ),
}
Parameters
  • policies – An optional multi-agent policies dict, mapping policy IDs to PolicySpec objects. If not provided, will use self.policies instead. Note that the policy_class, observation_space, and action_space properties in these PolicySpecs may be None and must therefore be inferred here.

  • env – An optional env instance, from which to infer the different spaces for the different policies. If not provided, will try to infer from spaces. Otherwise from self.observation_space and self.action_space. If no information on spaces can be infered, will raise an error.

  • spaces – Optional dict mapping policy IDs to tuples of 1) observation space and 2) action space that should be used for the respective policy. These spaces were usually provided by an already instantiated remote RolloutWorker. If not provided, will try to infer from env. Otherwise from self.observation_space and self.action_space. If no information on spaces can be infered, will raise an error.

  • default_policy_class – The Policy class to use should a PolicySpec have its policy_class property set to None.

Returns

A tuple consisting of 1) a MultiAgentPolicyConfigDict and 2) a is_policy_to_train(PolicyID, SampleBatchType) -> bool callable.

Raises
  • ValueError – In case, no spaces can be infered for the policy/ies.

  • ValueError – In case, two agents in the env map to the same PolicyID (according to self.policy_mapping_fn), but have different action- or observation spaces according to the infered space information.

validate_train_batch_size_vs_rollout_fragment_length() None[source]

Detects mismatches for train_batch_size vs rollout_fragment_length.

Only applicable for algorithms, whose train_batch_size should be directly dependent on rollout_fragment_length (synchronous sampling, on-policy PG algos).

If rollout_fragment_length != “auto”, makes sure that the product of rollout_fragment_length x num_rollout_workers x num_envs_per_worker roughly (10%) matches the provided train_batch_size. Otherwise, errors with asking the user to set rollout_fragment_length to auto or to a matching value.

Also, only checks this if train_batch_size > 0 (DDPPO sets this to -1 to auto-calculate the actual batch size later).

Raises
  • ValueError – If there is a mismatch between user provided

  • rollout_fragment_length

get(key, default=None)[source]

Shim method to help pretend we are a dict.

pop(key, default=None)[source]

Shim method to help pretend we are a dict.

keys()[source]

Shim method to help pretend we are a dict.

values()[source]

Shim method to help pretend we are a dict.

items()[source]

Shim method to help pretend we are a dict.

property multiagent

Shim method to help pretend we are a dict with ‘multiagent’ key.

Building Custom Algorithm Classes

Warning

As of Ray >= 1.9, it is no longer recommended to use the build_trainer() utility function for creating custom Algorithm sub-classes. Instead, follow the simple guidelines here for directly sub-classing from Algorithm.

In order to create a custom Algorithm, sub-class the Algorithm class and override one or more of its methods. Those are in particular:

See here for an example on how to override Algorithm.

Interacting with an Algorithm

Once you’ve built an AlgorithmConfig and retrieve an Algorithm from it via the build() method , you can use it to train and evaluate your experiments.

Here’s the full Algorithm API reference.

class ray.rllib.algorithms.algorithm.Algorithm(config: Optional[ray.rllib.algorithms.algorithm_config.AlgorithmConfig] = None, env=None, logger_creator: Optional[Callable[[], ray.tune.logger.logger.Logger]] = None, **kwargs)[source]

An RLlib algorithm responsible for optimizing one or more Policies.

Algorithms contain a WorkerSet under self.workers. A WorkerSet is normally composed of a single local worker (self.workers.local_worker()), used to compute and apply learning updates, and optionally one or more remote workers used to generate environment samples in parallel. WorkerSet is fault tolerant and elastic. It tracks health states for all the managed remote worker actors. As a result, Algorithm should never access the underlying actor handles directly. Instead, always access them via all the foreach APIs with assigned IDs of the underlying workers.

Each worker (remotes or local) contains a PolicyMap, which itself may contain either one policy for single-agent training or one or more policies for multi-agent training. Policies are synchronized automatically from time to time using ray.remote calls. The exact synchronization logic depends on the specific algorithm used, but this usually happens from local worker to all remote workers and after each training update.

You can write your own Algorithm classes by sub-classing from Algorithm or any of its built-in sub-classes. This allows you to override the training_step method to implement your own algorithm logic. You can find the different built-in algorithms’ training_step() methods in their respective main .py files, e.g. rllib.algorithms.dqn.dqn.py or rllib.algorithms.impala.impala.py.

The most important API methods a Algorithm exposes are train(), evaluate(), save() and restore().

static from_checkpoint(checkpoint: Union[str, ray.air.checkpoint.Checkpoint], policy_ids: Optional[Container[str]] = None, policy_mapping_fn: Optional[Callable[[Any, int], str]] = None, policies_to_train: Optional[Union[Container[str], Callable[[str, Optional[Union[SampleBatch, MultiAgentBatch]]], bool]]] = None) Algorithm[source]

Creates a new algorithm instance from a given checkpoint.

Note: This method must remain backward compatible from 2.0.0 on.

Parameters
  • checkpoint – The path (str) to the checkpoint directory to use or an AIR Checkpoint instance to restore from.

  • policy_ids – Optional list of PolicyIDs to recover. This allows users to restore an Algorithm with only a subset of the originally present Policies.

  • policy_mapping_fn – An optional (updated) policy mapping function to use from here on.

  • policies_to_train – An optional list of policy IDs to be trained or a callable taking PolicyID and SampleBatchType and returning a bool (trainable or not?). If None, will keep the existing setup in place. Policies, whose IDs are not in the list (or for which the callable returns False) will not be updated.

Returns

The instantiated Algorithm.

static from_state(state: Dict) ray.rllib.algorithms.algorithm.Algorithm[source]

Recovers an Algorithm from a state object.

The state of an instantiated Algorithm can be retrieved by calling its get_state method. It contains all information necessary to create the Algorithm from scratch. No access to the original code (e.g. configs, knowledge of the Algorithm’s class, etc..) is needed.

Parameters

state – The state to recover a new Algorithm instance from.

Returns

A new Algorithm instance.

__init__(config: Optional[ray.rllib.algorithms.algorithm_config.AlgorithmConfig] = None, env=None, logger_creator: Optional[Callable[[], ray.tune.logger.logger.Logger]] = None, **kwargs)[source]

Initializes an Algorithm instance.

Parameters
  • config – Algorithm-specific configuration object.

  • logger_creator – Callable that creates a ray.tune.Logger object. If unspecified, a default logger is created.

  • **kwargs – Arguments passed to the Trainable base class.

setup(config: ray.rllib.algorithms.algorithm_config.AlgorithmConfig) None[source]

Subclasses should override this for custom initialization.

New in version 0.8.7.

Parameters

config – Hyperparameters and other configs given. Copy of self.config.

classmethod get_default_policy_class(config: ray.rllib.algorithms.algorithm_config.AlgorithmConfig) Optional[Type[ray.rllib.policy.policy.Policy]][source]

Returns a default Policy class to use, given a config.

This class will be used by an Algorithm in case the policy class is not provided by the user in any single- or multi-agent PolicySpec.

step() dict[source]

Implements the main Trainer.train() logic.

Takes n attempts to perform a single training step. Thereby catches RayErrors resulting from worker failures. After n attempts, fails gracefully.

Override this method in your Trainer sub-classes if you would like to handle worker failures yourself. Otherwise, override only training_step() to implement the core algorithm logic.

Returns

The results dict with stats/infos on sampling, training, and - if required - evaluation.

evaluate(duration_fn: Optional[Callable[[int], int]] = None) dict[source]

Evaluates current policy under evaluation_config settings.

Note that this default implementation does not do anything beyond merging evaluation_config with the normal trainer config.

Parameters

duration_fn – An optional callable taking the already run num episodes as only arg and returning the number of episodes left to run. It’s used to find out whether evaluation should continue.

restore_workers(workers: ray.rllib.evaluation.worker_set.WorkerSet)[source]

Try to restore failed workers if necessary.

Algorithms that use custom RolloutWorkers may override this method to disable default, and create custom restoration logics.

Parameters

workers – The WorkerSet to restore. This may be Rollout or Evaluation workers.

training_step() dict[source]

Default single iteration logic of an algorithm.

  • Collect on-policy samples (SampleBatches) in parallel using the Trainer’s RolloutWorkers (@ray.remote).

  • Concatenate collected SampleBatches into one train batch.

  • Note that we may have more than one policy in the multi-agent case: Call the different policies’ learn_on_batch (simple optimizer) OR load_batch_into_buffer + learn_on_loaded_batch (multi-GPU optimizer) methods to calculate loss and update the model(s).

  • Return all collected metrics for the iteration.

Returns

The results dict from executing the training iteration.

compute_single_action(observation: Optional[Union[numpy.array, tf.Tensor, torch.Tensor, dict, tuple]] = None, state: Optional[List[Union[numpy.array, tf.Tensor, torch.Tensor, dict, tuple]]] = None, *, prev_action: Optional[Union[numpy.array, tf.Tensor, torch.Tensor, dict, tuple]] = None, prev_reward: Optional[float] = None, info: Optional[dict] = None, input_dict: Optional[ray.rllib.policy.sample_batch.SampleBatch] = None, policy_id: str = 'default_policy', full_fetch: bool = False, explore: Optional[bool] = None, timestep: Optional[int] = None, episode: Optional[ray.rllib.evaluation.episode.Episode] = None, unsquash_action: Optional[bool] = None, clip_action: Optional[bool] = None, unsquash_actions=- 1, clip_actions=- 1, **kwargs) Union[numpy.array, tf.Tensor, torch.Tensor, dict, tuple, Tuple[Union[numpy.array, tf.Tensor, torch.Tensor, dict, tuple], List[Union[numpy.array, tf.Tensor, torch.Tensor]], Dict[str, Union[numpy.array, tf.Tensor, torch.Tensor]]]][source]

Computes an action for the specified policy on the local worker.

Note that you can also access the policy object through self.get_policy(policy_id) and call compute_single_action() on it directly.

Parameters
  • observation – Single (unbatched) observation from the environment.

  • state – List of all RNN hidden (single, unbatched) state tensors.

  • prev_action – Single (unbatched) previous action value.

  • prev_reward – Single (unbatched) previous reward value.

  • info – Env info dict, if any.

  • input_dict – An optional SampleBatch that holds all the values for: obs, state, prev_action, and prev_reward, plus maybe custom defined views of the current env trajectory. Note that only one of obs or input_dict must be non-None.

  • policy_id – Policy to query (only applies to multi-agent). Default: “default_policy”.

  • full_fetch – Whether to return extra action fetch results. This is always set to True if state is specified.

  • explore – Whether to apply exploration to the action. Default: None -> use self.config[“explore”].

  • timestep – The current (sampling) time step.

  • episode – This provides access to all of the internal episodes’ state, which may be useful for model-based or multi-agent algorithms.

  • unsquash_action – Should actions be unsquashed according to the env’s/Policy’s action space? If None, use the value of self.config[“normalize_actions”].

  • clip_action – Should actions be clipped according to the env’s/Policy’s action space? If None, use the value of self.config[“clip_actions”].

Keyword Arguments

kwargs – forward compatibility placeholder

Returns

The computed action if full_fetch=False, or a tuple of a) the full output of policy.compute_actions() if full_fetch=True or we have an RNN-based Policy.

Raises

KeyError – If the policy_id cannot be found in this Trainer’s local worker.

compute_actions(observations: Union[numpy.array, tf.Tensor, torch.Tensor, dict, tuple], state: Optional[List[Union[numpy.array, tf.Tensor, torch.Tensor, dict, tuple]]] = None, *, prev_action: Optional[Union[numpy.array, tf.Tensor, torch.Tensor, dict, tuple]] = None, prev_reward: Optional[Union[numpy.array, tf.Tensor, torch.Tensor, dict, tuple]] = None, info: Optional[dict] = None, policy_id: str = 'default_policy', full_fetch: bool = False, explore: Optional[bool] = None, timestep: Optional[int] = None, episodes: Optional[List[ray.rllib.evaluation.episode.Episode]] = None, unsquash_actions: Optional[bool] = None, clip_actions: Optional[bool] = None, normalize_actions=None, **kwargs)[source]

Computes an action for the specified policy on the local Worker.

Note that you can also access the policy object through self.get_policy(policy_id) and call compute_actions() on it directly.

Parameters
  • observation – Observation from the environment.

  • state – RNN hidden state, if any. If state is not None, then all of compute_single_action(…) is returned (computed action, rnn state(s), logits dictionary). Otherwise compute_single_action(…)[0] is returned (computed action).

  • prev_action – Previous action value, if any.

  • prev_reward – Previous reward, if any.

  • info – Env info dict, if any.

  • policy_id – Policy to query (only applies to multi-agent).

  • full_fetch – Whether to return extra action fetch results. This is always set to True if RNN state is specified.

  • explore – Whether to pick an exploitation or exploration action (default: None -> use self.config[“explore”]).

  • timestep – The current (sampling) time step.

  • episodes – This provides access to all of the internal episodes’ state, which may be useful for model-based or multi-agent algorithms.

  • unsquash_actions – Should actions be unsquashed according to the env’s/Policy’s action space? If None, use self.config[“normalize_actions”].

  • clip_actions – Should actions be clipped according to the env’s/Policy’s action space? If None, use self.config[“clip_actions”].

Keyword Arguments

kwargs – forward compatibility placeholder

Returns

The computed action if full_fetch=False, or a tuple consisting of the full output of policy.compute_actions_from_input_dict() if full_fetch=True or we have an RNN-based Policy.

get_policy(policy_id: str = 'default_policy') ray.rllib.policy.policy.Policy[source]

Return policy for the specified id, or None.

Parameters

policy_id – ID of the policy to return.

get_weights(policies: Optional[List[str]] = None) dict[source]

Return a dictionary of policy ids to weights.

Parameters

policies – Optional list of policies to return weights for, or None for all policies.

set_weights(weights: Dict[str, dict])[source]

Set policy weights by policy id.

Parameters

weights – Map of policy ids to weights to set.

add_policy(policy_id: str, policy_cls: Optional[Type[ray.rllib.policy.policy.Policy]] = None, policy: Optional[ray.rllib.policy.policy.Policy] = None, *, observation_space: Optional[<MagicMock name='mock.spaces.Space' id='140716254345616'>] = None, action_space: Optional[<MagicMock name='mock.spaces.Space' id='140716254345616'>] = None, config: Optional[Union[ray.rllib.algorithms.algorithm_config.AlgorithmConfig, dict]] = None, policy_state: Optional[Dict[str, Union[numpy.array, tf.Tensor, torch.Tensor, dict, tuple]]] = None, policy_mapping_fn: Optional[Callable[[Any, int], str]] = None, policies_to_train: Optional[Union[Container[str], Callable[[str, Optional[Union[SampleBatch, MultiAgentBatch]]], bool]]] = None, evaluation_workers: bool = True, workers: Optional[List[Union[ray.rllib.evaluation.rollout_worker.RolloutWorker, ray.actor.ActorHandle]]] = -1) Optional[ray.rllib.policy.policy.Policy][source]

Adds a new policy to this Algorithm.

Parameters
  • policy_id – ID of the policy to add. IMPORTANT: Must not contain characters that are also not allowed in Unix/Win filesystems, such as: <>:"/|?*, or a dot, space or backslash at the end of the ID.

  • policy_cls – The Policy class to use for constructing the new Policy. Note: Only one of policy_cls or policy must be provided.

  • policy – The Policy instance to add to this algorithm. If not None, the given Policy object will be directly inserted into the Algorithm’s local worker and clones of that Policy will be created on all remote workers as well as all evaluation workers. Note: Only one of policy_cls or policy must be provided.

  • observation_space – The observation space of the policy to add. If None, try to infer this space from the environment.

  • action_space – The action space of the policy to add. If None, try to infer this space from the environment.

  • config – The config object or overrides for the policy to add.

  • policy_state – Optional state dict to apply to the new policy instance, right after its construction.

  • policy_mapping_fn – An optional (updated) policy mapping function to use from here on. Note that already ongoing episodes will not change their mapping but will use the old mapping till the end of the episode.

  • policies_to_train – An optional list of policy IDs to be trained or a callable taking PolicyID and SampleBatchType and returning a bool (trainable or not?). If None, will keep the existing setup in place. Policies, whose IDs are not in the list (or for which the callable returns False) will not be updated.

  • evaluation_workers – Whether to add the new policy also to the evaluation WorkerSet.

  • workers – A list of RolloutWorker/ActorHandles (remote RolloutWorkers) to add this policy to. If defined, will only add the given policy to these workers.

Returns

The newly added policy (the copy that got added to the local worker). If workers was provided, None is returned.

remove_policy(policy_id: str = 'default_policy', *, policy_mapping_fn: Optional[Callable[[Any], str]] = None, policies_to_train: Optional[Union[Container[str], Callable[[str, Optional[Union[SampleBatch, MultiAgentBatch]]], bool]]] = None, evaluation_workers: bool = True) None[source]

Removes a new policy from this Algorithm.

Parameters
  • policy_id – ID of the policy to be removed.

  • policy_mapping_fn – An optional (updated) policy mapping function to use from here on. Note that already ongoing episodes will not change their mapping but will use the old mapping till the end of the episode.

  • policies_to_train – An optional list of policy IDs to be trained or a callable taking PolicyID and SampleBatchType and returning a bool (trainable or not?). If None, will keep the existing setup in place. Policies, whose IDs are not in the list (or for which the callable returns False) will not be updated.

  • evaluation_workers – Whether to also remove the policy from the evaluation WorkerSet.

export_policy_model(export_dir: str, policy_id: str = 'default_policy', onnx: Optional[int] = None) None[source]

Exports policy model with given policy_id to a local directory.

Parameters
  • export_dir – Writable local directory.

  • policy_id – Optional policy id to export.

  • onnx – If given, will export model in ONNX format. The value of this parameter set the ONNX OpSet version to use. If None, the output format will be DL framework specific.

Example

>>> from ray.rllib.algorithms.ppo import PPO
>>> # Use an Algorithm from RLlib or define your own.
>>> algo = PPO(...) 
>>> for _ in range(10): 
>>>     algo.train() 
>>> algo.export_policy_model("/tmp/dir") 
>>> algo.export_policy_model("/tmp/dir/onnx", onnx=1) 
export_policy_checkpoint(export_dir: str, filename_prefix=- 1, policy_id: str = 'default_policy') None[source]

Exports Policy checkpoint to a local directory and returns an AIR Checkpoint.

Parameters
  • export_dir – Writable local directory to store the AIR Checkpoint information into.

  • policy_id – Optional policy ID to export. If not provided, will export “default_policy”. If policy_id does not exist in this Algorithm, will raise a KeyError.

Raises

KeyError if policy_id cannot be found in this Algorithm.

Example

>>> from ray.rllib.algorithms.ppo import PPO
>>> # Use an Algorithm from RLlib or define your own.
>>> algo = PPO(...) 
>>> for _ in range(10): 
>>>     algo.train() 
>>> algo.export_policy_checkpoint("/tmp/export_dir") 
import_policy_model_from_h5(import_file: str, policy_id: str = 'default_policy') None[source]

Imports a policy’s model with given policy_id from a local h5 file.

Parameters
  • import_file – The h5 file to import from.

  • policy_id – Optional policy id to import into.

Example

>>> from ray.rllib.algorithms.ppo import PPO
>>> algo = PPO(...) 
>>> algo.import_policy_model_from_h5("/tmp/weights.h5") 
>>> for _ in range(10): 
>>>     algo.train() 
save_checkpoint(checkpoint_dir: str) str[source]

Exports AIR Checkpoint to a local directory and returns its directory path.

The structure of an Algorithm checkpoint dir will be as follows:

policies/
    pol_1/
        policy_state.pkl
    pol_2/
        policy_state.pkl
rllib_checkpoint.json
algorithm_state.pkl

Note: rllib_checkpoint.json contains a “version” key (e.g. with value 0.1) helping RLlib to remain backward compatible wrt. restoring from checkpoints from Ray 2.0 onwards.

Parameters

checkpoint_dir – The directory where the checkpoint files will be stored.

Returns

The path to the created AIR Checkpoint directory.

load_checkpoint(checkpoint: Union[Dict, str]) None[source]

Subclasses should override this to implement restore().

Warning

In this method, do not rely on absolute paths. The absolute path of the checkpoint_dir used in Trainable.save_checkpoint may be changed.

If Trainable.save_checkpoint returned a prefixed string, the prefix of the checkpoint string returned by Trainable.save_checkpoint may be changed. This is because trial pausing depends on temporary directories.

The directory structure under the checkpoint_dir provided to Trainable.save_checkpoint is preserved.

See the examples below.

Example

>>> import os
>>> from ray.tune.trainable import Trainable
>>> class Example(Trainable):
...    def save_checkpoint(self, checkpoint_path):
...        my_checkpoint_path = os.path.join(checkpoint_path, "my/path")
...        return my_checkpoint_path
...    def load_checkpoint(self, my_checkpoint_path):
...        print(my_checkpoint_path)
>>> trainer = Example()
>>> # This is used when PAUSED.
>>> obj = trainer.save_to_object() 
<logdir>/tmpc8k_c_6hsave_to_object/checkpoint_0/my/path
>>> # Note the different prefix.
>>> trainer.restore_from_object(obj) 
<logdir>/tmpb87b5axfrestore_from_object/checkpoint_0/my/path

If Trainable.save_checkpoint returned a dict, then Tune will directly pass the dict data as the argument to this method.

Example

>>> from ray.tune.trainable import Trainable
>>> class Example(Trainable):
...    def save_checkpoint(self, checkpoint_path):
...        return {"my_data": 1}
...    def load_checkpoint(self, checkpoint_dict):
...        print(checkpoint_dict["my_data"])

New in version 0.8.7.

Parameters

checkpoint – If dict, the return value is as returned by save_checkpoint. If a string, then it is a checkpoint path that may have a different prefix than that returned by save_checkpoint. The directory structure underneath the checkpoint_dir from save_checkpoint is preserved.

log_result(result: dict) None[source]

Subclasses can optionally override this to customize logging.

The logging here is done on the worker process rather than the driver.

New in version 0.8.7.

Parameters

result – Training result returned by step().

cleanup() None[source]

Subclasses should override this for any cleanup on stop.

If any Ray actors are launched in the Trainable (i.e., with a RLlib trainer), be sure to kill the Ray actor process here.

This process should be lightweight. Per default,

You can kill a Ray actor by calling ray.kill(actor) on the actor or removing all references to it and waiting for garbage collection

New in version 0.8.7.

classmethod default_resource_request(config: Union[ray.rllib.algorithms.algorithm_config.AlgorithmConfig, dict]) Union[ray.tune.resources.Resources, ray.tune.execution.placement_groups.PlacementGroupFactory][source]

Provides a static resource requirement for the given configuration.

This can be overridden by sub-classes to set the correct trial resource allocation, so the user does not need to.

@classmethod
def default_resource_request(cls, config):
    return PlacementGroupFactory([{"CPU": 1}, {"CPU": 1}]])
Parameters
  • config[Dict[str – The Trainable’s config dict.

  • Any]] – The Trainable’s config dict.

Returns

A Resources object or

PlacementGroupFactory consumed by Tune for queueing.

Return type

Union[Resources, PlacementGroupFactory]

classmethod resource_help(config: Union[ray.rllib.algorithms.algorithm_config.AlgorithmConfig, dict]) str[source]

Returns a help string for configuring this trainable’s resources.

Parameters

config – The Trainer’s config dict.

get_auto_filled_metrics(now: Optional[datetime.datetime] = None, time_this_iter: Optional[float] = None, debug_metrics_only: bool = False) dict[source]

Return a dict with metrics auto-filled by the trainable.

If debug_metrics_only is True, only metrics that don’t require at least one iteration will be returned (ray.tune.result.DEBUG_METRICS).

classmethod merge_trainer_configs(config1: dict, config2: dict, _allow_unknown_configs: Optional[bool] = None) dict[source]

Merges a complete Algorithm config dict with a partial override dict.

Respects nested structures within the config dicts. The values in the partial override dict take priority.

Parameters
  • config1 – The complete Algorithm’s dict to be merged (overridden) with config2.

  • config2 – The partial override config dict to merge on top of config1.

  • _allow_unknown_configs – If True, keys in config2 that don’t exist in config1 are allowed and will be added to the final config.

Returns

The merged full algorithm config dict.

static validate_env(env: Any, env_context: ray.rllib.env.env_context.EnvContext) None[source]

Env validator function for this Algorithm class.

Override this in child classes to define custom validation behavior.

Parameters
  • env – The (sub-)environment to validate. This is normally a single sub-environment (e.g. a gym.Env) within a vectorized setup.

  • env_context – The EnvContext to configure the environment.

Raises

Exception in case something is wrong with the given environment.

import_model(import_file: str)[source]

Imports a model from import_file.

Note: Currently, only h5 files are supported.

Parameters

import_file – The file to import the model from.

Returns

A dict that maps ExportFormats to successfully exported models.