Algorithms
Contents
Algorithms#
The Algorithm
class is the highest-level API in RLlib.
It allows you to train and evaluate policies, save an experiment’s progress and restore from
a prior saved experiment when continuing an RL run.
Algorithm
is a sub-class
of Trainable
and thus fully supports distributed hyperparameter tuning for RL.
A typical RLlib Algorithm object: The components sitting inside an Algorithm are
normally N RolloutWorker
and zero or more @ray.remote
BaseEnv
per worker.#
Defining Algorithms with the AlgorithmConfig Class#
The AlgorithmConfig
class represents
the primary way of configuring and building an Algorithm
.
You don’t use AlgorithmConfig
directly in practice, but rather use its algorithm-specific
implementations such as PPOConfig
, which each come
with their own set of arguments to their respective .training()
method.
Here’s how you work with an AlgorithmConfig
.
- class ray.rllib.algorithms.algorithm_config.AlgorithmConfig(algo_class=None)[source]
A RLlib AlgorithmConfig builds an RLlib Algorithm from a given configuration.
Example
>>> from ray.rllib.algorithms.algorithm_config import AlgorithmConfig >>> from ray.rllib.algorithms.callbacks import MemoryTrackingCallbacks >>> # Construct a generic config object, specifying values within different >>> # sub-categories, e.g. "training". >>> config = AlgorithmConfig().training(gamma=0.9, lr=0.01) ... .environment(env="CartPole-v1") ... .resources(num_gpus=0) ... .rollouts(num_rollout_workers=4) ... .callbacks(MemoryTrackingCallbacks) >>> # A config object can be used to construct the respective Trainer. >>> rllib_algo = config.build()
Example
>>> from ray.rllib.algorithms.algorithm_config import AlgorithmConfig >>> from ray import tune >>> # In combination with a tune.grid_search: >>> config = AlgorithmConfig() >>> config.training(lr=tune.grid_search([0.01, 0.001])) >>> # Use `to_dict()` method to get the legacy plain python config dict >>> # for usage with `tune.Tuner().fit()`. >>> tune.Tuner( ... "[registered trainer class]", param_space=config.to_dict() ... ).fit()
- classmethod from_dict(config_dict: dict) ray.rllib.algorithms.algorithm_config.AlgorithmConfig [source]
Creates an AlgorithmConfig from a legacy python config dict.
Examples
>>> from ray.rllib.algorithms.ppo.ppo import DEFAULT_CONFIG, PPOConfig >>> ppo_config = PPOConfig.from_dict(DEFAULT_CONFIG) >>> ppo = ppo_config.build(env="Pendulum-v1")
- Parameters
config_dict – The legacy formatted python config dict for some algorithm.
- Returns
A new AlgorithmConfig object that matches the given python config dict.
- to_dict() dict [source]
Converts all settings into a legacy config dict for backward compatibility.
- Returns
A complete AlgorithmConfigDict, usable in backward-compatible Tune/RLlib use cases, e.g. w/
tune.Tuner().fit()
.
- update_from_dict(config_dict: dict) ray.rllib.algorithms.algorithm_config.AlgorithmConfig [source]
Modifies this AlgorithmConfig via the provided python config dict.
Warns if
config_dict
contains deprecated keys. Silently sets even properties ofself
that do NOT exist. This way, this method may be used to configure custom Policies which do not have their own specific AlgorithmConfig classes, e.g.ray.rllib.examples.policy.random_policy::RandomPolicy
.- Parameters
config_dict – The old-style python config dict (PartialAlgorithmConfigDict) to use for overriding some properties defined in there.
- Returns
This updated AlgorithmConfig object.
- copy(copy_frozen: Optional[bool] = None) ray.rllib.algorithms.algorithm_config.AlgorithmConfig [source]
Creates a deep copy of this config and (un)freezes if necessary.
- Parameters
copy_frozen – Whether the created deep copy will be frozen or not. If None, keep the same frozen status that
self
currently has.- Returns
A deep copy of
self
that is (un)frozen.
- freeze() None [source]
Freezes this config object, such that no attributes can be set anymore.
Algorithms should use this method to make sure that their config objects remain read-only after this.
- validate() None [source]
Validates all values in this config.
Note: This should NOT include immediate checks on single value correctness, e.g. “batch_mode” = [complete_episodes|truncate_episodes]. Those simgular, independent checks should instead go directly into their respective methods.
- build(env: Optional[Union[str, Any]] = None, logger_creator: Optional[Callable[[], ray.tune.logger.logger.Logger]] = None, use_copy: bool = True) Algorithm [source]
Builds an Algorithm from this AlgorithmConfig (or a copy thereof).
- Parameters
env – Name of the environment to use (e.g. a gym-registered str), a full class path (e.g. “ray.rllib.examples.env.random_env.RandomEnv”), or an Env class directly. Note that this arg can also be specified via the “env” key in
config
.logger_creator – Callable that creates a ray.tune.Logger object. If unspecified, a default logger is created.
use_copy – Whether to deepcopy
self
and pass the copy to the Algorithm (instead ofself
) as config. This is useful in case you would like to recycle the same AlgorithmConfig over and over, e.g. in a test case, in which we loop over different DL-frameworks.
- Returns
A ray.rllib.algorithms.algorithm.Algorithm object.
- python_environment(*, extra_python_environs_for_driver: Optional[dict] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, extra_python_environs_for_worker: Optional[dict] = <ray.rllib.algorithms.algorithm_config._NotProvided object>) ray.rllib.algorithms.algorithm_config.AlgorithmConfig [source]
Sets the config’s python environment settings.
- Parameters
extra_python_environs_for_driver – Any extra python env vars to set in the algorithm’s process, e.g., {“OMP_NUM_THREADS”: “16”}.
extra_python_environs_for_worker – The extra python environments need to set for worker processes.
- Returns
This updated AlgorithmConfig object.
- resources(*, num_gpus: Optional[Union[float, int]] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, _fake_gpus: Optional[bool] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, num_cpus_per_worker: Optional[Union[float, int]] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, num_gpus_per_worker: Optional[Union[float, int]] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, num_cpus_for_local_worker: Optional[int] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, custom_resources_per_worker: Optional[dict] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, placement_strategy: Optional[str] = <ray.rllib.algorithms.algorithm_config._NotProvided object>) ray.rllib.algorithms.algorithm_config.AlgorithmConfig [source]
Specifies resources allocated for an Algorithm and its ray actors/workers.
- Parameters
num_gpus – Number of GPUs to allocate to the algorithm process. Note that not all algorithms can take advantage of GPUs. Support for multi-GPU is currently only available for tf-[PPO/IMPALA/DQN/PG]. This can be fractional (e.g., 0.3 GPUs).
_fake_gpus – Set to True for debugging (multi-)?GPU funcitonality on a CPU machine. GPU towers will be simulated by graphs located on CPUs in this case. Use
num_gpus
to test for different numbers of fake GPUs.num_cpus_per_worker – Number of CPUs to allocate per worker.
num_gpus_per_worker – Number of GPUs to allocate per worker. This can be fractional. This is usually needed only if your env itself requires a GPU (i.e., it is a GPU-intensive video game), or model inference is unusually expensive.
custom_resources_per_worker – Any custom Ray resources to allocate per worker.
num_cpus_for_local_worker – Number of CPUs to allocate for the algorithm. Note: this only takes effect when running in Tune. Otherwise, the algorithm runs in the main program (driver).
custom_resources_per_worker – Any custom Ray resources to allocate per worker.
placement_strategy – The strategy for the placement group factory returned by
Algorithm.default_resource_request()
. A PlacementGroup defines, which devices (resources) should always be co-located on the same node. For example, an Algorithm with 2 rollout workers, running with num_gpus=1 will request a placement group with the bundles: [{“gpu”: 1, “cpu”: 1}, {“cpu”: 1}, {“cpu”: 1}], where the first bundle is for the driver and the other 2 bundles are for the two workers. These bundles can now be “placed” on the same or different nodes depending on the value ofplacement_strategy
: “PACK”: Packs bundles into as few nodes as possible. “SPREAD”: Places bundles across distinct nodes as even as possible. “STRICT_PACK”: Packs bundles into one node. The group is not allowed to span multiple nodes. “STRICT_SPREAD”: Packs bundles across distinct nodes.
- Returns
This updated AlgorithmConfig object.
- framework(framework: Optional[str] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, *, eager_tracing: Optional[bool] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, eager_max_retraces: Optional[int] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, tf_session_args: Optional[Dict[str, Any]] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, local_tf_session_args: Optional[Dict[str, Any]] = <ray.rllib.algorithms.algorithm_config._NotProvided object>) ray.rllib.algorithms.algorithm_config.AlgorithmConfig [source]
Sets the config’s DL framework settings.
- Parameters
framework – tf: TensorFlow (static-graph); tf2: TensorFlow 2.x (eager or traced, if eager_tracing=True); torch: PyTorch
eager_tracing – Enable tracing in eager mode. This greatly improves performance (speedup ~2x), but makes it slightly harder to debug since Python code won’t be evaluated after the initial eager pass. Only possible if framework=tf2.
eager_max_retraces – Maximum number of tf.function re-traces before a runtime error is raised. This is to prevent unnoticed retraces of methods inside the
_eager_traced
Policy, which could slow down execution by a factor of 4, without the user noticing what the root cause for this slowdown could be. Only necessary for framework=tf2. Set to None to ignore the re-trace count and never throw an error.tf_session_args – Configures TF for single-process operation by default.
local_tf_session_args – Override the following tf session args on the local worker
- Returns
This updated AlgorithmConfig object.
- environment(env: Optional[Union[str, Any]] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, *, env_config: Optional[dict] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, observation_space: Optional[<MagicMock name='mock.spaces.Space' id='140330629676496'>] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, action_space: Optional[<MagicMock name='mock.spaces.Space' id='140330629676496'>] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, env_task_fn: Optional[Callable[[dict, Any, ray.rllib.env.env_context.EnvContext], Any]] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, render_env: Optional[bool] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, clip_rewards: Optional[Union[bool, float]] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, normalize_actions: Optional[bool] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, clip_actions: Optional[bool] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, disable_env_checking: Optional[bool] = <ray.rllib.algorithms.algorithm_config._NotProvided object>) ray.rllib.algorithms.algorithm_config.AlgorithmConfig [source]
Sets the config’s RL-environment settings.
- Parameters
env – The environment specifier. This can either be a tune-registered env, via
tune.register_env([name], lambda env_ctx: [env object])
, or a string specifier of an RLlib supported type. In the latter case, RLlib will try to interpret the specifier as either an openAI gym env, a PyBullet env, a ViZDoomGym env, or a fully qualified classpath to an Env class, e.g. “ray.rllib.examples.env.random_env.RandomEnv”.env_config – Arguments dict passed to the env creator as an EnvContext object (which is a dict plus the properties: num_rollout_workers, worker_index, vector_index, and remote).
observation_space – The observation space for the Policies of this Algorithm.
action_space – The action space for the Policies of this Algorithm.
env_task_fn – A callable taking the last train results, the base env and the env context as args and returning a new task to set the env to. The env must be a
TaskSettableEnv
sub-class for this to work. Seeexamples/curriculum_learning.py
for an example.render_env – If True, try to render the environment on the local worker or on worker 1 (if num_rollout_workers > 0). For vectorized envs, this usually means that only the first sub-environment will be rendered. In order for this to work, your env will have to implement the
render()
method which either: a) handles window generation and rendering itself (returning True) or b) returns a numpy uint8 image of shape [height x width x 3 (RGB)].clip_rewards – Whether to clip rewards during Policy’s postprocessing. None (default): Clip for Atari only (r=sign(r)). True: r=sign(r): Fixed rewards -1.0, 1.0, or 0.0. False: Never clip. [float value]: Clip at -value and + value. Tuple[value1, value2]: Clip at value1 and value2.
normalize_actions – If True, RLlib will learn entirely inside a normalized action space (0.0 centered with small stddev; only affecting Box components). We will unsquash actions (and clip, just in case) to the bounds of the env’s action space before sending actions back to the env.
clip_actions – If True, RLlib will clip actions according to the env’s bounds before sending them back to the env. TODO: (sven) This option should be deprecated and always be False.
disable_env_checking – If True, disable the environment pre-checking module.
- Returns
This updated AlgorithmConfig object.
- rollouts(*, num_rollout_workers: Optional[int] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, num_envs_per_worker: Optional[int] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, create_env_on_local_worker: Optional[bool] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, sample_collector: Optional[Type[ray.rllib.evaluation.collectors.sample_collector.SampleCollector]] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, sample_async: Optional[bool] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, enable_connectors: Optional[bool] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, rollout_fragment_length: Optional[Union[int, str]] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, batch_mode: Optional[str] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, remote_worker_envs: Optional[bool] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, remote_env_batch_wait_ms: Optional[float] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, validate_workers_after_construction: Optional[bool] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, ignore_worker_failures: Optional[bool] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, recreate_failed_workers: Optional[bool] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, restart_failed_sub_environments: Optional[bool] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, num_consecutive_worker_failures_tolerance: Optional[int] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, horizon: Optional[int] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, soft_horizon: Optional[bool] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, no_done_at_end: Optional[bool] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, preprocessor_pref: Optional[str] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, observation_filter: Optional[str] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, synchronize_filter: Optional[bool] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, compress_observations: Optional[bool] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, enable_tf1_exec_eagerly: Optional[bool] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, sampler_perf_stats_ema_coef: Optional[float] = <ray.rllib.algorithms.algorithm_config._NotProvided object>) ray.rllib.algorithms.algorithm_config.AlgorithmConfig [source]
Sets the rollout worker configuration.
- Parameters
num_rollout_workers – Number of rollout worker actors to create for parallel sampling. Setting this to 0 will force rollouts to be done in the local worker (driver process or the Algorithm’s actor when using Tune).
num_envs_per_worker – Number of environments to evaluate vector-wise per worker. This enables model inference batching, which can improve performance for inference bottlenecked workloads.
sample_collector – The SampleCollector class to be used to collect and retrieve environment-, model-, and sampler data. Override the SampleCollector base class to implement your own collection/buffering/retrieval logic.
create_env_on_local_worker – When
num_rollout_workers
> 0, the driver (local_worker; worker-idx=0) does not need an environment. This is because it doesn’t have to sample (done by remote_workers; worker_indices > 0) nor evaluate (done by evaluation workers; see below).sample_async – Use a background thread for sampling (slightly off-policy, usually not advisable to turn on unless your env specifically requires it).
enable_connectors – Use connector based environment runner, so that all preprocessing of obs and postprocessing of actions are done in agent and action connectors.
rollout_fragment_length – Divide episodes into fragments of this many steps each during rollouts. Trajectories of this size are collected from rollout workers and combined into a larger batch of
train_batch_size
for learning. For example, given rollout_fragment_length=100 and train_batch_size=1000: 1. RLlib collects 10 fragments of 100 steps each from rollout workers. 2. These fragments are concatenated and we perform an epoch of SGD. When using multiple envs per worker, the fragment size is multiplied bynum_envs_per_worker
. This is since we are collecting steps from multiple envs in parallel. For example, if num_envs_per_worker=5, then rollout workers will return experiences in chunks of 5*100 = 500 steps. The dataflow here can vary per algorithm. For example, PPO further divides the train batch into minibatches for multi-epoch SGD. Set to “auto” to have RLlib compute an exactrollout_fragment_length
to match the given batch size.batch_mode – How to build per-Sampler (RolloutWorker) batches, which are then usually concat’d to form the train batch. Note that “steps” below can mean different things (either env- or agent-steps) and depends on the
count_steps_by
setting, adjustable viaAlgorithmConfig.multi_agent(count_steps_by=..)
: 1) “truncate_episodes”: Each call to sample() will return a batch of at mostrollout_fragment_length * num_envs_per_worker
in size. The batch will be exactlyrollout_fragment_length * num_envs
in size if postprocessing does not change batch sizes. Episodes may be truncated in order to meet this size requirement. This mode guarantees evenly sized batches, but increases variance as the future return must now be estimated at truncation boundaries. 2) “complete_episodes”: Each call to sample() will return a batch of at leastrollout_fragment_length * num_envs_per_worker
in size. Episodes will not be truncated, but multiple episodes may be packed within one batch to meet the (minimum) batch size. Note that whennum_envs_per_worker > 1
, episode steps will be buffered until the episode completes, and hence batches may contain significant amounts of off-policy data.remote_worker_envs – If using num_envs_per_worker > 1, whether to create those new envs in remote processes instead of in the same worker. This adds overheads, but can make sense if your envs can take much time to step / reset (e.g., for StarCraft). Use this cautiously; overheads are significant.
remote_env_batch_wait_ms – Timeout that remote workers are waiting when polling environments. 0 (continue when at least one env is ready) is a reasonable default, but optimal value could be obtained by measuring your environment step / reset and model inference perf.
validate_workers_after_construction – Whether to validate that each created remote worker is healthy after its construction process.
ignore_worker_failures – Whether to attempt to continue training if a worker crashes. The number of currently healthy workers is reported as the “num_healthy_workers” metric.
recreate_failed_workers – Whether - upon a worker failure - RLlib will try to recreate the lost worker as an identical copy of the failed one. The new worker will only differ from the failed one in its
self.recreated_worker=True
property value. It will have the sameworker_index
as the original one. If True, theignore_worker_failures
setting will be ignored.restart_failed_sub_environments – If True and any sub-environment (within a vectorized env) throws any error during env stepping, the Sampler will try to restart the faulty sub-environment. This is done without disturbing the other (still intact) sub-environment and without the RolloutWorker crashing.
num_consecutive_worker_failures_tolerance – The number of consecutive times a rollout worker (or evaluation worker) failure is tolerated before finally crashing the Algorithm. Only useful if either
ignore_worker_failures
orrecreate_failed_workers
is True. Note that forrestart_failed_sub_environments
and sub-environment failures, the worker itself is NOT affected and won’t throw any errors as the flawed sub-environment is silently restarted under the hood.horizon – Number of steps after which the episode is forced to terminate. Defaults to
env.spec.max_episode_steps
(if present) for Gym envs.soft_horizon – Calculate rewards but don’t reset the environment when the horizon is hit. This allows value estimation and RNN state to span across logical episodes denoted by horizon. This only has an effect if horizon != inf.
no_done_at_end – Don’t set ‘done’ at the end of the episode. In combination with
soft_horizon
, this works as follows: - no_done_at_end=False soft_horizon=False: Reset env and adddone=True
at end of each episode. - no_done_at_end=True soft_horizon=False: Reset env, but do NOT adddone=True
at end of the episode. - no_done_at_end=False soft_horizon=True: Do NOT reset env at horizon, but adddone=True
at the horizon (pretending the episode has terminated). - no_done_at_end=True soft_horizon=True: Do NOT reset env at horizon and do NOT adddone=True
at the horizon.preprocessor_pref – Whether to use “rllib” or “deepmind” preprocessors by default. Set to None for using no preprocessor. In this case, the model will have to handle possibly complex observations from the environment.
observation_filter – Element-wise observation filter, either “NoFilter” or “MeanStdFilter”.
synchronize_filter – Whether to synchronize the statistics of remote filters.
compress_observations – Whether to LZ4 compress individual observations in the SampleBatches collected during rollouts.
enable_tf1_exec_eagerly – Explicitly tells the rollout worker to enable TF eager execution. This is useful for example when framework is “torch”, but a TF2 policy needs to be restored for evaluation or league-based purposes.
sampler_perf_stats_ema_coef – If specified, perf stats are in EMAs. This is the coeff of how much new data points contribute to the averages. Default is None, which uses simple global average instead. The EMA update rule is: updated = (1 - ema_coef) * old + ema_coef * new
- Returns
This updated AlgorithmConfig object.
- training(gamma: Optional[float] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, lr: Optional[float] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, train_batch_size: Optional[int] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, model: Optional[dict] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, optimizer: Optional[dict] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, max_requests_in_flight_per_sampler_worker: Optional[int] = <ray.rllib.algorithms.algorithm_config._NotProvided object>) ray.rllib.algorithms.algorithm_config.AlgorithmConfig [source]
Sets the training related configuration.
- Parameters
gamma – Float specifying the discount factor of the Markov Decision process.
lr – The default learning rate.
train_batch_size – Training batch size, if applicable.
model – Arguments passed into the policy model. See models/catalog.py for a full list of the available model options. TODO: Provide ModelConfig objects instead of dicts.
optimizer – Arguments to pass to the policy optimizer.
max_requests_in_flight_per_sampler_worker – Max number of inflight requests to each sampling worker. See the FaultTolerantActorManager class for more details. Tuning these values is important when running experimens with large sample batches, where there is the risk that the object store may fill up, causing spilling of objects to disk. This can cause any asynchronous requests to become very slow, making your experiment run slow as well. You can inspect the object store during your experiment via a call to ray memory on your headnode, and by using the ray dashboard. If you’re seeing that the object store is filling up, turn down the number of remote requests in flight, or enable compression in your experiment of timesteps.
- Returns
This updated AlgorithmConfig object.
- callbacks(callbacks_class) ray.rllib.algorithms.algorithm_config.AlgorithmConfig [source]
Sets the callbacks configuration.
- Parameters
callbacks_class – Callbacks class, whose methods will be run during various phases of training and environment sample collection. See the
DefaultCallbacks
class andexamples/custom_metrics_and_callbacks.py
for more usage information.- Returns
This updated AlgorithmConfig object.
- exploration(*, explore: Optional[bool] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, exploration_config: Optional[dict] = <ray.rllib.algorithms.algorithm_config._NotProvided object>) ray.rllib.algorithms.algorithm_config.AlgorithmConfig [source]
Sets the config’s exploration settings.
- Parameters
explore – Default exploration behavior, iff
explore=None
is passed into compute_action(s). Set to False for no exploration behavior (e.g., for evaluation).exploration_config – A dict specifying the Exploration object’s config.
- Returns
This updated AlgorithmConfig object.
- evaluation(*, evaluation_interval: Optional[int] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, evaluation_duration: Optional[Union[int, str]] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, evaluation_duration_unit: Optional[str] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, evaluation_sample_timeout_s: Optional[float] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, evaluation_parallel_to_training: Optional[bool] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, evaluation_config: Optional[Union[ray.rllib.algorithms.algorithm_config.AlgorithmConfig, dict]] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, off_policy_estimation_methods: Optional[Dict] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, ope_split_batch_by_episode: Optional[bool] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, evaluation_num_workers: Optional[int] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, custom_evaluation_function: Optional[Callable] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, always_attach_evaluation_results: Optional[bool] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, enable_async_evaluation: Optional[bool] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, evaluation_num_episodes=-1) ray.rllib.algorithms.algorithm_config.AlgorithmConfig [source]
Sets the config’s evaluation settings.
- Parameters
evaluation_interval – Evaluate with every
evaluation_interval
training iterations. The evaluation stats will be reported under the “evaluation” metric key. Note that for Ape-X metrics are already only reported for the lowest epsilon workers (least random workers). Set to None (or 0) for no evaluation.evaluation_duration – Duration for which to run evaluation each
evaluation_interval
. The unit for the duration can be set viaevaluation_duration_unit
to either “episodes” (default) or “timesteps”. If using multiple evaluation workers (evaluation_num_workers > 1), the load to run will be split amongst these. If the value is “auto”: - Forevaluation_parallel_to_training=True
: Will run as many episodes/timesteps that fit into the (parallel) training step. - Forevaluation_parallel_to_training=False
: Error.evaluation_duration_unit – The unit, with which to count the evaluation duration. Either “episodes” (default) or “timesteps”.
evaluation_sample_timeout_s – The timeout (in seconds) for the ray.get call to the remote evaluation worker(s)
sample()
method. After this time, the user will receive a warning and instructions on how to fix the issue. This could be either to make sure the episode ends, increasing the timeout, or switching toevaluation_duration_unit=timesteps
.evaluation_parallel_to_training – Whether to run evaluation in parallel to a Algorithm.train() call using threading. Default=False. E.g. evaluation_interval=2 -> For every other training iteration, the Algorithm.train() and Algorithm.evaluate() calls run in parallel. Note: This is experimental. Possible pitfalls could be race conditions for weight synching at the beginning of the evaluation loop.
evaluation_config – Typical usage is to pass extra args to evaluation env creator and to disable exploration by computing deterministic actions. IMPORTANT NOTE: Policy gradient algorithms are able to find the optimal policy, even if this is a stochastic one. Setting “explore=False” here will result in the evaluation workers not using this optimal policy!
off_policy_estimation_methods – Specify how to evaluate the current policy, along with any optional config parameters. This only has an effect when reading offline experiences (“input” is not “sampler”). Available keys: {ope_method_name: {“type”: ope_type, …}} where
ope_method_name
is a user-defined string to save the OPE results under, andope_type
can be any subclass of OffPolicyEstimator, e.g. ray.rllib.offline.estimators.is::ImportanceSampling or your own custom subclass, or the full class path to the subclass. You can also add additional config arguments to be passed to the OffPolicyEstimator in the dict, e.g. {“qreg_dr”: {“type”: DoublyRobust, “q_model_type”: “qreg”, “k”: 5}}ope_split_batch_by_episode – Whether to use SampleBatch.split_by_episode() to split the input batch to episodes before estimating the ope metrics. In case of bandits you should make this False to see improvements in ope evaluation speed. In case of bandits, it is ok to not split by episode, since each record is one timestep already. The default is True.
evaluation_num_workers – Number of parallel workers to use for evaluation. Note that this is set to zero by default, which means evaluation will be run in the algorithm process (only if evaluation_interval is not None). If you increase this, it will increase the Ray resource usage of the algorithm since evaluation workers are created separately from rollout workers (used to sample data for training).
custom_evaluation_function – Customize the evaluation method. This must be a function of signature (algo: Algorithm, eval_workers: WorkerSet) -> metrics: dict. See the Algorithm.evaluate() method to see the default implementation. The Algorithm guarantees all eval workers have the latest policy state before this function is called.
always_attach_evaluation_results – Make sure the latest available evaluation results are always attached to a step result dict. This may be useful if Tune or some other meta controller needs access to evaluation metrics all the time.
enable_async_evaluation – If True, use an AsyncRequestsManager for the evaluation workers and use this manager to send
sample()
requests to the evaluation workers. This way, the Algorithm becomes more robust against long running episodes and/or failing (and restarting) workers.
- Returns
This updated AlgorithmConfig object.
- offline_data(*, input_=<ray.rllib.algorithms.algorithm_config._NotProvided object>, input_config=<ray.rllib.algorithms.algorithm_config._NotProvided object>, actions_in_input_normalized=<ray.rllib.algorithms.algorithm_config._NotProvided object>, input_evaluation=<ray.rllib.algorithms.algorithm_config._NotProvided object>, postprocess_inputs=<ray.rllib.algorithms.algorithm_config._NotProvided object>, shuffle_buffer_size=<ray.rllib.algorithms.algorithm_config._NotProvided object>, output=<ray.rllib.algorithms.algorithm_config._NotProvided object>, output_config=<ray.rllib.algorithms.algorithm_config._NotProvided object>, output_compress_columns=<ray.rllib.algorithms.algorithm_config._NotProvided object>, output_max_file_size=<ray.rllib.algorithms.algorithm_config._NotProvided object>, offline_sampling=<ray.rllib.algorithms.algorithm_config._NotProvided object>) ray.rllib.algorithms.algorithm_config.AlgorithmConfig [source]
Sets the config’s offline data settings.
- Parameters
input – Specify how to generate experiences: - “sampler”: Generate experiences via online (env) simulation (default). - A local directory or file glob expression (e.g., “/tmp/.json”). - A list of individual file paths/URIs (e.g., [“/tmp/1.json”, “s3://bucket/2.json”]). - A dict with string keys and sampling probabilities as values (e.g., {“sampler”: 0.4, “/tmp/.json”: 0.4, “s3://bucket/expert.json”: 0.2}). - A callable that takes an
IOContext
object as only arg and returns a ray.rllib.offline.InputReader. - A string key that indexes a callable with tune.registry.register_inputinput_config – Arguments that describe the settings for reading the input. If input is
sample
, this will be environment configuation, e.g.env_name
andenv_config
, etc. SeeEnvContext
for more info. If the input isdataset
, this will be e.g.format
,path
.actions_in_input_normalized – True, if the actions in a given offline “input” are already normalized (between -1.0 and 1.0). This is usually the case when the offline file has been generated by another RLlib algorithm (e.g. PPO or SAC), while “normalize_actions” was set to True.
postprocess_inputs – Whether to run postprocess_trajectory() on the trajectory fragments from offline inputs. Note that postprocessing will be done using the current policy, not the behavior policy, which is typically undesirable for on-policy algorithms.
shuffle_buffer_size – If positive, input batches will be shuffled via a sliding window buffer of this number of batches. Use this if the input data is not in random enough order. Input is delayed until the shuffle buffer is filled.
output – Specify where experiences should be saved: - None: don’t save any experiences - “logdir” to save to the agent log dir - a path/URI to save to a custom output directory (e.g., “s3://bckt/”) - a function that returns a rllib.offline.OutputWriter
output_config – Arguments accessible from the IOContext for configuring custom output.
output_compress_columns – What sample batch columns to LZ4 compress in the output data.
output_max_file_size – Max output file size before rolling over to a new file.
offline_sampling – Whether sampling for the Algorithm happens via reading from offline data. If True, RolloutWorkers will NOT limit the number of collected batches within the same
sample()
call based on the number of sub-environments within the worker (no sub-environments present).
- Returns
This updated AlgorithmConfig object.
- multi_agent(*, policies=<ray.rllib.algorithms.algorithm_config._NotProvided object>, policy_map_capacity=<ray.rllib.algorithms.algorithm_config._NotProvided object>, policy_map_cache=<ray.rllib.algorithms.algorithm_config._NotProvided object>, policy_mapping_fn=<ray.rllib.algorithms.algorithm_config._NotProvided object>, policies_to_train=<ray.rllib.algorithms.algorithm_config._NotProvided object>, observation_fn=<ray.rllib.algorithms.algorithm_config._NotProvided object>, count_steps_by=<ray.rllib.algorithms.algorithm_config._NotProvided object>, replay_mode=-1) ray.rllib.algorithms.algorithm_config.AlgorithmConfig [source]
Sets the config’s multi-agent settings.
Validates the new multi-agent settings and translates everything into a unified multi-agent setup format. For example a
policies
list or set of IDs is properly converted into a dict mapping these IDs to PolicySpecs.- Parameters
policies – Map of type MultiAgentPolicyConfigDict from policy ids to either 4-tuples of (policy_cls, obs_space, act_space, config) or PolicySpecs. These tuples or PolicySpecs define the class of the policy, the observation- and action spaces of the policies, and any extra config.
policy_map_capacity – Keep this many policies in the “policy_map” (before writing least-recently used ones to disk/S3).
policy_map_cache – Where to store overflowing (least-recently used) policies? Could be a directory (str) or an S3 location. None for using the default output dir.
policy_mapping_fn – Function mapping agent ids to policy ids. The signature is:
(agent_id, episode, worker, **kwargs) -> PolicyID
.policies_to_train – Determines those policies that should be updated. Options are: - None, for training all policies. - An iterable of PolicyIDs that should be trained. - A callable, taking a PolicyID and a SampleBatch or MultiAgentBatch and returning a bool (indicating whether the given policy is trainable or not, given the particular batch). This allows you to have a policy trained only on certain data (e.g. when playing against a certain opponent).
observation_fn – Optional function that can be used to enhance the local agent observations to include more state. See rllib/evaluation/observation_function.py for more info.
count_steps_by – Which metric to use as the “batch size” when building a MultiAgentBatch. The two supported values are: “env_steps”: Count each time the env is “stepped” (no matter how many multi-agent actions are passed/how many multi-agent observations have been returned in the previous step). “agent_steps”: Count each individual agent step as one step.
- Returns
This updated AlgorithmConfig object.
- is_multi_agent() bool [source]
Returns whether this config specifies a multi-agent setup.
- Returns
True, if a) >1 policies defined OR b) 1 policy defined, but its ID is NOT DEFAULT_POLICY_ID.
- reporting(*, keep_per_episode_custom_metrics: Optional[bool] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, metrics_episode_collection_timeout_s: Optional[float] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, metrics_num_episodes_for_smoothing: Optional[int] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, min_time_s_per_iteration: Optional[int] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, min_train_timesteps_per_iteration: Optional[int] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, min_sample_timesteps_per_iteration: Optional[int] = <ray.rllib.algorithms.algorithm_config._NotProvided object>) ray.rllib.algorithms.algorithm_config.AlgorithmConfig [source]
Sets the config’s reporting settings.
- Parameters
keep_per_episode_custom_metrics – Store raw custom metrics without calculating max, min, mean
metrics_episode_collection_timeout_s – Wait for metric batches for at most this many seconds. Those that have not returned in time will be collected in the next train iteration.
metrics_num_episodes_for_smoothing – Smooth rollout metrics over this many episodes, if possible. In case rollouts (sample collection) just started, there may be fewer than this many episodes in the buffer and we’ll compute metrics over this smaller number of available episodes. In case there are more than this many episodes collected in a single training iteration, use all of these episodes for metrics computation, meaning don’t ever cut any “excess” episodes.
min_time_s_per_iteration – Minimum time to accumulate within a single
train()
call. This value does not affect learning, only the number of timesAlgorithm.training_step()
is called byAlgorithm.train()
. If - after one such step attempt, the time taken has not reachedmin_time_s_per_iteration
, will perform n moretraining_step()
calls until the minimum time has been consumed. Set to 0 or None for no minimum time.min_train_timesteps_per_iteration – Minimum training timesteps to accumulate within a single
train()
call. This value does not affect learning, only the number of timesAlgorithm.training_step()
is called byAlgorithm.train()
. If - after one such step attempt, the training timestep count has not been reached, will perform n moretraining_step()
calls until the minimum timesteps have been executed. Set to 0 or None for no minimum timesteps.min_sample_timesteps_per_iteration – Minimum env sampling timesteps to accumulate within a single
train()
call. This value does not affect learning, only the number of timesAlgorithm.training_step()
is called byAlgorithm.train()
. If - after one such step attempt, the env sampling timestep count has not been reached, will perform n moretraining_step()
calls until the minimum timesteps have been executed. Set to 0 or None for no minimum timesteps.
- Returns
This updated AlgorithmConfig object.
- checkpointing(export_native_model_files: Optional[bool] = <ray.rllib.algorithms.algorithm_config._NotProvided object>) ray.rllib.algorithms.algorithm_config.AlgorithmConfig [source]
Sets the config’s checkpointing settings.
- Parameters
export_native_model_files – Whether an individual Policy- or the Algorithm’s checkpoints also contain (tf or torch) native model files. These could be used to restore just the NN models from these files w/o requiring RLlib. These files are generated by calling the tf- or torch- built-in saving utility methods on the actual models.
- Returns
This updated AlgorithmConfig object.
- debugging(*, logger_creator: Optional[Callable[[], ray.tune.logger.logger.Logger]] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, logger_config: Optional[dict] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, log_level: Optional[str] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, log_sys_usage: Optional[bool] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, fake_sampler: Optional[bool] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, seed: Optional[int] = <ray.rllib.algorithms.algorithm_config._NotProvided object>, worker_cls: Optional[Type[ray.rllib.evaluation.rollout_worker.RolloutWorker]] = <ray.rllib.algorithms.algorithm_config._NotProvided object>) ray.rllib.algorithms.algorithm_config.AlgorithmConfig [source]
Sets the config’s debugging settings.
- Parameters
logger_creator – Callable that creates a ray.tune.Logger object. If unspecified, a default logger is created.
logger_config – Define logger-specific configuration to be used inside Logger Default value None allows overwriting with nested dicts.
log_level – Set the ray.rllib.* log level for the agent process and its workers. Should be one of DEBUG, INFO, WARN, or ERROR. The DEBUG level will also periodically print out summaries of relevant internal dataflow (this is also printed out once at startup at the INFO level). When using the
rllib train
command, you can also use the-v
and-vv
flags as shorthand for INFO and DEBUG.log_sys_usage – Log system resource metrics to results. This requires
psutil
to be installed for sys stats, andgputil
for GPU metrics.fake_sampler – Use fake (infinite speed) sampler. For testing only.
seed – This argument, in conjunction with worker_index, sets the random seed of each worker, so that identically configured trials will have identical results. This makes experiments reproducible.
worker_cls – Use a custom RolloutWorker type for unit testing purpose.
- Returns
This updated AlgorithmConfig object.
- experimental(*, _tf_policy_handles_more_than_one_loss=<ray.rllib.algorithms.algorithm_config._NotProvided object>, _disable_preprocessor_api=<ray.rllib.algorithms.algorithm_config._NotProvided object>, _disable_action_flattening=<ray.rllib.algorithms.algorithm_config._NotProvided object>, _disable_execution_plan_api=<ray.rllib.algorithms.algorithm_config._NotProvided object>) ray.rllib.algorithms.algorithm_config.AlgorithmConfig [source]
Sets the config’s experimental settings.
- Parameters
_tf_policy_handles_more_than_one_loss – Experimental flag. If True, TFPolicy will handle more than one loss/optimizer. Set this to True, if you would like to return more than one loss term from your
loss_fn
and an equal number of optimizers from youroptimizer_fn
. In the future, the default for this will be True._disable_preprocessor_api – Experimental flag. If True, no (observation) preprocessor will be created and observations will arrive in model as they are returned by the env. In the future, the default for this will be True.
_disable_action_flattening – Experimental flag. If True, RLlib will no longer flatten the policy-computed actions into a single tensor (for storage in SampleCollectors/output files/etc..), but leave (possibly nested) actions as-is. Disabling flattening affects: - SampleCollectors: Have to store possibly nested action structs. - Models that have the previous action(s) as part of their input. - Algorithms reading from offline files (incl. action information).
_disable_execution_plan_api – Experimental flag. If True, the execution plan API will not be used. Instead, a Algorithm’s
training_iteration
method will be called as-is each training iteration.
- Returns
This updated AlgorithmConfig object.
- get_rollout_fragment_length(worker_index: int = 0) int [source]
Automatically infers a proper rollout_fragment_length setting if “auto”.
Uses the simple formula:
rollout_fragment_length
=train_batch_size
/ (num_envs_per_worker
*num_rollout_workers
)If result is not a fraction AND
worker_index
is provided, will make those workers add another timestep, such that the overall batch size (across the workers) will add up to exactly thetrain_batch_size
.- Returns
The user-provided
rollout_fragment_length
or a computed one (if user value is “auto”).
- get_evaluation_config_object() Optional[ray.rllib.algorithms.algorithm_config.AlgorithmConfig] [source]
Creates a full AlgorithmConfig object from
self.evaluation_config
.- Returns
A fully valid AlgorithmConfig object that can be used for the evaluation WorkerSet. If
self
is already an evaluation config object, return None.
- get_multi_agent_setup(*, policies: Optional[Dict[str, PolicySpec]] = None, env: Optional[Any] = None, spaces: Optional[Dict[str, Tuple[<MagicMock name='mock.Space' id='140329264337040'>, <MagicMock name='mock.Space' id='140329264337040'>]]] = None, default_policy_class: Optional[Type[ray.rllib.policy.policy.Policy]] = None) Tuple[Dict[str, PolicySpec], Callable[[str, Union[SampleBatch, MultiAgentBatch]], bool]] [source]
Compiles complete multi-agent config (dict) from the information in
self
.Infers the observation- and action spaces, the policy classes, and the policy’s configs. The returned
MultiAgentPolicyConfigDict
is fully unified and strictly maps PolicyIDs to complete PolicySpec objects (with all their fields not-None).Examples
>>> import numpy as np >>> from ray.rllib.algorithms.ppo import PPOConfig >>> config = ( ... PPOConfig() ... .environment("CartPole-v1") ... .framework("torch") ... .multi_agent(policies={"pol1", "pol2"}, policies_to_train=["pol1"]) ... ) >>> policy_dict, is_policy_to_train = \ ... config.get_multi_agent_setup() >>> is_policy_to_train("pol1") True >>> is_policy_to_train("pol2") False >>> print(policy_dict) { "pol1": PolicySpec( PPOTorchPolicyV2, # infered from Algo's default policy class Box(-2.0, 2.0, (4,), np.float), # infered from env Discrete(2), # infered from env {}, # not provided -> empty dict ), "pol2": PolicySpec( PPOTorchPolicyV2, # infered from Algo's default policy class Box(-2.0, 2.0, (4,), np.float), # infered from env Discrete(2), # infered from env {}, # not provided -> empty dict ), }
- Parameters
policies – An optional multi-agent
policies
dict, mapping policy IDs to PolicySpec objects. If not provided, will useself.policies
instead. Note that thepolicy_class
,observation_space
, andaction_space
properties in these PolicySpecs may be None and must therefore be inferred here.env – An optional env instance, from which to infer the different spaces for the different policies. If not provided, will try to infer from
spaces
. Otherwise fromself.observation_space
andself.action_space
. If no information on spaces can be infered, will raise an error.spaces – Optional dict mapping policy IDs to tuples of 1) observation space and 2) action space that should be used for the respective policy. These spaces were usually provided by an already instantiated remote RolloutWorker. If not provided, will try to infer from
env
. Otherwise fromself.observation_space
andself.action_space
. If no information on spaces can be infered, will raise an error.default_policy_class – The Policy class to use should a PolicySpec have its policy_class property set to None.
- Returns
A tuple consisting of 1) a MultiAgentPolicyConfigDict and 2) a
is_policy_to_train(PolicyID, SampleBatchType) -> bool
callable.- Raises
ValueError – In case, no spaces can be infered for the policy/ies.
ValueError – In case, two agents in the env map to the same PolicyID (according to
self.policy_mapping_fn
), but have different action- or observation spaces according to the infered space information.
- validate_train_batch_size_vs_rollout_fragment_length() None [source]
Detects mismatches for
train_batch_size
vsrollout_fragment_length
.Only applicable for algorithms, whose train_batch_size should be directly dependent on rollout_fragment_length (synchronous sampling, on-policy PG algos).
If rollout_fragment_length != “auto”, makes sure that the product of
rollout_fragment_length
xnum_rollout_workers
xnum_envs_per_worker
roughly (10%) matches the providedtrain_batch_size
. Otherwise, errors with asking the user to set rollout_fragment_length toauto
or to a matching value.Also, only checks this if
train_batch_size
> 0 (DDPPO sets this to -1 to auto-calculate the actual batch size later).- Raises
ValueError – If there is a mismatch between user provided
rollout_fragment_length –
- get(key, default=None)[source]
Shim method to help pretend we are a dict.
- pop(key, default=None)[source]
Shim method to help pretend we are a dict.
- keys()[source]
Shim method to help pretend we are a dict.
- values()[source]
Shim method to help pretend we are a dict.
- items()[source]
Shim method to help pretend we are a dict.
- property multiagent
Shim method to help pretend we are a dict with ‘multiagent’ key.
Building Custom Algorithm Classes#
Warning
As of Ray >= 1.9, it is no longer recommended to use the build_trainer()
utility
function for creating custom Algorithm sub-classes.
Instead, follow the simple guidelines here for directly sub-classing from
Algorithm
.
In order to create a custom Algorithm, sub-class the
Algorithm
class
and override one or more of its methods. Those are in particular:
get_default_config()
validate_config()
training_iteration()
Interacting with an Algorithm#
Once you’ve built an AlgorithmConfig
and retrieve an Algorithm
from it via
the build()
method , you can use it to train and evaluate your experiments.
Here’s the full Algorithm
API reference.
- class ray.rllib.algorithms.algorithm.Algorithm(config: Optional[ray.rllib.algorithms.algorithm_config.AlgorithmConfig] = None, env=None, logger_creator: Optional[Callable[[], ray.tune.logger.logger.Logger]] = None, **kwargs)[source]#
An RLlib algorithm responsible for optimizing one or more Policies.
Algorithms contain a WorkerSet under
self.workers
. A WorkerSet is normally composed of a single local worker (self.workers.local_worker()), used to compute and apply learning updates, and optionally one or more remote workers used to generate environment samples in parallel. WorkerSet is fault tolerant and elastic. It tracks health states for all the managed remote worker actors. As a result, Algorithm should never access the underlying actor handles directly. Instead, always access them via all the foreach APIs with assigned IDs of the underlying workers.Each worker (remotes or local) contains a PolicyMap, which itself may contain either one policy for single-agent training or one or more policies for multi-agent training. Policies are synchronized automatically from time to time using ray.remote calls. The exact synchronization logic depends on the specific algorithm used, but this usually happens from local worker to all remote workers and after each training update.
You can write your own Algorithm classes by sub-classing from
Algorithm
or any of its built-in sub-classes. This allows you to override thetraining_step
method to implement your own algorithm logic. You can find the different built-in algorithms’training_step()
methods in their respective main .py files, e.g. rllib.algorithms.dqn.dqn.py or rllib.algorithms.impala.impala.py.The most important API methods a Algorithm exposes are
train()
,evaluate()
,save()
andrestore()
.- static from_checkpoint(checkpoint: Union[str, ray.air.checkpoint.Checkpoint], policy_ids: Optional[Container[str]] = None, policy_mapping_fn: Optional[Callable[[Any, int], str]] = None, policies_to_train: Optional[Union[Container[str], Callable[[str, Optional[Union[SampleBatch, MultiAgentBatch]]], bool]]] = None) Algorithm [source]#
Creates a new algorithm instance from a given checkpoint.
Note: This method must remain backward compatible from 2.0.0 on.
- Parameters
checkpoint – The path (str) to the checkpoint directory to use or an AIR Checkpoint instance to restore from.
policy_ids – Optional list of PolicyIDs to recover. This allows users to restore an Algorithm with only a subset of the originally present Policies.
policy_mapping_fn – An optional (updated) policy mapping function to use from here on.
policies_to_train – An optional list of policy IDs to be trained or a callable taking PolicyID and SampleBatchType and returning a bool (trainable or not?). If None, will keep the existing setup in place. Policies, whose IDs are not in the list (or for which the callable returns False) will not be updated.
- Returns
The instantiated Algorithm.
- static from_state(state: Dict) ray.rllib.algorithms.algorithm.Algorithm [source]#
Recovers an Algorithm from a state object.
The
state
of an instantiated Algorithm can be retrieved by calling itsget_state
method. It contains all information necessary to create the Algorithm from scratch. No access to the original code (e.g. configs, knowledge of the Algorithm’s class, etc..) is needed.- Parameters
state – The state to recover a new Algorithm instance from.
- Returns
A new Algorithm instance.
- __init__(config: Optional[ray.rllib.algorithms.algorithm_config.AlgorithmConfig] = None, env=None, logger_creator: Optional[Callable[[], ray.tune.logger.logger.Logger]] = None, **kwargs)[source]#
Initializes an Algorithm instance.
- Parameters
config – Algorithm-specific configuration object.
logger_creator – Callable that creates a ray.tune.Logger object. If unspecified, a default logger is created.
**kwargs – Arguments passed to the Trainable base class.
- setup(config: ray.rllib.algorithms.algorithm_config.AlgorithmConfig) None [source]#
Subclasses should override this for custom initialization.
New in version 0.8.7.
- Parameters
config – Hyperparameters and other configs given. Copy of
self.config
.
- classmethod get_default_policy_class(config: ray.rllib.algorithms.algorithm_config.AlgorithmConfig) Optional[Type[ray.rllib.policy.policy.Policy]] [source]#
Returns a default Policy class to use, given a config.
This class will be used by an Algorithm in case the policy class is not provided by the user in any single- or multi-agent PolicySpec.
- step() dict [source]#
Implements the main
Trainer.train()
logic.Takes n attempts to perform a single training step. Thereby catches RayErrors resulting from worker failures. After n attempts, fails gracefully.
Override this method in your Trainer sub-classes if you would like to handle worker failures yourself. Otherwise, override only
training_step()
to implement the core algorithm logic.- Returns
The results dict with stats/infos on sampling, training, and - if required - evaluation.
- evaluate(duration_fn: Optional[Callable[[int], int]] = None) dict [source]#
Evaluates current policy under
evaluation_config
settings.Note that this default implementation does not do anything beyond merging evaluation_config with the normal trainer config.
- Parameters
duration_fn – An optional callable taking the already run num episodes as only arg and returning the number of episodes left to run. It’s used to find out whether evaluation should continue.
- restore_workers(workers: ray.rllib.evaluation.worker_set.WorkerSet)[source]#
Try to restore failed workers if necessary.
Algorithms that use custom RolloutWorkers may override this method to disable default, and create custom restoration logics.
- Parameters
workers – The WorkerSet to restore. This may be Rollout or Evaluation workers.
- training_step() dict [source]#
Default single iteration logic of an algorithm.
Collect on-policy samples (SampleBatches) in parallel using the Trainer’s RolloutWorkers (@ray.remote).
Concatenate collected SampleBatches into one train batch.
Note that we may have more than one policy in the multi-agent case: Call the different policies’
learn_on_batch
(simple optimizer) ORload_batch_into_buffer
+learn_on_loaded_batch
(multi-GPU optimizer) methods to calculate loss and update the model(s).Return all collected metrics for the iteration.
- Returns
The results dict from executing the training iteration.
- compute_single_action(observation: Optional[Union[numpy.array, tf.Tensor, torch.Tensor, dict, tuple]] = None, state: Optional[List[Union[numpy.array, tf.Tensor, torch.Tensor, dict, tuple]]] = None, *, prev_action: Optional[Union[numpy.array, tf.Tensor, torch.Tensor, dict, tuple]] = None, prev_reward: Optional[float] = None, info: Optional[dict] = None, input_dict: Optional[ray.rllib.policy.sample_batch.SampleBatch] = None, policy_id: str = 'default_policy', full_fetch: bool = False, explore: Optional[bool] = None, timestep: Optional[int] = None, episode: Optional[ray.rllib.evaluation.episode.Episode] = None, unsquash_action: Optional[bool] = None, clip_action: Optional[bool] = None, unsquash_actions=- 1, clip_actions=- 1, **kwargs) Union[numpy.array, tf.Tensor, torch.Tensor, dict, tuple, Tuple[Union[numpy.array, tf.Tensor, torch.Tensor, dict, tuple], List[Union[numpy.array, tf.Tensor, torch.Tensor]], Dict[str, Union[numpy.array, tf.Tensor, torch.Tensor]]]] [source]#
Computes an action for the specified policy on the local worker.
Note that you can also access the policy object through self.get_policy(policy_id) and call compute_single_action() on it directly.
- Parameters
observation – Single (unbatched) observation from the environment.
state – List of all RNN hidden (single, unbatched) state tensors.
prev_action – Single (unbatched) previous action value.
prev_reward – Single (unbatched) previous reward value.
info – Env info dict, if any.
input_dict – An optional SampleBatch that holds all the values for: obs, state, prev_action, and prev_reward, plus maybe custom defined views of the current env trajectory. Note that only one of
obs
orinput_dict
must be non-None.policy_id – Policy to query (only applies to multi-agent). Default: “default_policy”.
full_fetch – Whether to return extra action fetch results. This is always set to True if
state
is specified.explore – Whether to apply exploration to the action. Default: None -> use self.config[“explore”].
timestep – The current (sampling) time step.
episode – This provides access to all of the internal episodes’ state, which may be useful for model-based or multi-agent algorithms.
unsquash_action – Should actions be unsquashed according to the env’s/Policy’s action space? If None, use the value of self.config[“normalize_actions”].
clip_action – Should actions be clipped according to the env’s/Policy’s action space? If None, use the value of self.config[“clip_actions”].
- Keyword Arguments
kwargs – forward compatibility placeholder
- Returns
The computed action if full_fetch=False, or a tuple of a) the full output of policy.compute_actions() if full_fetch=True or we have an RNN-based Policy.
- Raises
KeyError – If the
policy_id
cannot be found in this Trainer’s local worker.
- compute_actions(observations: Union[numpy.array, tf.Tensor, torch.Tensor, dict, tuple], state: Optional[List[Union[numpy.array, tf.Tensor, torch.Tensor, dict, tuple]]] = None, *, prev_action: Optional[Union[numpy.array, tf.Tensor, torch.Tensor, dict, tuple]] = None, prev_reward: Optional[Union[numpy.array, tf.Tensor, torch.Tensor, dict, tuple]] = None, info: Optional[dict] = None, policy_id: str = 'default_policy', full_fetch: bool = False, explore: Optional[bool] = None, timestep: Optional[int] = None, episodes: Optional[List[ray.rllib.evaluation.episode.Episode]] = None, unsquash_actions: Optional[bool] = None, clip_actions: Optional[bool] = None, normalize_actions=None, **kwargs)[source]#
Computes an action for the specified policy on the local Worker.
Note that you can also access the policy object through self.get_policy(policy_id) and call compute_actions() on it directly.
- Parameters
observation – Observation from the environment.
state – RNN hidden state, if any. If state is not None, then all of compute_single_action(…) is returned (computed action, rnn state(s), logits dictionary). Otherwise compute_single_action(…)[0] is returned (computed action).
prev_action – Previous action value, if any.
prev_reward – Previous reward, if any.
info – Env info dict, if any.
policy_id – Policy to query (only applies to multi-agent).
full_fetch – Whether to return extra action fetch results. This is always set to True if RNN state is specified.
explore – Whether to pick an exploitation or exploration action (default: None -> use self.config[“explore”]).
timestep – The current (sampling) time step.
episodes – This provides access to all of the internal episodes’ state, which may be useful for model-based or multi-agent algorithms.
unsquash_actions – Should actions be unsquashed according to the env’s/Policy’s action space? If None, use self.config[“normalize_actions”].
clip_actions – Should actions be clipped according to the env’s/Policy’s action space? If None, use self.config[“clip_actions”].
- Keyword Arguments
kwargs – forward compatibility placeholder
- Returns
The computed action if full_fetch=False, or a tuple consisting of the full output of policy.compute_actions_from_input_dict() if full_fetch=True or we have an RNN-based Policy.
- get_policy(policy_id: str = 'default_policy') ray.rllib.policy.policy.Policy [source]#
Return policy for the specified id, or None.
- Parameters
policy_id – ID of the policy to return.
- get_weights(policies: Optional[List[str]] = None) dict [source]#
Return a dictionary of policy ids to weights.
- Parameters
policies – Optional list of policies to return weights for, or None for all policies.
- set_weights(weights: Dict[str, dict])[source]#
Set policy weights by policy id.
- Parameters
weights – Map of policy ids to weights to set.
- add_policy(policy_id: str, policy_cls: Optional[Type[ray.rllib.policy.policy.Policy]] = None, policy: Optional[ray.rllib.policy.policy.Policy] = None, *, observation_space: Optional[<MagicMock name='mock.spaces.Space' id='140330629676496'>] = None, action_space: Optional[<MagicMock name='mock.spaces.Space' id='140330629676496'>] = None, config: Optional[Union[ray.rllib.algorithms.algorithm_config.AlgorithmConfig, dict]] = None, policy_state: Optional[Dict[str, Union[numpy.array, tf.Tensor, torch.Tensor, dict, tuple]]] = None, policy_mapping_fn: Optional[Callable[[Any, int], str]] = None, policies_to_train: Optional[Union[Container[str], Callable[[str, Optional[Union[SampleBatch, MultiAgentBatch]]], bool]]] = None, evaluation_workers: bool = True, workers: Optional[List[Union[ray.rllib.evaluation.rollout_worker.RolloutWorker, ray.actor.ActorHandle]]] = -1) Optional[ray.rllib.policy.policy.Policy] [source]#
Adds a new policy to this Algorithm.
- Parameters
policy_id – ID of the policy to add. IMPORTANT: Must not contain characters that are also not allowed in Unix/Win filesystems, such as:
<>:"/|?*
or a dotor space ` ` at the end of the ID.
policy_cls – The Policy class to use for constructing the new Policy. Note: Only one of
policy_cls
orpolicy
must be provided.policy – The Policy instance to add to this algorithm. If not None, the given Policy object will be directly inserted into the Algorithm’s local worker and clones of that Policy will be created on all remote workers as well as all evaluation workers. Note: Only one of
policy_cls
orpolicy
must be provided.observation_space – The observation space of the policy to add. If None, try to infer this space from the environment.
action_space – The action space of the policy to add. If None, try to infer this space from the environment.
config – The config object or overrides for the policy to add.
policy_state – Optional state dict to apply to the new policy instance, right after its construction.
policy_mapping_fn – An optional (updated) policy mapping function to use from here on. Note that already ongoing episodes will not change their mapping but will use the old mapping till the end of the episode.
policies_to_train – An optional list of policy IDs to be trained or a callable taking PolicyID and SampleBatchType and returning a bool (trainable or not?). If None, will keep the existing setup in place. Policies, whose IDs are not in the list (or for which the callable returns False) will not be updated.
evaluation_workers – Whether to add the new policy also to the evaluation WorkerSet.
workers – A list of RolloutWorker/ActorHandles (remote RolloutWorkers) to add this policy to. If defined, will only add the given policy to these workers.
- Returns
The newly added policy (the copy that got added to the local worker). If
workers
was provided, None is returned.
- remove_policy(policy_id: str = 'default_policy', *, policy_mapping_fn: Optional[Callable[[Any], str]] = None, policies_to_train: Optional[Union[Container[str], Callable[[str, Optional[Union[SampleBatch, MultiAgentBatch]]], bool]]] = None, evaluation_workers: bool = True) None [source]#
Removes a new policy from this Algorithm.
- Parameters
policy_id – ID of the policy to be removed.
policy_mapping_fn – An optional (updated) policy mapping function to use from here on. Note that already ongoing episodes will not change their mapping but will use the old mapping till the end of the episode.
policies_to_train – An optional list of policy IDs to be trained or a callable taking PolicyID and SampleBatchType and returning a bool (trainable or not?). If None, will keep the existing setup in place. Policies, whose IDs are not in the list (or for which the callable returns False) will not be updated.
evaluation_workers – Whether to also remove the policy from the evaluation WorkerSet.
- export_policy_model(export_dir: str, policy_id: str = 'default_policy', onnx: Optional[int] = None) None [source]#
Exports policy model with given policy_id to a local directory.
- Parameters
export_dir – Writable local directory.
policy_id – Optional policy id to export.
onnx – If given, will export model in ONNX format. The value of this parameter set the ONNX OpSet version to use. If None, the output format will be DL framework specific.
Example
>>> from ray.rllib.algorithms.ppo import PPO >>> # Use an Algorithm from RLlib or define your own. >>> algo = PPO(...) >>> for _ in range(10): >>> algo.train() >>> algo.export_policy_model("/tmp/dir") >>> algo.export_policy_model("/tmp/dir/onnx", onnx=1)
- export_policy_checkpoint(export_dir: str, filename_prefix=- 1, policy_id: str = 'default_policy') None [source]#
Exports Policy checkpoint to a local directory and returns an AIR Checkpoint.
- Parameters
export_dir – Writable local directory to store the AIR Checkpoint information into.
policy_id – Optional policy ID to export. If not provided, will export “default_policy”. If
policy_id
does not exist in this Algorithm, will raise a KeyError.
- Raises
KeyError if policy_id cannot be found in this Algorithm. –
Example
>>> from ray.rllib.algorithms.ppo import PPO >>> # Use an Algorithm from RLlib or define your own. >>> algo = PPO(...) >>> for _ in range(10): >>> algo.train() >>> algo.export_policy_checkpoint("/tmp/export_dir")
- import_policy_model_from_h5(import_file: str, policy_id: str = 'default_policy') None [source]#
Imports a policy’s model with given policy_id from a local h5 file.
- Parameters
import_file – The h5 file to import from.
policy_id – Optional policy id to import into.
Example
>>> from ray.rllib.algorithms.ppo import PPO >>> algo = PPO(...) >>> algo.import_policy_model_from_h5("/tmp/weights.h5") >>> for _ in range(10): >>> algo.train()
- save_checkpoint(checkpoint_dir: str) str [source]#
Exports AIR Checkpoint to a local directory and returns its directory path.
The structure of an Algorithm checkpoint dir will be as follows:
policies/ pol_1/ policy_state.pkl pol_2/ policy_state.pkl rllib_checkpoint.json algorithm_state.pkl
Note:
rllib_checkpoint.json
contains a “version” key (e.g. with value 0.1) helping RLlib to remain backward compatible wrt. restoring from checkpoints from Ray 2.0 onwards.- Parameters
checkpoint_dir – The directory where the checkpoint files will be stored.
- Returns
The path to the created AIR Checkpoint directory.
- load_checkpoint(checkpoint: Union[Dict, str]) None [source]#
Subclasses should override this to implement restore().
Warning
In this method, do not rely on absolute paths. The absolute path of the checkpoint_dir used in
Trainable.save_checkpoint
may be changed.If
Trainable.save_checkpoint
returned a prefixed string, the prefix of the checkpoint string returned byTrainable.save_checkpoint
may be changed. This is because trial pausing depends on temporary directories.The directory structure under the checkpoint_dir provided to
Trainable.save_checkpoint
is preserved.See the examples below.
Example
>>> import os >>> from ray.tune.trainable import Trainable >>> class Example(Trainable): ... def save_checkpoint(self, checkpoint_path): ... my_checkpoint_path = os.path.join(checkpoint_path, "my/path") ... return my_checkpoint_path ... def load_checkpoint(self, my_checkpoint_path): ... print(my_checkpoint_path) >>> trainer = Example() >>> # This is used when PAUSED. >>> obj = trainer.save_to_object() <logdir>/tmpc8k_c_6hsave_to_object/checkpoint_0/my/path >>> # Note the different prefix. >>> trainer.restore_from_object(obj) <logdir>/tmpb87b5axfrestore_from_object/checkpoint_0/my/path
If
Trainable.save_checkpoint
returned a dict, then Tune will directly pass the dict data as the argument to this method.Example
>>> from ray.tune.trainable import Trainable >>> class Example(Trainable): ... def save_checkpoint(self, checkpoint_path): ... return {"my_data": 1} ... def load_checkpoint(self, checkpoint_dict): ... print(checkpoint_dict["my_data"])
New in version 0.8.7.
- Parameters
checkpoint – If dict, the return value is as returned by
save_checkpoint
. If a string, then it is a checkpoint path that may have a different prefix than that returned bysave_checkpoint
. The directory structure underneath thecheckpoint_dir
fromsave_checkpoint
is preserved.
- log_result(result: dict) None [source]#
Subclasses can optionally override this to customize logging.
The logging here is done on the worker process rather than the driver.
New in version 0.8.7.
- Parameters
result – Training result returned by step().
- cleanup() None [source]#
Subclasses should override this for any cleanup on stop.
If any Ray actors are launched in the Trainable (i.e., with a RLlib trainer), be sure to kill the Ray actor process here.
This process should be lightweight. Per default,
You can kill a Ray actor by calling
ray.kill(actor)
on the actor or removing all references to it and waiting for garbage collectionNew in version 0.8.7.
- classmethod default_resource_request(config: Union[ray.rllib.algorithms.algorithm_config.AlgorithmConfig, dict]) Union[ray.tune.resources.Resources, ray.tune.execution.placement_groups.PlacementGroupFactory] [source]#
Provides a static resource requirement for the given configuration.
This can be overridden by sub-classes to set the correct trial resource allocation, so the user does not need to.
@classmethod def default_resource_request(cls, config): return PlacementGroupFactory([{"CPU": 1}, {"CPU": 1}]])
- Parameters
config[Dict[str – The Trainable’s config dict.
Any]] – The Trainable’s config dict.
- Returns
- A Resources object or
PlacementGroupFactory consumed by Tune for queueing.
- Return type
Union[Resources, PlacementGroupFactory]
- classmethod resource_help(config: Union[ray.rllib.algorithms.algorithm_config.AlgorithmConfig, dict]) str [source]#
Returns a help string for configuring this trainable’s resources.
- Parameters
config – The Trainer’s config dict.
- get_auto_filled_metrics(now: Optional[datetime.datetime] = None, time_this_iter: Optional[float] = None, debug_metrics_only: bool = False) dict [source]#
Return a dict with metrics auto-filled by the trainable.
If
debug_metrics_only
is True, only metrics that don’t require at least one iteration will be returned (ray.tune.result.DEBUG_METRICS
).
- classmethod merge_trainer_configs(config1: dict, config2: dict, _allow_unknown_configs: Optional[bool] = None) dict [source]#
Merges a complete Algorithm config dict with a partial override dict.
Respects nested structures within the config dicts. The values in the partial override dict take priority.
- Parameters
config1 – The complete Algorithm’s dict to be merged (overridden) with
config2
.config2 – The partial override config dict to merge on top of
config1
._allow_unknown_configs – If True, keys in
config2
that don’t exist inconfig1
are allowed and will be added to the final config.
- Returns
The merged full algorithm config dict.
- static validate_env(env: Any, env_context: ray.rllib.env.env_context.EnvContext) None [source]#
Env validator function for this Algorithm class.
Override this in child classes to define custom validation behavior.
- Parameters
env – The (sub-)environment to validate. This is normally a single sub-environment (e.g. a gym.Env) within a vectorized setup.
env_context – The EnvContext to configure the environment.
- Raises
Exception in case something is wrong with the given environment. –