Note

Ray 2.10.0 introduces the alpha stage of RLlib’s “new API stack”. The team is currently transitioning algorithms, example scripts, and documentation to the new code base throughout the subsequent minor releases leading up to Ray 3.0.

See here for more details on how to activate and use the new API stack.

Advanced Python APIs#

Custom Training Workflows#

In the basic training example, Tune will call train() on your algorithm once per training iteration and report the new training results. Sometimes, it’s desirable to have full control over training, but still run inside Tune. Tune supports custom trainable functions that can be used to implement custom training workflows (example).

Curriculum Learning#

In curriculum learning, you can set the environment to different difficulties throughout the training process. This setting allows the algorithm to learn how to solve the actual and final problem incrementally, by interacting with and exploring in more and more difficult phases. Normally, such a curriculum starts with setting the environment to an easy level and then - as training progresses - transitions more toward a harder-to-solve difficulty. See the Reverse Curriculum Generation for Reinforcement Learning Agents blog post for another example of how you can do curriculum learning.

RLlib’s Algorithm and custom callbacks APIs allow for implementing any arbitrary curricula. This example script introduces the basic concepts you need to understand.

First, define some env options. This example uses the FrozenLake-v1 environment, a grid world, whose map is fully customizable. Three tasks of different env difficulties are represented by slightly different maps that the agent has to navigate.

ENV_OPTIONS = {
    "is_slippery": False,
    # Limit the number of steps the agent is allowed to make in the env to
    # make it almost impossible to learn without the curriculum.
    "max_episode_steps": 16,
}

# Our 3 tasks: 0=easiest, 1=medium, 2=hard
ENV_MAPS = [
    # 0
    [
        "SFFHFFFH",
        "FFFHFFFF",
        "FFGFFFFF",
        "FFFFFFFF",
        "HFFFFFFF",
        "HHFFFFHF",
        "FFFFFHHF",
        "FHFFFFFF",
    ],
    # 1
    [
        "SFFHFFFH",
        "FFFHFFFF",
        "FFFFFFFF",
        "FFFFFFFF",
        "HFFFFFFF",
        "HHFFGFHF",
        "FFFFFHHF",
        "FHFFFFFF",
    ],
    # 2
    [
        "SFFHFFFH",
        "FFFHFFFF",
        "FFFFFFFF",
        "FFFFFFFF",
        "HFFFFFFF",
        "HHFFFFHF",
        "FFFFFHHF",
        "FHFFFFFG",
    ],
]

Then, define the central piece controlling the curriculum, which is a custom callbacks class overriding the on_train_result().

import ray
from ray import tune
from ray.rllib.agents.callbacks import DefaultCallbacks

class MyCallbacks(DefaultCallbacks):
    def on_train_result(self, algorithm, result, **kwargs):
        if result["env_runners"]["episode_return_mean"] > 200:
            task = 2
        elif result["env_runners"]["episode_return_mean"] > 100:
            task = 1
        else:
            task = 0
        algorithm.env_runner_group.foreach_worker(
            lambda ev: ev.foreach_env(
                lambda env: env.set_task(task)))

ray.init()
tune.Tuner(
    "PPO",
    param_space={
        "env": YourEnv,
        "callbacks": MyCallbacks,
    },
).fit()

Global Coordination#

Sometimes, it’s necessary to coordinate between pieces of code that live in different processes managed by RLlib. For example, it can be useful to maintain a global average of a certain variable, or centrally control a hyperparameter used by policies. Ray provides a general way to achieve this through named actors (learn more about Ray actors here). These actors are assigned a global name and handles to them can be retrieved using these names. As an example, consider maintaining a shared global counter that’s incremented by environments and read periodically from your driver program:

import ray


@ray.remote
class Counter:
    def __init__(self):
        self.count = 0

    def inc(self, n):
        self.count += n

    def get(self):
        return self.count


# on the driver
counter = Counter.options(name="global_counter").remote()
print(ray.get(counter.get.remote()))  # get the latest count

# in your envs
counter = ray.get_actor("global_counter")
counter.inc.remote(1)  # async call to increment the global count

Ray actors provide high levels of performance, so in more complex cases they can be used implement communication patterns such as parameter servers and all-reduce.

Callbacks and Custom Metrics#

You can provide callbacks to be called at points during policy evaluation. These callbacks have access to state for the current episode. Certain callbacks such as on_postprocess_trajectory, on_sample_end, and on_train_result are also places where custom postprocessing can be applied to intermediate data or results.

User-defined state can be stored for the episode in the episode.user_data dict, and custom scalar metrics reported by saving values to the episode.custom_metrics dict. These custom metrics are aggregated and reported as part of training results. For a full example, take a look at this example script here and these unit test cases here.

Tip

You can create custom logic that can run on each evaluation episode by checking if the RolloutWorker is in evaluation mode, through accessing worker.policy_config["in_evaluation"]. You can then implement this check in on_episode_start() or on_episode_end() in your subclass of DefaultCallbacks. For running callbacks before and after the evaluation runs in whole we provide on_evaluate_start() and on_evaluate_end.

Click here to see the full API of the DefaultCallbacks class
class ray.rllib.algorithms.callbacks.DefaultCallbacks[source]#

Abstract base class for RLlib callbacks (similar to Keras callbacks).

These callbacks can be used for custom metrics and custom postprocessing.

By default, all of these callbacks are no-ops. To configure custom training callbacks, subclass DefaultCallbacks and then set {“callbacks”: YourCallbacksClass} in the algo config.

on_algorithm_init(*, algorithm: Algorithm, metrics_logger: MetricsLogger | None = None, **kwargs) None[source]#

Callback run when a new Algorithm instance has finished setup.

This method gets called at the end of Algorithm.setup() after all the initialization is done, and before actually training starts.

Parameters:
  • algorithm – Reference to the Algorithm instance.

  • metrics_logger – The MetricsLogger object inside the Algorithm. Can be used to log custom metrics after algo initialization.

  • kwargs – Forward compatibility placeholder.

on_workers_recreated(*, algorithm: Algorithm, worker_set: EnvRunnerGroup, worker_ids: List[int], is_evaluation: bool, **kwargs) None[source]#

Callback run after one or more workers have been recreated.

You can access (and change) the worker(s) in question via the following code snippet inside your custom override of this method:

Note that any “worker” inside the algorithm’s self.env_runner_group and self.eval_env_runner_group are instances of a subclass of EnvRunner.

class MyCallbacks(DefaultCallbacks):
    def on_workers_recreated(
        self,
        *,
        algorithm,
        worker_set,
        worker_ids,
        is_evaluation,
        **kwargs,
    ):
        # Define what you would like to do on the recreated
        # workers:
        def func(w):
            # Here, we just set some arbitrary property to 1.
            if is_evaluation:
                w._custom_property_for_evaluation = 1
            else:
                w._custom_property_for_training = 1

        # Use the `foreach_workers` method of the worker set and
        # only loop through those worker IDs that have been restarted.
        # Note that we set `local_worker=False` to NOT include it (local
        # workers are never recreated; if they fail, the entire Algorithm
        # fails).
        worker_set.foreach_worker(
            func,
            remote_worker_ids=worker_ids,
            local_worker=False,
        )
Parameters:
  • algorithm – Reference to the Algorithm instance.

  • worker_set – The EnvRunnerGroup object in which the workers in question reside. You can use a worker_set.foreach_worker(remote_worker_ids=..., local_worker=False) method call to execute custom code on the recreated (remote) workers. Note that the local worker is never recreated as a failure of this would also crash the Algorithm.

  • worker_ids – The list of (remote) worker IDs that have been recreated.

  • is_evaluation – Whether worker_set is the evaluation EnvRunnerGroup (located in Algorithm.eval_env_runner_group) or not.

on_checkpoint_loaded(*, algorithm: Algorithm, **kwargs) None[source]#

Callback run when an Algorithm has loaded a new state from a checkpoint.

This method gets called at the end of Algorithm.load_checkpoint().

Parameters:
  • algorithm – Reference to the Algorithm instance.

  • kwargs – Forward compatibility placeholder.

on_create_policy(*, policy_id: str, policy: Policy) None[source]#

Callback run whenever a new policy is added to an algorithm.

Parameters:
  • policy_id – ID of the newly created policy.

  • policy – The policy just created.

on_environment_created(*, env_runner: EnvRunner, metrics_logger: MetricsLogger | None = None, env: gymnasium.Env, env_context: EnvContext, **kwargs) None[source]#

Callback run when a new environment object has been created.

Note: This only applies to the new API stack. The env used is usually a gym.Env (or more specifically a gym.vector.Env).

Parameters:
  • env_runner – Reference to the current EnvRunner instance.

  • metrics_logger – The MetricsLogger object inside the env_runner. Can be used to log custom metrics after environment creation.

  • env – The environment object that has been created on env_runner. This is usually a gym.Env (or a gym.vector.Env) object.

  • env_context – The EnvContext object that has been passed to the gym.make() call as kwargs (and to the gym.Env as config). It should have all the config key/value pairs in it as well as the EnvContext-typical properties: worker_index, num_workers, and remote.

  • kwargs – Forward compatibility placeholder.

on_sub_environment_created(*, worker: EnvRunner, sub_environment: Any | gymnasium.Env, env_context: EnvContext, env_index: int | None = None, **kwargs) None[source]#

Callback run when a new sub-environment has been created.

This method gets called after each sub-environment (usually a gym.Env) has been created, validated (RLlib built-in validation + possible custom validation function implemented by overriding Algorithm.validate_env()), wrapped (e.g. video-wrapper), and seeded.

Parameters:
  • worker – Reference to the current rollout worker.

  • sub_environment – The sub-environment instance that has been created. This is usually a gym.Env object.

  • env_context – The EnvContext object that has been passed to the env’s constructor.

  • env_index – The index of the sub-environment that has been created (within the vector of sub-environments of the BaseEnv).

  • kwargs – Forward compatibility placeholder.

on_episode_created(*, episode: SingleAgentEpisode | MultiAgentEpisode | EpisodeV2, worker: EnvRunner | None = None, env_runner: EnvRunner | None = None, metrics_logger: MetricsLogger | None = None, base_env: BaseEnv | None = None, env: gymnasium.Env | None = None, policies: Dict[str, Policy] | None = None, rl_module: RLModule | None = None, env_index: int, **kwargs) None[source]#

Callback run when a new episode is created (but has not started yet!).

This method gets called after a new Episode(V2) (old stack) or MultiAgentEpisode instance has been created. This happens before the respective sub-environment’s (usually a gym.Env) reset() is called by RLlib.

Note, at the moment this callback does not get called in the new API stack and single-agent mode.

  1. Episode(V2)/MultiAgentEpisode created: This callback is called.

  2. Respective sub-environment (gym.Env) is reset().

  3. Callback on_episode_start is called.

  4. Stepping through sub-environment/episode commences.

Parameters:
  • episode – The newly created episode. On the new API stack, this will be a MultiAgentEpisode object. On the old API stack, this will be a Episode or EpisodeV2 object. This is the episode that is about to be started with an upcoming env.reset(). Only after this reset call, the on_episode_start callback will be called.

  • env_runner – Replaces worker arg. Reference to the current EnvRunner.

  • metrics_logger – The MetricsLogger object inside the env_runner. Can be used to log custom metrics after Episode creation.

  • env – Replaces base_env arg. The gym.Env (new API stack) or RLlib BaseEnv (old API stack) running the episode. On the old stack, the underlying sub environment objects can be retrieved by calling base_env.get_sub_environments().

  • rl_module – Replaces policies arg. Either the RLModule (new API stack) or a dict mapping policy IDs to policy objects (old stack). In single agent mode there will only be a single policy/RLModule under the rl_module["default_policy"] key.

  • env_index – The index of the sub-environment that is about to be reset (within the vector of sub-environments of the BaseEnv).

  • kwargs – Forward compatibility placeholder.

on_episode_start(*, episode: SingleAgentEpisode | MultiAgentEpisode | EpisodeV2, env_runner: EnvRunner | None = None, metrics_logger: MetricsLogger | None = None, env: gymnasium.Env | None = None, env_index: int, rl_module: RLModule | None = None, worker: EnvRunner | None = None, base_env: BaseEnv | None = None, policies: Dict[str, Policy] | None = None, **kwargs) None[source]#

Callback run right after an Episode has been started.

This method gets called after a SingleAgentEpisode or MultiAgentEpisode instance has been reset with a call to env.reset() by the EnvRunner.

  1. Single-/MultiAgentEpisode created: on_episode_created() is called.

  2. Respective sub-environment (gym.Env) is reset().

  3. Single-/MultiAgentEpisode starts: This callback is called.

  4. Stepping through sub-environment/episode commences.

Parameters:
  • episode – The just started (after env.reset()) SingleAgentEpisode or MultiAgentEpisode object.

  • env_runner – Reference to the EnvRunner running the env and episode.

  • metrics_logger – The MetricsLogger object inside the env_runner. Can be used to log custom metrics during env/episode stepping.

  • env – The gym.Env or gym.vector.Env object running the started episode.

  • env_index – The index of the sub-environment that is about to be reset (within the vector of sub-environments of the BaseEnv).

  • rl_module – The RLModule used to compute actions for stepping the env. In a single-agent setup, this is a (single-agent) RLModule, in a multi- agent setup, this will be a MultiRLModule.

  • kwargs – Forward compatibility placeholder.

on_episode_step(*, episode: SingleAgentEpisode | MultiAgentEpisode | EpisodeV2, env_runner: EnvRunner | None = None, metrics_logger: MetricsLogger | None = None, env: gymnasium.Env | None = None, env_index: int, rl_module: RLModule | None = None, worker: EnvRunner | None = None, base_env: BaseEnv | None = None, policies: Dict[str, Policy] | None = None, **kwargs) None[source]#

Called on each episode step (after the action(s) has/have been logged).

Note that on the new API stack, this callback is also called after the final step of an episode, meaning when terminated/truncated are returned as True from the env.step() call, but is still provided with the non-finalized episode object (meaning the data has NOT been converted to numpy arrays yet).

The exact time of the call of this callback is after env.step([action]) and also after the results of this step (observation, reward, terminated, truncated, infos) have been logged to the given episode object.

Parameters:
  • episode – The just stepped SingleAgentEpisode or MultiAgentEpisode object (after env.step() and after returned obs, rewards, etc.. have been logged to the episode object).

  • env_runner – Reference to the EnvRunner running the env and episode.

  • metrics_logger – The MetricsLogger object inside the env_runner. Can be used to log custom metrics during env/episode stepping.

  • env – The gym.Env or gym.vector.Env object running the started episode.

  • env_index – The index of the sub-environment that has just been stepped.

  • rl_module – The RLModule used to compute actions for stepping the env. In a single-agent setup, this is a (single-agent) RLModule, in a multi- agent setup, this will be a MultiRLModule.

  • kwargs – Forward compatibility placeholder.

on_episode_end(*, episode: SingleAgentEpisode | MultiAgentEpisode | EpisodeV2, env_runner: EnvRunner | None = None, metrics_logger: MetricsLogger | None = None, env: gymnasium.Env | None = None, env_index: int, rl_module: RLModule | None = None, worker: EnvRunner | None = None, base_env: BaseEnv | None = None, policies: Dict[str, Policy] | None = None, **kwargs) None[source]#

Called when an episode is done (after terminated/truncated have been logged).

The exact time of the call of this callback is after env.step([action]) and also after the results of this step (observation, reward, terminated, truncated, infos) have been logged to the given episode object, where either terminated or truncated were True:

  • The env is stepped: final_obs, rewards, ... = env.step([action])

  • The step results are logged episode.add_env_step(final_obs, rewards)

  • Callback on_episode_step is fired.

  • Another env-to-module connector call is made (even though we won’t need any RLModule forward pass anymore). We make this additional call to ensure that in case users use the connector pipeline to process observations (and write them back into the episode), the episode object has all observations - even the terminal one - properly processed.

  • —> This callback on_episode_end() is fired. <—

  • The episode is finalized (i.e. lists of obs/rewards/actions/etc.. are converted into numpy arrays).

Parameters:
  • episode – The terminated/truncated SingleAgent- or MultiAgentEpisode object (after env.step() that returned terminated=True OR truncated=True and after the returned obs, rewards, etc.. have been logged to the episode object). Note that this method is still called before(!) the episode object is finalized, meaning all its timestep data is still present in lists of individual timestep data.

  • env_runner – Reference to the EnvRunner running the env and episode.

  • metrics_logger – The MetricsLogger object inside the env_runner. Can be used to log custom metrics during env/episode stepping.

  • env – The gym.Env or gym.vector.Env object running the started episode.

  • env_index – The index of the sub-environment that has just been terminated or truncated.

  • rl_module – The RLModule used to compute actions for stepping the env. In a single-agent setup, this is a (single-agent) RLModule, in a multi- agent setup, this will be a MultiRLModule.

  • kwargs – Forward compatibility placeholder.

on_evaluate_start(*, algorithm: Algorithm, metrics_logger: MetricsLogger | None = None, **kwargs) None[source]#

Callback before evaluation starts.

This method gets called at the beginning of Algorithm.evaluate().

Parameters:
  • algorithm – Reference to the algorithm instance.

  • metrics_logger – The MetricsLogger object inside the Algorithm. Can be used to log custom metrics before running the next round of evaluation.

  • kwargs – Forward compatibility placeholder.

on_evaluate_end(*, algorithm: Algorithm, metrics_logger: MetricsLogger | None = None, evaluation_metrics: dict, **kwargs) None[source]#

Runs when the evaluation is done.

Runs at the end of Algorithm.evaluate().

Parameters:
  • algorithm – Reference to the algorithm instance.

  • metrics_logger – The MetricsLogger object inside the Algorithm. Can be used to log custom metrics after the most recent evaluation round.

  • evaluation_metrics – Results dict to be returned from algorithm.evaluate(). You can mutate this object to add additional metrics.

  • kwargs – Forward compatibility placeholder.

on_postprocess_trajectory(*, worker: EnvRunner, episode, agent_id: Any, policy_id: str, policies: Dict[str, Policy], postprocessed_batch: SampleBatch, original_batches: Dict[Any, Tuple[Policy, SampleBatch]], **kwargs) None[source]#

Called immediately after a policy’s postprocess_fn is called.

You can use this callback to do additional postprocessing for a policy, including looking at the trajectory data of other agents in multi-agent settings.

Parameters:
  • worker – Reference to the current rollout worker.

  • episode – Episode object.

  • agent_id – Id of the current agent.

  • policy_id – Id of the current policy for the agent.

  • policies – Dict mapping policy IDs to policy objects. In single agent mode there will only be a single “default_policy”.

  • postprocessed_batch – The postprocessed sample batch for this agent. You can mutate this object to apply your own trajectory postprocessing.

  • original_batches – Dict mapping agent IDs to their unpostprocessed trajectory data. You should not mutate this object.

  • kwargs – Forward compatibility placeholder.

on_sample_end(*, env_runner: EnvRunner | None = None, metrics_logger: MetricsLogger | None = None, samples: SampleBatch | List[SingleAgentEpisode | MultiAgentEpisode], worker: EnvRunner | None = None, **kwargs) None[source]#

Called at the end of EnvRunner.sample().

Parameters:
  • env_runner – Reference to the current EnvRunner object.

  • metrics_logger – The MetricsLogger object inside the env_runner. Can be used to log custom metrics during env/episode stepping.

  • samples – Batch to be returned. You can mutate this object to modify the samples generated.

  • kwargs – Forward compatibility placeholder.

on_learn_on_batch(*, policy: Policy, train_batch: SampleBatch, result: dict, **kwargs) None[source]#

Called at the beginning of Policy.learn_on_batch().

Note: This is called before 0-padding via pad_batch_to_sequences_of_same_size.

Also note, SampleBatch.INFOS column will not be available on train_batch within this callback if framework is tf1, due to the fact that tf1 static graph would mistake it as part of the input dict if present. It is available though, for tf2 and torch frameworks.

Parameters:
  • policy – Reference to the current Policy object.

  • train_batch – SampleBatch to be trained on. You can mutate this object to modify the samples generated.

  • result – A results dict to add custom metrics to.

  • kwargs – Forward compatibility placeholder.

on_train_result(*, algorithm: Algorithm, metrics_logger: MetricsLogger | None = None, result: dict, **kwargs) None[source]#

Called at the end of Algorithm.train().

Parameters:
  • algorithm – Current Algorithm instance.

  • metrics_logger – The MetricsLogger object inside the Algorithm. Can be used to log custom metrics after traing results are available.

  • result – Dict of results returned from Algorithm.train() call. You can mutate this object to add additional metrics.

  • kwargs – Forward compatibility placeholder.

Chaining Callbacks#

Use the make_multi_callbacks() utility to chain multiple callbacks together.

ray.rllib.algorithms.callbacks.make_multi_callbacks(callback_class_list: List[Type[DefaultCallbacks]]) DefaultCallbacks[source]#

Allows combining multiple sub-callbacks into one new callbacks class.

The resulting DefaultCallbacks will call all the sub-callbacks’ callbacks when called.

config.callbacks(make_multi_callbacks([
    MyCustomStatsCallbacks,
    MyCustomVideoCallbacks,
    MyCustomTraceCallbacks,
    ....
]))
Parameters:

callback_class_list – The list of sub-classes of DefaultCallbacks to be baked into the to-be-returned class. All of these sub-classes’ implemented methods will be called in the given order.

Returns:

A DefaultCallbacks subclass that combines all the given sub-classes.

Visualizing Custom Metrics#

Custom metrics can be accessed and visualized like any other training result:

../_images/custom_metric.png

Customizing Exploration Behavior#

RLlib offers a unified top-level API to configure and customize an agent’s exploration behavior, including the decisions (how and whether) to sample actions from distributions (stochastically or deterministically). The setup can be done using built-in Exploration classes (see this package), which are specified (and further configured) inside AlgorithmConfig().env_runners(..). Besides using one of the available classes, one can sub-class any of these built-ins, add custom behavior to it, and use that new class in the config instead.

Every policy has-an Exploration object, which is created from the AlgorithmConfig’s .env_runners(exploration_config=...) method, which specifies the class to use through the special “type” key, as well as constructor arguments through all other keys, e.g.:

from ray.rllib.algorithms.algorithm_config import AlgorithmConfig

config = AlgorithmConfig().env_runners(
    exploration_config={
        # Special `type` key provides class information
        "type": "StochasticSampling",
        # Add any needed constructor args here.
        "constructor_arg": "value",
    }
)

The following table lists all built-in Exploration sub-classes and the agents that currently use these by default:

../_images/rllib-exploration-api-table.svg

An Exploration class implements the get_exploration_action method, in which the exact exploratory behavior is defined. It takes the model’s output, the action distribution class, the model itself, a timestep (the global env-sampling steps already taken), and an explore switch and outputs a tuple of a) action and b) log-likelihood:


    def get_exploration_action(self,
                               *,
                               action_distribution: ActionDistribution,
                               timestep: Union[TensorType, int],
                               explore: bool = True):
        """Returns a (possibly) exploratory action and its log-likelihood.

        Given the Model's logits outputs and action distribution, returns an
        exploratory action.

        Args:
            action_distribution: The instantiated
                ActionDistribution object to work with when creating
                exploration actions.
            timestep: The current sampling time step. It can be a tensor
                for TF graph mode, otherwise an integer.
            explore: True: "Normal" exploration behavior.
                False: Suppress all exploratory behavior and return
                a deterministic action.

        Returns:
            A tuple consisting of 1) the chosen exploration action or a
            tf-op to fetch the exploration action from the graph and
            2) the log-likelihood of the exploration action.
        """
        pass

On the highest level, the Algorithm.compute_actions and Policy.compute_actions methods have a boolean explore switch, which is passed into Exploration.get_exploration_action. If explore=None, the value of Algorithm.config[“explore”] is used, which thus serves as a main switch for exploratory behavior, allowing for example turning off any exploration easily for evaluation purposes (see Customized Evaluation During Training).

The following are example excerpts from different Algorithms’ configs (see rllib/algorithms/algorithm.py) to setup different exploration behaviors:

# All of the following configs go into Algorithm.config.

# 1) Switching *off* exploration by default.
# Behavior: Calling `compute_action(s)` without explicitly setting its `explore`
# param will result in no exploration.
# However, explicitly calling `compute_action(s)` with `explore=True` will
# still(!) result in exploration (per-call overrides default).
"explore": False,

# 2) Switching *on* exploration by default.
# Behavior: Calling `compute_action(s)` without explicitly setting its
# explore param will result in exploration.
# However, explicitly calling `compute_action(s)` with `explore=False`
# will result in no(!) exploration (per-call overrides default).
"explore": True,

# 3) Example exploration_config usages:
# a) DQN: see rllib/algorithms/dqn/dqn.py
"explore": True,
"exploration_config": {
   # Exploration sub-class by name or full path to module+class
   # (e.g. “ray.rllib.utils.exploration.epsilon_greedy.EpsilonGreedy”)
   "type": "EpsilonGreedy",
   # Parameters for the Exploration class' constructor:
   "initial_epsilon": 1.0,
   "final_epsilon": 0.02,
   "epsilon_timesteps": 10000,  # Timesteps over which to anneal epsilon.
},

# b) DQN Soft-Q: In order to switch to Soft-Q exploration, do instead:
"explore": True,
"exploration_config": {
   "type": "SoftQ",
   # Parameters for the Exploration class' constructor:
   "temperature": 1.0,
},

# c) All policy-gradient algos and SAC: see rllib/algorithms/algorithm.py
# Behavior: The algo samples stochastically from the
# model-parameterized distribution. This is the global Algorithm default
# setting defined in algorithm.py and used by all PG-type algos (plus SAC).
"explore": True,
"exploration_config": {
   "type": "StochasticSampling",
   "random_timesteps": 0,  # timesteps at beginning, over which to act uniformly randomly
},

Customized Evaluation During Training#

RLlib reports online training rewards, however in some cases you may want to compute rewards with different settings (for example, with exploration turned off, or on a specific set of environment configurations). You can activate evaluating policies during training (Algorithm.train()) by setting the evaluation_interval to an int value (> 0) indicating every how many Algorithm.train() calls an “evaluation step” should be run:

from ray.rllib.algorithms.algorithm_config import AlgorithmConfig

# Run one evaluation step on every 3rd `Algorithm.train()` call.
config = AlgorithmConfig().evaluation(
    evaluation_interval=3,
)

An evaluation step runs - using its own EnvRunner instances - for evaluation_duration episodes or time-steps, depending on the evaluation_duration_unit setting, which can take values of either "episodes" (default) or "timesteps".

# Every time we run an evaluation step, run it for exactly 10 episodes.
config = AlgorithmConfig().evaluation(
    evaluation_duration=10,
    evaluation_duration_unit="episodes",
)
# Every time we run an evaluation step, run it for (close to) 200 timesteps.
config = AlgorithmConfig().evaluation(
    evaluation_duration=200,
    evaluation_duration_unit="timesteps",
)

Note: When using evaluation_duration_unit=timesteps and your evaluation_duration setting isn’t divisible by the number of evaluation workers (configurable with evaluation_num_env_runners), RLlib rounds up the number of time-steps specified to the nearest whole number of time-steps that is divisible by the number of evaluation workers. Also, when using evaluation_duration_unit=episodes and your evaluation_duration setting isn’t divisible by the number of evaluation workers (configurable with evaluation_num_env_runners), RLlib runs the remainder of episodes on the first n evaluation EnvRunners and leave the remaining workers idle for that time.

For example:

# Every time we run an evaluation step, run it for exactly 10 episodes, no matter,
# how many eval workers we have.
config = AlgorithmConfig().evaluation(
    evaluation_duration=10,
    evaluation_duration_unit="episodes",
    # What if number of eval workers is non-dividable by 10?
    # -> Run 7 episodes (1 per eval worker), then run 3 more episodes only using
    #    evaluation workers 1-3 (evaluation workers 4-7 remain idle during that time).
    evaluation_num_env_runners=7,
)

Before each evaluation step, weights from the main model are synchronized to all evaluation workers.

By default, the evaluation step (if there is one in the current iteration) is run right after the respective training step. For example, for evaluation_interval=1, the sequence of events is: train(0->1), eval(1), train(1->2), eval(2), train(2->3), .... Here, the indices show the version of neural network weights used. train(0->1) is an update step that changes the weights from version 0 to version 1 and eval(1) then uses weights version 1. Weights index 0 represents the randomly initialized weights of the neural network.

Another example: For evaluation_interval=2, the sequence is: train(0->1), train(1->2), eval(2), train(2->3), train(3->4), eval(4), ....

Instead of running train- and eval-steps in sequence, it is also possible to run them in parallel with the evaluation_parallel_to_training=True config setting. In this case, both training- and evaluation steps are run at the same time using multi-threading. This can speed up the evaluation process significantly, but leads to a 1-iteration delay between reported training- and evaluation results. The evaluation results are behind in this case b/c they use slightly outdated model weights (synchronized after the previous training step).

For example, for evaluation_parallel_to_training=True and evaluation_interval=1, the sequence is now: train(0->1) + eval(0), train(1->2) + eval(1), train(2->3) + eval(2), where + connects phases happening at the same time. Note that the change in the weights indices with respect to the non-parallel examples. The evaluation weights indices are now “one behind” the resulting train weights indices (train(1->**2**) + eval(**1**)).

When running with the evaluation_parallel_to_training=True setting, a special “auto” value is supported for evaluation_duration. This can be used to make the evaluation step take roughly as long as the concurrently ongoing training step:

# Run evaluation and training at the same time via threading and make sure they roughly
# take the same time, such that the next `Algorithm.train()` call can execute
# immediately and not have to wait for a still ongoing (e.g. b/c of very long episodes)
# evaluation step:
config = AlgorithmConfig().evaluation(
    evaluation_interval=2,
    # run evaluation and training in parallel
    evaluation_parallel_to_training=True,
    # automatically end evaluation when train step has finished
    evaluation_duration="auto",
    evaluation_duration_unit="timesteps",  # <- this setting is ignored; RLlib
    # will always run by timesteps (not by complete
    # episodes) in this duration=auto mode
)

The evaluation_config key allows you to override any config settings for the evaluation workers. For example, to switch off exploration in the evaluation steps, do:

# Switching off exploration behavior for evaluation workers
# (see rllib/algorithms/algorithm.py). Use any keys in this sub-dict that are
# also supported in the main Algorithm config.
config = AlgorithmConfig().evaluation(
    evaluation_config=AlgorithmConfig.overrides(explore=False),
)
# ... which is a more type-checked version of the old-style:
# config = AlgorithmConfig().evaluation(
#    evaluation_config={"explore": False},
# )

Note

Policy gradient algorithms are able to find the optimal policy, even if this is a stochastic one. Setting “explore=False” results in the evaluation workers not using this stochastic policy.

The level of parallelism within the evaluation step is determined by the evaluation_num_env_runners setting. Set this to larger values if you want the desired evaluation episodes or time-steps to run as much in parallel as possible. For example, if your evaluation_duration=10, evaluation_duration_unit=episodes, and evaluation_num_env_runners=10, each evaluation EnvRunner only has to run one episode in each evaluation step.

In case you observe occasional failures in your (evaluation) EnvRunners during evaluation (for example you have an environment that sometimes crashes or stalls), you should use the following combination of settings, minimizing the negative effects of such environment behavior:

Note that with or without parallel evaluation, all fault tolerance settings, such as ignore_env_runner_failures or restart_failed_env_runners are respected and applied to the failed evaluation workers.

Here’s an example:

# Having an environment that occasionally blocks completely for e.g. 10min would
# also affect (and block) training. Here is how you can defend your evaluation setup
# against oft-crashing or -stalling envs (or other unstable components on your evaluation
# workers).
config = AlgorithmConfig().evaluation(
    evaluation_interval=1,
    evaluation_parallel_to_training=True,
    evaluation_duration="auto",
    evaluation_duration_unit="timesteps",  # <- default anyway
    evaluation_force_reset_envs_before_iteration=True,  # <- default anyway
)

This runs the parallel sampling of all evaluation EnvRunners, such that if one of the workers takes too long to run through an episode and return data or fails entirely, the other evaluation EnvRunners still complete the job.

In case you would like to entirely customize the evaluation step, set custom_eval_function in your config to a callable, which takes the Algorithm object and an EnvRunnerGroup object (the Algorithm’s self.evaluation_workers EnvRunnerGroup instance) and returns a metrics dictionary. See algorithm.py for further documentation.

There is also an end-to-end example of how to set up a custom online evaluation in custom_evaluation.py. Note that if you only want to evaluate your policy at the end of training, you can set evaluation_interval: [int], where [int] should be the number of training iterations before stopping.

Below are some examples of how the custom evaluation metrics are reported nested under the evaluation key of normal training results:

------------------------------------------------------------------------
Sample output for `python custom_evaluation.py --no-custom-eval`
------------------------------------------------------------------------

INFO algorithm.py:623 -- Evaluating current policy for 10 episodes.
INFO algorithm.py:650 -- Running round 0 of parallel evaluation (2/10 episodes)
INFO algorithm.py:650 -- Running round 1 of parallel evaluation (4/10 episodes)
INFO algorithm.py:650 -- Running round 2 of parallel evaluation (6/10 episodes)
INFO algorithm.py:650 -- Running round 3 of parallel evaluation (8/10 episodes)
INFO algorithm.py:650 -- Running round 4 of parallel evaluation (10/10 episodes)

Result for PG_SimpleCorridor_2c6b27dc:
  ...
  evaluation:
    env_runners:
      custom_metrics: {}
      episode_len_mean: 15.864661654135338
      episode_return_max: 1.0
      episode_return_mean: 0.49624060150375937
      episode_return_min: 0.0
      episodes_this_iter: 133
------------------------------------------------------------------------
Sample output for `python custom_evaluation.py`
------------------------------------------------------------------------

INFO algorithm.py:631 -- Running custom eval function <function ...>
Update corridor length to 4
Update corridor length to 7
Custom evaluation round 1
Custom evaluation round 2
Custom evaluation round 3
Custom evaluation round 4

Result for PG_SimpleCorridor_0de4e686:
  ...
  evaluation:
    env_runners:
      custom_metrics: {}
      episode_len_mean: 9.15695067264574
      episode_return_max: 1.0
      episode_return_mean: 0.9596412556053812
      episode_return_min: 0.0
      episodes_this_iter: 223
      foo: 1

Rewriting Trajectories#

Note that in the on_postprocess_traj callback you have full access to the trajectory batch (post_batch) and other training state. This can be used to rewrite the trajectory, which has a number of uses including:

  • Backdating rewards to previous time steps (for example, based on values in info).

  • Adding model-based curiosity bonuses to rewards (you can train the model with a custom model supervised loss).

To access the policy / model (policy.model) in the callbacks, note that info['pre_batch'] returns a tuple where the first element is a policy and the second one is the batch itself. You can also access all the rollout worker state using the following call:

from ray.rllib.evaluation.rollout_worker import get_global_worker

# You can use this from any callback to get a reference to the
# RolloutWorker running in the process, which in turn has references to
# all the policies, etc: see rollout_worker.py for more info.
rollout_worker = get_global_worker()

Policy losses are defined over the post_batch data, so you can mutate that in the callbacks to change what data the policy loss function sees.