Note

Ray 2.40 uses RLlib’s new API stack by default. The Ray team has mostly completed transitioning algorithms, example scripts, and documentation to the new code base.

If you’re still using the old API stack, see New API stack migration guide for details on how to migrate.

New API stack migration guide#

This page explains, step by step, how to convert and translate your existing old API stack RLlib classes and code to RLlib’s new API stack.

What’s the new API stack?#

The new API stack is the result of re-writing the core RLlib APIs from scratch and reducing user-facing classes from more than a dozen critical ones down to only a handful of classes, without any loss of features. When designing these new interfaces, the Ray Team strictly applied the following principles:

  • Classes must be usable outside of RLlib.

  • Separation of concerns. Try to answer: “What should get done when and by whom?” and give each class as few non-overlapping and clearly defined tasks as possible.

  • Offer fine-grained modularity, full interoperability, and frictionless pluggability of classes.

  • Use widely accepted third-party standards and APIs wherever possible.

Applying the preceding principles, the Ray Team reduced the important must-know classes for the average RLlib user from eight on the old stack, to only five on the new stack. The core new API stack classes are:

  • RLModule, which replaces ModelV2 and PolicyMap APIs

  • Learner, which replaces RolloutWorker and some of Policy

  • SingleAgentEpisode and MultiAgentEpisode, which replace ViewRequirement, SampleCollector, Episode, and EpisodeV2

  • ConnectorV2, which replaces Connector and some of RolloutWorker and Policy

The AlgorithmConfig and Algorithm APIs remain as-is. These classes are already established APIs on the old stack.

Note

Even though the new API stack still provides rudimentary support for TensorFlow, RLlib supports a single deep learning framework, the PyTorch framework, dropping TensorFlow support entirely. Note, though, that the Ray team continues to design RLlib to be framework-agnostic and may add support for additional frameworks in the future.

Check your AlgorithmConfig#

RLlib turns on the new API stack by default for all RLlib algorithms.

Note

To deactivate the new API stack and switch back to the old one, use the api_stack() method in your AlgorithmConfig object like so:

config.api_stack(
    enable_rl_module_and_learner=False,
    enable_env_runner_and_connector_v2=False,
)

Note that there are a few other differences between configuring an old API stack algorithm and its new stack counterpart. Go through the following sections and make sure you’re translating the respective settings. Remove settings that the new stack doesn’t support or need.

AlgorithmConfig.framework()#

Even though the new API stack still provides rudimentary support for TensorFlow, RLlib supports a single deep learning framework, the PyTorch framework.

The new API stack deprecates the following framework-related settings:

# Make sure you always set the framework to "torch"...
config.framework("torch")

# ... and drop all tf-specific settings.
config.framework(
    eager_tracing=True,
    eager_max_retraces=20,
    tf_session_args={},
    local_tf_session_args={},
)

AlgorithmConfig.resources()#

The Ray team deprecated the num_gpus and _fake_gpus settings. To place your RLModule on one or more GPUs on the Learner side, do the following:

# The following setting is equivalent to the old stack's `config.resources(num_gpus=2)`.
config.learners(
    num_learners=2,
    num_gpus_per_learner=1,
)

Hint

The num_learners setting determines how many remote Learner workers there are in your Algorithm’s LearnerGroup. If you set this parameter to 0, your LearnerGroup only contains a local Learner that runs on the main process and shares its compute resources, typically 1 CPU. For asynchronous algorithms like IMPALA or APPO, this setting should therefore always be >0.

See here for an example on how to train with fractional GPUs. Also note that for fractional GPUs, you should always set num_learners to 0 or 1.

If GPUs aren’t available, but you want to learn with more than one Learner in a multi-CPU fashion, you can do the following:

config.learners(
    num_learners=2,  # or >2
    num_cpus_per_learner=1,  # <- default
    num_gpus_per_learner=0,  # <- default
)

the Ray team renamed the setting num_cpus_for_local_worker to num_cpus_for_main_process.

config.resources(num_cpus_for_main_process=0)  # default is 1

AlgorithmConfig.training()#

Train batch size#

Due to the new API stack’s Learner worker architecture, training may happen in distributed fashion over n Learner workers, so RLlib provides the train batch size per individual Learner. Don’t use the train_batch_size setting any longer:

config.training(
    train_batch_size_per_learner=512,
)

You don’t need to change this setting, even when increasing the number of Learner, through config.learners(num_learners=...).

Note that a good rule of thumb for scaling on the learner axis is to keep the train_batch_size_per_learner value constant with a growing number of Learners and to increase the learning rate as follows:

lr = [original_lr] * ([num_learners] ** 0.5)

Neural network configuration#

The old stack’s config.training(model=...) is no longer supported on the new API stack. Instead, use the new rl_module() method to configure RLlib’s default RLModule or specify and configure a custom RLModule.

See RLModules API, a general guide that also explains the use of the config.rl_module() method.

If you have an old stack ModelV2 and want to migrate the entire NN logic to the new stack, see ModelV2 to RLModule for migration instructions.

Learning rate- and coefficient schedules#

If you’re using schedules for learning rate or other coefficients, for example, the entropy_coeff setting in PPO, provide scheduling information directly in the respective setting. Scheduling behavior doesn’t require a specific, separate setting anymore.

When defining a schedule, provide a list of 2-tuples, where the first item is the global timestep (num_env_steps_sampled_lifetime in the reported metrics) and the second item is the value that the learning rate should reach at that timestep. Always start the first 2-tuple with timestep 0. Note that RLlib linearly interpolates values between two provided timesteps.

For example, to create a learning rate schedule that starts with a value of 1e-5, then increases over 1M timesteps to 1e-4 and stays constant after that, do the following:

config.training(
    lr=[
        [0, 1e-5],  # <- initial value at timestep 0
        [1000000, 1e-4],  # <- final value at 1M timesteps
    ],
)

In the preceding example, the value after 500k timesteps is roughly 5e-5 from linear interpolation.

Another example is to create an entropy coefficient schedule that starts with a value of 0.05, then increases over 1M timesteps to 0.1 and then suddenly drops to 0, after the 1Mth timestep, do the following:

config.training(
    entropy_coeff=[
        [0, 0.05],  # <- initial value at timestep 0
        [1000000, 0.1],  # <- value at 1M timesteps
        [1000001, 0.0],  # <- sudden drop to 0.0 right after 1M timesteps
    ]
)

In case you need to configure a more complex learning rate scheduling behavior or chain different schedulers into a pipeline, you can use the experimental _torch_lr_schedule_classes config property. See this example script for more details. Note that this example only covers learning rate schedules, but not any other coefficients.

AlgorithmConfig.learners()#

This method isn’t used on the old API stack because the old stack doesn’t use Learner workers.

It allows you to specify:

  1. the number of Learner workers through .learners(num_learners=...).

  2. the resources per learner; use .learners(num_gpus_per_learner=1) for GPU training and .learners(num_gpus_per_learner=0) for CPU training.

  3. the custom Learner class you want to use. See this example for more details.

  4. a config dict you would like to set for your custom learner: .learners(learner_config_dict={...}). Note that every Learner has access to the entire AlgorithmConfig object through self.config, but setting the learner_config_dict is a convenient way to avoid having to create an entirely new AlgorithmConfig subclass only to support a few extra settings for your custom Learner class.

AlgorithmConfig.env_runners()#

# RolloutWorkers have been replace by EnvRunners. EnvRunners are more efficient and offer
# a more separation-of-concerns design and cleaner code.
config.env_runners(
    num_env_runners=2,  # use this instead of `num_workers`
)

# The following `env_runners` settings are deprecated and should no longer be explicitly
# set on the new stack:
config.env_runners(
    create_env_on_local_worker=False,
    sample_collector=None,
    enable_connectors=True,
    remote_worker_envs=False,
    remote_env_batch_wait_ms=0,
    preprocessor_pref="deepmind",
    enable_tf1_exec_eagerly=False,
    sampler_perf_stats_ema_coef=None,
)

Hint

If you want to IDE-debug what’s going on inside your EnvRunners, set num_env_runners=0 and make sure you are running your experiment locally and not through Ray Tune. In order to do this with any of RLlib’s example or tuned_example scripts, simply set the command line args: --no-tune --num-env-runners=0.

In case you were using the observation_filter setting, perform the following translations:

# For `observation_filter="NoFilter"`, don't set anything in particular. This is the default.

# For `observation_filter="MeanStdFilter"`, do the following:
from ray.rllib.connectors.env_to_module import MeanStdFilter

config.env_runners(
    env_to_module_connector=lambda env: MeanStdFilter(multi_agent=False),  # <- or True
)

Hint

The main switch for whether to explore or not during sample collection has moved to the env_runners() method. See here for more details.

AlgorithmConfig.exploration()#

The main switch for whether to explore or not during sample collection has moved from the deprecated AlgorithmConfig.exploration() method to env_runners():

It determines whether the method your RLModule calls inside the EnvRunner is either _forward_exploration(), in the case explore=True, or _forward_inference(), in the case explore=False.

config.env_runners(explore=True)  # <- or False

The Ray team has deprecated the exploration_config setting. Instead, define the exact exploratory behavior, for example, sample an action from a distribution, inside the overridden _forward_exploration() method of your RLModule.

Custom callbacks#

If you’re using custom callbacks on the old API stack, you’re subclassing the DefaultCallbacks class, which the Ray team renamed to :py:class`~ray.rllib.callbacks.callbacks.RLlibCallback`. You can continue this approach with the new API stack and pass your custom subclass to your config like the following:

# config.callbacks(YourCallbacksClass)

However, if you’re overriding those methods that triggered on the EnvRunner side, for example, on_episode_start/stop/step/etc..., you may have to translate some call arguments.

The following is a one-to-one translation guide for these types of :py:class`~ray.rllib.callbacks.callbacks.RLlibCallback` methods:

from ray.rllib.callbacks.callbacks import RLlibCallback

class YourCallbacksClass(RLlibCallback):

    def on_episode_start(
        self,
        *,
        episode,
        env_runner,
        metrics_logger,
        env,
        env_index,
        rl_module,

        # Old API stack args; don't use or access these inside your method code.
        worker=None,
        base_env=None,
        policies=None,
        **kwargs,
    ):
        # The `SingleAgentEpisode` or `MultiAgentEpisode` that RLlib has just started.
        # See https://docs.ray.io/en/latest/rllib/single-agent-episode.html for more details:
        print(episode)

        # The `EnvRunner` class that collects the episode in question.
        # This class used to be a `RolloutWorker`. On the new stack, this class is either a
        # `SingleAgentEnvRunner` or a `MultiAgentEnvRunner` holding the gymnasium Env,
        # the RLModule, and the 2 connector pipelines, env-to-module and module-to-env.
        print(env_runner)

        # The MetricsLogger object on the EnvRunner (documentation is a WIP).
        print(metrics_logger.peek("episode_return_mean", default=0.0))

        # The gymnasium env that sample collection uses. Note that this env may be a
        # gymnasium.vector.VectorEnv.
        print(env)

        # The env index, in case of a vector env, that handles the `episode`.
        print(env_index)

        # The RL Module that this EnvRunner uses. Note that this module may be a "plain", single-agent
        # `RLModule`, or a `MultiRLModule` in the multi-agent case.
        print(rl_module)

# Change similarly:
# on_episode_created()
# on_episode_step()
# on_episode_end()

The following callback methods are no longer available on the new API stack:

  • on_sub_environment_created(): The new API stack uses Farama’s gymnasium vector Envs leaving no control for RLlib to call a callback on each individual env-index’s creation.

  • on_create_policy(): This method is no longer available on the new API stack because only RolloutWorker calls it.

  • on_postprocess_trajectory(): The new API stack no longer triggers and calls this method because ConnectorV2 pipelines handle trajectory processing entirely. The documentation for ConnectorV2 is under development.

ModelV2 to RLModule#

If you’re using a custom ModelV2 class and want to translate the entire NN architecture and possibly action distribution logic to the new API stack, see RL Modules in addition to this section.

Also, see these example scripts on how to write a custom CNN-containing RL Module and how to write a custom LSTM-containing RL Module.

There are various options for translating an existing, custom ModelV2 from the old API stack, to the new API stack’s RLModule:

  1. Move your ModelV2 code to a new, custom RLModule class. See RL Modules for details).

  2. Use an Algorithm checkpoint or a Policy checkpoint that you have from an old API stack training run and use this checkpoint with the new stack RL Module convenience wrapper.

  3. Use an existing AlgorithmConfig object from an old API stack training run, with the new stack RL Module convenience wrapper.

In more complex scenarios, you might’ve implemented custom policies, such that you could modify the behavior of constructing models and distributions.

Translating Policy.compute_actions_from_input_dict#

This old API stack method, as well as compute_actions and compute_single_action, directly translate to _forward_inference() and _forward_exploration(). The RLModule guide explains how to implement this method.

Translating Policy.action_distribution_fn#

To translate action_distribution_fn, write the following custom RLModule code:

from ray.rllib.models.torch.torch_distributions import YOUR_DIST_CLASS


class MyRLModule(TorchRLModule):
    def setup(self):
        ...
        # Set the following attribute at the end of your custom `setup()`.
        self.action_dist_cls = YOUR_DIST_CLASS
from ray.rllib.models.torch.torch_distributions import (
    YOUR_INFERENCE_DIST_CLASS,
    YOUR_EXPLORATION_DIST_CLASS,
    YOUR_TRAIN_DIST_CLASS,
)

    def get_inference_action_dist_cls(self):
        return YOUR_INFERENCE_DIST_CLASS

    def get_exploration_action_dist_cls(self):
        return YOUR_EXPLORATION_DIST_CLASS

    def get_train_action_dist_cls(self):
        return YOUR_TRAIN_DIST_CLASS

Translating Policy.action_sampler_fn#

To translate action_sampler_fn, write the following custom RLModule code:

from ray.rllib.models.torch.torch_distributions import YOUR_DIST_CLASS


class MyRLModule(TorchRLModule):

    def _forward_exploration(self, batch):
        computation_results = ...
        my_dist = YOUR_DIST_CLASS(computation_results)
        actions = my_dist.sample()
        return {Columns.ACTIONS: actions}

    # Maybe for inference, you would like to sample from the deterministic version
    # of your distribution:
    def _forward_inference(self, batch):
        computation_results = ...
        my_dist = YOUR_DIST_CLASS(computation_results)
        greedy_actions = my_dist.to_deterministic().sample()
        return {Columns.ACTIONS: greedy_actions}

Policy.compute_log_likelihoods#

Implement your custom RLModule’s _forward_train() method and return the Columns.ACTION_LOGP key together with the corresponding action log probabilities to pass this information to your loss functions, which your code calls after forward_train(). The loss logic can then access Columns.ACTION_LOGP.

Custom loss functions and policies#

If you’re using one or more custom loss functions or custom (PyTorch) optimizers to train your models, instead of doing these customizations inside the old stack’s Policy class, you need to move the logic into the new API stack’s Learner class.

See Learner for details on how to write a custom Learner .

The following example scripts show how to write: - a simple custom loss function - a custom Learner with 2 optimizers and different learning rates for each.

Note that the new API stack doesn’t support the Policy class. In the old stack, this class holds a neural network, which is the RLModule in the new API stack, an old stack connector, which is the ConnectorV2 in the new API stack, and one or more optimizers and losses, which are the Learner class in the new API stack.

The RL Module API is much more flexible than the old stack’s Policy API and provides a cleaner separation-of-concerns experience. Things related to action inference run on the EnvRunners, and things related to updating run on the Learner workers It also provides superior scalability, allowing training in a multi-GPU setup in any Ray cluster and multi-node with multi-GPU training on the Anyscale platform.

Custom connectors#

If you’re using custom connectors from the old API stack, move your logic into the new ConnectorV2 API. Translate your agent connectors into env-to-module ConnectorV2 pieces and your action connectors into module-to-env ConnectorV2 pieces.

The ConnectorV2 documentation is under development.

The following are some examples on how to write ConnectorV2 pieces for the different pipelines:

  1. Observation frame-stacking.

  2. Add the most recent action and reward to the RL Module’s input.

  3. Mean-std filtering on all observations.

  4. Flatten any complex observation space to a 1D space.