Note

Ray 2.10.0 introduces the alpha stage of RLlib’s “new API stack”. The Ray Team plans to transition algorithms, example scripts, and documentation to the new code base thereby incrementally replacing the “old API stack” (e.g., ModelV2, Policy, RolloutWorker) throughout the subsequent minor releases leading up to Ray 3.0.

Note, however, that so far only PPO (single- and multi-agent) and SAC (single-agent only) support the “new API stack” and continue to run by default with the old APIs. You can continue to use the existing custom (old stack) classes.

See here for more details on how to use the new API stack.

RLlib Utilities#

Here is a list of all the utilities available in RLlib.

Exploration API#

Exploration is crucial in RL for enabling a learning agent to find new, potentially high-reward states by reaching unexplored areas of the environment.

RLlib has several built-in exploration components that the different algorithms use. You can also customize an algorithm’s exploration behavior by sub-classing the Exploration base class and implementing your own logic:

Built-in Exploration components#

Exploration

Implements an exploration strategy for Policies.

Random

A random action selector (deterministic/greedy for explore=False).

StochasticSampling

An exploration that simply samples from a distribution.

EpsilonGreedy

Epsilon-greedy Exploration class that produces exploration actions.

GaussianNoise

An exploration that adds white noise to continuous actions.

OrnsteinUhlenbeckNoise

An exploration that adds Ornstein-Uhlenbeck noise to continuous actions.

RE3

Random Encoder for Efficient Exploration.

Curiosity

Implementation of: [1] Curiosity-driven Exploration by Self-supervised Prediction Pathak, Agrawal, Efros, and Darrell - UC Berkeley - ICML 2017.

ParameterNoise

An exploration that changes a Model's parameters.

Inference#

get_exploration_action

Returns a (possibly) exploratory action and its log-likelihood.

Callback hooks#

before_compute_actions

Hook for preparations before policy.compute_actions() is called.

on_episode_start

Handles necessary exploration logic at the beginning of an episode.

on_episode_end

Handles necessary exploration logic at the end of an episode.

postprocess_trajectory

Handles post-processing of done episode trajectories.

Setting and getting states#

get_state

Returns the current exploration state.

set_state

Sets the Exploration object's state to the given values.

Scheduler API#

Use a scheduler to set scheduled values for variables (in Python, PyTorch, or TensorFlow) based on an (int64) timestep input. The computed values are usually float32 types.

Built-in Scheduler components#

Schedule

Schedule classes implement various time-dependent scheduling schemas.

ConstantSchedule

A Schedule where the value remains constant over time.

LinearSchedule

Linear interpolation between initial_p and final_p.

PiecewiseSchedule

Implements a Piecewise Scheduler.

ExponentialSchedule

Exponential decay schedule from initial_p to final_p.

PolynomialSchedule

Polynomial interpolation between initial_p and final_p.

Methods#

value

Generates the value given a timestep (based on schedule's logic).

__call__

Simply calls self.value(t).

Training Operations Utilities#

multi_gpu_train_one_step

Multi-GPU version of train_one_step.

train_one_step

Function that improves the all policies in train_batch on the local worker.

Framework Utilities#

Import utilities#

try_import_torch

Tries importing torch and returns the module (or None).

try_import_tf

Tries importing tf and returns the module (or None).

try_import_tfp

Tries importing tfp and returns the module (or None).

Tensorflow utilities#

explained_variance

Computes the explained variance for a pair of labels and predictions.

flatten_inputs_to_1d_tensor

Flattens arbitrary input structs according to the given spaces struct.

get_gpu_devices

Returns a list of GPU device names, e.g.

get_placeholder

Returns a tf1.placeholder object given optional hints, such as a space.

huber_loss

Computes the huber loss for a given term and delta parameter.

l2_loss

Computes half the L2 norm over a tensor's values without the sqrt.

make_tf_callable

Returns a function that can be executed in either graph or eager mode.

minimize_and_clip

Computes, then clips gradients using objective, optimizer and var list.

one_hot

Returns a one-hot tensor, given and int tensor and a space.

reduce_mean_ignore_inf

Same as tf.reduce_mean() but ignores -inf values.

scope_vars

Get variables inside a given scope.

warn_if_infinite_kl_divergence

zero_logps_from_actions

Helper function useful for returning dummy logp's (0) for some actions.

Torch utilities#

apply_grad_clipping

Applies gradient clipping to already computed grads inside optimizer.

concat_multi_gpu_td_errors

Concatenates multi-GPU (per-tower) TD error tensors given TorchPolicy.

convert_to_torch_tensor

Converts any struct to torch.Tensors.

explained_variance

Computes the explained variance for a pair of labels and predictions.

flatten_inputs_to_1d_tensor

Flattens arbitrary input structs according to the given spaces struct.

get_device

Returns a torch device edepending on a config and current worker index.

global_norm

Returns the global L2 norm over a list of tensors.

huber_loss

Computes the huber loss for a given term and delta parameter.

l2_loss

Computes half the L2 norm over a tensor's values without the sqrt.

minimize_and_clip

Clips grads found in optimizer.param_groups to given value in place.

one_hot

Returns a one-hot tensor, given and int tensor and a space.

reduce_mean_ignore_inf

Same as torch.mean() but ignores -inf values.

sequence_mask

Offers same behavior as tf.sequence_mask for torch.

warn_if_infinite_kl_divergence

set_torch_seed

Sets the torch random seed to the given value.

softmax_cross_entropy_with_logits

Same behavior as tf.nn.softmax_cross_entropy_with_logits.

Numpy utilities#

aligned_array

concat_aligned

convert_to_numpy

Converts values in stats to non-Tensor numpy or python types.

fc

Calculates FC (dense) layer outputs given weights/biases and input.

flatten_inputs_to_1d_tensor

Flattens arbitrary input structs according to the given spaces struct.

make_action_immutable

Flags actions immutable to notify users when trying to change them.

huber_loss

Reference: https://en.wikipedia.org/wiki/Huber_loss.

l2_loss

Computes half the L2 norm of a tensor (w/o the sqrt): sum(x**2) / 2.

lstm

Calculates LSTM layer output given weights/biases, states, and input.

one_hot

One-hot utility function for numpy.

relu

Implementation of the leaky ReLU function.

sigmoid

Returns the sigmoid function applied to x.

softmax

Returns the softmax values for x.