RLlib Package Reference

ray.rllib.policy

class ray.rllib.policy.Policy(observation_space, action_space, config)[source]

An agent policy and loss, i.e., a TFPolicy or other subclass.

This object defines how to act in the environment, and also losses used to improve the policy based on its experiences. Note that both policy and loss are defined together for convenience, though the policy itself is logically separate.

All policies can directly extend Policy, however TensorFlow users may find TFPolicy simpler to implement. TFPolicy also enables RLlib to apply TensorFlow-specific optimizations such as fusing multiple policy graphs and multi-GPU support.

observation_space

Observation space of the policy.

Type

gym.Space

action_space

Action space of the policy.

Type

gym.Space

exploration

The exploration object to use for computing actions, or None.

Type

Exploration

abstract compute_actions(obs_batch, state_batches=None, prev_action_batch=None, prev_reward_batch=None, info_batch=None, episodes=None, explore=None, timestep=None, **kwargs)[source]

Computes actions for the current policy.

Parameters
  • obs_batch (Union[List,np.ndarray]) – Batch of observations.

  • state_batches (Optional[list]) – List of RNN state input batches, if any.

  • prev_action_batch (Optional[List,np.ndarray]) – Batch of previous action values.

  • prev_reward_batch (Optional[List,np.ndarray]) – Batch of previous rewards.

  • info_batch (info) – Batch of info objects.

  • episodes (list) – MultiAgentEpisode for each obs in obs_batch. This provides access to all of the internal episode state, which may be useful for model-based or multiagent algorithms.

  • explore (bool) – Whether to pick an exploitation or exploration action (default: None -> use self.config[“explore”]).

  • timestep (int) – The current (sampling) time step.

  • kwargs – forward compatibility placeholder

Returns

batch of output actions, with shape like

[BATCH_SIZE, ACTION_SHAPE].

state_outs (list): list of RNN state output batches, if any, with

shape like [STATE_SIZE, BATCH_SIZE].

info (dict): dictionary of extra feature batches, if any, with

shape like {“f1”: [BATCH_SIZE, …], “f2”: [BATCH_SIZE, …]}.

Return type

actions (np.ndarray)

compute_single_action(obs, state=None, prev_action=None, prev_reward=None, info=None, episode=None, clip_actions=False, explore=None, timestep=None, **kwargs)[source]

Unbatched version of compute_actions.

Parameters
  • obs (obj) – Single observation.

  • state (list) – List of RNN state inputs, if any.

  • prev_action (obj) – Previous action value, if any.

  • prev_reward (float) – Previous reward, if any.

  • info (dict) – info object, if any

  • episode (MultiAgentEpisode) – this provides access to all of the internal episode state, which may be useful for model-based or multi-agent algorithms.

  • clip_actions (bool) – Should actions be clipped?

  • explore (bool) – Whether to pick an exploitation or exploration action (default: None -> use self.config[“explore”]).

  • timestep (int) – The current (sampling) time step.

  • kwargs – forward compatibility placeholder

Returns

single action state_outs (list): list of RNN state outputs, if any info (dict): dictionary of extra features, if any

Return type

actions (obj)

compute_log_likelihoods(actions, obs_batch, state_batches=None, prev_action_batch=None, prev_reward_batch=None)[source]

Computes the log-prob/likelihood for a given action and observation.

Parameters
  • actions (Union[List,np.ndarray]) – Batch of actions, for which to retrieve the log-probs/likelihoods (given all other inputs: obs, states, ..).

  • obs_batch (Union[List,np.ndarray]) – Batch of observations.

  • state_batches (Optional[list]) – List of RNN state input batches, if any.

  • prev_action_batch (Optional[List,np.ndarray]) – Batch of previous action values.

  • prev_reward_batch (Optional[List,np.ndarray]) – Batch of previous rewards.

Returns

Batch of log probs/likelihoods, with

shape: [BATCH_SIZE].

Return type

log-likelihoods (np.ndarray)

postprocess_trajectory(sample_batch, other_agent_batches=None, episode=None)[source]

Implements algorithm-specific trajectory postprocessing.

This will be called on each trajectory fragment computed during policy evaluation. Each fragment is guaranteed to be only from one episode.

Parameters
  • sample_batch (SampleBatch) – batch of experiences for the policy, which will contain at most one episode trajectory.

  • other_agent_batches (dict) – In a multi-agent env, this contains a mapping of agent ids to (policy, agent_batch) tuples containing the policy and experiences of the other agents.

  • episode (MultiAgentEpisode) – this provides access to all of the internal episode state, which may be useful for model-based or multi-agent algorithms.

Returns

Postprocessed sample batch.

Return type

SampleBatch

learn_on_batch(samples)[source]

Fused compute gradients and apply gradients call.

Either this or the combination of compute/apply grads must be implemented by subclasses.

Returns

dictionary of extra metadata from compute_gradients().

Return type

grad_info

Examples

>>> batch = ev.sample()
>>> ev.learn_on_batch(samples)
compute_gradients(postprocessed_batch)[source]

Computes gradients against a batch of experiences.

Either this or learn_on_batch() must be implemented by subclasses.

Returns

List of gradient output values info (dict): Extra policy-specific values

Return type

grads (list)

apply_gradients(gradients)[source]

Applies previously computed gradients.

Either this or learn_on_batch() must be implemented by subclasses.

get_weights()[source]

Returns model weights.

Returns

Serializable copy or view of model weights

Return type

weights (obj)

set_weights(weights)[source]

Sets model weights.

Parameters

weights (obj) – Serializable copy or view of model weights

get_exploration_info()[source]

Returns the current exploration information of this policy.

This information depends on the policy’s Exploration object.

Returns

Serializable information on the self.exploration object.

Return type

any

is_recurrent()[source]

Whether this Policy holds a recurrent Model.

Returns

True if this Policy has-a RNN-based Model.

Return type

bool

num_state_tensors()[source]

The number of internal states needed by the RNN-Model of the Policy.

Returns

The number of RNN internal states kept by this Policy’s Model.

Return type

int

get_initial_state()[source]

Returns initial RNN state for the current policy.

get_state()[source]

Saves all local state.

Returns

Serialized local state.

Return type

state (obj)

set_state(state)[source]

Restores all local state.

Parameters

state (obj) – Serialized local state.

on_global_var_update(global_vars)[source]

Called on an update to global vars.

Parameters

global_vars (dict) – Global variables broadcast from the driver.

export_model(export_dir)[source]

Export Policy to local directory for serving.

Parameters

export_dir (str) – Local writable directory.

export_checkpoint(export_dir)[source]

Export Policy checkpoint to local directory.

Argument:

export_dir (str): Local writable directory.

import_model_from_h5(import_file)[source]

Imports Policy from local file.

Parameters

import_file (str) – Local readable file.

ray.rllib.policy.PolicyID

alias of builtins.str

class ray.rllib.policy.TorchPolicy(observation_space, action_space, config, *, model, loss, action_distribution_class, action_sampler_fn=None, action_distribution_fn=None, max_seq_len=20, get_batch_divisibility_req=None)[source]

Template for a PyTorch policy and loss to use with RLlib.

This is similar to TFPolicy, but for PyTorch.

observation_space

observation space of the policy.

Type

gym.Space

action_space

action space of the policy.

Type

gym.Space

config

config of the policy.

Type

dict

model

Torch model instance.

Type

TorchModel

dist_class

Torch action distribution class.

Type

type

compute_actions(obs_batch, state_batches=None, prev_action_batch=None, prev_reward_batch=None, info_batch=None, episodes=None, explore=None, timestep=None, **kwargs)[source]

Computes actions for the current policy.

Parameters
  • obs_batch (Union[List,np.ndarray]) – Batch of observations.

  • state_batches (Optional[list]) – List of RNN state input batches, if any.

  • prev_action_batch (Optional[List,np.ndarray]) – Batch of previous action values.

  • prev_reward_batch (Optional[List,np.ndarray]) – Batch of previous rewards.

  • info_batch (info) – Batch of info objects.

  • episodes (list) – MultiAgentEpisode for each obs in obs_batch. This provides access to all of the internal episode state, which may be useful for model-based or multiagent algorithms.

  • explore (bool) – Whether to pick an exploitation or exploration action (default: None -> use self.config[“explore”]).

  • timestep (int) – The current (sampling) time step.

  • kwargs – forward compatibility placeholder

Returns

batch of output actions, with shape like

[BATCH_SIZE, ACTION_SHAPE].

state_outs (list): list of RNN state output batches, if any, with

shape like [STATE_SIZE, BATCH_SIZE].

info (dict): dictionary of extra feature batches, if any, with

shape like {“f1”: [BATCH_SIZE, …], “f2”: [BATCH_SIZE, …]}.

Return type

actions (np.ndarray)

compute_log_likelihoods(actions, obs_batch, state_batches=None, prev_action_batch=None, prev_reward_batch=None)[source]

Computes the log-prob/likelihood for a given action and observation.

Parameters
  • actions (Union[List,np.ndarray]) – Batch of actions, for which to retrieve the log-probs/likelihoods (given all other inputs: obs, states, ..).

  • obs_batch (Union[List,np.ndarray]) – Batch of observations.

  • state_batches (Optional[list]) – List of RNN state input batches, if any.

  • prev_action_batch (Optional[List,np.ndarray]) – Batch of previous action values.

  • prev_reward_batch (Optional[List,np.ndarray]) – Batch of previous rewards.

Returns

Batch of log probs/likelihoods, with

shape: [BATCH_SIZE].

Return type

log-likelihoods (np.ndarray)

learn_on_batch(postprocessed_batch)[source]

Fused compute gradients and apply gradients call.

Either this or the combination of compute/apply grads must be implemented by subclasses.

Returns

dictionary of extra metadata from compute_gradients().

Return type

grad_info

Examples

>>> batch = ev.sample()
>>> ev.learn_on_batch(samples)
compute_gradients(postprocessed_batch)[source]

Computes gradients against a batch of experiences.

Either this or learn_on_batch() must be implemented by subclasses.

Returns

List of gradient output values info (dict): Extra policy-specific values

Return type

grads (list)

apply_gradients(gradients)[source]

Applies previously computed gradients.

Either this or learn_on_batch() must be implemented by subclasses.

get_weights()[source]

Returns model weights.

Returns

Serializable copy or view of model weights

Return type

weights (obj)

set_weights(weights)[source]

Sets model weights.

Parameters

weights (obj) – Serializable copy or view of model weights

is_recurrent()[source]

Whether this Policy holds a recurrent Model.

Returns

True if this Policy has-a RNN-based Model.

Return type

bool

num_state_tensors()[source]

The number of internal states needed by the RNN-Model of the Policy.

Returns

The number of RNN internal states kept by this Policy’s Model.

Return type

int

get_initial_state()[source]

Returns initial RNN state for the current policy.

extra_grad_process(optimizer, loss)[source]

Called after each optimizer.zero_grad() + loss.backward() call.

Called for each self._optimizers/loss-value pair. Allows for gradient processing before optimizer.step() is called. E.g. for gradient clipping.

Parameters
  • optimizer (torch.optim.Optimizer) – A torch optimizer object.

  • loss (torch.Tensor) – The loss tensor associated with the optimizer.

Returns

An info dict.

Return type

dict

extra_action_out(input_dict, state_batches, model, action_dist)[source]

Returns dict of extra info to include in experience batch.

Parameters
  • input_dict (dict) – Dict of model input tensors.

  • state_batches (list) – List of state tensors.

  • model (TorchModelV2) – Reference to the model.

  • action_dist (TorchActionDistribution) – Torch action dist object to get log-probs (e.g. for already sampled actions).

extra_grad_info(train_batch)[source]

Return dict of extra grad info.

optimizer()[source]

Custom PyTorch optimizer to use.

export_model(export_dir)[source]

TODO(sven): implement for torch.

export_checkpoint(export_dir)[source]

TODO(sven): implement for torch.

import_model_from_h5(import_file)[source]

Imports weights into torch model.

class ray.rllib.policy.TFPolicy(observation_space, action_space, config, sess, obs_input, sampled_action, loss, loss_inputs, model=None, sampled_action_logp=None, action_input=None, log_likelihood=None, dist_inputs=None, dist_class=None, state_inputs=None, state_outputs=None, prev_action_input=None, prev_reward_input=None, seq_lens=None, max_seq_len=20, batch_divisibility_req=1, update_ops=None, explore=None, timestep=None)[source]

An agent policy and loss implemented in TensorFlow.

Extending this class enables RLlib to perform TensorFlow specific optimizations on the policy, e.g., parallelization across gpus or fusing multiple graphs together in the multi-agent setting.

Input tensors are typically shaped like [BATCH_SIZE, …].

observation_space

observation space of the policy.

Type

gym.Space

action_space

action space of the policy.

Type

gym.Space

model

RLlib model used for the policy.

Type

rllib.models.Model

Examples

>>> policy = TFPolicySubclass(
    sess, obs_input, sampled_action, loss, loss_inputs)
>>> print(policy.compute_actions([1, 0, 2]))
(array([0, 1, 1]), [], {})
>>> print(policy.postprocess_trajectory(SampleBatch({...})))
SampleBatch({"action": ..., "advantages": ..., ...})
variables()[source]

Return the list of all savable variables for this policy.

get_placeholder(name)[source]

Returns the given action or loss input placeholder by name.

If the loss has not been initialized and a loss input placeholder is requested, an error is raised.

get_session()[source]

Returns a reference to the TF session for this policy.

loss_initialized()[source]

Returns whether the loss function has been initialized.

compute_actions(obs_batch, state_batches=None, prev_action_batch=None, prev_reward_batch=None, info_batch=None, episodes=None, explore=None, timestep=None, **kwargs)[source]

Computes actions for the current policy.

Parameters
  • obs_batch (Union[List,np.ndarray]) – Batch of observations.

  • state_batches (Optional[list]) – List of RNN state input batches, if any.

  • prev_action_batch (Optional[List,np.ndarray]) – Batch of previous action values.

  • prev_reward_batch (Optional[List,np.ndarray]) – Batch of previous rewards.

  • info_batch (info) – Batch of info objects.

  • episodes (list) – MultiAgentEpisode for each obs in obs_batch. This provides access to all of the internal episode state, which may be useful for model-based or multiagent algorithms.

  • explore (bool) – Whether to pick an exploitation or exploration action (default: None -> use self.config[“explore”]).

  • timestep (int) – The current (sampling) time step.

  • kwargs – forward compatibility placeholder

Returns

batch of output actions, with shape like

[BATCH_SIZE, ACTION_SHAPE].

state_outs (list): list of RNN state output batches, if any, with

shape like [STATE_SIZE, BATCH_SIZE].

info (dict): dictionary of extra feature batches, if any, with

shape like {“f1”: [BATCH_SIZE, …], “f2”: [BATCH_SIZE, …]}.

Return type

actions (np.ndarray)

compute_log_likelihoods(actions, obs_batch, state_batches=None, prev_action_batch=None, prev_reward_batch=None)[source]

Computes the log-prob/likelihood for a given action and observation.

Parameters
  • actions (Union[List,np.ndarray]) – Batch of actions, for which to retrieve the log-probs/likelihoods (given all other inputs: obs, states, ..).

  • obs_batch (Union[List,np.ndarray]) – Batch of observations.

  • state_batches (Optional[list]) – List of RNN state input batches, if any.

  • prev_action_batch (Optional[List,np.ndarray]) – Batch of previous action values.

  • prev_reward_batch (Optional[List,np.ndarray]) – Batch of previous rewards.

Returns

Batch of log probs/likelihoods, with

shape: [BATCH_SIZE].

Return type

log-likelihoods (np.ndarray)

compute_gradients(postprocessed_batch)[source]

Computes gradients against a batch of experiences.

Either this or learn_on_batch() must be implemented by subclasses.

Returns

List of gradient output values info (dict): Extra policy-specific values

Return type

grads (list)

apply_gradients(gradients)[source]

Applies previously computed gradients.

Either this or learn_on_batch() must be implemented by subclasses.

learn_on_batch(postprocessed_batch)[source]

Fused compute gradients and apply gradients call.

Either this or the combination of compute/apply grads must be implemented by subclasses.

Returns

dictionary of extra metadata from compute_gradients().

Return type

grad_info

Examples

>>> batch = ev.sample()
>>> ev.learn_on_batch(samples)
get_exploration_info()[source]

Returns the current exploration information of this policy.

This information depends on the policy’s Exploration object.

Returns

Serializable information on the self.exploration object.

Return type

any

get_weights()[source]

Returns model weights.

Returns

Serializable copy or view of model weights

Return type

weights (obj)

set_weights(weights)[source]

Sets model weights.

Parameters

weights (obj) – Serializable copy or view of model weights

export_model(export_dir)[source]

Export tensorflow graph to export_dir for serving.

export_checkpoint(export_dir, filename_prefix='model')[source]

Export tensorflow checkpoint to export_dir.

import_model_from_h5(import_file)[source]

Imports weights into tf model.

copy(existing_inputs)[source]

Creates a copy of self using existing input placeholders.

Optional, only required to work with the multi-GPU optimizer.

is_recurrent()[source]

Whether this Policy holds a recurrent Model.

Returns

True if this Policy has-a RNN-based Model.

Return type

bool

num_state_tensors()[source]

The number of internal states needed by the RNN-Model of the Policy.

Returns

The number of RNN internal states kept by this Policy’s Model.

Return type

int

extra_compute_action_feed_dict()[source]

Extra dict to pass to the compute actions session run.

extra_compute_action_fetches()[source]

Extra values to fetch and return from compute_actions().

By default we return action probability/log-likelihood info and action distribution inputs (if present).

extra_compute_grad_feed_dict()[source]

Extra dict to pass to the compute gradients session run.

extra_compute_grad_fetches()[source]

Extra values to fetch and return from compute_gradients().

optimizer()[source]

TF optimizer to use for policy optimization.

gradients(optimizer, loss)[source]

Override for custom gradient computation.

build_apply_op(optimizer, grads_and_vars)[source]

Override for custom gradient apply computation.

ray.rllib.policy.build_torch_policy(name, *, loss_fn, get_default_config=None, stats_fn=None, postprocess_fn=None, extra_action_out_fn=None, extra_grad_process_fn=None, optimizer_fn=None, before_init=None, after_init=None, action_sampler_fn=None, action_distribution_fn=None, make_model=None, make_model_and_action_dist=None, apply_gradients_fn=None, mixins=None, get_batch_divisibility_req=None)[source]

Helper function for creating a torch policy class at runtime.

Parameters
  • name (str) – name of the policy (e.g., “PPOTorchPolicy”)

  • loss_fn (callable) – Callable that returns a loss tensor as arguments given (policy, model, dist_class, train_batch).

  • get_default_config (Optional[callable]) – Optional callable that returns the default config to merge with any overrides.

  • stats_fn (Optional[callable]) – Optional callable that returns a dict of values given the policy and batch input tensors.

  • postprocess_fn (Optional[callable]) – Optional experience postprocessing function that takes the same args as Policy.postprocess_trajectory().

  • extra_action_out_fn (Optional[callable]) – Optional callable that returns a dict of extra values to include in experiences.

  • extra_grad_process_fn (Optional[callable]) – Optional callable that is called after gradients are computed and returns processing info.

  • optimizer_fn (Optional[callable]) – Optional callable that returns a torch optimizer given the policy and config.

  • before_init (Optional[callable]) – Optional callable to run at the beginning of Policy.__init__ that takes the same arguments as the Policy constructor.

  • after_init (Optional[callable]) – Optional callable to run at the end of policy init that takes the same arguments as the policy constructor.

  • action_sampler_fn (Optional[callable]) – Optional callable returning a sampled action and its log-likelihood given some (obs and state) inputs.

  • action_distribution_fn (Optional[callable]) – A callable that takes the Policy, Model, the observation batch, an explore-flag, a timestep, and an is_training flag and returns a tuple of a) distribution inputs (parameters), b) a dist-class to generate an action distribution object from, and c) internal-state outputs (empty list if not applicable).

  • make_model (Optional[callable]) – Optional func that takes the same arguments as Policy.__init__ and returns a model instance. The distribution class will be determined automatically. Note: Only one of make_model or make_model_and_action_dist should be provided.

  • make_model_and_action_dist (Optional[callable]) – Optional func that takes the same arguments as Policy.__init__ and returns a tuple of model instance and torch action distribution class. Note: Only one of make_model or make_model_and_action_dist should be provided.

  • apply_gradients_fn (Optional[callable]) – Optional callable that takes a grads list and applies these to the Model’s parameters.

  • mixins (list) – list of any class mixins for the returned policy class. These mixins will be applied in order and will have higher precedence than the TorchPolicy class.

  • get_batch_divisibility_req (Optional[callable]) – Optional callable that returns the divisibility requirement for sample batches.

Returns

TorchPolicy child class constructed from the specified args.

Return type

type

ray.rllib.policy.build_tf_policy(name, *, loss_fn, get_default_config=None, postprocess_fn=None, stats_fn=None, optimizer_fn=None, gradients_fn=None, apply_gradients_fn=None, grad_stats_fn=None, extra_action_fetches_fn=None, extra_learn_fetches_fn=None, before_init=None, before_loss_init=None, after_init=None, make_model=None, action_sampler_fn=None, action_distribution_fn=None, mixins=None, get_batch_divisibility_req=None, obs_include_prev_action_reward=True)[source]

Helper function for creating a dynamic tf policy at runtime.

Functions will be run in this order to initialize the policy:
  1. Placeholder setup: postprocess_fn

  2. Loss init: loss_fn, stats_fn

  3. Optimizer init: optimizer_fn, gradients_fn, apply_gradients_fn,

    grad_stats_fn

This means that you can e.g., depend on any policy attributes created in the running of loss_fn in later functions such as stats_fn.

In eager mode, the following functions will be run repeatedly on each eager execution: loss_fn, stats_fn, gradients_fn, apply_gradients_fn, and grad_stats_fn.

This means that these functions should not define any variables internally, otherwise they will fail in eager mode execution. Variable should only be created in make_model (if defined).

Parameters
  • name (str) – name of the policy (e.g., “PPOTFPolicy”)

  • loss_fn (func) – function that returns a loss tensor as arguments (policy, model, dist_class, train_batch)

  • get_default_config (func) – optional function that returns the default config to merge with any overrides

  • postprocess_fn (func) – optional experience postprocessing function that takes the same args as Policy.postprocess_trajectory()

  • stats_fn (func) – optional function that returns a dict of TF fetches given the policy and batch input tensors

  • optimizer_fn (func) – optional function that returns a tf.Optimizer given the policy and config

  • gradients_fn (func) – optional function that returns a list of gradients given (policy, optimizer, loss). If not specified, this defaults to optimizer.compute_gradients(loss)

  • apply_gradients_fn (func) – optional function that returns an apply gradients op given (policy, optimizer, grads_and_vars)

  • grad_stats_fn (func) – optional function that returns a dict of TF fetches given the policy, batch input, and gradient tensors

  • extra_action_fetches_fn (func) – optional function that returns a dict of TF fetches given the policy object

  • extra_learn_fetches_fn (func) – optional function that returns a dict of extra values to fetch and return when learning on a batch

  • before_init (func) – optional function to run at the beginning of policy init that takes the same arguments as the policy constructor

  • before_loss_init (func) – optional function to run prior to loss init that takes the same arguments as the policy constructor

  • after_init (func) – optional function to run at the end of policy init that takes the same arguments as the policy constructor

  • make_model (func) – optional function that returns a ModelV2 object given (policy, obs_space, action_space, config). All policy variables should be created in this function. If not specified, a default model will be created.

  • action_sampler_fn (Optional[callable]) – A callable returning a sampled action and its log-likelihood given some (obs and state) inputs.

  • action_distribution_fn (Optional[callable]) – A callable returning distribution inputs (parameters), a dist-class to generate an action distribution object from, and internal-state outputs (or an empty list if not applicable).

  • mixins (list) – list of any class mixins for the returned policy class. These mixins will be applied in order and will have higher precedence than the DynamicTFPolicy class

  • get_batch_divisibility_req (func) – optional function that returns the divisibility requirement for sample batches

  • obs_include_prev_action_reward (bool) – whether to include the previous action and reward in the model input

Returns

a DynamicTFPolicy instance that uses the specified args

ray.rllib.env

class ray.rllib.env.BaseEnv[source]

The lowest-level env interface used by RLlib for sampling.

BaseEnv models multiple agents executing asynchronously in multiple environments. A call to poll() returns observations from ready agents keyed by their environment and agent ids, and actions for those agents can be sent back via send_actions().

All other env types can be adapted to BaseEnv. RLlib handles these conversions internally in RolloutWorker, for example:

gym.Env => rllib.VectorEnv => rllib.BaseEnv rllib.MultiAgentEnv => rllib.BaseEnv rllib.ExternalEnv => rllib.BaseEnv

action_space

Action space. This must be defined for single-agent envs. Multi-agent envs can set this to None.

Type

gym.Space

observation_space

Observation space. This must be defined for single-agent envs. Multi-agent envs can set this to None.

Type

gym.Space

Examples

>>> env = MyBaseEnv()
>>> obs, rewards, dones, infos, off_policy_actions = env.poll()
>>> print(obs)
{
    "env_0": {
        "car_0": [2.4, 1.6],
        "car_1": [3.4, -3.2],
    },
    "env_1": {
        "car_0": [8.0, 4.1],
    },
    "env_2": {
        "car_0": [2.3, 3.3],
        "car_1": [1.4, -0.2],
        "car_3": [1.2, 0.1],
    },
}
>>> env.send_actions(
    actions={
        "env_0": {
            "car_0": 0,
            "car_1": 1,
        }, ...
    })
>>> obs, rewards, dones, infos, off_policy_actions = env.poll()
>>> print(obs)
{
    "env_0": {
        "car_0": [4.1, 1.7],
        "car_1": [3.2, -4.2],
    }, ...
}
>>> print(dones)
{
    "env_0": {
        "__all__": False,
        "car_0": False,
        "car_1": True,
    }, ...
}
static to_base_env(env, make_env=None, num_envs=1, remote_envs=False, remote_env_batch_wait_ms=0)[source]

Wraps any env type as needed to expose the async interface.

poll()[source]

Returns observations from ready agents.

The returns are two-level dicts mapping from env_id to a dict of agent_id to values. The number of agents and envs can vary over time.

Returns

  • obs (dict) (New observations for each ready agent.)

  • rewards (dict) (Reward values for each ready agent. If the) – episode is just started, the value will be None.

  • dones (dict) (Done values for each ready agent. The special key) – “__all__” is used to indicate env termination.

  • infos (dict) (Info values for each ready agent.)

  • off_policy_actions (dict) (Agents may take off-policy actions. When) – that happens, there will be an entry in this dict that contains the taken action. There is no need to send_actions() for agents that have already chosen off-policy actions.

send_actions(action_dict)[source]

Called to send actions back to running agents in this env.

Actions should be sent for each ready agent that returned observations in the previous poll() call.

Parameters

action_dict (dict) – Actions values keyed by env_id and agent_id.

try_reset(env_id=None)[source]

Attempt to reset the sub-env with the given id or all sub-envs.

If the environment does not support synchronous reset, None can be returned here.

Parameters

env_id (Optional[int]) – The sub-env ID if applicable. If None, reset the entire Env (i.e. all sub-envs).

Returns

Resetted observation or None if not supported.

Return type

obs (dict|None)

get_unwrapped()[source]

Return a reference to the underlying gym envs, if any.

Returns

Underlying gym envs or [].

Return type

envs (list)

stop()[source]

Releases all resources used.

class ray.rllib.env.MultiAgentEnv[source]

An environment that hosts multiple independent agents.

Agents are identified by (string) agent ids. Note that these “agents” here are not to be confused with RLlib agents.

Examples

>>> env = MyMultiAgentEnv()
>>> obs = env.reset()
>>> print(obs)
{
    "car_0": [2.4, 1.6],
    "car_1": [3.4, -3.2],
    "traffic_light_1": [0, 3, 5, 1],
}
>>> obs, rewards, dones, infos = env.step(
...    action_dict={
...        "car_0": 1, "car_1": 0, "traffic_light_1": 2,
...    })
>>> print(rewards)
{
    "car_0": 3,
    "car_1": -1,
    "traffic_light_1": 0,
}
>>> print(dones)
{
    "car_0": False,    # car_0 is still running
    "car_1": True,     # car_1 is done
    "__all__": False,  # the env is not done
}
>>> print(infos)
{
    "car_0": {},  # info for car_0
    "car_1": {},  # info for car_1
}
reset()[source]

Resets the env and returns observations from ready agents.

Returns

New observations for each ready agent.

Return type

obs (dict)

step(action_dict)[source]

Returns observations from ready agents.

The returns are dicts mapping from agent_id strings to values. The number of agents in the env can vary over time.

Returns

  • obs (dict) (New observations for each ready agent.)

  • rewards (dict) (Reward values for each ready agent. If the) – episode is just started, the value will be None.

  • dones (dict) (Done values for each ready agent. The special key) – “__all__” (required) is used to indicate env termination.

  • infos (dict) (Optional info values for each agent id.)

with_agent_groups(groups, obs_space=None, act_space=None)[source]

Convenience method for grouping together agents in this env.

An agent group is a list of agent ids that are mapped to a single logical agent. All agents of the group must act at the same time in the environment. The grouped agent exposes Tuple action and observation spaces that are the concatenated action and obs spaces of the individual agents.

The rewards of all the agents in a group are summed. The individual agent rewards are available under the “individual_rewards” key of the group info return.

Agent grouping is required to leverage algorithms such as Q-Mix.

This API is experimental.

Parameters
  • groups (dict) – Mapping from group id to a list of the agent ids of group members. If an agent id is not present in any group value, it will be left ungrouped.

  • obs_space (Space) – Optional observation space for the grouped env. Must be a tuple space.

  • act_space (Space) – Optional action space for the grouped env. Must be a tuple space.

Examples

>>> env = YourMultiAgentEnv(...)
>>> grouped_env = env.with_agent_groups(env, {
...   "group1": ["agent1", "agent2", "agent3"],
...   "group2": ["agent4", "agent5"],
... })
class ray.rllib.env.ExternalEnv(action_space, observation_space, max_concurrent=100)[source]

An environment that interfaces with external agents.

Unlike simulator envs, control is inverted. The environment queries the policy to obtain actions and logs observations and rewards for training. This is in contrast to gym.Env, where the algorithm drives the simulation through env.step() calls.

You can use ExternalEnv as the backend for policy serving (by serving HTTP requests in the run loop), for ingesting offline logs data (by reading offline transitions in the run loop), or other custom use cases not easily expressed through gym.Env.

ExternalEnv supports both on-policy actions (through self.get_action()), and off-policy actions (through self.log_action()).

This env is thread-safe, but individual episodes must be executed serially.

action_space

Action space.

Type

gym.Space

observation_space

Observation space.

Type

gym.Space

Examples

>>> register_env("my_env", lambda config: YourExternalEnv(config))
>>> trainer = DQNTrainer(env="my_env")
>>> while True:
>>>     print(trainer.train())
run()[source]

Override this to implement the run loop.

Your loop should continuously:
  1. Call self.start_episode(episode_id)

  2. Call self.get_action(episode_id, obs)

    -or- self.log_action(episode_id, obs, action)

  3. Call self.log_returns(episode_id, reward)

  4. Call self.end_episode(episode_id, obs)

  5. Wait if nothing to do.

Multiple episodes may be started at the same time.

start_episode(episode_id=None, training_enabled=True)[source]

Record the start of an episode.

Parameters
  • episode_id (Optional[str]) – Unique string id for the episode or None for it to be auto-assigned and returned.

  • training_enabled (bool) – Whether to use experiences for this episode to improve the policy.

Returns

Unique string id for the episode.

Return type

episode_id (str)

get_action(episode_id, observation)[source]

Record an observation and get the on-policy action.

Parameters
  • episode_id (str) – Episode id returned from start_episode().

  • observation (obj) – Current environment observation.

Returns

Action from the env action space.

Return type

action (obj)

log_action(episode_id, observation, action)[source]

Record an observation and (off-policy) action taken.

Parameters
  • episode_id (str) – Episode id returned from start_episode().

  • observation (obj) – Current environment observation.

  • action (obj) – Action for the observation.

log_returns(episode_id, reward, info=None)[source]

Record returns from the environment.

The reward will be attributed to the previous action taken by the episode. Rewards accumulate until the next action. If no reward is logged before the next action, a reward of 0.0 is assumed.

Parameters
  • episode_id (str) – Episode id returned from start_episode().

  • reward (float) – Reward from the environment.

  • info (dict) – Optional info dict.

end_episode(episode_id, observation)[source]

Record the end of an episode.

Parameters
  • episode_id (str) – Episode id returned from start_episode().

  • observation (obj) – Current environment observation.

class ray.rllib.env.ExternalMultiAgentEnv(action_space, observation_space, max_concurrent=100)[source]

This is the multi-agent version of ExternalEnv.

run()[source]

Override this to implement the multi-agent run loop.

Your loop should continuously:
  1. Call self.start_episode(episode_id)

  2. Call self.get_action(episode_id, obs_dict)

    -or- self.log_action(episode_id, obs_dict, action_dict)

  3. Call self.log_returns(episode_id, reward_dict)

  4. Call self.end_episode(episode_id, obs_dict)

  5. Wait if nothing to do.

Multiple episodes may be started at the same time.

start_episode(episode_id=None, training_enabled=True)[source]

Record the start of an episode.

Parameters
  • episode_id (Optional[str]) – Unique string id for the episode or None for it to be auto-assigned and returned.

  • training_enabled (bool) – Whether to use experiences for this episode to improve the policy.

Returns

Unique string id for the episode.

Return type

episode_id (str)

get_action(episode_id, observation_dict)[source]

Record an observation and get the on-policy action. observation_dict is expected to contain the observation of all agents acting in this episode step.

Parameters
  • episode_id (str) – Episode id returned from start_episode().

  • observation_dict (dict) – Current environment observation.

Returns

Action from the env action space.

Return type

action (dict)

log_action(episode_id, observation_dict, action_dict)[source]

Record an observation and (off-policy) action taken.

Parameters
  • episode_id (str) – Episode id returned from start_episode().

  • observation_dict (dict) – Current environment observation.

  • action_dict (dict) – Action for the observation.

log_returns(episode_id, reward_dict, info_dict=None, multiagent_done_dict=None)[source]

Record returns from the environment.

The reward will be attributed to the previous action taken by the episode. Rewards accumulate until the next action. If no reward is logged before the next action, a reward of 0.0 is assumed.

Parameters
  • episode_id (str) – Episode id returned from start_episode().

  • reward_dict (dict) – Reward from the environment agents.

  • info_dict (dict) – Optional info dict.

  • multiagent_done_dict (dict) – Optional done dict for agents.

end_episode(episode_id, observation_dict)[source]

Record the end of an episode.

Parameters
  • episode_id (str) – Episode id returned from start_episode().

  • observation_dict (dict) – Current environment observation.

class ray.rllib.env.VectorEnv(observation_space, action_space, num_envs)[source]

An environment that supports batch evaluation using clones of sub-envs.

vector_reset()[source]

Resets all sub-environments.

Returns

List of observations from each environment.

Return type

obs (List[any])

reset_at(index)[source]

Resets a single environment.

Returns

Observations from the reset sub environment.

Return type

obs (obj)

vector_step(actions)[source]

Performs a vectorized step on all sub environments using actions.

Parameters

actions (List[any]) – List of actions (one for each sub-env).

Returns

New observations for each sub-env. rewards (List[any]): Reward values for each sub-env. dones (List[any]): Done values for each sub-env. infos (List[any]): Info values for each sub-env.

Return type

obs (List[any])

get_unwrapped()[source]

Returns the underlying sub environments.

Returns

List of all underlying sub environments.

Return type

List[Env]

class ray.rllib.env.EnvContext(env_config, worker_index, vector_index=0, remote=False)[source]

Wraps env configurations to include extra rllib metadata.

These attributes can be used to parameterize environments per process. For example, one might use worker_index to control which data file an environment reads in on initialization.

RLlib auto-sets these attributes when constructing registered envs.

worker_index

When there are multiple workers created, this uniquely identifies the worker the env is created in.

Type

int

vector_index

When there are multiple envs per worker, this uniquely identifies the env index within the worker.

Type

int

remote

Whether environment should be remote or not.

Type

bool

class ray.rllib.env.PolicyClient(address, inference_mode='local', update_interval=10.0)[source]

REST client to interact with a RLlib policy server.

start_episode(episode_id=None, training_enabled=True)[source]

Record the start of one or more episode(s).

Parameters
  • episode_id (Optional[str]) – Unique string id for the episode or None for it to be auto-assigned.

  • training_enabled (bool) – Whether to use experiences for this episode to improve the policy.

Returns

Unique string id for the episode.

Return type

episode_id (str)

get_action(episode_id, observation)[source]

Record an observation and get the on-policy action.

Parameters
  • episode_id (str) – Episode id returned from start_episode().

  • observation (obj) – Current environment observation.

Returns

Action from the env action space.

Return type

action (obj)

log_action(episode_id, observation, action)[source]

Record an observation and (off-policy) action taken.

Parameters
  • episode_id (str) – Episode id returned from start_episode().

  • observation (obj) – Current environment observation.

  • action (obj) – Action for the observation.

log_returns(episode_id, reward, info=None, multiagent_done_dict=None)[source]

Record returns from the environment.

The reward will be attributed to the previous action taken by the episode. Rewards accumulate until the next action. If no reward is logged before the next action, a reward of 0.0 is assumed.

Parameters
  • episode_id (str) – Episode id returned from start_episode().

  • reward (float) – Reward from the environment.

  • info (dict) – Extra info dict.

  • multiagent_done_dict (dict) – Multi-agent done information.

end_episode(episode_id, observation)[source]

Record the end of an episode.

Parameters
  • episode_id (str) – Episode id returned from start_episode().

  • observation (obj) – Current environment observation.

update_policy_weights()[source]

Query the server for new policy weights, if local inference is enabled.

class ray.rllib.env.PolicyServerInput(ioctx, address, port)[source]

REST policy server that acts as an offline data source.

This launches a multi-threaded server that listens on the specified host and port to serve policy requests and forward experiences to RLlib. For high performance experience collection, it implements InputReader.

For an example, run examples/cartpole_server.py along with examples/cartpole_client.py –inference-mode=local|remote.

Examples

>>> pg = PGTrainer(
...     env="CartPole-v0", config={
...         "input": lambda ioctx:
...             PolicyServerInput(ioctx, addr, port),
...         "num_workers": 0,  # Run just 1 server, in the trainer.
...     }
>>> while True:
        pg.train()
>>> client = PolicyClient("localhost:9900", inference_mode="local")
>>> eps_id = client.start_episode()
>>> action = client.get_action(eps_id, obs)
>>> ...
>>> client.log_returns(eps_id, reward)
>>> ...
>>> client.log_returns(eps_id, reward)
next()[source]

Return the next batch of experiences read.

Returns

SampleBatch or MultiAgentBatch read.

ray.rllib.evaluation

class ray.rllib.evaluation.MultiAgentEpisode(policies, policy_mapping_fn, batch_builder_factory, extra_batch_callback)[source]

Tracks the current state of a (possibly multi-agent) episode.

new_batch_builder

Create a new MultiAgentSampleBatchBuilder.

Type

func

add_extra_batch

Return a built MultiAgentBatch to the sampler.

Type

func

batch_builder

Batch builder for the current episode.

Type

obj

total_reward

Summed reward across all agents in this episode.

Type

float

length

Length of this episode.

Type

int

episode_id

Unique id identifying this trajectory.

Type

int

agent_rewards

Summed rewards broken down by agent.

Type

dict

custom_metrics

Dict where the you can add custom metrics.

Type

dict

user_data

Dict that you can use for temporary storage.

Type

dict

Use case 1: Model-based rollouts in multi-agent:

A custom compute_actions() function in a policy can inspect the current episode state and perform a number of rollouts based on the policies and state of other agents in the environment.

Use case 2: Returning extra rollouts data.

The model rollouts can be returned back to the sampler by calling:

>>> batch = episode.new_batch_builder()
>>> for each transition:
       batch.add_values(...)  # see sampler for usage
>>> episode.extra_batches.add(batch.build_and_reset())
soft_reset()[source]

Clears rewards and metrics, but retains RNN and other state.

This is used to carry state across multiple logical episodes in the same env (i.e., if soft_horizon is set).

policy_for(agent_id='agent0')[source]

Returns the policy for the specified agent.

If the agent is new, the policy mapping fn will be called to bind the agent to a policy for the duration of the episode.

last_observation_for(agent_id='agent0')[source]

Returns the last observation for the specified agent.

last_raw_obs_for(agent_id='agent0')[source]

Returns the last un-preprocessed obs for the specified agent.

last_info_for(agent_id='agent0')[source]

Returns the last info for the specified agent.

last_action_for(agent_id='agent0')[source]

Returns the last action for the specified agent, or zeros.

prev_action_for(agent_id='agent0')[source]

Returns the previous action for the specified agent.

prev_reward_for(agent_id='agent0')[source]

Returns the previous reward for the specified agent.

rnn_state_for(agent_id='agent0')[source]

Returns the last RNN state for the specified agent.

last_pi_info_for(agent_id='agent0')[source]

Returns the last info object for the specified agent.

class ray.rllib.evaluation.RolloutWorker(env_creator, policy, policy_mapping_fn=None, policies_to_train=None, tf_session_creator=None, rollout_fragment_length=100, batch_mode='truncate_episodes', episode_horizon=None, preprocessor_pref='deepmind', sample_async=False, compress_observations=False, num_envs=1, observation_fn=None, observation_filter='NoFilter', clip_rewards=None, clip_actions=True, env_config=None, model_config=None, policy_config=None, worker_index=0, num_workers=0, monitor_path=None, log_dir=None, log_level=None, callbacks=None, input_creator=<function RolloutWorker.<lambda>>, input_evaluation=frozenset({}), output_creator=<function RolloutWorker.<lambda>>, remote_worker_envs=False, remote_env_batch_wait_ms=0, soft_horizon=False, no_done_at_end=False, seed=None, extra_python_environs=None, fake_sampler=False)[source]

Common experience collection class.

This class wraps a policy instance and an environment class to collect experiences from the environment. You can create many replicas of this class as Ray actors to scale RL training.

This class supports vectorized and multi-agent policy evaluation (e.g., VectorEnv, MultiAgentEnv, etc.)

Examples

>>> # Create a rollout worker and using it to collect experiences.
>>> worker = RolloutWorker(
...   env_creator=lambda _: gym.make("CartPole-v0"),
...   policy=PGTFPolicy)
>>> print(worker.sample())
SampleBatch({
    "obs": [[...]], "actions": [[...]], "rewards": [[...]],
    "dones": [[...]], "new_obs": [[...]]})
>>> # Creating a multi-agent rollout worker
>>> worker = RolloutWorker(
...   env_creator=lambda _: MultiAgentTrafficGrid(num_cars=25),
...   policies={
...       # Use an ensemble of two policies for car agents
...       "car_policy1":
...         (PGTFPolicy, Box(...), Discrete(...), {"gamma": 0.99}),
...       "car_policy2":
...         (PGTFPolicy, Box(...), Discrete(...), {"gamma": 0.95}),
...       # Use a single shared policy for all traffic lights
...       "traffic_light_policy":
...         (PGTFPolicy, Box(...), Discrete(...), {}),
...   },
...   policy_mapping_fn=lambda agent_id:
...     random.choice(["car_policy1", "car_policy2"])
...     if agent_id.startswith("car_") else "traffic_light_policy")
>>> print(worker.sample())
MultiAgentBatch({
    "car_policy1": SampleBatch(...),
    "car_policy2": SampleBatch(...),
    "traffic_light_policy": SampleBatch(...)})
sample()[source]

Returns a batch of experience sampled from this worker.

This method must be implemented by subclasses.

Returns

A columnar batch of experiences (e.g., tensors), or a multi-agent batch.

Return type

SampleBatch|MultiAgentBatch

Examples

>>> print(worker.sample())
SampleBatch({"obs": [1, 2, 3], "action": [0, 1, 0], ...})
sample_with_count()[source]

Same as sample() but returns the count as a separate future.

get_weights(policies=None)[source]

Returns the model weights of this worker.

Returns

weights that can be set on another worker. info: dictionary of extra metadata.

Return type

object

Examples

>>> weights = worker.get_weights()
set_weights(weights, global_vars=None)[source]

Sets the model weights of this worker.

Examples

>>> weights = worker.get_weights()
>>> worker.set_weights(weights)
compute_gradients(samples)[source]

Returns a gradient computed w.r.t the specified samples.

Returns

A list of gradients that can be applied on a compatible worker. In the multi-agent case, returns a dict of gradients keyed by policy ids. An info dictionary of extra metadata is also returned.

Return type

(grads, info)

Examples

>>> batch = worker.sample()
>>> grads, info = worker.compute_gradients(samples)
apply_gradients(grads)[source]

Applies the given gradients to this worker’s weights.

Examples

>>> samples = worker.sample()
>>> grads, info = worker.compute_gradients(samples)
>>> worker.apply_gradients(grads)
learn_on_batch(samples)[source]

Update policies based on the given batch.

This is the equivalent to apply_gradients(compute_gradients(samples)), but can be optimized to avoid pulling gradients into CPU memory.

Returns

dictionary of extra metadata from compute_gradients().

Return type

info

Examples

>>> batch = worker.sample()
>>> worker.learn_on_batch(samples)
sample_and_learn(expected_batch_size, num_sgd_iter, sgd_minibatch_size, standardize_fields)[source]

Sample and batch and learn on it.

This is typically used in combination with distributed allreduce.

Parameters
  • expected_batch_size (int) – Expected number of samples to learn on.

  • num_sgd_iter (int) – Number of SGD iterations.

  • sgd_minibatch_size (int) – SGD minibatch size.

  • standardize_fields (list) – List of sample fields to normalize.

Returns

dictionary of extra metadata from learn_on_batch(). count: number of samples learned on.

Return type

info

get_metrics()[source]

Returns a list of new RolloutMetric objects from evaluation.

foreach_env(func)[source]

Apply the given function to each underlying env instance.

get_policy(policy_id='default_policy')[source]

Return policy for the specified id, or None.

Parameters

policy_id (str) – id of policy to return.

for_policy(func, policy_id='default_policy')[source]

Apply the given function to the specified policy.

foreach_policy(func)[source]

Apply the given function to each (policy, policy_id) tuple.

foreach_trainable_policy(func)[source]

Applies the given function to each (policy, policy_id) tuple, which can be found in self.policies_to_train.

Parameters

func (callable) – A function - taking a Policy and its ID - that is called on all Policies within self.policies_to_train.

Returns

The list of n return values of all

func([policy], [ID])-calls.

Return type

List[any]

sync_filters(new_filters)[source]

Changes self’s filter to given and rebases any accumulated delta.

Parameters

new_filters (dict) – Filters with new state to update local copy.

get_filters(flush_after=False)[source]

Returns a snapshot of filters.

Parameters

flush_after (bool) – Clears the filter buffer state.

Returns

Dict for serializable filters

Return type

return_filters (dict)

creation_args()[source]

Returns the args used to create this worker.

get_host()[source]

Returns the hostname of the process running this evaluator.

apply(func, *args)[source]

Apply the given function to this evaluator instance.

setup_torch_data_parallel(url, world_rank, world_size, backend)[source]

Join a torch process group for distributed SGD.

get_node_ip()[source]

Returns the IP address of the current node.

find_free_port()[source]

Finds a free port on the current node.

ray.rllib.evaluation.PolicyEvaluator

alias of ray.rllib.utils.deprecation.renamed_class.<locals>.DeprecationWrapper

ray.rllib.evaluation.PolicyGraph

alias of ray.rllib.utils.deprecation.renamed_class.<locals>.DeprecationWrapper

ray.rllib.evaluation.TFPolicyGraph

alias of ray.rllib.utils.deprecation.renamed_class.<locals>.DeprecationWrapper

ray.rllib.evaluation.TorchPolicyGraph

alias of ray.rllib.utils.deprecation.renamed_class.<locals>.DeprecationWrapper

class ray.rllib.evaluation.MultiAgentBatch(*args, **kw)
class ray.rllib.evaluation.SampleBatchBuilder[source]

Util to build a SampleBatch incrementally.

For efficiency, SampleBatches hold values in column form (as arrays). However, it is useful to add data one row (dict) at a time.

add_values(**values)[source]

Add the given dictionary (row) of values to this batch.

add_batch(batch)[source]

Add the given batch of values to this batch.

build_and_reset()[source]

Returns a sample batch including all previously added values.

class ray.rllib.evaluation.MultiAgentSampleBatchBuilder(policy_map, clip_rewards, callbacks)[source]

Util to build SampleBatches for each policy in a multi-agent env.

Input data is per-agent, while output data is per-policy. There is an M:N mapping between agents and policies. We retain one local batch builder per agent. When an agent is done, then its local batch is appended into the corresponding policy batch for the agent’s policy.

total()[source]

Returns summed number of steps across all agent buffers.

has_pending_agent_data()[source]

Returns whether there is pending unprocessed data.

add_values(agent_id, policy_id, **values)[source]

Add the given dictionary (row) of values to this batch.

Parameters
  • agent_id (obj) – Unique id for the agent we are adding values for.

  • policy_id (obj) – Unique id for policy controlling the agent.

  • values (dict) – Row of values to add for this agent.

postprocess_batch_so_far(episode)[source]

Apply policy postprocessors to any unprocessed rows.

This pushes the postprocessed per-agent batches onto the per-policy builders, clearing per-agent state.

Parameters

episode (Optional[MultiAgentEpisode]) – Current MultiAgentEpisode object.

build_and_reset(episode)[source]

Returns the accumulated sample batches for each policy.

Any unprocessed rows will be first postprocessed with a policy postprocessor. The internal state of this builder will be reset.

Parameters

episode (Optional[MultiAgentEpisode]) – Current MultiAgentEpisode object.

class ray.rllib.evaluation.SyncSampler(worker, env, policies, policy_mapping_fn, preprocessors, obs_filters, clip_rewards, rollout_fragment_length, callbacks, horizon=None, pack=False, tf_sess=None, clip_actions=True, soft_horizon=False, no_done_at_end=False, observation_fn=None)[source]
class ray.rllib.evaluation.AsyncSampler(worker, env, policies, policy_mapping_fn, preprocessors, obs_filters, clip_rewards, rollout_fragment_length, callbacks, horizon=None, pack=False, tf_sess=None, clip_actions=True, blackhole_outputs=False, soft_horizon=False, no_done_at_end=False, observation_fn=None)[source]
run()[source]

Method representing the thread’s activity.

You may override this method in a subclass. The standard run() method invokes the callable object passed to the object’s constructor as the target argument, if any, with sequential and keyword arguments taken from the args and kwargs arguments, respectively.

ray.rllib.evaluation.compute_advantages(rollout, last_r, gamma=0.9, lambda_=1.0, use_gae=True, use_critic=True)[source]

Given a rollout, compute its value targets and the advantage.

Parameters
  • rollout (SampleBatch) – SampleBatch of a single trajectory

  • last_r (float) – Value estimation for last observation

  • gamma (float) – Discount factor.

  • lambda_ (float) – Parameter for GAE

  • use_gae (bool) – Using Generalized Advantage Estimation

  • use_critic (bool) – Whether to use critic (value estimates). Setting this to False will use 0 as baseline.

Returns

Object with experience from rollout and

processed rewards.

Return type

SampleBatch (SampleBatch)

ray.rllib.evaluation.collect_metrics(local_worker=None, remote_workers=[], to_be_collected=[], timeout_seconds=180)[source]

Gathers episode metrics from RolloutWorker instances.

class ray.rllib.evaluation.SampleBatch(*args, **kwargs)[source]

Wrapper around a dictionary with string keys and array-like values.

For example, {“obs”: [1, 2, 3], “reward”: [0, -1, 1]} is a batch of three samples, each with an “obs” and “reward” attribute.

concat(other)[source]

Returns a new SampleBatch with each data column concatenated.

Examples

>>> b1 = SampleBatch({"a": [1, 2]})
>>> b2 = SampleBatch({"a": [3, 4, 5]})
>>> print(b1.concat(b2))
{"a": [1, 2, 3, 4, 5]}
rows()[source]

Returns an iterator over data rows, i.e. dicts with column values.

Examples

>>> batch = SampleBatch({"a": [1, 2, 3], "b": [4, 5, 6]})
>>> for row in batch.rows():
       print(row)
{"a": 1, "b": 4}
{"a": 2, "b": 5}
{"a": 3, "b": 6}
columns(keys)[source]

Returns a list of just the specified columns.

Examples

>>> batch = SampleBatch({"a": [1], "b": [2], "c": [3]})
>>> print(batch.columns(["a", "b"]))
[[1], [2]]
shuffle()[source]

Shuffles the rows of this batch in-place.

split_by_episode()[source]

Splits this batch’s data by eps_id.

Returns

list of SampleBatch, one per distinct episode.

slice(start, end)[source]

Returns a slice of the row data of this batch.

Parameters
  • start (int) – Starting index.

  • end (int) – Ending index.

Returns

SampleBatch which has a slice of this batch’s data.

ray.rllib.execution

ray.rllib.models

class ray.rllib.models.ActionDistribution(inputs, model)[source]

The policy action distribution of an agent.

inputs

input vector to compute samples from.

Type

Tensors

model

reference to model producing the inputs.

Type

ModelV2

sample()[source]

Draw a sample from the action distribution.

deterministic_sample()[source]

Get the deterministic “sampling” output from the distribution. This is usually the max likelihood output, i.e. mean for Normal, argmax for Categorical, etc..

sampled_action_logp()[source]

Returns the log probability of the last sampled action.

logp(x)[source]

The log-likelihood of the action distribution.

kl(other)[source]

The KL-divergence between two action distributions.

entropy()[source]

The entropy of the action distribution.

multi_kl(other)[source]

The KL-divergence between two action distributions.

This differs from kl() in that it can return an array for MultiDiscrete. TODO(ekl) consider removing this.

multi_entropy()[source]

The entropy of the action distribution.

This differs from entropy() in that it can return an array for MultiDiscrete. TODO(ekl) consider removing this.

static required_model_output_shape(action_space, model_config)[source]

Returns the required shape of an input parameter tensor for a particular action space and an optional dict of distribution-specific options.

Parameters
  • action_space (gym.Space) – The action space this distribution will be used for, whose shape attributes will be used to determine the required shape of the input parameter tensor.

  • model_config (dict) – Model’s config dict (as defined in catalog.py)

Returns

size of the

required input vector (minus leading batch dimension).

Return type

model_output_shape (int or np.ndarray of ints)

class ray.rllib.models.ModelCatalog[source]

Registry of models, preprocessors, and action distributions for envs.

Examples

>>> prep = ModelCatalog.get_preprocessor(env)
>>> observation = prep.transform(raw_observation)
>>> dist_class, dist_dim = ModelCatalog.get_action_dist(
        env.action_space, {})
>>> model = ModelCatalog.get_model(inputs, dist_dim, options)
>>> dist = dist_class(model.outputs, model)
>>> action = dist.sample()
static get_action_dist(action_space, config, dist_type=None, framework='tf', **kwargs)[source]

Returns a distribution class and size for the given action space.

Parameters
  • action_space (Space) – Action space of the target gym env.

  • config (Optional[dict]) – Optional model config.

  • dist_type (Optional[str]) – Identifier of the action distribution.

  • framework (str) – One of “tf”, “tfe”, or “torch”.

  • kwargs (dict) – Optional kwargs to pass on to the Distribution’s constructor.

Returns

Python class of the distribution. dist_dim (int): The size of the input vector to the distribution.

Return type

dist_class (ActionDistribution)

static get_action_shape(action_space)[source]

Returns action tensor dtype and shape for the action space.

Parameters

action_space (Space) – Action space of the target gym env.

Returns

Dtype and shape of the actions tensor.

Return type

(dtype, shape)

static get_action_placeholder(action_space, name='action')[source]

Returns an action placeholder consistent with the action space

Parameters
  • action_space (Space) – Action space of the target gym env.

  • name (str) – An optional string to name the placeholder by. Default: “action”.

Returns

A placeholder for the actions

Return type

action_placeholder (Tensor)

static get_model_v2(obs_space, action_space, num_outputs, model_config, framework='tf', name='default_model', model_interface=None, default_model=None, **model_kwargs)[source]

Returns a suitable model compatible with given spaces and output.

Parameters
  • obs_space (Space) – Observation space of the target gym env. This may have an original_space attribute that specifies how to unflatten the tensor into a ragged tensor.

  • action_space (Space) – Action space of the target gym env.

  • num_outputs (int) – The size of the output vector of the model.

  • framework (str) – One of “tf”, “tfe”, or “torch”.

  • name (str) – Name (scope) for the model.

  • model_interface (cls) – Interface required for the model

  • default_model (cls) – Override the default class for the model. This only has an effect when not using a custom model

  • model_kwargs (dict) – args to pass to the ModelV2 constructor

Returns

Model to use for the policy.

Return type

model (ModelV2)

static get_preprocessor(env, options=None)[source]

Returns a suitable preprocessor for the given env.

This is a wrapper for get_preprocessor_for_space().

static get_preprocessor_for_space(observation_space, options=None)[source]

Returns a suitable preprocessor for the given observation space.

Parameters
  • observation_space (Space) – The input observation space.

  • options (dict) – Options to pass to the preprocessor.

Returns

Preprocessor for the observations.

Return type

preprocessor (Preprocessor)

static register_custom_preprocessor(preprocessor_name, preprocessor_class)[source]

Register a custom preprocessor class by name.

The preprocessor can be later used by specifying {“custom_preprocessor”: preprocesor_name} in the model config.

Parameters
  • preprocessor_name (str) – Name to register the preprocessor under.

  • preprocessor_class (type) – Python class of the preprocessor.

static register_custom_model(model_name, model_class)[source]

Register a custom model class by name.

The model can be later used by specifying {“custom_model”: model_name} in the model config.

Parameters
  • model_name (str) – Name to register the model under.

  • model_class (type) – Python class of the model.

static register_custom_action_dist(action_dist_name, action_dist_class)[source]

Register a custom action distribution class by name.

The model can be later used by specifying {“custom_action_dist”: action_dist_name} in the model config.

Parameters
  • model_name (str) – Name to register the action distribution under.

  • model_class (type) – Python class of the action distribution.

static get_model(input_dict, obs_space, action_space, num_outputs, options, state_in=None, seq_lens=None)[source]

Deprecated: Use get_model_v2() instead.

class ray.rllib.models.Model(input_dict, obs_space, action_space, num_outputs, options, state_in=None, seq_lens=None)[source]

This class is deprecated! Use ModelV2 instead.

value_function()[source]

Builds the value function output.

This method can be overridden to customize the implementation of the value function (e.g., not sharing hidden layers).

Returns

Tensor of size [BATCH_SIZE] for the value function.

custom_loss(policy_loss, loss_inputs)[source]

Override to customize the loss function used to optimize this model.

This can be used to incorporate self-supervised losses (by defining a loss over existing input and output tensors of this model), and supervised losses (by defining losses over a variable-sharing copy of this model’s layers).

You can find an runnable example in examples/custom_loss.py.

Parameters
  • policy_loss (Tensor) – scalar policy loss from the policy.

  • loss_inputs (dict) – map of input placeholders for rollout data.

Returns

Scalar tensor for the customized loss for this model.

custom_stats()[source]

Override to return custom metrics from your model.

The stats will be reported as part of the learner stats, i.e.,
info:
learner:
model:

key1: metric1 key2: metric2

Returns

Dict of string keys to scalar tensors.

loss()[source]

Deprecated: use self.custom_loss().

class ray.rllib.models.Preprocessor(obs_space, options=None)[source]

Defines an abstract observation preprocessor function.

shape

Shape of the preprocessed output.

Type

obj

transform(observation)[source]

Returns the preprocessed observation.

write(observation, array, offset)[source]

Alternative to transform for more efficient flattening.

check_shape(observation)[source]

Checks the shape of the given observation.

class ray.rllib.models.FullyConnectedNetwork(input_dict, obs_space, action_space, num_outputs, options, state_in=None, seq_lens=None)[source]

Generic fully connected network.

class ray.rllib.models.VisionNetwork(input_dict, obs_space, action_space, num_outputs, options, state_in=None, seq_lens=None)[source]

Generic vision network.

ray.rllib.utils

ray.rllib.utils.override(cls)[source]

Annotation for documenting method overrides.

Parameters

cls (type) – The superclass that provides the overriden method. If this cls does not actually have the method, an error is raised.

ray.rllib.utils.PublicAPI(obj)[source]

Annotation for documenting public APIs.

Public APIs are classes and methods exposed to end users of RLlib. You can expect these APIs to remain stable across RLlib releases.

Subclasses that inherit from a @PublicAPI base class can be assumed part of the RLlib public API as well (e.g., all trainer classes are in public API because Trainer is @PublicAPI).

In addition, you can assume all trainer configurations are part of their public API as well.

ray.rllib.utils.DeveloperAPI(obj)[source]

Annotation for documenting developer APIs.

Developer APIs are classes and methods explicitly exposed to developers for the purposes of building custom algorithms or advanced training strategies on top of RLlib internals. You can generally expect these APIs to be stable sans minor changes (but less stable than public APIs).

Subclasses that inherit from a @DeveloperAPI base class can be assumed part of the RLlib developer API as well.

ray.rllib.utils.try_import_tf(error=False)[source]

Tries importing tf and returns the module (or None).

Parameters

error (bool) – Whether to raise an error if tf cannot be imported.

Returns

The tf module (either from tf2.0.compat.v1 OR as tf1.x.

Raises

ImportError – If error=True and tf is not installed.

ray.rllib.utils.try_import_tfp(error=False)[source]

Tries importing tfp and returns the module (or None).

Parameters

error (bool) – Whether to raise an error if tfp cannot be imported.

Returns

The tfp module.

Raises

ImportError – If error=True and tfp is not installed.

ray.rllib.utils.try_import_torch(error=False)[source]

Tries importing torch and returns the module (or None).

Parameters

error (bool) – Whether to raise an error if torch cannot be imported.

Returns

torch AND torch.nn modules.

Return type

tuple

Raises

ImportError – If error=True and PyTorch is not installed.

ray.rllib.utils.check_framework(framework, allow_none=True)[source]

Checks, whether the given framework is “valid”.

Meaning, whether all necessary dependencies are installed.

Parameters
  • framework (str) – Once of “tf”, “torch”, or None.

  • allow_none (bool) – Whether framework=None (e.g. numpy implementatiopn) is allowed or not.

Returns

The input framework string.

Return type

str

Raises

ImportError – If given framework is not installed.

ray.rllib.utils.deprecation_warning(old, new=None, error=None)[source]

Logs (via the logger object) or throws a deprecation warning/error.

Parameters
  • old (str) – A description of the “thing” that is to be deprecated.

  • new (Optional[str]) – A description of the new “thing” that replaces it.

  • error (Optional[bool,Exception]) – Whether or which exception to throw. If True, throw ValueError.

ray.rllib.utils.renamed_agent(cls)[source]

Helper class for renaming Agent => Trainer with a warning.

ray.rllib.utils.renamed_class(cls, old_name)[source]

Helper class for renaming classes with a warning.

ray.rllib.utils.renamed_function(func, old_name)[source]

Helper function for renaming a function.

class ray.rllib.utils.FilterManager[source]

Manages filters and coordination across remote evaluators that expose get_filters and sync_filters.

static synchronize(local_filters, remotes, update_remote=True)[source]

Aggregates all filters from remote evaluators.

Local copy is updated and then broadcasted to all remote evaluators.

Parameters
  • local_filters (dict) – Filters to be synchronized.

  • remotes (list) – Remote evaluators with filters.

  • update_remote (bool) – Whether to push updates to remote filters.

class ray.rllib.utils.Filter[source]

Processes input, possibly statefully.

apply_changes(other, *args, **kwargs)[source]

Updates self with “new state” from other filter.

copy()[source]

Creates a new object with same state as self.

Returns

A copy of self.

sync(other)[source]

Copies all state from other filter to self.

clear_buffer()[source]

Creates copy of current state and clears accumulated state

ray.rllib.utils.sigmoid(x, derivative=False)[source]

Returns the sigmoid function applied to x. Alternatively, can return the derivative or the sigmoid function.

Parameters
  • x (np.ndarray) – The input to the sigmoid function.

  • derivative (bool) – Whether to return the derivative or not. Default: False.

Returns

The sigmoid function (or its derivative) applied to x.

Return type

np.ndarray

ray.rllib.utils.softmax(x, axis=- 1)[source]

Returns the softmax values for x as: S(xi) = e^xi / SUMj(e^xj), where j goes over all elements in x.

Parameters
  • x (np.ndarray) – The input to the softmax function.

  • axis (int) – The axis along which to softmax.

Returns

The softmax over x.

Return type

np.ndarray

ray.rllib.utils.relu(x, alpha=0.0)[source]

Implementation of the leaky ReLU function: y = x * alpha if x < 0 else x

Parameters
  • x (np.ndarray) – The input values.

  • alpha (float) – A scaling (“leak”) factor to use for negative x.

Returns

The leaky ReLU output for x.

Return type

np.ndarray

ray.rllib.utils.one_hot(x, depth=0, on_value=1, off_value=0)[source]

One-hot utility function for numpy. Thanks to qianyizhang: https://gist.github.com/qianyizhang/07ee1c15cad08afb03f5de69349efc30.

Parameters
  • x (np.ndarray) – The input to be one-hot encoded.

  • depth (int) – The max. number to be one-hot encoded (size of last rank).

  • on_value (float) – The value to use for on. Default: 1.0.

  • off_value (float) – The value to use for off. Default: 0.0.

Returns

The one-hot encoded equivalent of the input array.

Return type

np.ndarray

ray.rllib.utils.fc(x, weights, biases=None, framework=None)[source]

Calculates the outputs of a fully-connected (dense) layer given weights/biases and an input.

Parameters
  • x (np.ndarray) – The input to the dense layer.

  • weights (np.ndarray) – The weights matrix.

  • biases (Optional[np.ndarray]) – The biases vector. All 0s if None.

  • framework (Optional[str]) – An optional framework hint (to figure out, e.g. whether to transpose torch weight matrices).

Returns

The dense layer’s output.

ray.rllib.utils.lstm(x, weights, biases=None, initial_internal_states=None, time_major=False, forget_bias=1.0)[source]

Calculates the outputs of an LSTM layer given weights/biases, internal_states, and input.

Parameters
  • x (np.ndarray) – The inputs to the LSTM layer including time-rank (0th if time-major, else 1st) and the batch-rank (1st if time-major, else 0th).

  • weights (np.ndarray) – The weights matrix.

  • biases (Optional[np.ndarray]) – The biases vector. All 0s if None.

  • initial_internal_states (Optional[np.ndarray]) – The initial internal states to pass into the layer. All 0s if None.

  • time_major (bool) – Whether to use time-major or not. Default: False.

  • forget_bias (float) – Gets added to first sigmoid (forget gate) output. Default: 1.0.

Returns

  • The LSTM layer’s output.

  • Tuple: Last (c-state, h-state).

Return type

Tuple

class ray.rllib.utils.PolicyClient(address)[source]

DEPRECATED: Please use rllib.env.PolicyClient instead.

start_episode(episode_id=None, training_enabled=True)[source]

Record the start of an episode.

Parameters
  • episode_id (Optional[str]) – Unique string id for the episode or None for it to be auto-assigned.

  • training_enabled (bool) – Whether to use experiences for this episode to improve the policy.

Returns

Unique string id for the episode.

Return type

episode_id (str)

get_action(episode_id, observation)[source]

Record an observation and get the on-policy action.

Parameters
  • episode_id (str) – Episode id returned from start_episode().

  • observation (obj) – Current environment observation.

Returns

Action from the env action space.

Return type

action (obj)

log_action(episode_id, observation, action)[source]

Record an observation and (off-policy) action taken.

Parameters
  • episode_id (str) – Episode id returned from start_episode().

  • observation (obj) – Current environment observation.

  • action (obj) – Action for the observation.

log_returns(episode_id, reward, info=None)[source]

Record returns from the environment.

The reward will be attributed to the previous action taken by the episode. Rewards accumulate until the next action. If no reward is logged before the next action, a reward of 0.0 is assumed.

Parameters
  • episode_id (str) – Episode id returned from start_episode().

  • reward (float) – Reward from the environment.

end_episode(episode_id, observation)[source]

Record the end of an episode.

Parameters
  • episode_id (str) – Episode id returned from start_episode().

  • observation (obj) – Current environment observation.

class ray.rllib.utils.PolicyServer(external_env, address, port)[source]

DEPRECATED: Please use rllib.env.PolicyServerInput instead.

class ray.rllib.utils.LinearSchedule(**kwargs)[source]

Linear interpolation between initial_p and final_p. Simply uses Polynomial with power=1.0.

final_p + (initial_p - final_p) * (1 - t/t_max)

class ray.rllib.utils.PiecewiseSchedule(endpoints, framework, interpolation=<function _linear_interpolation>, outside_value=None)[source]
class ray.rllib.utils.PolynomialSchedule(schedule_timesteps, final_p, framework, initial_p=1.0, power=2.0)[source]
class ray.rllib.utils.ExponentialSchedule(schedule_timesteps, framework, initial_p=1.0, decay_rate=0.1)[source]
class ray.rllib.utils.ConstantSchedule(value, framework)[source]

A Schedule where the value remains constant over time.

ray.rllib.utils.check(x, y, decimals=5, atol=None, rtol=None, false=False)[source]

Checks two structures (dict, tuple, list, np.array, float, int, etc..) for (almost) numeric identity. All numbers in the two structures have to match up to decimal digits after the floating point. Uses assertions.

Parameters
  • x (any) – The value to be compared (to the expectation: y). This may be a Tensor.

  • y (any) – The expected value to be compared to x. This must not be a Tensor.

  • decimals (int) – The number of digits after the floating point up to which all numeric values have to match.

  • atol (float) – Absolute tolerance of the difference between x and y (overrides decimals if given).

  • rtol (float) – Relative tolerance of the difference between x and y (overrides decimals if given).

  • false (bool) – Whether to check that x and y are NOT the same.

ray.rllib.utils.framework_iterator(config=None, frameworks='tf', 'tfe', 'torch', session=False)[source]

An generator that allows for looping through n frameworks for testing.

Provides the correct config entries (“framework”) as well as the correct eager/non-eager contexts for tfe/tf.

Parameters
  • config (Optional[dict]) – An optional config dict to alter in place depending on the iteration.

  • frameworks (Tuple[str]) – A list/tuple of the frameworks to be tested. Allowed are: “tf”, “tfe”, “torch”, and “auto”.

  • session (bool) – If True, enter a tf.Session() and yield that as well in the tf-case (otherwise, yield (fw, None)).

Yields

str

If enter_session is False:

The current framework (“tf”, “tfe”, “torch”) used.

Tuple(str, Union[None,tf.Session]: If enter_session is True:

A tuple of the current fw and the tf.Session if fw=”tf”.

ray.rllib.utils.check_compute_action(trainer, include_prev_action_reward=False)[source]

Tests different combinations of arguments for trainer.compute_action.

Parameters
  • trainer (Trainer) – The trainer object to test.

  • include_prev_action_reward (bool) – Whether to include the prev-action and -reward in the compute_action call.

Throws:

ValueError: If anything unexpected happens.

ray.rllib.utils.merge_dicts(d1, d2)[source]
Parameters
  • d1 (dict) – Dict 1.

  • d2 (dict) – Dict 2.

Returns

A new dict that is d1 and d2 deep merged.

Return type

dict

ray.rllib.utils.deep_update(original, new_dict, new_keys_allowed=False, whitelist=None, override_all_if_type_changes=None)[source]

Updates original dict with values from new_dict recursively.

If new key is introduced in new_dict, then if new_keys_allowed is not True, an error will be thrown. Further, for sub-dicts, if the key is in the whitelist, then new subkeys can be introduced.

Parameters
  • original (dict) – Dictionary with default values.

  • new_dict (dict) – Dictionary with values to be updated

  • new_keys_allowed (bool) – Whether new keys are allowed.

  • whitelist (Optional[List[str]]) – List of keys that correspond to dict values where new subkeys can be introduced. This is only at the top level.

  • override_all_if_type_changes (Optional[List[str]]) – List of top level keys with value=dict, for which we always simply override the entire value (dict), iff the “type” key in that value dict changes.

ray.rllib.utils.add_mixins(base, mixins)[source]

Returns a new class with mixins applied in priority order.

ray.rllib.utils.force_list(elements=None, to_tuple=False)[source]

Makes sure elements is returned as a list, whether elements is a single item, already a list, or a tuple.

Parameters
  • elements (Optional[any]) – The inputs as single item, list, or tuple to be converted into a list/tuple. If None, returns empty list/tuple.

  • to_tuple (bool) – Whether to use tuple (instead of list).

Returns

All given elements in a list/tuple depending on

to_tuple’s value. If elements is None, returns an empty list/tuple.

Return type

Union[list,tuple]

ray.rllib.utils.force_tuple(elements=None, *, to_tuple=True)

Makes sure elements is returned as a list, whether elements is a single item, already a list, or a tuple.

Parameters
  • elements (Optional[any]) – The inputs as single item, list, or tuple to be converted into a list/tuple. If None, returns empty list/tuple.

  • to_tuple (bool) – Whether to use tuple (instead of list).

Returns

All given elements in a list/tuple depending on

to_tuple’s value. If elements is None, returns an empty list/tuple.

Return type

Union[list,tuple]