RLlib Package Reference

ray.rllib.policy

class ray.rllib.policy.Policy(observation_space, action_space, config)[source]

An agent policy and loss, i.e., a TFPolicy or other subclass.

This object defines how to act in the environment, and also losses used to improve the policy based on its experiences. Note that both policy and loss are defined together for convenience, though the policy itself is logically separate.

All policies can directly extend Policy, however TensorFlow users may find TFPolicy simpler to implement. TFPolicy also enables RLlib to apply TensorFlow-specific optimizations such as fusing multiple policy graphs and multi-GPU support.

observation_space

Observation space of the policy.

Type

gym.Space

action_space

Action space of the policy.

Type

gym.Space

exploration

The exploration object to use for computing actions, or None.

Type

Exploration

abstract compute_actions(obs_batch, state_batches=None, prev_action_batch=None, prev_reward_batch=None, info_batch=None, episodes=None, explore=None, timestep=None, **kwargs)[source]

Computes actions for the current policy.

Parameters
  • obs_batch (Union[List,np.ndarray]) – Batch of observations.

  • state_batches (Optional[list]) – List of RNN state input batches, if any.

  • prev_action_batch (Optional[List,np.ndarray]) – Batch of previous action values.

  • prev_reward_batch (Optional[List,np.ndarray]) – Batch of previous rewards.

  • info_batch (info) – Batch of info objects.

  • episodes (list) – MultiAgentEpisode for each obs in obs_batch. This provides access to all of the internal episode state, which may be useful for model-based or multiagent algorithms.

  • explore (bool) – Whether to pick an exploitation or exploration action (default: None -> use self.config[“explore”]).

  • timestep (int) – The current (sampling) time step.

  • kwargs – forward compatibility placeholder

Returns

batch of output actions, with shape like

[BATCH_SIZE, ACTION_SHAPE].

state_outs (list): list of RNN state output batches, if any, with

shape like [STATE_SIZE, BATCH_SIZE].

info (dict): dictionary of extra feature batches, if any, with

shape like {“f1”: [BATCH_SIZE, …], “f2”: [BATCH_SIZE, …]}.

Return type

actions (np.ndarray)

compute_single_action(obs, state=None, prev_action=None, prev_reward=None, info=None, episode=None, clip_actions=False, explore=None, timestep=None, **kwargs)[source]

Unbatched version of compute_actions.

Parameters
  • obs (obj) – Single observation.

  • state (list) – List of RNN state inputs, if any.

  • prev_action (obj) – Previous action value, if any.

  • prev_reward (float) – Previous reward, if any.

  • info (dict) – info object, if any

  • episode (MultiAgentEpisode) – this provides access to all of the internal episode state, which may be useful for model-based or multi-agent algorithms.

  • clip_actions (bool) – Should actions be clipped?

  • explore (bool) – Whether to pick an exploitation or exploration action (default: None -> use self.config[“explore”]).

  • timestep (int) – The current (sampling) time step.

  • kwargs – forward compatibility placeholder

Returns

single action state_outs (list): list of RNN state outputs, if any info (dict): dictionary of extra features, if any

Return type

actions (obj)

compute_log_likelihoods(actions, obs_batch, state_batches=None, prev_action_batch=None, prev_reward_batch=None)[source]

Computes the log-prob/likelihood for a given action and observation.

Parameters
  • actions (Union[List,np.ndarray]) – Batch of actions, for which to retrieve the log-probs/likelihoods (given all other inputs: obs, states, ..).

  • obs_batch (Union[List,np.ndarray]) – Batch of observations.

  • state_batches (Optional[list]) – List of RNN state input batches, if any.

  • prev_action_batch (Optional[List,np.ndarray]) – Batch of previous action values.

  • prev_reward_batch (Optional[List,np.ndarray]) – Batch of previous rewards.

Returns

Batch of log probs/likelihoods, with

shape: [BATCH_SIZE].

Return type

log-likelihoods (np.ndarray)

postprocess_trajectory(sample_batch, other_agent_batches=None, episode=None)[source]

Implements algorithm-specific trajectory postprocessing.

This will be called on each trajectory fragment computed during policy evaluation. Each fragment is guaranteed to be only from one episode.

Parameters
  • sample_batch (SampleBatch) – batch of experiences for the policy, which will contain at most one episode trajectory.

  • other_agent_batches (dict) – In a multi-agent env, this contains a mapping of agent ids to (policy, agent_batch) tuples containing the policy and experiences of the other agents.

  • episode (MultiAgentEpisode) – this provides access to all of the internal episode state, which may be useful for model-based or multi-agent algorithms.

Returns

Postprocessed sample batch.

Return type

SampleBatch

learn_on_batch(samples)[source]

Fused compute gradients and apply gradients call.

Either this or the combination of compute/apply grads must be implemented by subclasses.

Returns

dictionary of extra metadata from compute_gradients().

Return type

grad_info

Examples

>>> batch = ev.sample()
>>> ev.learn_on_batch(samples)
compute_gradients(postprocessed_batch)[source]

Computes gradients against a batch of experiences.

Either this or learn_on_batch() must be implemented by subclasses.

Returns

List of gradient output values info (dict): Extra policy-specific values

Return type

grads (list)

apply_gradients(gradients)[source]

Applies previously computed gradients.

Either this or learn_on_batch() must be implemented by subclasses.

get_weights()[source]

Returns model weights.

Returns

Serializable copy or view of model weights

Return type

weights (obj)

set_weights(weights)[source]

Sets model weights.

Parameters

weights (obj) – Serializable copy or view of model weights

get_exploration_info()[source]

Returns the current exploration information of this policy.

This information depends on the policy’s Exploration object.

Returns

Serializable information on the self.exploration object.

Return type

any

is_recurrent()[source]

Whether this Policy holds a recurrent Model.

Returns

True if this Policy has-a RNN-based Model.

Return type

bool

num_state_tensors()[source]

The number of internal states needed by the RNN-Model of the Policy.

Returns

The number of RNN internal states kept by this Policy’s Model.

Return type

int

get_initial_state()[source]

Returns initial RNN state for the current policy.

get_state()[source]

Saves all local state.

Returns

Serialized local state.

Return type

state (obj)

set_state(state)[source]

Restores all local state.

Parameters

state (obj) – Serialized local state.

on_global_var_update(global_vars)[source]

Called on an update to global vars.

Parameters

global_vars (dict) – Global variables broadcast from the driver.

export_model(export_dir)[source]

Export Policy to local directory for serving.

Parameters

export_dir (str) – Local writable directory.

export_checkpoint(export_dir)[source]

Export Policy checkpoint to local directory.

Argument:

export_dir (str): Local writable directory.

import_model_from_h5(import_file)[source]

Imports Policy from local file.

Parameters

import_file (str) – Local readable file.

class ray.rllib.policy.TorchPolicy(observation_space, action_space, config, *, model, loss, action_distribution_class, action_sampler_fn=None, action_distribution_fn=None, max_seq_len=20, get_batch_divisibility_req=None)[source]

Template for a PyTorch policy and loss to use with RLlib.

This is similar to TFPolicy, but for PyTorch.

observation_space

observation space of the policy.

Type

gym.Space

action_space

action space of the policy.

Type

gym.Space

config

config of the policy.

Type

dict

model

Torch model instance.

Type

TorchModel

dist_class

Torch action distribution class.

Type

type

compute_actions(obs_batch, state_batches=None, prev_action_batch=None, prev_reward_batch=None, info_batch=None, episodes=None, explore=None, timestep=None, **kwargs)[source]

Computes actions for the current policy.

Parameters
  • obs_batch (Union[List,np.ndarray]) – Batch of observations.

  • state_batches (Optional[list]) – List of RNN state input batches, if any.

  • prev_action_batch (Optional[List,np.ndarray]) – Batch of previous action values.

  • prev_reward_batch (Optional[List,np.ndarray]) – Batch of previous rewards.

  • info_batch (info) – Batch of info objects.

  • episodes (list) – MultiAgentEpisode for each obs in obs_batch. This provides access to all of the internal episode state, which may be useful for model-based or multiagent algorithms.

  • explore (bool) – Whether to pick an exploitation or exploration action (default: None -> use self.config[“explore”]).

  • timestep (int) – The current (sampling) time step.

  • kwargs – forward compatibility placeholder

Returns

batch of output actions, with shape like

[BATCH_SIZE, ACTION_SHAPE].

state_outs (list): list of RNN state output batches, if any, with

shape like [STATE_SIZE, BATCH_SIZE].

info (dict): dictionary of extra feature batches, if any, with

shape like {“f1”: [BATCH_SIZE, …], “f2”: [BATCH_SIZE, …]}.

Return type

actions (np.ndarray)

compute_log_likelihoods(actions, obs_batch, state_batches=None, prev_action_batch=None, prev_reward_batch=None)[source]

Computes the log-prob/likelihood for a given action and observation.

Parameters
  • actions (Union[List,np.ndarray]) – Batch of actions, for which to retrieve the log-probs/likelihoods (given all other inputs: obs, states, ..).

  • obs_batch (Union[List,np.ndarray]) – Batch of observations.

  • state_batches (Optional[list]) – List of RNN state input batches, if any.

  • prev_action_batch (Optional[List,np.ndarray]) – Batch of previous action values.

  • prev_reward_batch (Optional[List,np.ndarray]) – Batch of previous rewards.

Returns

Batch of log probs/likelihoods, with

shape: [BATCH_SIZE].

Return type

log-likelihoods (np.ndarray)

learn_on_batch(postprocessed_batch)[source]

Fused compute gradients and apply gradients call.

Either this or the combination of compute/apply grads must be implemented by subclasses.

Returns

dictionary of extra metadata from compute_gradients().

Return type

grad_info

Examples

>>> batch = ev.sample()
>>> ev.learn_on_batch(samples)
compute_gradients(postprocessed_batch)[source]

Computes gradients against a batch of experiences.

Either this or learn_on_batch() must be implemented by subclasses.

Returns

List of gradient output values info (dict): Extra policy-specific values

Return type

grads (list)

apply_gradients(gradients)[source]

Applies previously computed gradients.

Either this or learn_on_batch() must be implemented by subclasses.

get_weights()[source]

Returns model weights.

Returns

Serializable copy or view of model weights

Return type

weights (obj)

set_weights(weights)[source]

Sets model weights.

Parameters

weights (obj) – Serializable copy or view of model weights

is_recurrent()[source]

Whether this Policy holds a recurrent Model.

Returns

True if this Policy has-a RNN-based Model.

Return type

bool

num_state_tensors()[source]

The number of internal states needed by the RNN-Model of the Policy.

Returns

The number of RNN internal states kept by this Policy’s Model.

Return type

int

get_initial_state()[source]

Returns initial RNN state for the current policy.

extra_grad_process(optimizer, loss)[source]

Called after each optimizer.zero_grad() + loss.backward() call.

Called for each self._optimizers/loss-value pair. Allows for gradient processing before optimizer.step() is called. E.g. for gradient clipping.

Parameters
  • optimizer (torch.optim.Optimizer) – A torch optimizer object.

  • loss (torch.Tensor) – The loss tensor associated with the optimizer.

Returns

An info dict.

Return type

dict

extra_action_out(input_dict, state_batches, model, action_dist)[source]

Returns dict of extra info to include in experience batch.

Parameters
  • input_dict (dict) – Dict of model input tensors.

  • state_batches (list) – List of state tensors.

  • model (TorchModelV2) – Reference to the model.

  • action_dist (TorchActionDistribution) – Torch action dist object to get log-probs (e.g. for already sampled actions).

extra_grad_info(train_batch)[source]

Return dict of extra grad info.

optimizer()[source]

Custom PyTorch optimizer to use.

export_model(export_dir)[source]

TODO(sven): implement for torch.

export_checkpoint(export_dir)[source]

TODO(sven): implement for torch.

import_model_from_h5(import_file)[source]

Imports weights into torch model.

class ray.rllib.policy.TFPolicy(observation_space, action_space, config, sess, obs_input, sampled_action, loss, loss_inputs, model=None, sampled_action_logp=None, action_input=None, log_likelihood=None, dist_inputs=None, dist_class=None, state_inputs=None, state_outputs=None, prev_action_input=None, prev_reward_input=None, seq_lens=None, max_seq_len=20, batch_divisibility_req=1, update_ops=None, explore=None, timestep=None)[source]

An agent policy and loss implemented in TensorFlow.

Extending this class enables RLlib to perform TensorFlow specific optimizations on the policy, e.g., parallelization across gpus or fusing multiple graphs together in the multi-agent setting.

Input tensors are typically shaped like [BATCH_SIZE, …].

observation_space

observation space of the policy.

Type

gym.Space

action_space

action space of the policy.

Type

gym.Space

model

RLlib model used for the policy.

Type

rllib.models.Model

Examples

>>> policy = TFPolicySubclass(
    sess, obs_input, sampled_action, loss, loss_inputs)
>>> print(policy.compute_actions([1, 0, 2]))
(array([0, 1, 1]), [], {})
>>> print(policy.postprocess_trajectory(SampleBatch({...})))
SampleBatch({"action": ..., "advantages": ..., ...})
variables()[source]

Return the list of all savable variables for this policy.

get_placeholder(name)[source]

Returns the given action or loss input placeholder by name.

If the loss has not been initialized and a loss input placeholder is requested, an error is raised.

get_session()[source]

Returns a reference to the TF session for this policy.

loss_initialized()[source]

Returns whether the loss function has been initialized.

compute_actions(obs_batch, state_batches=None, prev_action_batch=None, prev_reward_batch=None, info_batch=None, episodes=None, explore=None, timestep=None, **kwargs)[source]

Computes actions for the current policy.

Parameters
  • obs_batch (Union[List,np.ndarray]) – Batch of observations.

  • state_batches (Optional[list]) – List of RNN state input batches, if any.

  • prev_action_batch (Optional[List,np.ndarray]) – Batch of previous action values.

  • prev_reward_batch (Optional[List,np.ndarray]) – Batch of previous rewards.

  • info_batch (info) – Batch of info objects.

  • episodes (list) – MultiAgentEpisode for each obs in obs_batch. This provides access to all of the internal episode state, which may be useful for model-based or multiagent algorithms.

  • explore (bool) – Whether to pick an exploitation or exploration action (default: None -> use self.config[“explore”]).

  • timestep (int) – The current (sampling) time step.

  • kwargs – forward compatibility placeholder

Returns

batch of output actions, with shape like

[BATCH_SIZE, ACTION_SHAPE].

state_outs (list): list of RNN state output batches, if any, with

shape like [STATE_SIZE, BATCH_SIZE].

info (dict): dictionary of extra feature batches, if any, with

shape like {“f1”: [BATCH_SIZE, …], “f2”: [BATCH_SIZE, …]}.

Return type

actions (np.ndarray)

compute_log_likelihoods(actions, obs_batch, state_batches=None, prev_action_batch=None, prev_reward_batch=None)[source]

Computes the log-prob/likelihood for a given action and observation.

Parameters
  • actions (Union[List,np.ndarray]) – Batch of actions, for which to retrieve the log-probs/likelihoods (given all other inputs: obs, states, ..).

  • obs_batch (Union[List,np.ndarray]) – Batch of observations.

  • state_batches (Optional[list]) – List of RNN state input batches, if any.

  • prev_action_batch (Optional[List,np.ndarray]) – Batch of previous action values.

  • prev_reward_batch (Optional[List,np.ndarray]) – Batch of previous rewards.

Returns

Batch of log probs/likelihoods, with

shape: [BATCH_SIZE].

Return type

log-likelihoods (np.ndarray)

compute_gradients(postprocessed_batch)[source]

Computes gradients against a batch of experiences.

Either this or learn_on_batch() must be implemented by subclasses.

Returns

List of gradient output values info (dict): Extra policy-specific values

Return type

grads (list)

apply_gradients(gradients)[source]

Applies previously computed gradients.

Either this or learn_on_batch() must be implemented by subclasses.

learn_on_batch(postprocessed_batch)[source]

Fused compute gradients and apply gradients call.

Either this or the combination of compute/apply grads must be implemented by subclasses.

Returns

dictionary of extra metadata from compute_gradients().

Return type

grad_info

Examples

>>> batch = ev.sample()
>>> ev.learn_on_batch(samples)
get_exploration_info()[source]

Returns the current exploration information of this policy.

This information depends on the policy’s Exploration object.

Returns

Serializable information on the self.exploration object.

Return type

any

get_weights()[source]

Returns model weights.

Returns

Serializable copy or view of model weights

Return type

weights (obj)

set_weights(weights)[source]

Sets model weights.

Parameters

weights (obj) – Serializable copy or view of model weights

export_model(export_dir)[source]

Export tensorflow graph to export_dir for serving.

export_checkpoint(export_dir, filename_prefix='model')[source]

Export tensorflow checkpoint to export_dir.

import_model_from_h5(import_file)[source]

Imports weights into tf model.

copy(existing_inputs)[source]

Creates a copy of self using existing input placeholders.

Optional, only required to work with the multi-GPU optimizer.

is_recurrent()[source]

Whether this Policy holds a recurrent Model.

Returns

True if this Policy has-a RNN-based Model.

Return type

bool

num_state_tensors()[source]

The number of internal states needed by the RNN-Model of the Policy.

Returns

The number of RNN internal states kept by this Policy’s Model.

Return type

int

extra_compute_action_feed_dict()[source]

Extra dict to pass to the compute actions session run.

extra_compute_action_fetches()[source]

Extra values to fetch and return from compute_actions().

By default we return action probability/log-likelihood info and action distribution inputs (if present).

extra_compute_grad_feed_dict()[source]

Extra dict to pass to the compute gradients session run.

extra_compute_grad_fetches()[source]

Extra values to fetch and return from compute_gradients().

optimizer()[source]

TF optimizer to use for policy optimization.

gradients(optimizer, loss)[source]

Override for custom gradient computation.

build_apply_op(optimizer, grads_and_vars)[source]

Override for custom gradient apply computation.

ray.rllib.policy.build_torch_policy(name, *, loss_fn, get_default_config=None, stats_fn=None, postprocess_fn=None, extra_action_out_fn=None, extra_grad_process_fn=None, optimizer_fn=None, before_init=None, after_init=None, action_sampler_fn=None, action_distribution_fn=None, make_model=None, make_model_and_action_dist=None, apply_gradients_fn=None, mixins=None, get_batch_divisibility_req=None)[source]

Helper function for creating a torch policy class at runtime.

Parameters
  • name (str) – name of the policy (e.g., “PPOTorchPolicy”)

  • loss_fn (callable) – Callable that returns a loss tensor as arguments given (policy, model, dist_class, train_batch).

  • get_default_config (Optional[callable]) – Optional callable that returns the default config to merge with any overrides.

  • stats_fn (Optional[callable]) – Optional callable that returns a dict of values given the policy and batch input tensors.

  • postprocess_fn (Optional[callable]) – Optional experience postprocessing function that takes the same args as Policy.postprocess_trajectory().

  • extra_action_out_fn (Optional[callable]) – Optional callable that returns a dict of extra values to include in experiences.

  • extra_grad_process_fn (Optional[callable]) – Optional callable that is called after gradients are computed and returns processing info.

  • optimizer_fn (Optional[callable]) – Optional callable that returns a torch optimizer given the policy and config.

  • before_init (Optional[callable]) – Optional callable to run at the beginning of Policy.__init__ that takes the same arguments as the Policy constructor.

  • after_init (Optional[callable]) – Optional callable to run at the end of policy init that takes the same arguments as the policy constructor.

  • action_sampler_fn (Optional[callable]) – Optional callable returning a sampled action and its log-likelihood given some (obs and state) inputs.

  • action_distribution_fn (Optional[callable]) – A callable that takes the Policy, Model, the observation batch, an explore-flag, a timestep, and an is_training flag and returns a tuple of a) distribution inputs (parameters), b) a dist-class to generate an action distribution object from, and c) internal-state outputs (empty list if not applicable).

  • make_model (Optional[callable]) – Optional func that takes the same arguments as Policy.__init__ and returns a model instance. The distribution class will be determined automatically. Note: Only one of make_model or make_model_and_action_dist should be provided.

  • make_model_and_action_dist (Optional[callable]) – Optional func that takes the same arguments as Policy.__init__ and returns a tuple of model instance and torch action distribution class. Note: Only one of make_model or make_model_and_action_dist should be provided.

  • apply_gradients_fn (Optional[callable]) – Optional callable that takes a grads list and applies these to the Model’s parameters.

  • mixins (list) – list of any class mixins for the returned policy class. These mixins will be applied in order and will have higher precedence than the TorchPolicy class.

  • get_batch_divisibility_req (Optional[callable]) – Optional callable that returns the divisibility requirement for sample batches.

Returns

TorchPolicy child class constructed from the specified args.

Return type

type

ray.rllib.policy.build_tf_policy(name, *, loss_fn, get_default_config=None, postprocess_fn=None, stats_fn=None, optimizer_fn=None, gradients_fn=None, apply_gradients_fn=None, grad_stats_fn=None, extra_action_fetches_fn=None, extra_learn_fetches_fn=None, before_init=None, before_loss_init=None, after_init=None, make_model=None, action_sampler_fn=None, action_distribution_fn=None, mixins=None, get_batch_divisibility_req=None, obs_include_prev_action_reward=True)[source]

Helper function for creating a dynamic tf policy at runtime.

Functions will be run in this order to initialize the policy:
  1. Placeholder setup: postprocess_fn

  2. Loss init: loss_fn, stats_fn

  3. Optimizer init: optimizer_fn, gradients_fn, apply_gradients_fn,

    grad_stats_fn

This means that you can e.g., depend on any policy attributes created in the running of loss_fn in later functions such as stats_fn.

In eager mode, the following functions will be run repeatedly on each eager execution: loss_fn, stats_fn, gradients_fn, apply_gradients_fn, and grad_stats_fn.

This means that these functions should not define any variables internally, otherwise they will fail in eager mode execution. Variable should only be created in make_model (if defined).

Parameters
  • name (str) – name of the policy (e.g., “PPOTFPolicy”)

  • loss_fn (func) – function that returns a loss tensor as arguments (policy, model, dist_class, train_batch)

  • get_default_config (func) – optional function that returns the default config to merge with any overrides

  • postprocess_fn (func) – optional experience postprocessing function that takes the same args as Policy.postprocess_trajectory()

  • stats_fn (func) – optional function that returns a dict of TF fetches given the policy and batch input tensors

  • optimizer_fn (func) – optional function that returns a tf.Optimizer given the policy and config

  • gradients_fn (func) – optional function that returns a list of gradients given (policy, optimizer, loss). If not specified, this defaults to optimizer.compute_gradients(loss)

  • apply_gradients_fn (func) – optional function that returns an apply gradients op given (policy, optimizer, grads_and_vars)

  • grad_stats_fn (func) – optional function that returns a dict of TF fetches given the policy, batch input, and gradient tensors

  • extra_action_fetches_fn (func) – optional function that returns a dict of TF fetches given the policy object

  • extra_learn_fetches_fn (func) – optional function that returns a dict of extra values to fetch and return when learning on a batch

  • before_init (func) – optional function to run at the beginning of policy init that takes the same arguments as the policy constructor

  • before_loss_init (func) – optional function to run prior to loss init that takes the same arguments as the policy constructor

  • after_init (func) – optional function to run at the end of policy init that takes the same arguments as the policy constructor

  • make_model (func) – optional function that returns a ModelV2 object given (policy, obs_space, action_space, config). All policy variables should be created in this function. If not specified, a default model will be created.

  • action_sampler_fn (Optional[callable]) – A callable returning a sampled action and its log-likelihood given some (obs and state) inputs.

  • action_distribution_fn (Optional[callable]) – A callable returning distribution inputs (parameters), a dist-class to generate an action distribution object from, and internal-state outputs (or an empty list if not applicable).

  • mixins (list) – list of any class mixins for the returned policy class. These mixins will be applied in order and will have higher precedence than the DynamicTFPolicy class

  • get_batch_divisibility_req (func) – optional function that returns the divisibility requirement for sample batches

  • obs_include_prev_action_reward (bool) – whether to include the previous action and reward in the model input

Returns

a DynamicTFPolicy instance that uses the specified args

ray.rllib.env

class ray.rllib.env.BaseEnv[source]

The lowest-level env interface used by RLlib for sampling.

BaseEnv models multiple agents executing asynchronously in multiple environments. A call to poll() returns observations from ready agents keyed by their environment and agent ids, and actions for those agents can be sent back via send_actions().

All other env types can be adapted to BaseEnv. RLlib handles these conversions internally in RolloutWorker, for example:

gym.Env => rllib.VectorEnv => rllib.BaseEnv rllib.MultiAgentEnv => rllib.BaseEnv rllib.ExternalEnv => rllib.BaseEnv

action_space

Action space. This must be defined for single-agent envs. Multi-agent envs can set this to None.

Type

gym.Space

observation_space

Observation space. This must be defined for single-agent envs. Multi-agent envs can set this to None.

Type

gym.Space

Examples

>>> env = MyBaseEnv()
>>> obs, rewards, dones, infos, off_policy_actions = env.poll()
>>> print(obs)
{
    "env_0": {
        "car_0": [2.4, 1.6],
        "car_1": [3.4, -3.2],
    },
    "env_1": {
        "car_0": [8.0, 4.1],
    },
    "env_2": {
        "car_0": [2.3, 3.3],
        "car_1": [1.4, -0.2],
        "car_3": [1.2, 0.1],
    },
}
>>> env.send_actions(
    actions={
        "env_0": {
            "car_0": 0,
            "car_1": 1,
        }, ...
    })
>>> obs, rewards, dones, infos, off_policy_actions = env.poll()
>>> print(obs)
{
    "env_0": {
        "car_0": [4.1, 1.7],
        "car_1": [3.2, -4.2],
    }, ...
}
>>> print(dones)
{
    "env_0": {
        "__all__": False,
        "car_0": False,
        "car_1": True,
    }, ...
}
static to_base_env(env, make_env=None, num_envs=1, remote_envs=False, remote_env_batch_wait_ms=0)[source]

Wraps any env type as needed to expose the async interface.

poll()[source]

Returns observations from ready agents.

The returns are two-level dicts mapping from env_id to a dict of agent_id to values. The number of agents and envs can vary over time.

Returns

  • obs (dict) (New observations for each ready agent.)

  • rewards (dict) (Reward values for each ready agent. If the) – episode is just started, the value will be None.

  • dones (dict) (Done values for each ready agent. The special key) – “__all__” is used to indicate env termination.

  • infos (dict) (Info values for each ready agent.)

  • off_policy_actions (dict) (Agents may take off-policy actions. When) – that happens, there will be an entry in this dict that contains the taken action. There is no need to send_actions() for agents that have already chosen off-policy actions.

send_actions(action_dict)[source]

Called to send actions back to running agents in this env.

Actions should be sent for each ready agent that returned observations in the previous poll() call.

Parameters

action_dict (dict) – Actions values keyed by env_id and agent_id.

try_reset(env_id)[source]

Attempt to reset the env with the given id.

If the environment does not support synchronous reset, None can be returned here.

Returns

Resetted observation or None if not supported.

Return type

obs (dict|None)

get_unwrapped()[source]

Return a reference to the underlying gym envs, if any.

Returns

Underlying gym envs or [].

Return type

envs (list)

stop()[source]

Releases all resources used.

class ray.rllib.env.MultiAgentEnv[source]

An environment that hosts multiple independent agents.

Agents are identified by (string) agent ids. Note that these “agents” here are not to be confused with RLlib agents.

Examples

>>> env = MyMultiAgentEnv()
>>> obs = env.reset()
>>> print(obs)
{
    "car_0": [2.4, 1.6],
    "car_1": [3.4, -3.2],
    "traffic_light_1": [0, 3, 5, 1],
}
>>> obs, rewards, dones, infos = env.step(
    action_dict={
        "car_0": 1, "car_1": 0, "traffic_light_1": 2,
    })
>>> print(rewards)
{
    "car_0": 3,
    "car_1": -1,
    "traffic_light_1": 0,
}
>>> print(dones)
{
    "car_0": False,    # car_0 is still running
    "car_1": True,     # car_1 is done
    "__all__": False,  # the env is not done
}
>>> print(infos)
{
    "car_0": {},  # info for car_0
    "car_1": {},  # info for car_1
}
reset()[source]

Resets the env and returns observations from ready agents.

Returns

New observations for each ready agent.

Return type

obs (dict)

step(action_dict)[source]

Returns observations from ready agents.

The returns are dicts mapping from agent_id strings to values. The number of agents in the env can vary over time.

Returns

  • obs (dict) (New observations for each ready agent.)

  • rewards (dict) (Reward values for each ready agent. If the) – episode is just started, the value will be None.

  • dones (dict) (Done values for each ready agent. The special key) – “__all__” (required) is used to indicate env termination.

  • infos (dict) (Optional info values for each agent id.)

with_agent_groups(groups, obs_space=None, act_space=None)[source]

Convenience method for grouping together agents in this env.

An agent group is a list of agent ids that are mapped to a single logical agent. All agents of the group must act at the same time in the environment. The grouped agent exposes Tuple action and observation spaces that are the concatenated action and obs spaces of the individual agents.

The rewards of all the agents in a group are summed. The individual agent rewards are available under the “individual_rewards” key of the group info return.

Agent grouping is required to leverage algorithms such as Q-Mix.

This API is experimental.

Parameters
  • groups (dict) – Mapping from group id to a list of the agent ids of group members. If an agent id is not present in any group value, it will be left ungrouped.

  • obs_space (Space) – Optional observation space for the grouped env. Must be a tuple space.

  • act_space (Space) – Optional action space for the grouped env. Must be a tuple space.

Examples

>>> env = YourMultiAgentEnv(...)
>>> grouped_env = env.with_agent_groups(env, {
...   "group1": ["agent1", "agent2", "agent3"],
...   "group2": ["agent4", "agent5"],
... })
class ray.rllib.env.ExternalEnv(action_space, observation_space, max_concurrent=100)[source]

An environment that interfaces with external agents.

Unlike simulator envs, control is inverted. The environment queries the policy to obtain actions and logs observations and rewards for training. This is in contrast to gym.Env, where the algorithm drives the simulation through env.step() calls.

You can use ExternalEnv as the backend for policy serving (by serving HTTP requests in the run loop), for ingesting offline logs data (by reading offline transitions in the run loop), or other custom use cases not easily expressed through gym.Env.

ExternalEnv supports both on-policy actions (through self.get_action()), and off-policy actions (through self.log_action()).

This env is thread-safe, but individual episodes must be executed serially.

action_space

Action space.

Type

gym.Space

observation_space

Observation space.

Type

gym.Space

Examples

>>> register_env("my_env", lambda config: YourExternalEnv(config))
>>> trainer = DQNTrainer(env="my_env")
>>> while True:
      print(trainer.train())
run()[source]

Override this to implement the run loop.

Your loop should continuously:
  1. Call self.start_episode(episode_id)

  2. Call self.get_action(episode_id, obs)

    -or- self.log_action(episode_id, obs, action)

  3. Call self.log_returns(episode_id, reward)

  4. Call self.end_episode(episode_id, obs)

  5. Wait if nothing to do.

Multiple episodes may be started at the same time.

start_episode(episode_id=None, training_enabled=True)[source]

Record the start of an episode.

Parameters
  • episode_id (str) – Unique string id for the episode or None for it to be auto-assigned.

  • training_enabled (bool) – Whether to use experiences for this episode to improve the policy.

Returns

Unique string id for the episode.

Return type

episode_id (str)

get_action(episode_id, observation)[source]

Record an observation and get the on-policy action.

Parameters
  • episode_id (str) – Episode id returned from start_episode().

  • observation (obj) – Current environment observation.

Returns

Action from the env action space.

Return type

action (obj)

log_action(episode_id, observation, action)[source]

Record an observation and (off-policy) action taken.

Parameters
  • episode_id (str) – Episode id returned from start_episode().

  • observation (obj) – Current environment observation.

  • action (obj) – Action for the observation.

log_returns(episode_id, reward, info=None)[source]

Record returns from the environment.

The reward will be attributed to the previous action taken by the episode. Rewards accumulate until the next action. If no reward is logged before the next action, a reward of 0.0 is assumed.

Parameters
  • episode_id (str) – Episode id returned from start_episode().

  • reward (float) – Reward from the environment.

  • info (dict) – Optional info dict.

end_episode(episode_id, observation)[source]

Record the end of an episode.

Parameters
  • episode_id (str) – Episode id returned from start_episode().

  • observation (obj) – Current environment observation.

class ray.rllib.env.ExternalMultiAgentEnv(action_space, observation_space, max_concurrent=100)[source]

This is the multi-agent version of ExternalEnv.

run()[source]

Override this to implement the multi-agent run loop.

Your loop should continuously:
  1. Call self.start_episode(episode_id)

  2. Call self.get_action(episode_id, obs_dict)

    -or- self.log_action(episode_id, obs_dict, action_dict)

  3. Call self.log_returns(episode_id, reward_dict)

  4. Call self.end_episode(episode_id, obs_dict)

  5. Wait if nothing to do.

Multiple episodes may be started at the same time.

start_episode(episode_id=None, training_enabled=True)[source]

Record the start of an episode.

Parameters
  • episode_id (str) – Unique string id for the episode or None for it to be auto-assigned.

  • training_enabled (bool) – Whether to use experiences for this episode to improve the policy.

Returns

Unique string id for the episode.

Return type

episode_id (str)

get_action(episode_id, observation_dict)[source]

Record an observation and get the on-policy action. observation_dict is expected to contain the observation of all agents acting in this episode step.

Parameters
  • episode_id (str) – Episode id returned from start_episode().

  • observation_dict (dict) – Current environment observation.

Returns

Action from the env action space.

Return type

action (dict)

log_action(episode_id, observation_dict, action_dict)[source]

Record an observation and (off-policy) action taken.

Parameters
  • episode_id (str) – Episode id returned from start_episode().

  • observation_dict (dict) – Current environment observation.

  • action_dict (dict) – Action for the observation.

log_returns(episode_id, reward_dict, info_dict=None)[source]

Record returns from the environment.

The reward will be attributed to the previous action taken by the episode. Rewards accumulate until the next action. If no reward is logged before the next action, a reward of 0.0 is assumed.

Parameters
  • episode_id (str) – Episode id returned from start_episode().

  • reward_dict (dict) – Reward from the environment agents.

  • info (dict) – Optional info dict.

end_episode(episode_id, observation_dict)[source]

Record the end of an episode.

Parameters
  • episode_id (str) – Episode id returned from start_episode().

  • observation_dict (dict) – Current environment observation.

class ray.rllib.env.VectorEnv[source]

An environment that supports batch evaluation.

Subclasses must define the following attributes:

action_space

Action space of individual envs.

Type

gym.Space

observation_space

Observation space of individual envs.

Type

gym.Space

num_envs

Number of envs in this vector env.

Type

int

vector_reset()[source]

Resets all environments.

Returns

Vector of observations from each environment.

Return type

obs (list)

reset_at(index)[source]

Resets a single environment.

Returns

Observations from the resetted environment.

Return type

obs (obj)

vector_step(actions)[source]

Vectorized step.

Parameters

actions (list) – Actions for each env.

Returns

New observations for each env. rewards (list): Reward values for each env. dones (list): Done values for each env. infos (list): Info values for each env.

Return type

obs (list)

get_unwrapped()[source]

Returns the underlying env instances.

class ray.rllib.env.EnvContext(env_config, worker_index, vector_index=0, remote=False)[source]

Wraps env configurations to include extra rllib metadata.

These attributes can be used to parameterize environments per process. For example, one might use worker_index to control which data file an environment reads in on initialization.

RLlib auto-sets these attributes when constructing registered envs.

worker_index

When there are multiple workers created, this uniquely identifies the worker the env is created in.

Type

int

vector_index

When there are multiple envs per worker, this uniquely identifies the env index within the worker.

Type

int

remote

Whether environment should be remote or not.

Type

bool

class ray.rllib.env.PolicyClient(address, inference_mode='local', update_interval=10.0)[source]

REST client to interact with a RLlib policy server.

start_episode(episode_id=None, training_enabled=True)[source]

Record the start of an episode.

Parameters
  • episode_id (str) – Unique string id for the episode or None for it to be auto-assigned.

  • training_enabled (bool) – Whether to use experiences for this episode to improve the policy.

Returns

Unique string id for the episode.

Return type

episode_id (str)

get_action(episode_id, observation)[source]

Record an observation and get the on-policy action.

Parameters
  • episode_id (str) – Episode id returned from start_episode().

  • observation (obj) – Current environment observation.

Returns

Action from the env action space.

Return type

action (obj)

log_action(episode_id, observation, action)[source]

Record an observation and (off-policy) action taken.

Parameters
  • episode_id (str) – Episode id returned from start_episode().

  • observation (obj) – Current environment observation.

  • action (obj) – Action for the observation.

log_returns(episode_id, reward, info=None)[source]

Record returns from the environment.

The reward will be attributed to the previous action taken by the episode. Rewards accumulate until the next action. If no reward is logged before the next action, a reward of 0.0 is assumed.

Parameters
  • episode_id (str) – Episode id returned from start_episode().

  • reward (float) – Reward from the environment.

end_episode(episode_id, observation)[source]

Record the end of an episode.

Parameters
  • episode_id (str) – Episode id returned from start_episode().

  • observation (obj) – Current environment observation.

class ray.rllib.env.PolicyServerInput(ioctx, address, port)[source]

REST policy server that acts as an offline data source.

This launches a multi-threaded server that listens on the specified host and port to serve policy requests and forward experiences to RLlib. For high performance experience collection, it implements InputReader.

For an example, run examples/cartpole_server.py along with examples/cartpole_client.py –inference-mode=local|remote.

Examples

>>> pg = PGTrainer(
...     env="CartPole-v0", config={
...         "input": lambda ioctx:
...             PolicyServerInput(ioctx, addr, port),
...         "num_workers": 0,  # Run just 1 server, in the trainer.
...     }
>>> while True:
        pg.train()
>>> client = PolicyClient("localhost:9900", inference_mode="local")
>>> eps_id = client.start_episode()
>>> action = client.get_action(eps_id, obs)
>>> ...
>>> client.log_returns(eps_id, reward)
>>> ...
>>> client.log_returns(eps_id, reward)
next()[source]

Return the next batch of experiences read.

Returns

SampleBatch or MultiAgentBatch read.

ray.rllib.evaluation

class ray.rllib.evaluation.MultiAgentEpisode(policies, policy_mapping_fn, batch_builder_factory, extra_batch_callback)[source]

Tracks the current state of a (possibly multi-agent) episode.

new_batch_builder

Create a new MultiAgentSampleBatchBuilder.

Type

func

add_extra_batch

Return a built MultiAgentBatch to the sampler.

Type

func

batch_builder

Batch builder for the current episode.

Type

obj

total_reward

Summed reward across all agents in this episode.

Type

float

length

Length of this episode.

Type

int

episode_id

Unique id identifying this trajectory.

Type

int

agent_rewards

Summed rewards broken down by agent.

Type

dict

custom_metrics

Dict where the you can add custom metrics.

Type

dict

user_data

Dict that you can use for temporary storage.

Type

dict

Use case 1: Model-based rollouts in multi-agent:

A custom compute_actions() function in a policy can inspect the current episode state and perform a number of rollouts based on the policies and state of other agents in the environment.

Use case 2: Returning extra rollouts data.

The model rollouts can be returned back to the sampler by calling:

>>> batch = episode.new_batch_builder()
>>> for each transition:
       batch.add_values(...)  # see sampler for usage
>>> episode.extra_batches.add(batch.build_and_reset())
soft_reset()[source]

Clears rewards and metrics, but retains RNN and other state.

This is used to carry state across multiple logical episodes in the same env (i.e., if soft_horizon is set).

policy_for(agent_id='agent0')[source]

Returns the policy for the specified agent.

If the agent is new, the policy mapping fn will be called to bind the agent to a policy for the duration of the episode.

last_observation_for(agent_id='agent0')[source]

Returns the last observation for the specified agent.

last_raw_obs_for(agent_id='agent0')[source]

Returns the last un-preprocessed obs for the specified agent.

last_info_for(agent_id='agent0')[source]

Returns the last info for the specified agent.

last_action_for(agent_id='agent0')[source]

Returns the last action for the specified agent, or zeros.

prev_action_for(agent_id='agent0')[source]

Returns the previous action for the specified agent.

prev_reward_for(agent_id='agent0')[source]

Returns the previous reward for the specified agent.

rnn_state_for(agent_id='agent0')[source]

Returns the last RNN state for the specified agent.

last_pi_info_for(agent_id='agent0')[source]

Returns the last info object for the specified agent.

class ray.rllib.evaluation.RolloutWorker(env_creator, policy, policy_mapping_fn=None, policies_to_train=None, tf_session_creator=None, rollout_fragment_length=100, batch_mode='truncate_episodes', episode_horizon=None, preprocessor_pref='deepmind', sample_async=False, compress_observations=False, num_envs=1, observation_filter='NoFilter', clip_rewards=None, clip_actions=True, env_config=None, model_config=None, policy_config=None, worker_index=0, num_workers=0, monitor_path=None, log_dir=None, log_level=None, callbacks=None, input_creator=<function RolloutWorker.<lambda>>, input_evaluation=frozenset({}), output_creator=<function RolloutWorker.<lambda>>, remote_worker_envs=False, remote_env_batch_wait_ms=0, soft_horizon=False, no_done_at_end=False, seed=None, _fake_sampler=False, extra_python_environs=None)[source]

Common experience collection class.

This class wraps a policy instance and an environment class to collect experiences from the environment. You can create many replicas of this class as Ray actors to scale RL training.

This class supports vectorized and multi-agent policy evaluation (e.g., VectorEnv, MultiAgentEnv, etc.)

Examples

>>> # Create a rollout worker and using it to collect experiences.
>>> worker = RolloutWorker(
...   env_creator=lambda _: gym.make("CartPole-v0"),
...   policy=PGTFPolicy)
>>> print(worker.sample())
SampleBatch({
    "obs": [[...]], "actions": [[...]], "rewards": [[...]],
    "dones": [[...]], "new_obs": [[...]]})
>>> # Creating a multi-agent rollout worker
>>> worker = RolloutWorker(
...   env_creator=lambda _: MultiAgentTrafficGrid(num_cars=25),
...   policies={
...       # Use an ensemble of two policies for car agents
...       "car_policy1":
...         (PGTFPolicy, Box(...), Discrete(...), {"gamma": 0.99}),
...       "car_policy2":
...         (PGTFPolicy, Box(...), Discrete(...), {"gamma": 0.95}),
...       # Use a single shared policy for all traffic lights
...       "traffic_light_policy":
...         (PGTFPolicy, Box(...), Discrete(...), {}),
...   },
...   policy_mapping_fn=lambda agent_id:
...     random.choice(["car_policy1", "car_policy2"])
...     if agent_id.startswith("car_") else "traffic_light_policy")
>>> print(worker.sample())
MultiAgentBatch({
    "car_policy1": SampleBatch(...),
    "car_policy2": SampleBatch(...),
    "traffic_light_policy": SampleBatch(...)})
sample()[source]

Evaluate the current policies and return a batch of experiences.

Returns

SampleBatch|MultiAgentBatch from evaluating the current policies.

sample_with_count()[source]

Same as sample() but returns the count as a separate future.

get_weights(policies=None)[source]

Returns the model weights of this Evaluator.

This method must be implemented by subclasses.

Returns

weights that can be set on a compatible evaluator. info: dictionary of extra metadata.

Return type

object

Examples

>>> weights = ev1.get_weights()
set_weights(weights, global_vars=None)[source]

Sets the model weights of this Evaluator.

This method must be implemented by subclasses.

Examples

>>> weights = ev1.get_weights()
>>> ev2.set_weights(weights)
compute_gradients(samples)[source]

Returns a gradient computed w.r.t the specified samples.

Either this or learn_on_batch() must be implemented by subclasses.

Returns

A list of gradients that can be applied on a compatible evaluator. In the multi-agent case, returns a dict of gradients keyed by policy ids. An info dictionary of extra metadata is also returned.

Return type

(grads, info)

Examples

>>> batch = ev.sample()
>>> grads, info = ev2.compute_gradients(samples)
apply_gradients(grads)[source]

Applies the given gradients to this evaluator’s weights.

Either this or learn_on_batch() must be implemented by subclasses.

Examples

>>> samples = ev1.sample()
>>> grads, info = ev2.compute_gradients(samples)
>>> ev1.apply_gradients(grads)
learn_on_batch(samples)[source]

Update policies based on the given batch.

This is the equivalent to apply_gradients(compute_gradients(samples)), but can be optimized to avoid pulling gradients into CPU memory.

Either this or the combination of compute/apply grads must be implemented by subclasses.

Returns

dictionary of extra metadata from compute_gradients().

Return type

info

Examples

>>> batch = ev.sample()
>>> ev.learn_on_batch(samples)
sample_and_learn(expected_batch_size, num_sgd_iter, sgd_minibatch_size, standardize_fields)[source]

Sample and batch and learn on it.

This is typically used in combination with distributed allreduce.

Parameters
  • expected_batch_size (int) – Expected number of samples to learn on.

  • num_sgd_iter (int) – Number of SGD iterations.

  • sgd_minibatch_size (int) – SGD minibatch size.

  • standardize_fields (list) – List of sample fields to normalize.

Returns

dictionary of extra metadata from learn_on_batch(). count: number of samples learned on.

Return type

info

get_metrics()[source]

Returns a list of new RolloutMetric objects from evaluation.

foreach_env(func)[source]

Apply the given function to each underlying env instance.

get_policy(policy_id='default_policy')[source]

Return policy for the specified id, or None.

Parameters

policy_id (str) – id of policy to return.

for_policy(func, policy_id='default_policy')[source]

Apply the given function to the specified policy.

foreach_policy(func)[source]

Apply the given function to each (policy, policy_id) tuple.

foreach_trainable_policy(func)[source]

Applies the given function to each (policy, policy_id) tuple, which can be found in self.policies_to_train.

Parameters

func (callable) – A function - taking a Policy and its ID - that is called on all Policies within self.policies_to_train.

Returns

The list of n return values of all

func([policy], [ID])-calls.

Return type

List[any]

sync_filters(new_filters)[source]

Changes self’s filter to given and rebases any accumulated delta.

Parameters

new_filters (dict) – Filters with new state to update local copy.

get_filters(flush_after=False)[source]

Returns a snapshot of filters.

Parameters

flush_after (bool) – Clears the filter buffer state.

Returns

Dict for serializable filters

Return type

return_filters (dict)

creation_args()[source]

Returns the args used to create this worker.

setup_torch_data_parallel(url, world_rank, world_size, backend)[source]

Join a torch process group for distributed SGD.

get_node_ip()[source]

Returns the IP address of the current node.

find_free_port()[source]

Finds a free port on the current node.

ray.rllib.evaluation.PolicyEvaluator

alias of ray.rllib.utils.deprecation.renamed_class.<locals>.DeprecationWrapper

class ray.rllib.evaluation.EvaluatorInterface[source]

This is the interface between policy optimizers and policy evaluation.

See also: RolloutWorker

sample()[source]

Returns a batch of experience sampled from this evaluator.

This method must be implemented by subclasses.

Returns

A columnar batch of experiences (e.g., tensors), or a multi-agent batch.

Return type

SampleBatch|MultiAgentBatch

Examples

>>> print(ev.sample())
SampleBatch({"obs": [1, 2, 3], "action": [0, 1, 0], ...})
learn_on_batch(samples)[source]

Update policies based on the given batch.

This is the equivalent to apply_gradients(compute_gradients(samples)), but can be optimized to avoid pulling gradients into CPU memory.

Either this or the combination of compute/apply grads must be implemented by subclasses.

Returns

dictionary of extra metadata from compute_gradients().

Return type

info

Examples

>>> batch = ev.sample()
>>> ev.learn_on_batch(samples)
compute_gradients(samples)[source]

Returns a gradient computed w.r.t the specified samples.

Either this or learn_on_batch() must be implemented by subclasses.

Returns

A list of gradients that can be applied on a compatible evaluator. In the multi-agent case, returns a dict of gradients keyed by policy ids. An info dictionary of extra metadata is also returned.

Return type

(grads, info)

Examples

>>> batch = ev.sample()
>>> grads, info = ev2.compute_gradients(samples)
apply_gradients(grads)[source]

Applies the given gradients to this evaluator’s weights.

Either this or learn_on_batch() must be implemented by subclasses.

Examples

>>> samples = ev1.sample()
>>> grads, info = ev2.compute_gradients(samples)
>>> ev1.apply_gradients(grads)
get_weights()[source]

Returns the model weights of this Evaluator.

This method must be implemented by subclasses.

Returns

weights that can be set on a compatible evaluator. info: dictionary of extra metadata.

Return type

object

Examples

>>> weights = ev1.get_weights()
set_weights(weights)[source]

Sets the model weights of this Evaluator.

This method must be implemented by subclasses.

Examples

>>> weights = ev1.get_weights()
>>> ev2.set_weights(weights)
get_host()[source]

Returns the hostname of the process running this evaluator.

apply(func, *args)[source]

Apply the given function to this evaluator instance.

ray.rllib.evaluation.PolicyGraph

alias of ray.rllib.utils.deprecation.renamed_class.<locals>.DeprecationWrapper

ray.rllib.evaluation.TFPolicyGraph

alias of ray.rllib.utils.deprecation.renamed_class.<locals>.DeprecationWrapper

ray.rllib.evaluation.TorchPolicyGraph

alias of ray.rllib.utils.deprecation.renamed_class.<locals>.DeprecationWrapper

class ray.rllib.evaluation.MultiAgentBatch(*args, **kw)
class ray.rllib.evaluation.SampleBatchBuilder[source]

Util to build a SampleBatch incrementally.

For efficiency, SampleBatches hold values in column form (as arrays). However, it is useful to add data one row (dict) at a time.

add_values(**values)[source]

Add the given dictionary (row) of values to this batch.

add_batch(batch)[source]

Add the given batch of values to this batch.

build_and_reset()[source]

Returns a sample batch including all previously added values.

class ray.rllib.evaluation.MultiAgentSampleBatchBuilder(policy_map, clip_rewards, callbacks)[source]

Util to build SampleBatches for each policy in a multi-agent env.

Input data is per-agent, while output data is per-policy. There is an M:N mapping between agents and policies. We retain one local batch builder per agent. When an agent is done, then its local batch is appended into the corresponding policy batch for the agent’s policy.

total()[source]

Returns summed number of steps across all agent buffers.

has_pending_agent_data()[source]

Returns whether there is pending unprocessed data.

add_values(agent_id, policy_id, **values)[source]

Add the given dictionary (row) of values to this batch.

Parameters
  • agent_id (obj) – Unique id for the agent we are adding values for.

  • policy_id (obj) – Unique id for policy controlling the agent.

  • values (dict) – Row of values to add for this agent.

postprocess_batch_so_far(episode)[source]

Apply policy postprocessors to any unprocessed rows.

This pushes the postprocessed per-agent batches onto the per-policy builders, clearing per-agent state.

Parameters

episode – current MultiAgentEpisode object or None

build_and_reset(episode)[source]

Returns the accumulated sample batches for each policy.

Any unprocessed rows will be first postprocessed with a policy postprocessor. The internal state of this builder will be reset.

Parameters

episode – current MultiAgentEpisode object or None

class ray.rllib.evaluation.SyncSampler(worker, env, policies, policy_mapping_fn, preprocessors, obs_filters, clip_rewards, rollout_fragment_length, callbacks, horizon=None, pack=False, tf_sess=None, clip_actions=True, soft_horizon=False, no_done_at_end=False)[source]
class ray.rllib.evaluation.AsyncSampler(worker, env, policies, policy_mapping_fn, preprocessors, obs_filters, clip_rewards, rollout_fragment_length, callbacks, horizon=None, pack=False, tf_sess=None, clip_actions=True, blackhole_outputs=False, soft_horizon=False, no_done_at_end=False)[source]
run()[source]

Method representing the thread’s activity.

You may override this method in a subclass. The standard run() method invokes the callable object passed to the object’s constructor as the target argument, if any, with sequential and keyword arguments taken from the args and kwargs arguments, respectively.

ray.rllib.evaluation.compute_advantages(rollout, last_r, gamma=0.9, lambda_=1.0, use_gae=True, use_critic=True)[source]

Given a rollout, compute its value targets and the advantage.

Parameters
  • rollout (SampleBatch) – SampleBatch of a single trajectory

  • last_r (float) – Value estimation for last observation

  • gamma (float) – Discount factor.

  • lambda_ (float) – Parameter for GAE

  • use_gae (bool) – Using Generalized Advantage Estimation

  • use_critic (bool) – Whether to use critic (value estimates). Setting this to False will use 0 as baseline.

Returns

Object with experience from rollout and

processed rewards.

Return type

SampleBatch (SampleBatch)

ray.rllib.evaluation.collect_metrics(local_worker=None, remote_workers=[], to_be_collected=[], timeout_seconds=180)[source]

Gathers episode metrics from RolloutWorker instances.

class ray.rllib.evaluation.SampleBatch(*args, **kwargs)[source]

Wrapper around a dictionary with string keys and array-like values.

For example, {“obs”: [1, 2, 3], “reward”: [0, -1, 1]} is a batch of three samples, each with an “obs” and “reward” attribute.

concat(other)[source]

Returns a new SampleBatch with each data column concatenated.

Examples

>>> b1 = SampleBatch({"a": [1, 2]})
>>> b2 = SampleBatch({"a": [3, 4, 5]})
>>> print(b1.concat(b2))
{"a": [1, 2, 3, 4, 5]}
rows()[source]

Returns an iterator over data rows, i.e. dicts with column values.

Examples

>>> batch = SampleBatch({"a": [1, 2, 3], "b": [4, 5, 6]})
>>> for row in batch.rows():
       print(row)
{"a": 1, "b": 4}
{"a": 2, "b": 5}
{"a": 3, "b": 6}
columns(keys)[source]

Returns a list of just the specified columns.

Examples

>>> batch = SampleBatch({"a": [1], "b": [2], "c": [3]})
>>> print(batch.columns(["a", "b"]))
[[1], [2]]
shuffle()[source]

Shuffles the rows of this batch in-place.

split_by_episode()[source]

Splits this batch’s data by eps_id.

Returns

list of SampleBatch, one per distinct episode.

slice(start, end)[source]

Returns a slice of the row data of this batch.

Parameters
  • start (int) – Starting index.

  • end (int) – Ending index.

Returns

SampleBatch which has a slice of this batch’s data.

ray.rllib.models

class ray.rllib.models.ActionDistribution(inputs, model)[source]

The policy action distribution of an agent.

inputs

input vector to compute samples from.

Type

Tensors

model

reference to model producing the inputs.

Type

ModelV2

sample()[source]

Draw a sample from the action distribution.

deterministic_sample()[source]

Get the deterministic “sampling” output from the distribution. This is usually the max likelihood output, i.e. mean for Normal, argmax for Categorical, etc..

sampled_action_logp()[source]

Returns the log probability of the last sampled action.

logp(x)[source]

The log-likelihood of the action distribution.

kl(other)[source]

The KL-divergence between two action distributions.

entropy()[source]

The entropy of the action distribution.

multi_kl(other)[source]

The KL-divergence between two action distributions.

This differs from kl() in that it can return an array for MultiDiscrete. TODO(ekl) consider removing this.

multi_entropy()[source]

The entropy of the action distribution.

This differs from entropy() in that it can return an array for MultiDiscrete. TODO(ekl) consider removing this.

static required_model_output_shape(action_space, model_config)[source]

Returns the required shape of an input parameter tensor for a particular action space and an optional dict of distribution-specific options.

Parameters
  • action_space (gym.Space) – The action space this distribution will be used for, whose shape attributes will be used to determine the required shape of the input parameter tensor.

  • model_config (dict) – Model’s config dict (as defined in catalog.py)

Returns

size of the

required input vector (minus leading batch dimension).

Return type

model_output_shape (int or np.ndarray of ints)

class ray.rllib.models.ModelCatalog[source]

Registry of models, preprocessors, and action distributions for envs.

Examples

>>> prep = ModelCatalog.get_preprocessor(env)
>>> observation = prep.transform(raw_observation)
>>> dist_class, dist_dim = ModelCatalog.get_action_dist(
        env.action_space, {})
>>> model = ModelCatalog.get_model(inputs, dist_dim, options)
>>> dist = dist_class(model.outputs, model)
>>> action = dist.sample()
static get_action_dist(action_space, config, dist_type=None, framework='tf', **kwargs)[source]

Returns a distribution class and size for the given action space.

Parameters
  • action_space (Space) – Action space of the target gym env.

  • config (Optional[dict]) – Optional model config.

  • dist_type (Optional[str]) – Identifier of the action distribution.

  • framework (str) – One of “tf” or “torch”.

  • kwargs (dict) – Optional kwargs to pass on to the Distribution’s constructor.

Returns

Python class of the distribution. dist_dim (int): The size of the input vector to the distribution.

Return type

dist_class (ActionDistribution)

static get_action_shape(action_space)[source]

Returns action tensor dtype and shape for the action space.

Parameters

action_space (Space) – Action space of the target gym env.

Returns

Dtype and shape of the actions tensor.

Return type

(dtype, shape)

static get_action_placeholder(action_space, name=None)[source]

Returns an action placeholder consistent with the action space

Parameters

action_space (Space) – Action space of the target gym env.

Returns

A placeholder for the actions

Return type

action_placeholder (Tensor)

static get_model_v2(obs_space, action_space, num_outputs, model_config, framework='tf', name='default_model', model_interface=None, default_model=None, **model_kwargs)[source]

Returns a suitable model compatible with given spaces and output.

Parameters
  • obs_space (Space) – Observation space of the target gym env. This may have an original_space attribute that specifies how to unflatten the tensor into a ragged tensor.

  • action_space (Space) – Action space of the target gym env.

  • num_outputs (int) – The size of the output vector of the model.

  • framework (str) – One of “tf” or “torch”.

  • name (str) – Name (scope) for the model.

  • model_interface (cls) – Interface required for the model

  • default_model (cls) – Override the default class for the model. This only has an effect when not using a custom model

  • model_kwargs (dict) – args to pass to the ModelV2 constructor

Returns

Model to use for the policy.

Return type

model (ModelV2)

static get_preprocessor(env, options=None)[source]

Returns a suitable preprocessor for the given env.

This is a wrapper for get_preprocessor_for_space().

static get_preprocessor_for_space(observation_space, options=None)[source]

Returns a suitable preprocessor for the given observation space.

Parameters
  • observation_space (Space) – The input observation space.

  • options (dict) – Options to pass to the preprocessor.

Returns

Preprocessor for the observations.

Return type

preprocessor (Preprocessor)

static register_custom_preprocessor(preprocessor_name, preprocessor_class)[source]

Register a custom preprocessor class by name.

The preprocessor can be later used by specifying {“custom_preprocessor”: preprocesor_name} in the model config.

Parameters
  • preprocessor_name (str) – Name to register the preprocessor under.

  • preprocessor_class (type) – Python class of the preprocessor.

static register_custom_model(model_name, model_class)[source]

Register a custom model class by name.

The model can be later used by specifying {“custom_model”: model_name} in the model config.

Parameters
  • model_name (str) – Name to register the model under.

  • model_class (type) – Python class of the model.

static register_custom_action_dist(action_dist_name, action_dist_class)[source]

Register a custom action distribution class by name.

The model can be later used by specifying {“custom_action_dist”: action_dist_name} in the model config.

Parameters
  • model_name (str) – Name to register the action distribution under.

  • model_class (type) – Python class of the action distribution.

static get_model(input_dict, obs_space, action_space, num_outputs, options, state_in=None, seq_lens=None)[source]

Deprecated: use get_model_v2() instead.

class ray.rllib.models.Model(input_dict, obs_space, action_space, num_outputs, options, state_in=None, seq_lens=None)[source]

This class is deprecated, please use TFModelV2 instead.

value_function()[source]

Builds the value function output.

This method can be overridden to customize the implementation of the value function (e.g., not sharing hidden layers).

Returns

Tensor of size [BATCH_SIZE] for the value function.

custom_loss(policy_loss, loss_inputs)[source]

Override to customize the loss function used to optimize this model.

This can be used to incorporate self-supervised losses (by defining a loss over existing input and output tensors of this model), and supervised losses (by defining losses over a variable-sharing copy of this model’s layers).

You can find an runnable example in examples/custom_loss.py.

Parameters
  • policy_loss (Tensor) – scalar policy loss from the policy.

  • loss_inputs (dict) – map of input placeholders for rollout data.

Returns

Scalar tensor for the customized loss for this model.

custom_stats()[source]

Override to return custom metrics from your model.

The stats will be reported as part of the learner stats, i.e.,
info:
learner:
model:

key1: metric1 key2: metric2

Returns

Dict of string keys to scalar tensors.

loss()[source]

Deprecated: use self.custom_loss().

class ray.rllib.models.Preprocessor(obs_space, options=None)[source]

Defines an abstract observation preprocessor function.

shape

Shape of the preprocessed output.

Type

obj

transform(observation)[source]

Returns the preprocessed observation.

write(observation, array, offset)[source]

Alternative to transform for more efficient flattening.

check_shape(observation)[source]

Checks the shape of the given observation.

class ray.rllib.models.FullyConnectedNetwork(input_dict, obs_space, action_space, num_outputs, options, state_in=None, seq_lens=None)[source]

Generic fully connected network.

class ray.rllib.models.VisionNetwork(input_dict, obs_space, action_space, num_outputs, options, state_in=None, seq_lens=None)[source]

Generic vision network.

ray.rllib.optimizers

class ray.rllib.optimizers.PolicyOptimizer(workers)[source]

Policy optimizers encapsulate distributed RL optimization strategies.

Policy optimizers serve as the “control plane” of algorithms.

For example, AsyncOptimizer is used for A3C, and LocalMultiGPUOptimizer is used for PPO. These optimizers are all pluggable, and it is possible to mix and match as needed.

config

The JSON configuration passed to this optimizer.

Type

dict

workers

The set of rollout workers to use.

Type

WorkerSet

num_steps_trained

Number of timesteps trained on so far.

Type

int

num_steps_sampled

Number of timesteps sampled so far.

Type

int

step()[source]

Takes a logical optimization step.

This should run for long enough to minimize call overheads (i.e., at least a couple seconds), but short enough to return control periodically to callers (i.e., at most a few tens of seconds).

Returns

Optional fetches from compute grads calls.

Return type

fetches (dict|None)

stats()[source]

Returns a dictionary of internal performance statistics.

save()[source]

Returns a serializable object representing the optimizer state.

restore(data)[source]

Restores optimizer state from the given data object.

stop()[source]

Release any resources used by this optimizer.

collect_metrics(timeout_seconds, min_history=100, selected_workers=None)[source]

Returns worker and optimizer stats.

Parameters
  • timeout_seconds (int) – Max wait time for a worker before dropping its results. This usually indicates a hung worker.

  • min_history (int) – Min history length to smooth results over.

  • selected_workers (list) – Override the list of remote workers to collect metrics from.

Returns

A training result dict from worker metrics with

info replaced with stats from self.

Return type

res (dict)

reset(remote_workers)[source]

Called to change the set of remote workers being used.

foreach_worker(func)[source]

Apply the given function to each worker instance.

foreach_worker_with_index(func)[source]

Apply the given function to each worker instance.

The index will be passed as the second arg to the given function.

class ray.rllib.optimizers.AsyncReplayOptimizer(workers, learning_starts=1000, buffer_size=10000, prioritized_replay=True, prioritized_replay_alpha=0.6, prioritized_replay_beta=0.4, prioritized_replay_eps=1e-06, train_batch_size=512, rollout_fragment_length=50, num_replay_buffer_shards=1, max_weight_sync_delay=400, debug=False, batch_replay=False)[source]

Main event loop of the Ape-X optimizer (async sampling with replay).

This class coordinates the data transfers between the learner thread, remote workers (Ape-X actors), and replay buffer actors.

This has two modes of operation:
  • normal replay: replays independent samples.

  • batch replay: simplified mode where entire sample batches are

    replayed. This supports RNNs, but not prioritization.

This optimizer requires that rollout workers return an additional “td_error” array in the info return of compute_gradients(). This error term will be used for sample prioritization.

step()[source]

Takes a logical optimization step.

This should run for long enough to minimize call overheads (i.e., at least a couple seconds), but short enough to return control periodically to callers (i.e., at most a few tens of seconds).

Returns

Optional fetches from compute grads calls.

Return type

fetches (dict|None)

stop()[source]

Release any resources used by this optimizer.

reset(remote_workers)[source]

Called to change the set of remote workers being used.

stats()[source]

Returns a dictionary of internal performance statistics.

class ray.rllib.optimizers.AsyncSamplesOptimizer(workers, train_batch_size=500, rollout_fragment_length=50, num_envs_per_worker=1, num_gpus=0, lr=0.0005, replay_buffer_num_slots=0, replay_proportion=0.0, num_data_loader_buffers=1, max_sample_requests_in_flight_per_worker=2, broadcast_interval=1, num_sgd_iter=1, minibatch_buffer_size=1, learner_queue_size=16, learner_queue_timeout=300, num_aggregation_workers=0, _fake_gpus=False)[source]

Main event loop of the IMPALA architecture.

This class coordinates the data transfers between the learner thread and remote workers (IMPALA actors).

step()[source]

Takes a logical optimization step.

This should run for long enough to minimize call overheads (i.e., at least a couple seconds), but short enough to return control periodically to callers (i.e., at most a few tens of seconds).

Returns

Optional fetches from compute grads calls.

Return type

fetches (dict|None)

stop()[source]

Release any resources used by this optimizer.

reset(remote_workers)[source]

Called to change the set of remote workers being used.

stats()[source]

Returns a dictionary of internal performance statistics.

class ray.rllib.optimizers.AsyncGradientsOptimizer(workers, grads_per_step=100)[source]

An asynchronous RL optimizer, e.g. for implementing A3C.

This optimizer asynchronously pulls and applies gradients from remote workers, sending updated weights back as needed. This pipelines the gradient computations on the remote workers.

step()[source]

Takes a logical optimization step.

This should run for long enough to minimize call overheads (i.e., at least a couple seconds), but short enough to return control periodically to callers (i.e., at most a few tens of seconds).

Returns

Optional fetches from compute grads calls.

Return type

fetches (dict|None)

stats()[source]

Returns a dictionary of internal performance statistics.

class ray.rllib.optimizers.SyncSamplesOptimizer(workers, num_sgd_iter=1, train_batch_size=1, sgd_minibatch_size=0, standardize_fields=frozenset({}))[source]

A simple synchronous RL optimizer.

In each step, this optimizer pulls samples from a number of remote workers, concatenates them, and then updates a local model. The updated model weights are then broadcast to all remote workers.

step()[source]

Takes a logical optimization step.

This should run for long enough to minimize call overheads (i.e., at least a couple seconds), but short enough to return control periodically to callers (i.e., at most a few tens of seconds).

Returns

Optional fetches from compute grads calls.

Return type

fetches (dict|None)

stats()[source]

Returns a dictionary of internal performance statistics.

class ray.rllib.optimizers.SyncReplayOptimizer(workers, learning_starts=1000, buffer_size=10000, prioritized_replay=True, prioritized_replay_alpha=0.6, prioritized_replay_beta=0.4, prioritized_replay_eps=1e-06, final_prioritized_replay_beta=0.4, train_batch_size=32, before_learn_on_batch=None, synchronize_sampling=False, prioritized_replay_beta_annealing_timesteps=20000.0)[source]

Variant of the local sync optimizer that supports replay (for DQN).

This optimizer requires that rollout workers return an additional “td_error” array in the info return of compute_gradients(). This error term will be used for sample prioritization.

step()[source]

Takes a logical optimization step.

This should run for long enough to minimize call overheads (i.e., at least a couple seconds), but short enough to return control periodically to callers (i.e., at most a few tens of seconds).

Returns

Optional fetches from compute grads calls.

Return type

fetches (dict|None)

stats()[source]

Returns a dictionary of internal performance statistics.

class ray.rllib.optimizers.SyncBatchReplayOptimizer(workers, learning_starts=1000, buffer_size=10000, train_batch_size=32)[source]

Variant of the sync replay optimizer that replays entire batches.

This enables RNN support. Does not currently support prioritization.

step()[source]

Takes a logical optimization step.

This should run for long enough to minimize call overheads (i.e., at least a couple seconds), but short enough to return control periodically to callers (i.e., at most a few tens of seconds).

Returns

Optional fetches from compute grads calls.

Return type

fetches (dict|None)

stats()[source]

Returns a dictionary of internal performance statistics.

class ray.rllib.optimizers.MicrobatchOptimizer(workers, train_batch_size=10000, microbatch_size=1000)[source]

A microbatching synchronous RL optimizer.

This optimizer pulls sample batches from workers until the target microbatch size is reached. Then, it computes and accumulates the policy gradient in a local buffer. This process is repeated until the number of samples collected equals the train batch size. Then, an accumulated gradient update is made.

This allows for training with effective batch sizes much larger than can fit in GPU or host memory.

step()[source]

Takes a logical optimization step.

This should run for long enough to minimize call overheads (i.e., at least a couple seconds), but short enough to return control periodically to callers (i.e., at most a few tens of seconds).

Returns

Optional fetches from compute grads calls.

Return type

fetches (dict|None)

stats()[source]

Returns a dictionary of internal performance statistics.

class ray.rllib.optimizers.LocalMultiGPUOptimizer(workers, sgd_batch_size=128, num_sgd_iter=10, rollout_fragment_length=200, num_envs_per_worker=1, train_batch_size=1024, num_gpus=0, standardize_fields=[], shuffle_sequences=True, _fake_gpus=False)[source]

A synchronous optimizer that uses multiple local GPUs.

Samples are pulled synchronously from multiple remote workers, concatenated, and then split across the memory of multiple local GPUs. A number of SGD passes are then taken over the in-memory data. For more details, see multi_gpu_impl.LocalSyncParallelOptimizer.

This optimizer is Tensorflow-specific and requires the underlying Policy to be a TFPolicy instance that implements the copy() method for multi-GPU tower generation.

Note that all replicas of the TFPolicy will merge their extra_compute_grad and apply_grad feed_dicts and fetches. This may result in unexpected behavior.

step()[source]

Takes a logical optimization step.

This should run for long enough to minimize call overheads (i.e., at least a couple seconds), but short enough to return control periodically to callers (i.e., at most a few tens of seconds).

Returns

Optional fetches from compute grads calls.

Return type

fetches (dict|None)

stats()[source]

Returns a dictionary of internal performance statistics.

class ray.rllib.optimizers.TorchDistributedDataParallelOptimizer(workers, expected_batch_size, num_sgd_iter=1, sgd_minibatch_size=0, standardize_fields=frozenset({}), keep_local_weights_in_sync=True, backend='gloo')[source]

EXPERIMENTAL: torch distributed multi-node SGD.

step()[source]

Takes a logical optimization step.

This should run for long enough to minimize call overheads (i.e., at least a couple seconds), but short enough to return control periodically to callers (i.e., at most a few tens of seconds).

Returns

Optional fetches from compute grads calls.

Return type

fetches (dict|None)

stats()[source]

Returns a dictionary of internal performance statistics.

ray.rllib.utils

ray.rllib.utils.override(cls)[source]

Annotation for documenting method overrides.

Parameters

cls (type) – The superclass that provides the overriden method. If this cls does not actually have the method, an error is raised.

ray.rllib.utils.PublicAPI(obj)[source]

Annotation for documenting public APIs.

Public APIs are classes and methods exposed to end users of RLlib. You can expect these APIs to remain stable across RLlib releases.

Subclasses that inherit from a @PublicAPI base class can be assumed part of the RLlib public API as well (e.g., all trainer classes are in public API because Trainer is @PublicAPI).

In addition, you can assume all trainer configurations are part of their public API as well.

ray.rllib.utils.DeveloperAPI(obj)[source]

Annotation for documenting developer APIs.

Developer APIs are classes and methods explicitly exposed to developers for the purposes of building custom algorithms or advanced training strategies on top of RLlib internals. You can generally expect these APIs to be stable sans minor changes (but less stable than public APIs).

Subclasses that inherit from a @DeveloperAPI base class can be assumed part of the RLlib developer API as well (e.g., all policy optimizers are developer API because PolicyOptimizer is @DeveloperAPI).

ray.rllib.utils.try_import_tf(error=False)[source]
Parameters

error (bool) – Whether to raise an error if tf cannot be imported.

Returns

The tf module (either from tf2.0.compat.v1 OR as tf1.x.

ray.rllib.utils.try_import_tfp(error=False)[source]
Parameters

error (bool) – Whether to raise an error if tfp cannot be imported.

Returns

The tfp module.

ray.rllib.utils.try_import_torch(error=False)[source]
Parameters

error (bool) – Whether to raise an error if torch cannot be imported.

Returns

torch AND torch.nn modules.

Return type

tuple

ray.rllib.utils.check_framework(framework='tf')[source]

Checks, whether the given framework is “valid”, meaning, whether all necessary dependencies are installed. Errors otherwise.

Parameters

framework (str) – Once of “tf”, “torch”, or None.

Returns

The input framework string.

Return type

str

ray.rllib.utils.deprecation_warning(old, new=None, error=None)[source]

Logs (via the logger object) or throws a deprecation warning/error.

Parameters
  • old (str) – A description of the “thing” that is to be deprecated.

  • new (Optional[str]) – A description of the new “thing” that replaces it.

  • error (Optional[bool,Exception]) – Whether or which exception to throw. If True, throw ValueError.

ray.rllib.utils.renamed_agent(cls)[source]

Helper class for renaming Agent => Trainer with a warning.

ray.rllib.utils.renamed_class(cls, old_name)[source]

Helper class for renaming classes with a warning.

ray.rllib.utils.renamed_function(func, old_name)[source]

Helper function for renaming a function.

class ray.rllib.utils.FilterManager[source]

Manages filters and coordination across remote evaluators that expose get_filters and sync_filters.

static synchronize(local_filters, remotes, update_remote=True)[source]

Aggregates all filters from remote evaluators.

Local copy is updated and then broadcasted to all remote evaluators.

Parameters
  • local_filters (dict) – Filters to be synchronized.

  • remotes (list) – Remote evaluators with filters.

  • update_remote (bool) – Whether to push updates to remote filters.

class ray.rllib.utils.Filter[source]

Processes input, possibly statefully.

apply_changes(other, *args, **kwargs)[source]

Updates self with “new state” from other filter.

copy()[source]

Creates a new object with same state as self.

Returns

A copy of self.

sync(other)[source]

Copies all state from other filter to self.

clear_buffer()[source]

Creates copy of current state and clears accumulated state

ray.rllib.utils.sigmoid(x, derivative=False)[source]

Returns the sigmoid function applied to x. Alternatively, can return the derivative or the sigmoid function.

Parameters
  • x (np.ndarray) – The input to the sigmoid function.

  • derivative (bool) – Whether to return the derivative or not. Default: False.

Returns

The sigmoid function (or its derivative) applied to x.

Return type

np.ndarray

ray.rllib.utils.softmax(x, axis=- 1)[source]

Returns the softmax values for x as: S(xi) = e^xi / SUMj(e^xj), where j goes over all elements in x.

Parameters
  • x (np.ndarray) – The input to the softmax function.

  • axis (int) – The axis along which to softmax.

Returns

The softmax over x.

Return type

np.ndarray

ray.rllib.utils.relu(x, alpha=0.0)[source]

Implementation of the leaky ReLU function: y = x * alpha if x < 0 else x

Parameters
  • x (np.ndarray) – The input values.

  • alpha (float) – A scaling (“leak”) factor to use for negative x.

Returns

The leaky ReLU output for x.

Return type

np.ndarray

ray.rllib.utils.one_hot(x, depth=0, on_value=1, off_value=0)[source]

One-hot utility function for numpy. Thanks to qianyizhang: https://gist.github.com/qianyizhang/07ee1c15cad08afb03f5de69349efc30.

Parameters
  • x (np.ndarray) – The input to be one-hot encoded.

  • depth (int) – The max. number to be one-hot encoded (size of last rank).

  • on_value (float) – The value to use for on. Default: 1.0.

  • off_value (float) – The value to use for off. Default: 0.0.

Returns

The one-hot encoded equivalent of the input array.

Return type

np.ndarray

ray.rllib.utils.fc(x, weights, biases=None, framework=None)[source]

Calculates the outputs of a fully-connected (dense) layer given weights/biases and an input.

Parameters
  • x (np.ndarray) – The input to the dense layer.

  • weights (np.ndarray) – The weights matrix.

  • biases (Optional[np.ndarray]) – The biases vector. All 0s if None.

  • framework (Optional[str]) – An optional framework hint (to figure out, e.g. whether to transpose torch weight matrices).

Returns

The dense layer’s output.

ray.rllib.utils.lstm(x, weights, biases=None, initial_internal_states=None, time_major=False, forget_bias=1.0)[source]

Calculates the outputs of an LSTM layer given weights/biases, internal_states, and input.

Parameters
  • x (np.ndarray) – The inputs to the LSTM layer including time-rank (0th if time-major, else 1st) and the batch-rank (1st if time-major, else 0th).

  • weights (np.ndarray) – The weights matrix.

  • biases (Optional[np.ndarray]) – The biases vector. All 0s if None.

  • initial_internal_states (Optional[np.ndarray]) – The initial internal states to pass into the layer. All 0s if None.

  • time_major (bool) – Whether to use time-major or not. Default: False.

  • forget_bias (float) – Gets added to first sigmoid (forget gate) output. Default: 1.0.

Returns

  • The LSTM layer’s output.

  • Tuple: Last (c-state, h-state).

Return type

Tuple

class ray.rllib.utils.PolicyClient(address)[source]

DEPRECATED: Please use rllib.env.PolicyClient instead.

start_episode(episode_id=None, training_enabled=True)[source]

Record the start of an episode.

Parameters
  • episode_id (str) – Unique string id for the episode or None for it to be auto-assigned.

  • training_enabled (bool) – Whether to use experiences for this episode to improve the policy.

Returns

Unique string id for the episode.

Return type

episode_id (str)

get_action(episode_id, observation)[source]

Record an observation and get the on-policy action.

Parameters
  • episode_id (str) – Episode id returned from start_episode().

  • observation (obj) – Current environment observation.

Returns

Action from the env action space.

Return type

action (obj)

log_action(episode_id, observation, action)[source]

Record an observation and (off-policy) action taken.

Parameters
  • episode_id (str) – Episode id returned from start_episode().

  • observation (obj) – Current environment observation.

  • action (obj) – Action for the observation.

log_returns(episode_id, reward, info=None)[source]

Record returns from the environment.

The reward will be attributed to the previous action taken by the episode. Rewards accumulate until the next action. If no reward is logged before the next action, a reward of 0.0 is assumed.

Parameters
  • episode_id (str) – Episode id returned from start_episode().

  • reward (float) – Reward from the environment.

end_episode(episode_id, observation)[source]

Record the end of an episode.

Parameters
  • episode_id (str) – Episode id returned from start_episode().

  • observation (obj) – Current environment observation.

class ray.rllib.utils.PolicyServer(external_env, address, port)[source]

DEPRECATED: Please use rllib.env.PolicyServerInput instead.

class ray.rllib.utils.LinearSchedule(**kwargs)[source]

Linear interpolation between initial_p and final_p. Simply uses Polynomial with power=1.0.

final_p + (initial_p - final_p) * (1 - t/t_max)

class ray.rllib.utils.PiecewiseSchedule(endpoints, framework, interpolation=<function _linear_interpolation>, outside_value=None)[source]
class ray.rllib.utils.PolynomialSchedule(schedule_timesteps, final_p, framework, initial_p=1.0, power=2.0)[source]
class ray.rllib.utils.ExponentialSchedule(schedule_timesteps, framework, initial_p=1.0, decay_rate=0.1)[source]
class ray.rllib.utils.ConstantSchedule(value, framework)[source]

A Schedule where the value remains constant over time.

ray.rllib.utils.check(x, y, decimals=5, atol=None, rtol=None, false=False)[source]

Checks two structures (dict, tuple, list, np.array, float, int, etc..) for (almost) numeric identity. All numbers in the two structures have to match up to decimal digits after the floating point. Uses assertions.

Parameters
  • x (any) – The value to be compared (to the expectation: y). This may be a Tensor.

  • y (any) – The expected value to be compared to x. This must not be a Tensor.

  • decimals (int) – The number of digits after the floating point up to which all numeric values have to match.

  • atol (float) – Absolute tolerance of the difference between x and y (overrides decimals if given).

  • rtol (float) – Relative tolerance of the difference between x and y (overrides decimals if given).

  • false (bool) – Whether to check that x and y are NOT the same.

ray.rllib.utils.framework_iterator(config=None, frameworks='tf', 'eager', 'torch', session=False)[source]

An generator that allows for looping through n frameworks for testing.

Provides the correct config entries (“use_pytorch” and “eager”) as well as the correct eager/non-eager contexts for tf.

Parameters
  • config (Optional[dict]) – An optional config dict to alter in place depending on the iteration.

  • frameworks (Tuple[str]) – A list/tuple of the frameworks to be tested. Allowed are: “tf”, “eager”, and “torch”.

  • session (bool) – If True, enter a tf.Session() and yield that as well in the tf-case (otherwise, yield (fw, None)).

Yields

str

If enter_session is False:

The current framework (“tf”, “eager”, “torch”) used.

Tuple(str, Union[None,tf.Session]: If enter_session is True:

A tuple of the current fw and the tf.Session if fw=”tf”.

ray.rllib.utils.merge_dicts(d1, d2)[source]
Parameters
  • d1 (dict) – Dict 1.

  • d2 (dict) – Dict 2.

Returns

A new dict that is d1 and d2 deep merged.

Return type

dict

ray.rllib.utils.deep_update(original, new_dict, new_keys_allowed=False, whitelist=None, override_all_if_type_changes=None)[source]

Updates original dict with values from new_dict recursively.

If new key is introduced in new_dict, then if new_keys_allowed is not True, an error will be thrown. Further, for sub-dicts, if the key is in the whitelist, then new subkeys can be introduced.

Parameters
  • original (dict) – Dictionary with default values.

  • new_dict (dict) – Dictionary with values to be updated

  • new_keys_allowed (bool) – Whether new keys are allowed.

  • whitelist (Optional[List[str]]) – List of keys that correspond to dict values where new subkeys can be introduced. This is only at the top level.

  • override_all_if_type_changes (Optional[List[str]]) – List of top level keys with value=dict, for which we always simply override the entire value (dict), iff the “type” key in that value dict changes.

ray.rllib.utils.add_mixins(base, mixins)[source]

Returns a new class with mixins applied in priority order.

ray.rllib.utils.force_list(elements=None, to_tuple=False)[source]

Makes sure elements is returned as a list, whether elements is a single item, already a list, or a tuple.

Parameters
  • elements (Optional[any]) – The inputs as single item, list, or tuple to be converted into a list/tuple. If None, returns empty list/tuple.

  • to_tuple (bool) – Whether to use tuple (instead of list).

Returns

All given elements in a list/tuple depending on

to_tuple’s value. If elements is None, returns an empty list/tuple.

Return type

Union[list,tuple]

ray.rllib.utils.force_tuple(elements=None, *, to_tuple=True)

Makes sure elements is returned as a list, whether elements is a single item, already a list, or a tuple.

Parameters
  • elements (Optional[any]) – The inputs as single item, list, or tuple to be converted into a list/tuple. If None, returns empty list/tuple.

  • to_tuple (bool) – Whether to use tuple (instead of list).

Returns

All given elements in a list/tuple depending on

to_tuple’s value. If elements is None, returns an empty list/tuple.

Return type

Union[list,tuple]