Ray 2.10.0 introduces the alpha stage of RLlib’s “new API stack”. The Ray Team plans to transition algorithms, example scripts, and documentation to the new code base thereby incrementally replacing the “old API stack” (e.g., ModelV2, Policy, RolloutWorker) throughout the subsequent minor releases leading up to Ray 3.0.

Note, however, that so far only PPO (single- and multi-agent) and SAC (single-agent only) support the “new API stack” and continue to run by default with the old APIs. You can continue to use the existing custom (old stack) classes.

See here for more details on how to use the new API stack.

External Application API#

In some cases, for instance when interacting with an externally hosted simulator or production environment, it makes more sense to interact with RLlib as if it were an independently running service, rather than RLlib hosting the simulations itself. This is possible via RLlib’s external applications interface (full documentation).

class ray.rllib.env.policy_client.PolicyClient(address: str, inference_mode: str = 'local', update_interval: float = 10.0, session: Session | None = None)[source]#

REST client to interact with an RLlib policy server.

start_episode(episode_id: str | None = None, training_enabled: bool = True) str[source]#

Record the start of one or more episode(s).

  • episode_id (Optional[str]) – Unique string id for the episode or None for it to be auto-assigned.

  • training_enabled – Whether to use experiences for this episode to improve the policy.


Unique string id for the episode.

Return type:


get_action(episode_id: str, observation: Any | Dict[Any, Any]) Any | Dict[Any, Any][source]#

Record an observation and get the on-policy action.

  • episode_id – Episode id returned from start_episode().

  • observation – Current environment observation.


Action from the env action space.

Return type:


log_action(episode_id: str, observation: Any | Dict[Any, Any], action: Any | Dict[Any, Any]) None[source]#

Record an observation and (off-policy) action taken.

  • episode_id – Episode id returned from start_episode().

  • observation – Current environment observation.

  • action – Action for the observation.

log_returns(episode_id: str, reward: float, info: dict | Dict[Any, Any] = None, multiagent_done_dict: Dict[Any, Any] | None = None) None[source]#

Record returns from the environment.

The reward will be attributed to the previous action taken by the episode. Rewards accumulate until the next action. If no reward is logged before the next action, a reward of 0.0 is assumed.

  • episode_id – Episode id returned from start_episode().

  • reward – Reward from the environment.

  • info – Extra info dict.

  • multiagent_done_dict – Multi-agent done information.

end_episode(episode_id: str, observation: Any | Dict[Any, Any]) None[source]#

Record the end of an episode.

  • episode_id – Episode id returned from start_episode().

  • observation – Current environment observation.

update_policy_weights() None[source]#

Query the server for new policy weights, if local inference is enabled.

class ray.rllib.env.policy_server_input.PolicyServerInput(ioctx: IOContext, address: str, port: int, idle_timeout: float = 3.0, max_sample_queue_size: int = 20)[source]#

REST policy server that acts as an offline data source.

This launches a multi-threaded server that listens on the specified host and port to serve policy requests and forward experiences to RLlib. For high performance experience collection, it implements InputReader.

For an example, run examples/envs/external_envs/ along with examples/envs/external_envs/ --inference-mode=local|remote.

WARNING: This class is not meant to be publicly exposed. Anyone that can communicate with this server can execute arbitary code on the machine. Use this with caution, in isolated environments, and at your own risk.

import gymnasium as gym
from ray.rllib.algorithms.ppo import PPOConfig
from ray.rllib.env.policy_client import PolicyClient
from ray.rllib.env.policy_server_input import PolicyServerInput
addr, port = ...
config = (
        input_=lambda ioctx: PolicyServerInput(ioctx, addr, port)
    # Run just 1 server (in the Algorithm's EnvRunnerGroup).
algo =
while True:
client = PolicyClient(
    "localhost:9900", inference_mode="local")
eps_id = client.start_episode()
env = gym.make("CartPole-v1")
obs, info = env.reset()
action = client.get_action(eps_id, obs)
_, reward, _, _, _ = env.step(action)
client.log_returns(eps_id, reward)
client.log_returns(eps_id, reward)

Returns the next batch of read experiences.


The experience read (SampleBatch or MultiAgentBatch).