Offline reinforcement learning with Ray AIRΒΆ

In this example, we’ll train a reinforcement learning agent using offline training.

Offline training means that the data from the environment (and the actions performed by the agent) have been stored on disk. In contrast, online training samples experiences live by interacting with the environment.

Let’s start with installing our dependencies:

# !pip install -qU "ray[rllib]" gym

Now we can run some imports:

import argparse
import gym
import os

import numpy as np
import ray
from ray.air import Checkpoint
from ray.air.config import RunConfig
from ray.train.rl.rl_predictor import RLPredictor
from ray.train.rl.rl_trainer import RLTrainer
from ray.air.config import ScalingConfig
from ray.air.result import Result
from ray.rllib.algorithms.bc import BC
from ray.tune.tuner import Tuner

We will be training on offline data - this means we have full agent trajectories stored somewhere on disk and want to train on these past experiences.

Usually this data could come from external systems, or a database of historical data. But for this example, we’ll generate some offline data ourselves and store it using RLlibs output_config.

def generate_offline_data(path: str):
    print(f"Generating offline data for training at {path}")
    trainer = RLTrainer(
        algorithm="PPO",
        run_config=RunConfig(stop={"timesteps_total": 5000}),
        config={
            "env": "CartPole-v1",
            "output": "dataset",
            "output_config": {
                "format": "json",
                "path": path,
                "max_num_samples_per_file": 1,
            },
            "batch_mode": "complete_episodes",
        },
    )
    trainer.fit()

Here we define the training function. It will create an RLTrainer using the PPO algorithm and kick off training on the CartPole-v1 environment. It will use the offline data provided in path for this.

def train_rl_bc_offline(path: str, num_workers: int, use_gpu: bool = False) -> Result:
    print("Starting offline training")
    dataset = ray.data.read_json(
        path, parallelism=num_workers, ray_remote_args={"num_cpus": 1}
    )

    trainer = RLTrainer(
        run_config=RunConfig(stop={"training_iteration": 5}),
        scaling_config=ScalingConfig(num_workers=num_workers, use_gpu=use_gpu),
        datasets={"train": dataset},
        algorithm=BC,
        config={
            "env": "CartPole-v1",
            "framework": "tf",
            "evaluation_num_workers": 1,
            "evaluation_interval": 1,
            "evaluation_config": {"input": "sampler"},
        },
    )

    # Todo (krfricke/xwjiang): Enable checkpoint config in RunConfig
    # result = trainer.fit()
    tuner = Tuner(
        trainer,
        _tuner_kwargs={"checkpoint_at_end": True},
    )
    result = tuner.fit()[0]
    return result

Once we trained our RL policy, we want to evaluate it on a fresh environment. For this, we will also define a utility function:

def evaluate_using_checkpoint(checkpoint: Checkpoint, num_episodes) -> list:
    predictor = RLPredictor.from_checkpoint(checkpoint)

    env = gym.make("CartPole-v1")

    rewards = []
    for i in range(num_episodes):
        obs = env.reset()
        reward = 0.0
        done = False
        while not done:
            action = predictor.predict(np.array([obs]))
            obs, r, done, _ = env.step(action[0])
            reward += r
        rewards.append(reward)

    return rewards

Let’s put it all together. First, we initialize Ray and create the offline data:

ray.init(num_cpus=8)

path = "/tmp/out"
generate_offline_data(path)
2022-09-26 18:22:15,032	INFO worker.py:1509 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 
Generating offline data for training at /tmp/out

Tune Status

Current time:2022-09-26 18:22:31
Running for: 00:00:15.61
Memory: 10.4/62.7 GiB

System Info

Using FIFO scheduling algorithm.
Resources requested: 0/8 CPUs, 0/0 GPUs, 0.0/33.26 GiB heap, 0.0/16.63 GiB objects

Trial Status

Trial name status loc iter total time (s) ts reward num_recreated_worker s episode_reward_max episode_reward_min
AIRPPO_d229c_00000TERMINATED192.168.1.241:3893828 2 8.775258528 45.760 137 10
(AIRPPO pid=3893828) 2022-09-26 18:22:18,476	INFO algorithm.py:2104 -- Your framework setting is 'tf', meaning you are using static-graph mode. Set framework='tf2' to enable eager execution with tf2.x. You may also then want to set eager_tracing=True in order to reach similar execution speed as with static-graph mode.
(AIRPPO pid=3893828) 2022-09-26 18:22:18,476	INFO ppo.py:379 -- In multi-agent mode, policies will be optimized sequentially by the multi-GPU optimizer. Consider setting simple_optimizer=True if this doesn't work for you.
(AIRPPO pid=3893828) 2022-09-26 18:22:18,477	INFO algorithm.py:356 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
(RolloutWorker pid=3893857) 2022-09-26 18:22:21,261	WARNING env.py:159 -- Your env reset() method appears to take 'seed' or 'return_info' arguments. Note that these are not yet supported in RLlib. Seeding will take place using 'env.seed()' and the info dict will not be returned from reset.
(AIRPPO pid=3893828) 2022-09-26 18:22:22,261	WARNING util.py:66 -- Install gputil for GPU system monitoring.
Repartition:   0%|          | 0/1 [00:00<?, ?it/s]
Repartition:   0%|          | 0/1 [00:00<?, ?it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00,  1.34it/s]
Write Progress:   0%|          | 0/1 [00:00<?, ?it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00,  1.34it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 175.38it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00,  1.33it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 232.05it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 288.70it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 265.45it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 295.96it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 381.47it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 273.05it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 240.86it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 222.25it/s]
Repartition:   0%|          | 0/1 [00:00<?, ?it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 219.46it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 152.97it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 237.54it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 229.13it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 283.36it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 351.08it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 302.44it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 217.21it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 226.01it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 335.54it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 206.81it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 397.15it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 287.34it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 245.70it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 224.11it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 174.41it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 376.54it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 231.93it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 325.49it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 280.89it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 206.81it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 263.11it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 236.59it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 320.96it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 372.13it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 293.16it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 318.30it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 225.89it/s]

Trial Progress

Trial name agent_timesteps_totalcounters custom_metrics date done episode_len_meanepisode_media episode_reward_max episode_reward_mean episode_reward_min episodes_this_iter episodes_totalexperiment_id hostname info iterations_since_restorenode_ip num_agent_steps_sampled num_agent_steps_trained num_env_steps_sampled num_env_steps_sampled_this_iter num_env_steps_trained num_env_steps_trained_this_iter num_faulty_episodes num_healthy_workers num_recreated_workers num_steps_trained_this_iterperf pidpolicy_reward_max policy_reward_mean policy_reward_min sampler_perf sampler_results time_since_restore time_this_iter_s time_total_stimers timestamp timesteps_since_restore timesteps_total training_iterationtrial_id warmup_time
AIRPPO_d229c_00000 8528{'num_env_steps_sampled': 8528, 'num_env_steps_trained': 8528, 'num_agent_steps_sampled': 8528, 'num_agent_steps_trained': 8528}{} 2022-09-26_18-22-31True 45.76{} 137 45.76 10 84 284eadfde34443046629ed77655da6915c9corvus {'learner': {'default_policy': {'learner_stats': {'cur_kl_coeff': 0.30000001192092896, 'cur_lr': 4.999999873689376e-05, 'total_loss': 9.522293, 'policy_loss': -0.03154374, 'vf_loss': 9.54884, 'vf_explained_var': -0.011132962, 'kl': 0.016653905, 'entropy': 0.6111665, 'entropy_coeff': 0.0, 'model': {}}, 'custom_metrics': {}, 'num_agent_steps_trained': 128.0}}, 'num_env_steps_sampled': 8528, 'num_env_steps_trained': 8528, 'num_agent_steps_sampled': 8528, 'num_agent_steps_trained': 8528} 2192.168.1.241 8528 8528 8528 4238 8528 4238 0 2 0 4238{'cpu_util_percent': 16.94, 'ram_util_percent': 16.5}3893828{} {} {} {'mean_raw_obs_processing_ms': 0.2064514408664396, 'mean_inference_ms': 0.31645616264795123, 'mean_action_processing_ms': 0.032597069914330125, 'mean_env_wait_ms': 0.027492389739157415, 'mean_env_render_ms': 0.0}{'episode_reward_max': 137.0, 'episode_reward_min': 10.0, 'episode_reward_mean': 45.76, 'episode_len_mean': 45.76, 'episode_media': {}, 'episodes_this_iter': 84, 'policy_reward_min': {}, 'policy_reward_max': {}, 'policy_reward_mean': {}, 'custom_metrics': {}, 'hist_stats': {'episode_reward': [22.0, 17.0, 27.0, 14.0, 13.0, 10.0, 16.0, 12.0, 25.0, 29.0, 22.0, 27.0, 18.0, 26.0, 35.0, 25.0, 41.0, 23.0, 69.0, 56.0, 53.0, 30.0, 120.0, 40.0, 38.0, 86.0, 10.0, 19.0, 137.0, 43.0, 72.0, 119.0, 21.0, 53.0, 45.0, 36.0, 14.0, 35.0, 69.0, 100.0, 118.0, 48.0, 12.0, 21.0, 12.0, 30.0, 59.0, 34.0, 72.0, 63.0, 50.0, 42.0, 32.0, 28.0, 44.0, 59.0, 19.0, 86.0, 32.0, 69.0, 47.0, 62.0, 73.0, 13.0, 72.0, 36.0, 12.0, 49.0, 17.0, 117.0, 19.0, 13.0, 24.0, 12.0, 17.0, 23.0, 49.0, 22.0, 86.0, 79.0, 92.0, 21.0, 101.0, 30.0, 12.0, 62.0, 80.0, 32.0, 18.0, 95.0, 18.0, 35.0, 80.0, 69.0, 72.0, 116.0, 67.0, 83.0, 35.0, 19.0], 'episode_lengths': [22, 17, 27, 14, 13, 10, 16, 12, 25, 29, 22, 27, 18, 26, 35, 25, 41, 23, 69, 56, 53, 30, 120, 40, 38, 86, 10, 19, 137, 43, 72, 119, 21, 53, 45, 36, 14, 35, 69, 100, 118, 48, 12, 21, 12, 30, 59, 34, 72, 63, 50, 42, 32, 28, 44, 59, 19, 86, 32, 69, 47, 62, 73, 13, 72, 36, 12, 49, 17, 117, 19, 13, 24, 12, 17, 23, 49, 22, 86, 79, 92, 21, 101, 30, 12, 62, 80, 32, 18, 95, 18, 35, 80, 69, 72, 116, 67, 83, 35, 19]}, 'sampler_perf': {'mean_raw_obs_processing_ms': 0.2064514408664396, 'mean_inference_ms': 0.31645616264795123, 'mean_action_processing_ms': 0.032597069914330125, 'mean_env_wait_ms': 0.027492389739157415, 'mean_env_render_ms': 0.0}, 'num_faulty_episodes': 0} 8.77525 3.47446 8.77525{'training_iteration_time_ms': 4384.142, 'load_time_ms': 0.287, 'load_throughput': 14835762.966, 'learn_time_ms': 2129.317, 'learn_throughput': 2002.52, 'synch_weights_time_ms': 1.33} 1664241751 0 8528 2d229c_00000 3.78836
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 227.44it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 221.02it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 221.55it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 241.52it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 267.02it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 299.79it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 353.41it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 339.84it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 218.85it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 277.73it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 252.29it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 233.57it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 291.35it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 345.12it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 230.60it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 331.38it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 246.51it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 239.32it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 265.46it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 228.80it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 256.47it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 294.52it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 343.40it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 227.59it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 243.35it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 226.79it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 227.47it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 219.45it/s]
Repartition:   0%|          | 0/1 [00:00<?, ?it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 253.25it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 240.10it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 236.03it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 220.66it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 222.38it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 224.67it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 327.73it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 233.71it/s]
2022-09-26 18:22:31,518	INFO tune.py:762 -- Total run time: 16.00 seconds (15.59 seconds for the tuning loop).

Then, we run training:

result = train_rl_bc_offline(path=path, num_workers=2, use_gpu=False)
Starting offline training

Tune Status

Current time:2022-09-26 18:22:55
Running for: 00:00:10.97
Memory: 10.9/62.7 GiB

System Info

Using FIFO scheduling algorithm.
Resources requested: 0/8 CPUs, 0/0 GPUs, 0.0/33.26 GiB heap, 0.0/16.63 GiB objects

Trial Status

Trial name status loc iter total time (s) ts reward num_recreated_worker s episode_reward_max episode_reward_min
AIRBC_e3afc_00000TERMINATED192.168.1.241:3894380 5 0.99661211084 nan0 nan nan
(AIRBC pid=3894380) 2022-09-26 18:22:47,815	INFO algorithm.py:2104 -- Your framework setting is 'tf', meaning you are using static-graph mode. Set framework='tf2' to enable eager execution with tf2.x. You may also then want to set eager_tracing=True in order to reach similar execution speed as with static-graph mode.
(AIRBC pid=3894380) 2022-09-26 18:22:47,816	INFO algorithm.py:356 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
Read:   0%|          | 0/2 [00:00<?, ?it/s]
Read:  50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 1/2 [00:00<00:00,  3.37it/s]
Read: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:00<00:00,  5.58it/s]
Repartition:   0%|          | 0/2 [00:00<?, ?it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:00<00:00, 14.03it/s]
(RolloutWorker pid=3894910) 2022-09-26 18:22:51,123	WARNING env.py:159 -- Your env reset() method appears to take 'seed' or 'return_info' arguments. Note that these are not yet supported in RLlib. Seeding will take place using 'env.seed()' and the info dict will not be returned from reset.
(RolloutWorker pid=3894910) DatasetReader 1 has 239, samples.
(RolloutWorker pid=3894911) DatasetReader 2 has 239, samples.
(AIRBC pid=3894380) 2022-09-26 18:22:51,756	WARNING deprecation.py:47 -- DeprecationWarning: `simple_optimizer` has been deprecated. This will raise an error in the future!
(RolloutWorker pid=3895021) 2022-09-26 18:22:54,479	WARNING env.py:159 -- Your env reset() method appears to take 'seed' or 'return_info' arguments. Note that these are not yet supported in RLlib. Seeding will take place using 'env.seed()' and the info dict will not be returned from reset.
(AIRBC pid=3894380) 2022-09-26 18:22:54,735	WARNING util.py:66 -- Install gputil for GPU system monitoring.

Trial Progress

Trial name agent_timesteps_totalcounters custom_metrics date done episode_len_meanepisode_media episode_reward_max episode_reward_mean episode_reward_min episodes_this_iter episodes_totalevaluation experiment_id hostname info iterations_since_restorenode_ip num_agent_steps_sampled num_agent_steps_trained num_env_steps_sampled num_env_steps_sampled_this_iter num_env_steps_trained num_env_steps_trained_this_iter num_faulty_episodes num_healthy_workers num_recreated_workers num_steps_trained_this_iterperf pidpolicy_reward_max policy_reward_mean policy_reward_min sampler_perf sampler_results time_since_restore time_this_iter_s time_total_stimers timestamp timesteps_since_restore timesteps_total training_iterationtrial_id warmup_time
AIRBC_e3afc_00000 11084{'num_env_steps_sampled': 11084, 'num_env_steps_trained': 11084, 'num_agent_steps_sampled': 11084, 'num_agent_steps_trained': 11084}{} 2022-09-26_18-22-55True nan{} nan nan nan 0 0{'episode_reward_max': 24.0, 'episode_reward_min': 10.0, 'episode_reward_mean': 16.9, 'episode_len_mean': 16.9, 'episode_media': {}, 'episodes_this_iter': 10, 'policy_reward_min': {}, 'policy_reward_max': {}, 'policy_reward_mean': {}, 'custom_metrics': {}, 'hist_stats': {'episode_reward': [22.0, 12.0, 10.0, 21.0, 24.0, 19.0, 15.0, 17.0, 17.0, 12.0], 'episode_lengths': [22, 12, 10, 21, 24, 19, 15, 17, 17, 12]}, 'sampler_perf': {'mean_raw_obs_processing_ms': 0.1389556124250126, 'mean_inference_ms': 0.3053837110354277, 'mean_action_processing_ms': 0.031036834474401493, 'mean_env_wait_ms': 0.0260694335622386, 'mean_env_render_ms': 0.0}, 'num_faulty_episodes': 0, 'num_agent_steps_sampled_this_iter': 169, 'num_env_steps_sampled_this_iter': 169, 'timesteps_this_iter': 169, 'num_healthy_workers': 1, 'num_recreated_workers': 0}21b4e50f0a544d479bf6794c0eedc65acorvus {'learner': {'default_policy': {'learner_stats': {'policy_loss': 0.69113123, 'total_loss': 0.69113123, 'model': {}}, 'custom_metrics': {}, 'num_agent_steps_trained': 2000.0}}, 'num_env_steps_sampled': 11084, 'num_env_steps_trained': 11084, 'num_agent_steps_sampled': 11084, 'num_agent_steps_trained': 11084} 5192.168.1.241 11084 11084 11084 2270 11084 2270 0 2 0 2270{} 3894380{} {} {} {} {'episode_reward_max': nan, 'episode_reward_min': nan, 'episode_reward_mean': nan, 'episode_len_mean': nan, 'episode_media': {}, 'episodes_this_iter': 0, 'policy_reward_min': {}, 'policy_reward_max': {}, 'policy_reward_mean': {}, 'custom_metrics': {}, 'hist_stats': {'episode_reward': [], 'episode_lengths': []}, 'sampler_perf': {}, 'num_faulty_episodes': 0} 0.996612 0.116935 0.996612{'training_iteration_time_ms': 55.867, 'sample_time_ms': 32.326, 'load_time_ms': 0.227, 'load_throughput': 9744218.306, 'learn_time_ms': 21.78, 'learn_throughput': 101779.38, 'synch_weights_time_ms': 1.468} 1664241775 0 11084 5e3afc_00000 6.92411
2022-09-26 18:22:56,255	INFO tune.py:762 -- Total run time: 11.35 seconds (10.95 seconds for the tuning loop).

And then, using the obtained checkpoint, we evaluate the policy on a fresh environment:

num_eval_episodes = 3

rewards = evaluate_using_checkpoint(result.checkpoint, num_episodes=num_eval_episodes)
print(f"Average reward over {num_eval_episodes} episodes: " f"{np.mean(rewards)}")
2022-09-26 18:23:01,591	INFO algorithm.py:2104 -- Your framework setting is 'tf', meaning you are using static-graph mode. Set framework='tf2' to enable eager execution with tf2.x. You may also then want to set eager_tracing=True in order to reach similar execution speed as with static-graph mode.
2022-09-26 18:23:01,591	WARNING deprecation.py:47 -- DeprecationWarning: `simple_optimizer` has been deprecated. This will raise an error in the future!
2022-09-26 18:23:01,593	INFO algorithm.py:356 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
Read: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:00<00:00,  2.54it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:00<00:00, 10.71it/s]
(RolloutWorker pid=3895623) 2022-09-26 18:23:05,106	WARNING env.py:159 -- Your env reset() method appears to take 'seed' or 'return_info' arguments. Note that these are not yet supported in RLlib. Seeding will take place using 'env.seed()' and the info dict will not be returned from reset.
(RolloutWorker pid=3895623) DatasetReader 1 has 239, samples.
(RolloutWorker pid=3895624) DatasetReader 2 has 239, samples.
2022-09-26 18:23:05,788	WARNING deprecation.py:47 -- DeprecationWarning: `simple_optimizer` has been deprecated. This will raise an error in the future!
(RolloutWorker pid=3895732) 2022-09-26 18:23:08,264	WARNING env.py:159 -- Your env reset() method appears to take 'seed' or 'return_info' arguments. Note that these are not yet supported in RLlib. Seeding will take place using 'env.seed()' and the info dict will not be returned from reset.
2022-09-26 18:23:08,520	WARNING util.py:66 -- Install gputil for GPU system monitoring.
2022-09-26 18:23:08,568	INFO trainable.py:690 -- Restored on 192.168.1.241 from checkpoint: /home/pdmurray/ray_results/AIRBC_2022-09-26_18-22-44/AIRBC_e3afc_00000_0_2022-09-26_18-22-45/checkpoint_000005
2022-09-26 18:23:08,569	INFO trainable.py:699 -- Current state after restoring: {'_iteration': 5, '_timesteps_total': None, '_time_total': 0.9966120719909668, '_episodes_total': 0}
Average reward over 3 episodes: 23.666666666666668