Offline reinforcement learning with Ray AIRΒΆ

In this example, we’ll train a reinforcement learning agent using offline training.

Offline training means that the data from the environment (and the actions performed by the agent) have been stored on disk. In contrast, online training samples experiences live by interacting with the environment.

Let’s start with installing our dependencies:

!pip install -qU "ray[rllib]" gym

Now we can run some imports:

import argparse
import gym
import os

import numpy as np
import ray
from ray.air import Checkpoint
from ray.air.config import RunConfig
from ray.train.rl.rl_predictor import RLPredictor
from ray.train.rl.rl_trainer import RLTrainer
from ray.air.result import Result
from ray.rllib.agents.marwil import BCTrainer
from ray.tune.tuner import Tuner
2022-05-20 11:57:36,802	WARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.execution.buffers` has been deprecated. Use `ray.rllib.utils.replay_buffers` instead. This will raise an error in the future!
2022-05-20 11:57:36,815	WARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.agents.marwil` has been deprecated. Use `ray.rllib.algorithms.marwil` instead. This will raise an error in the future!

We will be training on offline data - this means we have full agent trajectories stored somewhere on disk and want to train on these past experiences.

Usually this data could come from external systems, or a database of historical data. But for this example, we’ll generate some offline data ourselves and store it using RLlibs output_config.

def generate_offline_data(path: str):
    print(f"Generating offline data for training at {path}")
    trainer = RLTrainer(
        algorithm="PPO",
        run_config=RunConfig(stop={"timesteps_total": 5000}),
        config={
            "env": "CartPole-v0",
            "output": "dataset",
            "output_config": {
                "format": "json",
                "path": path,
                "max_num_samples_per_file": 1,
            },
            "batch_mode": "complete_episodes",
        },
    )
    trainer.fit()

Here we define the training function. It will create an RLTrainer using the PPO algorithm and kick off training on the CartPole-v0 environment. It will use the offline data provided in path for this.

def train_rl_bc_offline(path: str, num_workers: int, use_gpu: bool = False) -> Result:
    print("Starting offline training")
    dataset = ray.data.read_json(
        path, parallelism=num_workers, ray_remote_args={"num_cpus": 1}
    )

    trainer = RLTrainer(
        run_config=RunConfig(stop={"training_iteration": 5}),
        scaling_config={
            "num_workers": num_workers,
            "use_gpu": use_gpu,
        },
        datasets={"train": dataset},
        algorithm=BCTrainer,
        config={
            "env": "CartPole-v0",
            "framework": "tf",
            "evaluation_num_workers": 1,
            "evaluation_interval": 1,
            "evaluation_config": {"input": "sampler"},
        },
    )

    # Todo (krfricke/xwjiang): Enable checkpoint config in RunConfig
    # result = trainer.fit()
    tuner = Tuner(
        trainer,
        _tuner_kwargs={"checkpoint_at_end": True},
    )
    result = tuner.fit()[0]
    return result

Once we trained our RL policy, we want to evaluate it on a fresh environment. For this, we will also define a utility function:

def evaluate_using_checkpoint(checkpoint: Checkpoint, num_episodes) -> list:
    predictor = RLPredictor.from_checkpoint(checkpoint)

    env = gym.make("CartPole-v0")

    rewards = []
    for i in range(num_episodes):
        obs = env.reset()
        reward = 0.0
        done = False
        while not done:
            action = predictor.predict([obs])
            obs, r, done, _ = env.step(action[0])
            reward += r
        rewards.append(reward)

    return rewards

Let’s put it all together. First, we initialize Ray and create the offline data:

ray.init(num_cpus=8)

path = "/tmp/out"
generate_offline_data(path)
2022-05-20 11:57:39,477	INFO services.py:1483 -- View the Ray dashboard at http://127.0.0.1:8265
2022-05-20 11:57:40,910	WARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.agents.dqn.dqn.DEFAULT_CONFIG` has been deprecated. Use `ray.rllib.agents.dqn.dqn.DQNConfig(...)` instead. This will raise an error in the future!
Generating offline data for training at /tmp/out
== Status ==
Current time: 2022-05-20 11:58:13 (running for 00:00:31.89)
Memory usage on this node: 10.0/16.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/8 CPUs, 0/0 GPUs, 0.0/4.13 GiB heap, 0.0/2.0 GiB objects
Result logdir: /Users/kai/ray_results/AIRPPOTrainer_2022-05-20_11-57-41
Number of trials: 1/1 (1 TERMINATED)
Trial name status loc iter total time (s) ts reward episode_reward_max episode_reward_min episode_len_mean
AIRPPOTrainer_ab506_00000TERMINATED127.0.0.1:28838 2 11.58338665 46.31 147 11 46.31


(raylet) 2022-05-20 11:57:42,730	INFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=63962 --object-store-name=/tmp/ray/session_2022-05-20_11-57-36_849562_28764/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-20_11-57-36_849562_28764/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=64061 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:64346 --redis-password=5241590000000000 --startup-token=8 --runtime-env-hash=-2010331134
(pid=28838) 2022-05-20 11:57:51,258	WARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.execution.buffers` has been deprecated. Use `ray.rllib.utils.replay_buffers` instead. This will raise an error in the future!
(AIRPPOTrainer pid=28838) 2022-05-20 11:57:51,947	INFO trainer.py:1728 -- Your framework setting is 'tf', meaning you are using static-graph mode. Set framework='tf2' to enable eager execution with tf2.x. You may also then want to set eager_tracing=True in order to reach similar execution speed as with static-graph mode.
(AIRPPOTrainer pid=28838) 2022-05-20 11:57:51,948	INFO ppo.py:361 -- In multi-agent mode, policies will be optimized sequentially by the multi-GPU optimizer. Consider setting simple_optimizer=True if this doesn't work for you.
(AIRPPOTrainer pid=28838) 2022-05-20 11:57:51,948	INFO trainer.py:328 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
(raylet) 2022-05-20 11:57:53,104	INFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=63962 --object-store-name=/tmp/ray/session_2022-05-20_11-57-36_849562_28764/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-20_11-57-36_849562_28764/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=64061 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:64346 --redis-password=5241590000000000 --startup-token=9 --runtime-env-hash=-2010331134
(raylet) 2022-05-20 11:57:53,104	INFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=63962 --object-store-name=/tmp/ray/session_2022-05-20_11-57-36_849562_28764/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-20_11-57-36_849562_28764/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=64061 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:64346 --redis-password=5241590000000000 --startup-token=10 --runtime-env-hash=-2010331134
(RolloutWorker pid=28848) 2022-05-20 11:58:00,061	WARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.execution.buffers` has been deprecated. Use `ray.rllib.utils.replay_buffers` instead. This will raise an error in the future!
(RolloutWorker pid=28849) 2022-05-20 11:58:00,061	WARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.execution.buffers` has been deprecated. Use `ray.rllib.utils.replay_buffers` instead. This will raise an error in the future!
(AIRPPOTrainer pid=28838) 2022-05-20 11:58:01,467	WARNING util.py:65 -- Install gputil for GPU system monitoring.
Repartition:   0%|          | 0/1 [00:00<?, ?it/s]
Repartition:   0%|          | 0/1 [00:00<?, ?it/s]
Repartition:   0%|          | 0/1 [00:00<?, ?it/s]
Repartition:   0%|          | 0/1 [00:00<?, ?it/s]
Repartition:   0%|          | 0/1 [00:00<?, ?it/s]
Repartition:   0%|          | 0/1 [00:00<?, ?it/s]
Repartition:   0%|          | 0/1 [00:00<?, ?it/s]
Repartition:   0%|          | 0/1 [00:00<?, ?it/s]
Repartition:   0%|          | 0/1 [00:00<?, ?it/s]
Repartition:   0%|          | 0/1 [00:00<?, ?it/s]
Repartition:   0%|          | 0/1 [00:00<?, ?it/s]
Repartition:   0%|          | 0/1 [00:00<?, ?it/s]
Repartition:   0%|          | 0/1 [00:00<?, ?it/s]
Repartition:   0%|          | 0/1 [00:00<?, ?it/s]
Repartition:   0%|          | 0/1 [00:00<?, ?it/s]
Repartition:   0%|          | 0/1 [00:00<?, ?it/s]
(raylet) 2022-05-20 11:58:02,584	INFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=63962 --object-store-name=/tmp/ray/session_2022-05-20_11-57-36_849562_28764/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-20_11-57-36_849562_28764/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=64061 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:64346 --redis-password=5241590000000000 --startup-token=11 --runtime-env-hash=-2010331069
(raylet) 2022-05-20 11:58:02,584	INFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=63962 --object-store-name=/tmp/ray/session_2022-05-20_11-57-36_849562_28764/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-20_11-57-36_849562_28764/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=64061 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:64346 --redis-password=5241590000000000 --startup-token=12 --runtime-env-hash=-2010331069
Repartition:   0%|          | 0/1 [00:00<?, ?it/s]
Repartition:   0%|          | 0/1 [00:00<?, ?it/s]
Repartition:   0%|          | 0/1 [00:00<?, ?it/s]
Repartition:   0%|          | 0/1 [00:00<?, ?it/s]
Repartition:   0%|          | 0/1 [00:01<?, ?it/s]
Repartition:   0%|          | 0/1 [00:01<?, ?it/s]
Repartition:   0%|          | 0/1 [00:01<?, ?it/s]
Repartition:   0%|          | 0/1 [00:01<?, ?it/s]
Repartition:   0%|          | 0/1 [00:01<?, ?it/s]
Repartition:   0%|          | 0/1 [00:01<?, ?it/s]
Repartition:   0%|          | 0/1 [00:01<?, ?it/s]
Repartition:   0%|          | 0/1 [00:01<?, ?it/s]
Repartition:   0%|          | 0/1 [00:01<?, ?it/s]
Repartition:   0%|          | 0/1 [00:01<?, ?it/s]
Repartition:   0%|          | 0/1 [00:01<?, ?it/s]
Repartition:   0%|          | 0/1 [00:01<?, ?it/s]
Repartition:   0%|          | 0/1 [00:01<?, ?it/s]
Repartition:   0%|          | 0/1 [00:01<?, ?it/s]
Repartition:   0%|          | 0/1 [00:01<?, ?it/s]
Repartition:   0%|          | 0/1 [00:01<?, ?it/s]
Repartition:   0%|          | 0/1 [00:01<?, ?it/s]
Repartition:   0%|          | 0/1 [00:01<?, ?it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:01<00:00,  1.98s/it]
Write Progress:   0%|          | 0/1 [00:00<?, ?it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:02<00:00,  2.04s/it]
Write Progress:   0%|          | 0/1 [00:00<?, ?it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 38.96it/s]
Write Progress:   0%|          | 0/1 [00:00<?, ?it/s]
Write Progress:   0%|          | 0/1 [00:00<?, ?it/s]
Write Progress:   0%|          | 0/1 [00:00<?, ?it/s]
Write Progress:   0%|          | 0/1 [00:00<?, ?it/s]
Write Progress:   0%|          | 0/1 [00:00<?, ?it/s]
Write Progress:   0%|          | 0/1 [00:00<?, ?it/s]
(raylet) 2022-05-20 11:58:04,608	INFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=63962 --object-store-name=/tmp/ray/session_2022-05-20_11-57-36_849562_28764/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-20_11-57-36_849562_28764/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=64061 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:64346 --redis-password=5241590000000000 --startup-token=13 --runtime-env-hash=-2010331069
Write Progress:   0%|          | 0/1 [00:00<?, ?it/s]
Write Progress:   0%|          | 0/1 [00:00<?, ?it/s]
Write Progress:   0%|          | 0/1 [00:01<?, ?it/s]
Write Progress:   0%|          | 0/1 [00:01<?, ?it/s]
Write Progress:   0%|          | 0/1 [00:01<?, ?it/s]
Write Progress:   0%|          | 0/1 [00:01<?, ?it/s]
Write Progress:   0%|          | 0/1 [00:01<?, ?it/s]
Write Progress:   0%|          | 0/1 [00:01<?, ?it/s]
Write Progress:   0%|          | 0/1 [00:01<?, ?it/s]
Write Progress:   0%|          | 0/1 [00:01<?, ?it/s]
Write Progress:   0%|          | 0/1 [00:01<?, ?it/s]
Write Progress:   0%|          | 0/1 [00:02<?, ?it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:02<00:00,  2.11s/it]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 149.48it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 113.58it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 148.52it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 227.01it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 194.43it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 263.51it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 158.20it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 296.46it/s]
Repartition:   0%|          | 0/1 [00:00<?, ?it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 158.08it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 195.96it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 183.05it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 312.01it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 216.03it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 289.20it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 210.04it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 263.99it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 165.20it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 224.62it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 198.53it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 338.41it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 193.87it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 266.95it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 195.85it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 302.64it/s]
Repartition:   0%|          | 0/1 [00:00<?, ?it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 185.63it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 185.39it/s]
Write Progress:   0%|          | 0/1 [00:00<?, ?it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 300.90it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 238.33it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 259.00it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 313.19it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 218.53it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 278.25it/s]
(AIRPPOTrainer pid=28838) 2022-05-20 11:58:07,504	WARNING deprecation.py:47 -- DeprecationWarning: `slice` has been deprecated. Use `SampleBatch[start:stop]` instead. This will raise an error in the future!
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 264.41it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 329.79it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 215.19it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 299.66it/s]
Result for AIRPPOTrainer_ab506_00000:
  agent_timesteps_total: 4305
  counters:
    num_agent_steps_sampled: 4305
    num_agent_steps_trained: 4305
    num_env_steps_sampled: 4305
    num_env_steps_trained: 4305
  custom_metrics: {}
  date: 2022-05-20_11-58-09
  done: false
  episode_len_mean: 21.633165829145728
  episode_media: {}
  episode_reward_max: 83.0
  episode_reward_mean: 21.633165829145728
  episode_reward_min: 9.0
  episodes_this_iter: 199
  episodes_total: 199
  experiment_id: d6ab9eba2e4e488384aa2e958fab71c8
  hostname: Kais-MacBook-Pro.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 0.6652079820632935
          entropy_coeff: 0.0
          kl: 0.027841072529554367
          model: {}
          policy_loss: -0.042915552854537964
          total_loss: 9.028203010559082
          vf_explained_var: -0.05767782777547836
          vf_loss: 9.065549850463867
        num_agent_steps_trained: 128.0
    num_agent_steps_sampled: 4305
    num_agent_steps_trained: 4305
    num_env_steps_sampled: 4305
    num_env_steps_trained: 4305
  iterations_since_restore: 1
  node_ip: 127.0.0.1
  num_agent_steps_sampled: 4305
  num_agent_steps_trained: 4305
  num_env_steps_sampled: 4305
  num_env_steps_sampled_this_iter: 4305
  num_env_steps_trained: 4305
  num_env_steps_trained_this_iter: 4305
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 16.474999999999998
    ram_util_percent: 61.041666666666664
  pid: 28838
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.06155790977082133
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.04961143452632256
    mean_inference_ms: 0.5584241294994345
    mean_raw_obs_processing_ms: 0.09605169519383157
  sampler_results:
    custom_metrics: {}
    episode_len_mean: 21.633165829145728
    episode_media: {}
    episode_reward_max: 83.0
    episode_reward_mean: 21.633165829145728
    episode_reward_min: 9.0
    episodes_this_iter: 199
    hist_stats:
      episode_lengths:
      - 19
      - 13
      - 43
      - 26
      - 13
      - 16
      - 13
      - 12
      - 13
      - 27
      - 40
      - 18
      - 14
      - 16
      - 19
      - 19
      - 12
      - 13
      - 10
      - 24
      - 16
      - 18
      - 15
      - 11
      - 16
      - 63
      - 14
      - 15
      - 30
      - 12
      - 13
      - 20
      - 21
      - 20
      - 28
      - 29
      - 22
      - 20
      - 16
      - 14
      - 13
      - 17
      - 21
      - 12
      - 31
      - 25
      - 27
      - 19
      - 18
      - 28
      - 15
      - 19
      - 14
      - 22
      - 19
      - 22
      - 34
      - 43
      - 18
      - 17
      - 31
      - 18
      - 12
      - 13
      - 21
      - 16
      - 10
      - 24
      - 22
      - 9
      - 12
      - 34
      - 26
      - 19
      - 71
      - 14
      - 21
      - 29
      - 12
      - 10
      - 9
      - 12
      - 26
      - 13
      - 15
      - 14
      - 25
      - 21
      - 13
      - 21
      - 18
      - 16
      - 20
      - 18
      - 50
      - 25
      - 12
      - 13
      - 16
      - 28
      - 14
      - 11
      - 25
      - 10
      - 19
      - 23
      - 27
      - 11
      - 34
      - 9
      - 12
      - 30
      - 15
      - 59
      - 13
      - 49
      - 39
      - 24
      - 33
      - 10
      - 66
      - 21
      - 30
      - 19
      - 17
      - 29
      - 25
      - 19
      - 83
      - 12
      - 12
      - 27
      - 12
      - 31
      - 17
      - 27
      - 18
      - 14
      - 16
      - 21
      - 13
      - 30
      - 34
      - 10
      - 15
      - 14
      - 18
      - 23
      - 36
      - 35
      - 16
      - 20
      - 15
      - 22
      - 9
      - 22
      - 22
      - 12
      - 13
      - 11
      - 22
      - 21
      - 48
      - 12
      - 14
      - 16
      - 44
      - 13
      - 14
      - 33
      - 32
      - 26
      - 24
      - 22
      - 27
      - 16
      - 20
      - 14
      - 12
      - 59
      - 13
      - 12
      - 22
      - 31
      - 31
      - 13
      - 14
      - 15
      - 35
      - 14
      - 28
      - 21
      - 15
      - 41
      - 22
      - 13
      - 21
      - 11
      - 35
      episode_reward:
      - 19.0
      - 13.0
      - 43.0
      - 26.0
      - 13.0
      - 16.0
      - 13.0
      - 12.0
      - 13.0
      - 27.0
      - 40.0
      - 18.0
      - 14.0
      - 16.0
      - 19.0
      - 19.0
      - 12.0
      - 13.0
      - 10.0
      - 24.0
      - 16.0
      - 18.0
      - 15.0
      - 11.0
      - 16.0
      - 63.0
      - 14.0
      - 15.0
      - 30.0
      - 12.0
      - 13.0
      - 20.0
      - 21.0
      - 20.0
      - 28.0
      - 29.0
      - 22.0
      - 20.0
      - 16.0
      - 14.0
      - 13.0
      - 17.0
      - 21.0
      - 12.0
      - 31.0
      - 25.0
      - 27.0
      - 19.0
      - 18.0
      - 28.0
      - 15.0
      - 19.0
      - 14.0
      - 22.0
      - 19.0
      - 22.0
      - 34.0
      - 43.0
      - 18.0
      - 17.0
      - 31.0
      - 18.0
      - 12.0
      - 13.0
      - 21.0
      - 16.0
      - 10.0
      - 24.0
      - 22.0
      - 9.0
      - 12.0
      - 34.0
      - 26.0
      - 19.0
      - 71.0
      - 14.0
      - 21.0
      - 29.0
      - 12.0
      - 10.0
      - 9.0
      - 12.0
      - 26.0
      - 13.0
      - 15.0
      - 14.0
      - 25.0
      - 21.0
      - 13.0
      - 21.0
      - 18.0
      - 16.0
      - 20.0
      - 18.0
      - 50.0
      - 25.0
      - 12.0
      - 13.0
      - 16.0
      - 28.0
      - 14.0
      - 11.0
      - 25.0
      - 10.0
      - 19.0
      - 23.0
      - 27.0
      - 11.0
      - 34.0
      - 9.0
      - 12.0
      - 30.0
      - 15.0
      - 59.0
      - 13.0
      - 49.0
      - 39.0
      - 24.0
      - 33.0
      - 10.0
      - 66.0
      - 21.0
      - 30.0
      - 19.0
      - 17.0
      - 29.0
      - 25.0
      - 19.0
      - 83.0
      - 12.0
      - 12.0
      - 27.0
      - 12.0
      - 31.0
      - 17.0
      - 27.0
      - 18.0
      - 14.0
      - 16.0
      - 21.0
      - 13.0
      - 30.0
      - 34.0
      - 10.0
      - 15.0
      - 14.0
      - 18.0
      - 23.0
      - 36.0
      - 35.0
      - 16.0
      - 20.0
      - 15.0
      - 22.0
      - 9.0
      - 22.0
      - 22.0
      - 12.0
      - 13.0
      - 11.0
      - 22.0
      - 21.0
      - 48.0
      - 12.0
      - 14.0
      - 16.0
      - 44.0
      - 13.0
      - 14.0
      - 33.0
      - 32.0
      - 26.0
      - 24.0
      - 22.0
      - 27.0
      - 16.0
      - 20.0
      - 14.0
      - 12.0
      - 59.0
      - 13.0
      - 12.0
      - 22.0
      - 31.0
      - 31.0
      - 13.0
      - 14.0
      - 15.0
      - 35.0
      - 14.0
      - 28.0
      - 21.0
      - 15.0
      - 41.0
      - 22.0
      - 13.0
      - 21.0
      - 11.0
      - 35.0
    off_policy_estimator: {}
    policy_reward_max: {}
    policy_reward_mean: {}
    policy_reward_min: {}
    sampler_perf:
      mean_action_processing_ms: 0.06155790977082133
      mean_env_render_ms: 0.0
      mean_env_wait_ms: 0.04961143452632256
      mean_inference_ms: 0.5584241294994345
      mean_raw_obs_processing_ms: 0.09605169519383157
  time_since_restore: 7.9085540771484375
  time_this_iter_s: 7.9085540771484375
  time_total_s: 7.9085540771484375
  timers:
    learn_throughput: 2306.994
    learn_time_ms: 1866.064
    load_throughput: 22514312.618
    load_time_ms: 0.191
    training_iteration_time_ms: 7904.312
    update_time_ms: 2.387
  timestamp: 1653044289
  timesteps_since_restore: 0
  timesteps_total: 4305
  training_iteration: 1
  trial_id: ab506_00000
  warmup_time: 9.528029203414917
  
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 188.97it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 236.59it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 178.06it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 315.36it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 203.67it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 255.77it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 207.51it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 185.77it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 177.55it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 277.47it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 202.14it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 242.84it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 193.57it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 246.67it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 201.46it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 281.16it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 202.47it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 290.54it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 249.19it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 270.48it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 263.02it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 294.46it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 175.01it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 285.23it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 246.56it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 270.58it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 236.35it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 295.77it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 175.38it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 268.61it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 250.06it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 290.12it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 179.67it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 234.74it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 233.09it/s]
Write Progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 279.17it/s]
Result for AIRPPOTrainer_ab506_00000:
  agent_timesteps_total: 8665
  counters:
    num_agent_steps_sampled: 8665
    num_agent_steps_trained: 8665
    num_env_steps_sampled: 8665
    num_env_steps_trained: 8665
  custom_metrics: {}
  date: 2022-05-20_11-58-13
  done: true
  episode_len_mean: 46.31
  episode_media: {}
  episode_reward_max: 147.0
  episode_reward_mean: 46.31
  episode_reward_min: 11.0
  episodes_this_iter: 88
  episodes_total: 287
  experiment_id: d6ab9eba2e4e488384aa2e958fab71c8
  hostname: Kais-MacBook-Pro.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.30000001192092896
          cur_lr: 4.999999873689376e-05
          entropy: 0.6104190349578857
          entropy_coeff: 0.0
          kl: 0.015321698971092701
          model: {}
          policy_loss: -0.025790905579924583
          total_loss: 9.480770111083984
          vf_explained_var: -0.029562775045633316
          vf_loss: 9.50196361541748
        num_agent_steps_trained: 128.0
    num_agent_steps_sampled: 8665
    num_agent_steps_trained: 8665
    num_env_steps_sampled: 8665
    num_env_steps_trained: 8665
  iterations_since_restore: 2
  node_ip: 127.0.0.1
  num_agent_steps_sampled: 8665
  num_agent_steps_trained: 8665
  num_env_steps_sampled: 8665
  num_env_steps_sampled_this_iter: 4360
  num_env_steps_trained: 8665
  num_env_steps_trained_this_iter: 4360
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 24.18
    ram_util_percent: 62.260000000000005
  pid: 28838
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_action_processing_ms: 0.06236081053304994
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 0.05041366691869162
    mean_inference_ms: 0.5623494344695713
    mean_raw_obs_processing_ms: 0.09146254327599868
  sampler_results:
    custom_metrics: {}
    episode_len_mean: 46.31
    episode_media: {}
    episode_reward_max: 147.0
    episode_reward_mean: 46.31
    episode_reward_min: 11.0
    episodes_this_iter: 88
    hist_stats:
      episode_lengths:
      - 15
      - 35
      - 14
      - 28
      - 21
      - 15
      - 41
      - 22
      - 13
      - 21
      - 11
      - 35
      - 13
      - 24
      - 62
      - 35
      - 25
      - 37
      - 47
      - 112
      - 33
      - 22
      - 45
      - 24
      - 72
      - 19
      - 62
      - 67
      - 42
      - 113
      - 46
      - 28
      - 74
      - 96
      - 20
      - 24
      - 22
      - 31
      - 17
      - 14
      - 129
      - 32
      - 31
      - 27
      - 108
      - 62
      - 12
      - 45
      - 27
      - 45
      - 37
      - 93
      - 52
      - 54
      - 59
      - 86
      - 22
      - 38
      - 46
      - 16
      - 22
      - 37
      - 70
      - 13
      - 83
      - 78
      - 40
      - 147
      - 27
      - 81
      - 29
      - 21
      - 24
      - 42
      - 61
      - 58
      - 72
      - 16
      - 25
      - 52
      - 116
      - 22
      - 17
      - 76
      - 102
      - 26
      - 42
      - 81
      - 47
      - 22
      - 16
      - 59
      - 122
      - 86
      - 100
      - 19
      - 18
      - 18
      - 19
      - 107
      episode_reward:
      - 15.0
      - 35.0
      - 14.0
      - 28.0
      - 21.0
      - 15.0
      - 41.0
      - 22.0
      - 13.0
      - 21.0
      - 11.0
      - 35.0
      - 13.0
      - 24.0
      - 62.0
      - 35.0
      - 25.0
      - 37.0
      - 47.0
      - 112.0
      - 33.0
      - 22.0
      - 45.0
      - 24.0
      - 72.0
      - 19.0
      - 62.0
      - 67.0
      - 42.0
      - 113.0
      - 46.0
      - 28.0
      - 74.0
      - 96.0
      - 20.0
      - 24.0
      - 22.0
      - 31.0
      - 17.0
      - 14.0
      - 129.0
      - 32.0
      - 31.0
      - 27.0
      - 108.0
      - 62.0
      - 12.0
      - 45.0
      - 27.0
      - 45.0
      - 37.0
      - 93.0
      - 52.0
      - 54.0
      - 59.0
      - 86.0
      - 22.0
      - 38.0
      - 46.0
      - 16.0
      - 22.0
      - 37.0
      - 70.0
      - 13.0
      - 83.0
      - 78.0
      - 40.0
      - 147.0
      - 27.0
      - 81.0
      - 29.0
      - 21.0
      - 24.0
      - 42.0
      - 61.0
      - 58.0
      - 72.0
      - 16.0
      - 25.0
      - 52.0
      - 116.0
      - 22.0
      - 17.0
      - 76.0
      - 102.0
      - 26.0
      - 42.0
      - 81.0
      - 47.0
      - 22.0
      - 16.0
      - 59.0
      - 122.0
      - 86.0
      - 100.0
      - 19.0
      - 18.0
      - 18.0
      - 19.0
      - 107.0
    off_policy_estimator: {}
    policy_reward_max: {}
    policy_reward_mean: {}
    policy_reward_min: {}
    sampler_perf:
      mean_action_processing_ms: 0.06236081053304994
      mean_env_render_ms: 0.0
      mean_env_wait_ms: 0.05041366691869162
      mean_inference_ms: 0.5623494344695713
      mean_raw_obs_processing_ms: 0.09146254327599868
  time_since_restore: 11.58330774307251
  time_this_iter_s: 3.6747536659240723
  time_total_s: 11.58330774307251
  timers:
    learn_throughput: 2418.754
    learn_time_ms: 1791.211
    load_throughput: 15739993.14
    load_time_ms: 0.275
    training_iteration_time_ms: 5786.655
    update_time_ms: 2.414
  timestamp: 1653044293
  timesteps_since_restore: 0
  timesteps_total: 8665
  training_iteration: 2
  trial_id: ab506_00000
  warmup_time: 9.528029203414917
  
2022-05-20 11:58:13,583	INFO tune.py:753 -- Total run time: 32.49 seconds (31.86 seconds for the tuning loop).

Then, we run training:

result = train_rl_bc_offline(path=path, num_workers=2, use_gpu=False)
Starting offline training
== Status ==
Current time: 2022-05-20 11:58:39 (running for 00:00:25.89)
Memory usage on this node: 9.8/16.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/8 CPUs, 0/0 GPUs, 0.0/4.13 GiB heap, 0.0/2.0 GiB objects
Result logdir: /Users/kai/ray_results/AIRBCTrainer_2022-05-20_11-58-14
Number of trials: 1/1 (1 TERMINATED)
Trial name status loc iter total time (s) ts reward episode_reward_max episode_reward_min episode_len_mean
AIRBCTrainer_bef2c_00000TERMINATED127.0.0.1:28876 5 9.282297 nan nan nan nan


(raylet) 2022-05-20 11:58:14,957	INFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=63962 --object-store-name=/tmp/ray/session_2022-05-20_11-57-36_849562_28764/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-20_11-57-36_849562_28764/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=64061 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:64346 --redis-password=5241590000000000 --startup-token=15 --runtime-env-hash=-2010331134
(pid=28876) 2022-05-20 11:58:21,630	WARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.execution.buffers` has been deprecated. Use `ray.rllib.utils.replay_buffers` instead. This will raise an error in the future!
(AIRBCTrainer pid=28876) 2022-05-20 11:58:21,973	INFO trainer.py:1728 -- Your framework setting is 'tf', meaning you are using static-graph mode. Set framework='tf2' to enable eager execution with tf2.x. You may also then want to set eager_tracing=True in order to reach similar execution speed as with static-graph mode.
(AIRBCTrainer pid=28876) 2022-05-20 11:58:21,973	WARNING deprecation.py:47 -- DeprecationWarning: `config['multiagent']['replay_mode']` has been deprecated. config['replay_buffer_config']['replay_mode'] This will raise an error in the future!
(AIRBCTrainer pid=28876) 2022-05-20 11:58:21,973	INFO utils.py:241 -- No value for key `replay_batch_size` in replay_buffer_config. config['replay_buffer_config']['replay_batch_size'] will be automatically set to config['train_batch_size']
(AIRBCTrainer pid=28876) 2022-05-20 11:58:21,974	INFO trainer.py:328 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
Read:   0%|          | 0/2 [00:00<?, ?it/s]
Read: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:00<00:00, 19.56it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:00<00:00, 42.83it/s]
(raylet) 2022-05-20 11:58:22,976	INFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=63962 --object-store-name=/tmp/ray/session_2022-05-20_11-57-36_849562_28764/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-20_11-57-36_849562_28764/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=64061 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:64346 --redis-password=5241590000000000 --startup-token=16 --runtime-env-hash=-2010331134
(raylet) 2022-05-20 11:58:22,988	INFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=63962 --object-store-name=/tmp/ray/session_2022-05-20_11-57-36_849562_28764/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-20_11-57-36_849562_28764/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=64061 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:64346 --redis-password=5241590000000000 --startup-token=17 --runtime-env-hash=-2010331134
(RolloutWorker pid=28883) 2022-05-20 11:58:29,734	WARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.execution.buffers` has been deprecated. Use `ray.rllib.utils.replay_buffers` instead. This will raise an error in the future!
(RolloutWorker pid=28882) 2022-05-20 11:58:29,734	WARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.execution.buffers` has been deprecated. Use `ray.rllib.utils.replay_buffers` instead. This will raise an error in the future!
(RolloutWorker pid=28883) DatasetReader  2  has  57  samples.
(RolloutWorker pid=28882) DatasetReader  1  has  57  samples.
(AIRBCTrainer pid=28876) 2022-05-20 11:58:30,346	WARNING deprecation.py:47 -- DeprecationWarning: `simple_optimizer` has been deprecated. This will raise an error in the future!
(AIRBCTrainer pid=28876) 2022-05-20 11:58:30,346	WARNING deprecation.py:47 -- DeprecationWarning: `config['multiagent']['replay_mode']` has been deprecated. config['replay_buffer_config']['replay_mode'] This will raise an error in the future!
(AIRBCTrainer pid=28876) 2022-05-20 11:58:30,402	WARNING util.py:65 -- Install gputil for GPU system monitoring.
Stage 0:   0%|          | 0/1 [00:00<?, ?it/s]
Stage 0:   0%|          | 0/1 [00:00<?, ?it/s]
(raylet) 2022-05-20 11:58:31,224	INFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=63962 --object-store-name=/tmp/ray/session_2022-05-20_11-57-36_849562_28764/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-20_11-57-36_849562_28764/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=64061 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:64346 --redis-password=5241590000000000 --startup-token=18 --runtime-env-hash=-2010331134
(RolloutWorker pid=28893) 2022-05-20 11:58:37,819	WARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.execution.buffers` has been deprecated. Use `ray.rllib.utils.replay_buffers` instead. This will raise an error in the future!
Result for AIRBCTrainer_bef2c_00000:
  agent_timesteps_total: 445
  counters:
    num_agent_steps_sampled: 445
    num_agent_steps_trained: 2000
    num_env_steps_sampled: 445
    num_env_steps_trained: 2000
  custom_metrics: {}
  date: 2022-05-20_11-58-38
  done: false
  episode_len_mean: .nan
  episode_media: {}
  episode_reward_max: .nan
  episode_reward_mean: .nan
  episode_reward_min: .nan
  episodes_this_iter: 0
  episodes_total: 0
  evaluation:
    custom_metrics: {}
    episode_len_mean: 22.5
    episode_media: {}
    episode_reward_max: 54.0
    episode_reward_mean: 22.5
    episode_reward_min: 10.0
    episodes_this_iter: 10
    hist_stats:
      episode_lengths:
      - 30
      - 10
      - 18
      - 54
      - 31
      - 14
      - 18
      - 16
      - 11
      - 23
      episode_reward:
      - 30.0
      - 10.0
      - 18.0
      - 54.0
      - 31.0
      - 14.0
      - 18.0
      - 16.0
      - 11.0
      - 23.0
    off_policy_estimator: {}
    policy_reward_max: {}
    policy_reward_mean: {}
    policy_reward_min: {}
    sampler_perf:
      mean_action_processing_ms: 0.05497447157328107
      mean_env_render_ms: 0.0
      mean_env_wait_ms: 0.04451886742515902
      mean_inference_ms: 0.4903911489301024
      mean_raw_obs_processing_ms: 0.07444250900133521
    timesteps_this_iter: 0
  experiment_id: e44358ccdd9e498cbd98dd52e498c2fb
  hostname: Kais-MacBook-Pro.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          model: {}
          policy_loss: 0.6931660175323486
          total_loss: 0.6931660175323486
        num_agent_steps_trained: 2000.0
    num_agent_steps_sampled: 445
    num_agent_steps_trained: 2000
    num_env_steps_sampled: 445
    num_env_steps_trained: 2000
  iterations_since_restore: 1
  node_ip: 127.0.0.1
  num_agent_steps_sampled: 445
  num_agent_steps_trained: 2000
  num_env_steps_sampled: 445
  num_env_steps_sampled_this_iter: 445
  num_env_steps_trained: 2000
  num_env_steps_trained_this_iter: 2000
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 9.483333333333333
    ram_util_percent: 60.383333333333326
  pid: 28876
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf: {}
  sampler_results:
    custom_metrics: {}
    episode_len_mean: .nan
    episode_media: {}
    episode_reward_max: .nan
    episode_reward_mean: .nan
    episode_reward_min: .nan
    episodes_this_iter: 0
    hist_stats:
      episode_lengths: []
      episode_reward: []
    off_policy_estimator: {}
    policy_reward_max: {}
    policy_reward_mean: {}
    policy_reward_min: {}
    sampler_perf: {}
  time_since_restore: 7.898306846618652
  time_this_iter_s: 7.898306846618652
  time_total_s: 7.898306846618652
  timers:
    learn_throughput: 21120.047
    learn_time_ms: 94.697
    load_throughput: 11881881.02
    load_time_ms: 0.168
    training_iteration_time_ms: 259.02
    update_time_ms: 1.614
  timestamp: 1653044318
  timesteps_since_restore: 0
  timesteps_total: 445
  training_iteration: 1
  trial_id: bef2c_00000
  warmup_time: 8.44019627571106
  
Result for AIRBCTrainer_bef2c_00000:
  agent_timesteps_total: 2297
  counters:
    num_agent_steps_sampled: 2297
    num_agent_steps_trained: 10000
    num_env_steps_sampled: 2297
    num_env_steps_trained: 10000
  custom_metrics: {}
  date: 2022-05-20_11-58-39
  done: true
  episode_len_mean: .nan
  episode_media: {}
  episode_reward_max: .nan
  episode_reward_mean: .nan
  episode_reward_min: .nan
  episodes_this_iter: 0
  episodes_total: 0
  evaluation:
    custom_metrics: {}
    episode_len_mean: 24.1
    episode_media: {}
    episode_reward_max: 43.0
    episode_reward_mean: 24.1
    episode_reward_min: 11.0
    episodes_this_iter: 10
    hist_stats:
      episode_lengths:
      - 11
      - 19
      - 27
      - 43
      - 33
      - 18
      - 19
      - 35
      - 15
      - 21
      episode_reward:
      - 11.0
      - 19.0
      - 27.0
      - 43.0
      - 33.0
      - 18.0
      - 19.0
      - 35.0
      - 15.0
      - 21.0
    off_policy_estimator: {}
    policy_reward_max: {}
    policy_reward_mean: {}
    policy_reward_min: {}
    sampler_perf:
      mean_action_processing_ms: 0.054491435182963496
      mean_env_render_ms: 0.0
      mean_env_wait_ms: 0.04467233881220088
      mean_inference_ms: 0.4441456947045478
      mean_raw_obs_processing_ms: 0.07285220421893394
    timesteps_this_iter: 0
  experiment_id: e44358ccdd9e498cbd98dd52e498c2fb
  hostname: Kais-MacBook-Pro.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          model: {}
          policy_loss: 0.6909552216529846
          total_loss: 0.6909552216529846
        num_agent_steps_trained: 2000.0
    num_agent_steps_sampled: 2297
    num_agent_steps_trained: 10000
    num_env_steps_sampled: 2297
    num_env_steps_trained: 10000
  iterations_since_restore: 5
  node_ip: 127.0.0.1
  num_agent_steps_sampled: 2297
  num_agent_steps_trained: 10000
  num_env_steps_sampled: 2297
  num_env_steps_sampled_this_iter: 493
  num_env_steps_trained: 10000
  num_env_steps_trained_this_iter: 2000
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 9.3
    ram_util_percent: 61.3
  pid: 28876
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf: {}
  sampler_results:
    custom_metrics: {}
    episode_len_mean: .nan
    episode_media: {}
    episode_reward_max: .nan
    episode_reward_mean: .nan
    episode_reward_min: .nan
    episodes_this_iter: 0
    hist_stats:
      episode_lengths: []
      episode_reward: []
    off_policy_estimator: {}
    policy_reward_max: {}
    policy_reward_mean: {}
    policy_reward_min: {}
    sampler_perf: {}
  time_since_restore: 9.279996871948242
  time_this_iter_s: 0.32008910179138184
  time_total_s: 9.279996871948242
  timers:
    learn_throughput: 86954.351
    learn_time_ms: 23.001
    load_throughput: 11342087.615
    load_time_ms: 0.176
    training_iteration_time_ms: 194.49
    update_time_ms: 1.59
  timestamp: 1653044319
  timesteps_since_restore: 0
  timesteps_total: 2297
  training_iteration: 5
  trial_id: bef2c_00000
  warmup_time: 8.44019627571106
  
2022-05-20 11:58:40,413	INFO tune.py:753 -- Total run time: 26.38 seconds (25.84 seconds for the tuning loop).
Read progress: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:00<00:00, 11.78it/s]

And then, using the obtained checkpoint, we evaluate the policy on a fresh environment:

num_eval_episodes = 3

rewards = evaluate_using_checkpoint(result.checkpoint, num_episodes=num_eval_episodes)
print(f"Average reward over {num_eval_episodes} episodes: " f"{np.mean(rewards)}")
2022-05-20 11:58:40,636	INFO trainer.py:1728 -- Your framework setting is 'tf', meaning you are using static-graph mode. Set framework='tf2' to enable eager execution with tf2.x. You may also then want to set eager_tracing=True in order to reach similar execution speed as with static-graph mode.
2022-05-20 11:58:40,637	WARNING deprecation.py:47 -- DeprecationWarning: `simple_optimizer` has been deprecated. This will raise an error in the future!
2022-05-20 11:58:40,637	WARNING deprecation.py:47 -- DeprecationWarning: `config['multiagent']['replay_mode']` has been deprecated. config['replay_buffer_config']['replay_mode'] This will raise an error in the future!
2022-05-20 11:58:40,638	INFO trainer.py:328 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
Read: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:01<00:00,  1.58it/s]
Repartition: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:00<00:00,  6.84it/s]
(RolloutWorker pid=28906) 2022-05-20 11:58:49,326	WARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.execution.buffers` has been deprecated. Use `ray.rllib.utils.replay_buffers` instead. This will raise an error in the future!
(RolloutWorker pid=28907) 2022-05-20 11:58:49,324	WARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.execution.buffers` has been deprecated. Use `ray.rllib.utils.replay_buffers` instead. This will raise an error in the future!
(RolloutWorker pid=28906) DatasetReader  1  has  57  samples.
(RolloutWorker pid=28907) DatasetReader  2  has  57  samples.
2022-05-20 11:58:49,953	WARNING deprecation.py:47 -- DeprecationWarning: `simple_optimizer` has been deprecated. This will raise an error in the future!
2022-05-20 11:58:49,954	WARNING deprecation.py:47 -- DeprecationWarning: `config['multiagent']['replay_mode']` has been deprecated. config['replay_buffer_config']['replay_mode'] This will raise an error in the future!
2022-05-20 11:58:50,013	WARNING util.py:65 -- Install gputil for GPU system monitoring.
2022-05-20 11:58:50,042	INFO trainable.py:589 -- Restored on 127.0.0.1 from checkpoint: /Users/kai/ray_results/AIRBCTrainer_2022-05-20_11-58-14/AIRBCTrainer_bef2c_00000_0_2022-05-20_11-58-14/checkpoint_000005/checkpoint-5
2022-05-20 11:58:50,043	INFO trainable.py:597 -- Current state after restoring: {'_iteration': 5, '_timesteps_total': None, '_time_total': 9.279996871948242, '_episodes_total': 0}
Average reward over 3 episodes: 41.333333333333336
(RolloutWorker pid=28913) 2022-05-20 11:58:56,934	WARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.execution.buffers` has been deprecated. Use `ray.rllib.utils.replay_buffers` instead. This will raise an error in the future!