Offline reinforcement learning with Ray AIR
Offline reinforcement learning with Ray AIR#
In this example, weβll train a reinforcement learning agent using offline training.
Offline training means that the data from the environment (and the actions performed by the agent) have been stored on disk. In contrast, online training samples experiences live by interacting with the environment.
Letβs start with installing our dependencies:
# !pip install -qU "ray[rllib]" gymnasium
Now we can run some imports:
import argparse
import gymnasium as gym
import os
import numpy as np
import ray
from ray.air import Checkpoint
from ray.air.config import RunConfig
from ray.train.rl.rl_predictor import RLPredictor
from ray.train.rl.rl_trainer import RLTrainer
from ray.air.config import ScalingConfig
from ray.air.result import Result
from ray.rllib.algorithms.bc import BC
from ray.tune.tuner import Tuner
We will be training on offline data - this means we have full agent trajectories stored somewhere on disk and want to train on these past experiences.
Usually this data could come from external systems, or a database of historical data. But for this example, weβll generate some offline data ourselves and store it using RLlibs output_config
.
def generate_offline_data(path: str):
print(f"Generating offline data for training at {path}")
trainer = RLTrainer(
algorithm="PPO",
run_config=RunConfig(stop={"timesteps_total": 5000}),
config={
"env": "CartPole-v1",
"output": "dataset",
"output_config": {
"format": "json",
"path": path,
"max_num_samples_per_file": 1,
},
"batch_mode": "complete_episodes",
},
)
trainer.fit()
Here we define the training function. It will create an RLTrainer
using the PPO
algorithm and kick off training on the CartPole-v1
environment. It will use the offline data provided in path
for this.
def train_rl_bc_offline(path: str, num_workers: int, use_gpu: bool = False) -> Result:
print("Starting offline training")
dataset = ray.data.read_json(
path, parallelism=num_workers, ray_remote_args={"num_cpus": 1}
)
trainer = RLTrainer(
run_config=RunConfig(stop={"training_iteration": 5}),
scaling_config=ScalingConfig(num_workers=num_workers, use_gpu=use_gpu),
datasets={"train": dataset},
algorithm=BC,
config={
"env": "CartPole-v1",
"framework": "tf",
"evaluation_num_workers": 1,
"evaluation_interval": 1,
"evaluation_config": {"input": "sampler"},
},
)
# Todo (krfricke/xwjiang): Enable checkpoint config in RunConfig
# result = trainer.fit()
tuner = Tuner(
trainer,
_tuner_kwargs={"checkpoint_at_end": True},
)
result = tuner.fit()[0]
return result
Once we trained our RL policy, we want to evaluate it on a fresh environment. For this, we will also define a utility function:
def evaluate_using_checkpoint(checkpoint: Checkpoint, num_episodes) -> list:
predictor = RLPredictor.from_checkpoint(checkpoint)
env = gym.make("CartPole-v1")
rewards = []
for i in range(num_episodes):
obs, _ = env.reset()
reward = 0.0
terminated = truncated = False
while not terminated and not truncated:
action = predictor.predict(np.array([obs]))
obs, r, terminated, truncated, _ = env.step(action[0])
reward += r
rewards.append(reward)
return rewards
Letβs put it all together. First, we initialize Ray and create the offline data:
ray.init(num_cpus=8)
path = "/tmp/out"
generate_offline_data(path)
2022-09-26 18:22:15,032 INFO worker.py:1509 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265
Generating offline data for training at /tmp/out
Tune Status
Current time: | 2022-09-26 18:22:31 |
Running for: | 00:00:15.61 |
Memory: | 10.4/62.7 GiB |
System Info
Using FIFO scheduling algorithm.Resources requested: 0/8 CPUs, 0/0 GPUs, 0.0/33.26 GiB heap, 0.0/16.63 GiB objects
Trial Status
Trial name | status | loc | iter | total time (s) | ts | reward | num_recreated_worker s | episode_reward_max | episode_reward_min |
---|---|---|---|---|---|---|---|---|---|
AIRPPO_d229c_00000 | TERMINATED | 192.168.1.241:3893828 | 2 | 8.77525 | 8528 | 45.76 | 0 | 137 | 10 |
(AIRPPO pid=3893828) 2022-09-26 18:22:18,476 INFO algorithm.py:2104 -- Your framework setting is 'tf', meaning you are using static-graph mode. Set framework='tf2' to enable eager execution with tf2.x. You may also then want to set eager_tracing=True in order to reach similar execution speed as with static-graph mode.
(AIRPPO pid=3893828) 2022-09-26 18:22:18,476 INFO ppo.py:379 -- In multi-agent mode, policies will be optimized sequentially by the multi-GPU optimizer. Consider setting simple_optimizer=True if this doesn't work for you.
(AIRPPO pid=3893828) 2022-09-26 18:22:18,477 INFO algorithm.py:356 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
(RolloutWorker pid=3893857) 2022-09-26 18:22:21,261 WARNING env.py:159 -- Your env reset() method appears to take 'seed' or 'return_info' arguments. Note that these are not yet supported in RLlib. Seeding will take place using 'env.seed()' and the info dict will not be returned from reset.
(AIRPPO pid=3893828) 2022-09-26 18:22:22,261 WARNING util.py:66 -- Install gputil for GPU system monitoring.
Repartition: 0%| | 0/1 [00:00<?, ?it/s]
Repartition: 0%| | 0/1 [00:00<?, ?it/s]
Repartition: 100%|ββββββββββ| 1/1 [00:00<00:00, 1.34it/s]
Write Progress: 0%| | 0/1 [00:00<?, ?it/s]
Repartition: 100%|ββββββββββ| 1/1 [00:00<00:00, 1.34it/s]
Write Progress: 100%|ββββββββββ| 1/1 [00:00<00:00, 175.38it/s]
Write Progress: 100%|ββββββββββ| 1/1 [00:00<00:00, 1.33it/s]
Repartition: 100%|ββββββββββ| 1/1 [00:00<00:00, 232.05it/s]
Write Progress: 100%|ββββββββββ| 1/1 [00:00<00:00, 288.70it/s]
Repartition: 100%|ββββββββββ| 1/1 [00:00<00:00, 265.45it/s]
Write Progress: 100%|ββββββββββ| 1/1 [00:00<00:00, 295.96it/s]
Repartition: 100%|ββββββββββ| 1/1 [00:00<00:00, 381.47it/s]
Write Progress: 100%|ββββββββββ| 1/1 [00:00<00:00, 273.05it/s]
Repartition: 100%|ββββββββββ| 1/1 [00:00<00:00, 240.86it/s]
Write Progress: 100%|ββββββββββ| 1/1 [00:00<00:00, 222.25it/s]
Repartition: 0%| | 0/1 [00:00<?, ?it/s]
Repartition: 100%|ββββββββββ| 1/1 [00:00<00:00, 219.46it/s]
Write Progress: 100%|ββββββββββ| 1/1 [00:00<00:00, 152.97it/s]
Repartition: 100%|ββββββββββ| 1/1 [00:00<00:00, 237.54it/s]
Write Progress: 100%|ββββββββββ| 1/1 [00:00<00:00, 229.13it/s]
Repartition: 100%|ββββββββββ| 1/1 [00:00<00:00, 283.36it/s]
Write Progress: 100%|ββββββββββ| 1/1 [00:00<00:00, 351.08it/s]
Repartition: 100%|ββββββββββ| 1/1 [00:00<00:00, 302.44it/s]
Write Progress: 100%|ββββββββββ| 1/1 [00:00<00:00, 217.21it/s]
Repartition: 100%|ββββββββββ| 1/1 [00:00<00:00, 226.01it/s]
Write Progress: 100%|ββββββββββ| 1/1 [00:00<00:00, 335.54it/s]
Repartition: 100%|ββββββββββ| 1/1 [00:00<00:00, 206.81it/s]
Write Progress: 100%|ββββββββββ| 1/1 [00:00<00:00, 397.15it/s]
Repartition: 100%|ββββββββββ| 1/1 [00:00<00:00, 287.34it/s]
Write Progress: 100%|ββββββββββ| 1/1 [00:00<00:00, 245.70it/s]
Repartition: 100%|ββββββββββ| 1/1 [00:00<00:00, 224.11it/s]
Write Progress: 100%|ββββββββββ| 1/1 [00:00<00:00, 174.41it/s]
Repartition: 100%|ββββββββββ| 1/1 [00:00<00:00, 376.54it/s]
Write Progress: 100%|ββββββββββ| 1/1 [00:00<00:00, 231.93it/s]
Repartition: 100%|ββββββββββ| 1/1 [00:00<00:00, 325.49it/s]
Write Progress: 100%|ββββββββββ| 1/1 [00:00<00:00, 280.89it/s]
Repartition: 100%|ββββββββββ| 1/1 [00:00<00:00, 206.81it/s]
Write Progress: 100%|ββββββββββ| 1/1 [00:00<00:00, 263.11it/s]
Repartition: 100%|ββββββββββ| 1/1 [00:00<00:00, 236.59it/s]
Write Progress: 100%|ββββββββββ| 1/1 [00:00<00:00, 320.96it/s]
Repartition: 100%|ββββββββββ| 1/1 [00:00<00:00, 372.13it/s]
Write Progress: 100%|ββββββββββ| 1/1 [00:00<00:00, 293.16it/s]
Repartition: 100%|ββββββββββ| 1/1 [00:00<00:00, 318.30it/s]
Write Progress: 100%|ββββββββββ| 1/1 [00:00<00:00, 225.89it/s]
Trial Progress
Trial name | agent_timesteps_total | counters | custom_metrics | date | done | episode_len_mean | episode_media | episode_reward_max | episode_reward_mean | episode_reward_min | episodes_this_iter | episodes_total | experiment_id | hostname | info | iterations_since_restore | node_ip | num_agent_steps_sampled | num_agent_steps_trained | num_env_steps_sampled | num_env_steps_sampled_this_iter | num_env_steps_trained | num_env_steps_trained_this_iter | num_faulty_episodes | num_healthy_workers | num_recreated_workers | num_steps_trained_this_iter | perf | pid | policy_reward_max | policy_reward_mean | policy_reward_min | sampler_perf | sampler_results | time_since_restore | time_this_iter_s | time_total_s | timers | timestamp | timesteps_since_restore | timesteps_total | training_iteration | trial_id | warmup_time |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
AIRPPO_d229c_00000 | 8528 | {'num_env_steps_sampled': 8528, 'num_env_steps_trained': 8528, 'num_agent_steps_sampled': 8528, 'num_agent_steps_trained': 8528} | {} | 2022-09-26_18-22-31 | True | 45.76 | {} | 137 | 45.76 | 10 | 84 | 284 | eadfde34443046629ed77655da6915c9 | corvus | {'learner': {'default_policy': {'learner_stats': {'cur_kl_coeff': 0.30000001192092896, 'cur_lr': 4.999999873689376e-05, 'total_loss': 9.522293, 'policy_loss': -0.03154374, 'vf_loss': 9.54884, 'vf_explained_var': -0.011132962, 'kl': 0.016653905, 'entropy': 0.6111665, 'entropy_coeff': 0.0, 'model': {}}, 'custom_metrics': {}, 'num_agent_steps_trained': 128.0}}, 'num_env_steps_sampled': 8528, 'num_env_steps_trained': 8528, 'num_agent_steps_sampled': 8528, 'num_agent_steps_trained': 8528} | 2 | 192.168.1.241 | 8528 | 8528 | 8528 | 4238 | 8528 | 4238 | 0 | 2 | 0 | 4238 | {'cpu_util_percent': 16.94, 'ram_util_percent': 16.5} | 3893828 | {} | {} | {} | {'mean_raw_obs_processing_ms': 0.2064514408664396, 'mean_inference_ms': 0.31645616264795123, 'mean_action_processing_ms': 0.032597069914330125, 'mean_env_wait_ms': 0.027492389739157415, 'mean_env_render_ms': 0.0} | {'episode_reward_max': 137.0, 'episode_reward_min': 10.0, 'episode_reward_mean': 45.76, 'episode_len_mean': 45.76, 'episode_media': {}, 'episodes_this_iter': 84, 'policy_reward_min': {}, 'policy_reward_max': {}, 'policy_reward_mean': {}, 'custom_metrics': {}, 'hist_stats': {'episode_reward': [22.0, 17.0, 27.0, 14.0, 13.0, 10.0, 16.0, 12.0, 25.0, 29.0, 22.0, 27.0, 18.0, 26.0, 35.0, 25.0, 41.0, 23.0, 69.0, 56.0, 53.0, 30.0, 120.0, 40.0, 38.0, 86.0, 10.0, 19.0, 137.0, 43.0, 72.0, 119.0, 21.0, 53.0, 45.0, 36.0, 14.0, 35.0, 69.0, 100.0, 118.0, 48.0, 12.0, 21.0, 12.0, 30.0, 59.0, 34.0, 72.0, 63.0, 50.0, 42.0, 32.0, 28.0, 44.0, 59.0, 19.0, 86.0, 32.0, 69.0, 47.0, 62.0, 73.0, 13.0, 72.0, 36.0, 12.0, 49.0, 17.0, 117.0, 19.0, 13.0, 24.0, 12.0, 17.0, 23.0, 49.0, 22.0, 86.0, 79.0, 92.0, 21.0, 101.0, 30.0, 12.0, 62.0, 80.0, 32.0, 18.0, 95.0, 18.0, 35.0, 80.0, 69.0, 72.0, 116.0, 67.0, 83.0, 35.0, 19.0], 'episode_lengths': [22, 17, 27, 14, 13, 10, 16, 12, 25, 29, 22, 27, 18, 26, 35, 25, 41, 23, 69, 56, 53, 30, 120, 40, 38, 86, 10, 19, 137, 43, 72, 119, 21, 53, 45, 36, 14, 35, 69, 100, 118, 48, 12, 21, 12, 30, 59, 34, 72, 63, 50, 42, 32, 28, 44, 59, 19, 86, 32, 69, 47, 62, 73, 13, 72, 36, 12, 49, 17, 117, 19, 13, 24, 12, 17, 23, 49, 22, 86, 79, 92, 21, 101, 30, 12, 62, 80, 32, 18, 95, 18, 35, 80, 69, 72, 116, 67, 83, 35, 19]}, 'sampler_perf': {'mean_raw_obs_processing_ms': 0.2064514408664396, 'mean_inference_ms': 0.31645616264795123, 'mean_action_processing_ms': 0.032597069914330125, 'mean_env_wait_ms': 0.027492389739157415, 'mean_env_render_ms': 0.0}, 'num_faulty_episodes': 0} | 8.77525 | 3.47446 | 8.77525 | {'training_iteration_time_ms': 4384.142, 'load_time_ms': 0.287, 'load_throughput': 14835762.966, 'learn_time_ms': 2129.317, 'learn_throughput': 2002.52, 'synch_weights_time_ms': 1.33} | 1664241751 | 0 | 8528 | 2 | d229c_00000 | 3.78836 |
Repartition: 100%|ββββββββββ| 1/1 [00:00<00:00, 227.44it/s]
Write Progress: 100%|ββββββββββ| 1/1 [00:00<00:00, 221.02it/s]
Repartition: 100%|ββββββββββ| 1/1 [00:00<00:00, 221.55it/s]
Write Progress: 100%|ββββββββββ| 1/1 [00:00<00:00, 241.52it/s]
Repartition: 100%|ββββββββββ| 1/1 [00:00<00:00, 267.02it/s]
Write Progress: 100%|ββββββββββ| 1/1 [00:00<00:00, 299.79it/s]
Repartition: 100%|ββββββββββ| 1/1 [00:00<00:00, 353.41it/s]
Write Progress: 100%|ββββββββββ| 1/1 [00:00<00:00, 339.84it/s]
Repartition: 100%|ββββββββββ| 1/1 [00:00<00:00, 218.85it/s]
Write Progress: 100%|ββββββββββ| 1/1 [00:00<00:00, 277.73it/s]
Repartition: 100%|ββββββββββ| 1/1 [00:00<00:00, 252.29it/s]
Write Progress: 100%|ββββββββββ| 1/1 [00:00<00:00, 233.57it/s]
Repartition: 100%|ββββββββββ| 1/1 [00:00<00:00, 291.35it/s]
Write Progress: 100%|ββββββββββ| 1/1 [00:00<00:00, 345.12it/s]
Repartition: 100%|ββββββββββ| 1/1 [00:00<00:00, 230.60it/s]
Write Progress: 100%|ββββββββββ| 1/1 [00:00<00:00, 331.38it/s]
Repartition: 100%|ββββββββββ| 1/1 [00:00<00:00, 246.51it/s]
Write Progress: 100%|ββββββββββ| 1/1 [00:00<00:00, 239.32it/s]
Repartition: 100%|ββββββββββ| 1/1 [00:00<00:00, 265.46it/s]
Write Progress: 100%|ββββββββββ| 1/1 [00:00<00:00, 228.80it/s]
Repartition: 100%|ββββββββββ| 1/1 [00:00<00:00, 256.47it/s]
Write Progress: 100%|ββββββββββ| 1/1 [00:00<00:00, 294.52it/s]
Repartition: 100%|ββββββββββ| 1/1 [00:00<00:00, 343.40it/s]
Write Progress: 100%|ββββββββββ| 1/1 [00:00<00:00, 227.59it/s]
Repartition: 100%|ββββββββββ| 1/1 [00:00<00:00, 243.35it/s]
Write Progress: 100%|ββββββββββ| 1/1 [00:00<00:00, 226.79it/s]
Repartition: 100%|ββββββββββ| 1/1 [00:00<00:00, 227.47it/s]
Write Progress: 100%|ββββββββββ| 1/1 [00:00<00:00, 219.45it/s]
Repartition: 0%| | 0/1 [00:00<?, ?it/s]
Repartition: 100%|ββββββββββ| 1/1 [00:00<00:00, 253.25it/s]
Write Progress: 100%|ββββββββββ| 1/1 [00:00<00:00, 240.10it/s]
Repartition: 100%|ββββββββββ| 1/1 [00:00<00:00, 236.03it/s]
Write Progress: 100%|ββββββββββ| 1/1 [00:00<00:00, 220.66it/s]
Repartition: 100%|ββββββββββ| 1/1 [00:00<00:00, 222.38it/s]
Write Progress: 100%|ββββββββββ| 1/1 [00:00<00:00, 224.67it/s]
Repartition: 100%|ββββββββββ| 1/1 [00:00<00:00, 327.73it/s]
Write Progress: 100%|ββββββββββ| 1/1 [00:00<00:00, 233.71it/s]
2022-09-26 18:22:31,518 INFO tune.py:762 -- Total run time: 16.00 seconds (15.59 seconds for the tuning loop).
Then, we run training:
result = train_rl_bc_offline(path=path, num_workers=2, use_gpu=False)
Starting offline training
Tune Status
Current time: | 2022-09-26 18:22:55 |
Running for: | 00:00:10.97 |
Memory: | 10.9/62.7 GiB |
System Info
Using FIFO scheduling algorithm.Resources requested: 0/8 CPUs, 0/0 GPUs, 0.0/33.26 GiB heap, 0.0/16.63 GiB objects
Trial Status
Trial name | status | loc | iter | total time (s) | ts | reward | num_recreated_worker s | episode_reward_max | episode_reward_min |
---|---|---|---|---|---|---|---|---|---|
AIRBC_e3afc_00000 | TERMINATED | 192.168.1.241:3894380 | 5 | 0.996612 | 11084 | nan | 0 | nan | nan |
(AIRBC pid=3894380) 2022-09-26 18:22:47,815 INFO algorithm.py:2104 -- Your framework setting is 'tf', meaning you are using static-graph mode. Set framework='tf2' to enable eager execution with tf2.x. You may also then want to set eager_tracing=True in order to reach similar execution speed as with static-graph mode.
(AIRBC pid=3894380) 2022-09-26 18:22:47,816 INFO algorithm.py:356 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
Read: 0%| | 0/2 [00:00<?, ?it/s]
Read: 50%|βββββ | 1/2 [00:00<00:00, 3.37it/s]
Read: 100%|ββββββββββ| 2/2 [00:00<00:00, 5.58it/s]
Repartition: 0%| | 0/2 [00:00<?, ?it/s]
Repartition: 100%|ββββββββββ| 2/2 [00:00<00:00, 14.03it/s]
(RolloutWorker pid=3894910) 2022-09-26 18:22:51,123 WARNING env.py:159 -- Your env reset() method appears to take 'seed' or 'return_info' arguments. Note that these are not yet supported in RLlib. Seeding will take place using 'env.seed()' and the info dict will not be returned from reset.
(RolloutWorker pid=3894910) DatasetReader 1 has 239, samples.
(RolloutWorker pid=3894911) DatasetReader 2 has 239, samples.
(AIRBC pid=3894380) 2022-09-26 18:22:51,756 WARNING deprecation.py:47 -- DeprecationWarning: `simple_optimizer` has been deprecated. This will raise an error in the future!
(RolloutWorker pid=3895021) 2022-09-26 18:22:54,479 WARNING env.py:159 -- Your env reset() method appears to take 'seed' or 'return_info' arguments. Note that these are not yet supported in RLlib. Seeding will take place using 'env.seed()' and the info dict will not be returned from reset.
(AIRBC pid=3894380) 2022-09-26 18:22:54,735 WARNING util.py:66 -- Install gputil for GPU system monitoring.
Trial Progress
Trial name | agent_timesteps_total | counters | custom_metrics | date | done | episode_len_mean | episode_media | episode_reward_max | episode_reward_mean | episode_reward_min | episodes_this_iter | episodes_total | evaluation | experiment_id | hostname | info | iterations_since_restore | node_ip | num_agent_steps_sampled | num_agent_steps_trained | num_env_steps_sampled | num_env_steps_sampled_this_iter | num_env_steps_trained | num_env_steps_trained_this_iter | num_faulty_episodes | num_healthy_workers | num_recreated_workers | num_steps_trained_this_iter | perf | pid | policy_reward_max | policy_reward_mean | policy_reward_min | sampler_perf | sampler_results | time_since_restore | time_this_iter_s | time_total_s | timers | timestamp | timesteps_since_restore | timesteps_total | training_iteration | trial_id | warmup_time |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
AIRBC_e3afc_00000 | 11084 | {'num_env_steps_sampled': 11084, 'num_env_steps_trained': 11084, 'num_agent_steps_sampled': 11084, 'num_agent_steps_trained': 11084} | {} | 2022-09-26_18-22-55 | True | nan | {} | nan | nan | nan | 0 | 0 | {'episode_reward_max': 24.0, 'episode_reward_min': 10.0, 'episode_reward_mean': 16.9, 'episode_len_mean': 16.9, 'episode_media': {}, 'episodes_this_iter': 10, 'policy_reward_min': {}, 'policy_reward_max': {}, 'policy_reward_mean': {}, 'custom_metrics': {}, 'hist_stats': {'episode_reward': [22.0, 12.0, 10.0, 21.0, 24.0, 19.0, 15.0, 17.0, 17.0, 12.0], 'episode_lengths': [22, 12, 10, 21, 24, 19, 15, 17, 17, 12]}, 'sampler_perf': {'mean_raw_obs_processing_ms': 0.1389556124250126, 'mean_inference_ms': 0.3053837110354277, 'mean_action_processing_ms': 0.031036834474401493, 'mean_env_wait_ms': 0.0260694335622386, 'mean_env_render_ms': 0.0}, 'num_faulty_episodes': 0, 'num_agent_steps_sampled_this_iter': 169, 'num_env_steps_sampled_this_iter': 169, 'timesteps_this_iter': 169, 'num_healthy_workers': 1, 'num_recreated_workers': 0} | 21b4e50f0a544d479bf6794c0eedc65a | corvus | {'learner': {'default_policy': {'learner_stats': {'policy_loss': 0.69113123, 'total_loss': 0.69113123, 'model': {}}, 'custom_metrics': {}, 'num_agent_steps_trained': 2000.0}}, 'num_env_steps_sampled': 11084, 'num_env_steps_trained': 11084, 'num_agent_steps_sampled': 11084, 'num_agent_steps_trained': 11084} | 5 | 192.168.1.241 | 11084 | 11084 | 11084 | 2270 | 11084 | 2270 | 0 | 2 | 0 | 2270 | {} | 3894380 | {} | {} | {} | {} | {'episode_reward_max': nan, 'episode_reward_min': nan, 'episode_reward_mean': nan, 'episode_len_mean': nan, 'episode_media': {}, 'episodes_this_iter': 0, 'policy_reward_min': {}, 'policy_reward_max': {}, 'policy_reward_mean': {}, 'custom_metrics': {}, 'hist_stats': {'episode_reward': [], 'episode_lengths': []}, 'sampler_perf': {}, 'num_faulty_episodes': 0} | 0.996612 | 0.116935 | 0.996612 | {'training_iteration_time_ms': 55.867, 'sample_time_ms': 32.326, 'load_time_ms': 0.227, 'load_throughput': 9744218.306, 'learn_time_ms': 21.78, 'learn_throughput': 101779.38, 'synch_weights_time_ms': 1.468} | 1664241775 | 0 | 11084 | 5 | e3afc_00000 | 6.92411 |
2022-09-26 18:22:56,255 INFO tune.py:762 -- Total run time: 11.35 seconds (10.95 seconds for the tuning loop).
And then, using the obtained checkpoint, we evaluate the policy on a fresh environment:
num_eval_episodes = 3
rewards = evaluate_using_checkpoint(result.checkpoint, num_episodes=num_eval_episodes)
print(f"Average reward over {num_eval_episodes} episodes: " f"{np.mean(rewards)}")
2022-09-26 18:23:01,591 INFO algorithm.py:2104 -- Your framework setting is 'tf', meaning you are using static-graph mode. Set framework='tf2' to enable eager execution with tf2.x. You may also then want to set eager_tracing=True in order to reach similar execution speed as with static-graph mode.
2022-09-26 18:23:01,591 WARNING deprecation.py:47 -- DeprecationWarning: `simple_optimizer` has been deprecated. This will raise an error in the future!
2022-09-26 18:23:01,593 INFO algorithm.py:356 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
Read: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 2/2 [00:00<00:00, 2.54it/s]
Repartition: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 2/2 [00:00<00:00, 10.71it/s]
(RolloutWorker pid=3895623) 2022-09-26 18:23:05,106 WARNING env.py:159 -- Your env reset() method appears to take 'seed' or 'return_info' arguments. Note that these are not yet supported in RLlib. Seeding will take place using 'env.seed()' and the info dict will not be returned from reset.
(RolloutWorker pid=3895623) DatasetReader 1 has 239, samples.
(RolloutWorker pid=3895624) DatasetReader 2 has 239, samples.
2022-09-26 18:23:05,788 WARNING deprecation.py:47 -- DeprecationWarning: `simple_optimizer` has been deprecated. This will raise an error in the future!
(RolloutWorker pid=3895732) 2022-09-26 18:23:08,264 WARNING env.py:159 -- Your env reset() method appears to take 'seed' or 'return_info' arguments. Note that these are not yet supported in RLlib. Seeding will take place using 'env.seed()' and the info dict will not be returned from reset.
2022-09-26 18:23:08,520 WARNING util.py:66 -- Install gputil for GPU system monitoring.
2022-09-26 18:23:08,568 INFO trainable.py:690 -- Restored on 192.168.1.241 from checkpoint: /home/pdmurray/ray_results/AIRBC_2022-09-26_18-22-44/AIRBC_e3afc_00000_0_2022-09-26_18-22-45/checkpoint_000005
2022-09-26 18:23:08,569 INFO trainable.py:699 -- Current state after restoring: {'_iteration': 5, '_timesteps_total': None, '_time_total': 0.9966120719909668, '_episodes_total': 0}
Average reward over 3 episodes: 23.666666666666668