Online reinforcement learning with Ray AIR
Online reinforcement learning with Ray AIR#
In this example, we’ll train a reinforcement learning agent using online training.
Online training means that the data from the environment is sampled while we are running the algorithm. In contrast, offline training uses data that has been stored on disk before.
Let’s start with installing our dependencies:
!pip install -qU "ray[rllib]" gymnasium
Now we can run some imports:
import argparse
import gymnasium as gym
import os
import numpy as np
import ray
from ray.air import Checkpoint
from ray.air.config import RunConfig
from ray.train.rl.rl_predictor import RLPredictor
from ray.train.rl.rl_trainer import RLTrainer
from ray.air.config import ScalingConfig
from ray.air.result import Result
from ray.rllib.algorithms.bc import BC
from ray.tune.tuner import Tuner
2022-05-19 13:54:16,520 WARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.execution.buffers` has been deprecated. Use `ray.rllib.utils.replay_buffers` instead. This will raise an error in the future!
2022-05-19 13:54:16,531 WARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.agents.marwil` has been deprecated. Use `ray.rllib.algorithms.marwil` instead. This will raise an error in the future!
Here we define the training function. It will create an RLTrainer
using the PPO
algorithm and kick off training on the CartPole-v1
environment:
def train_rl_ppo_online(num_workers: int, use_gpu: bool = False) -> Result:
print("Starting online training")
trainer = RLTrainer(
run_config=RunConfig(stop={"training_iteration": 5}),
scaling_config=ScalingConfig(num_workers=num_workers, use_gpu=use_gpu),
algorithm="PPO",
config={
"env": "CartPole-v1",
"framework": "tf",
},
)
# Todo (krfricke/xwjiang): Enable checkpoint config in RunConfig
# result = trainer.fit()
tuner = Tuner(
trainer,
_tuner_kwargs={"checkpoint_at_end": True},
)
result = tuner.fit()[0]
return result
Once we trained our RL policy, we want to evaluate it on a fresh environment. For this, we will also define a utility function:
def evaluate_using_checkpoint(checkpoint: Checkpoint, num_episodes) -> list:
predictor = RLPredictor.from_checkpoint(checkpoint)
env = gym.make("CartPole-v1")
rewards = []
for i in range(num_episodes):
obs, _ = env.reset()
reward = 0.0
terminated = truncated = False
while not terminated and not truncated:
action = predictor.predict(np.array([obs]))
obs, r, terminated, truncated, _ = env.step(action[0])
reward += r
rewards.append(reward)
return rewards
Let’s put it all together. First, we run training:
result = train_rl_ppo_online(num_workers=2, use_gpu=False)
2022-05-19 13:54:16,582 WARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.agents.dqn.dqn.DEFAULT_CONFIG` has been deprecated. Use `ray.rllib.agents.dqn.dqn.DQNConfig(...)` instead. This will raise an error in the future!
Starting online training
2022-05-19 13:54:19,326 INFO services.py:1483 -- View the Ray dashboard at http://127.0.0.1:8267
== Status ==
Current time: 2022-05-19 13:54:57 (running for 00:00:35.99)
Memory usage on this node: 9.6/16.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/16 CPUs, 0/0 GPUs, 0.0/4.54 GiB heap, 0.0/2.0 GiB objects
Result logdir: /Users/kai/ray_results/AIRPPOTrainer_2022-05-19_13-54-16
Number of trials: 1/1 (1 TERMINATED)
Current time: 2022-05-19 13:54:57 (running for 00:00:35.99)
Memory usage on this node: 9.6/16.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/16 CPUs, 0/0 GPUs, 0.0/4.54 GiB heap, 0.0/2.0 GiB objects
Result logdir: /Users/kai/ray_results/AIRPPOTrainer_2022-05-19_13-54-16
Number of trials: 1/1 (1 TERMINATED)
Trial name | status | loc | iter | total time (s) | ts | reward | episode_reward_max | episode_reward_min | episode_len_mean |
---|---|---|---|---|---|---|---|---|---|
AIRPPOTrainer_cd8d6_00000 | TERMINATED | 127.0.0.1:14174 | 5 | 16.7029 | 20000 | 124.79 | 200 | 9 | 124.79 |
(raylet) 2022-05-19 13:54:23,061 INFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=63729 --object-store-name=/tmp/ray/session_2022-05-19_13-54-16_649144_14093/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-19_13-54-16_649144_14093/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=63909 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:65260 --redis-password=5241590000000000 --startup-token=16 --runtime-env-hash=-2010331134
(pid=14174) 2022-05-19 13:54:30,271 WARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.execution.buffers` has been deprecated. Use `ray.rllib.utils.replay_buffers` instead. This will raise an error in the future!
(AIRPPOTrainer pid=14174) 2022-05-19 13:54:30,749 INFO trainer.py:1728 -- Your framework setting is 'tf', meaning you are using static-graph mode. Set framework='tf2' to enable eager execution with tf2.x. You may also then want to set eager_tracing=True in order to reach similar execution speed as with static-graph mode.
(AIRPPOTrainer pid=14174) 2022-05-19 13:54:30,750 INFO ppo.py:361 -- In multi-agent mode, policies will be optimized sequentially by the multi-GPU optimizer. Consider setting simple_optimizer=True if this doesn't work for you.
(AIRPPOTrainer pid=14174) 2022-05-19 13:54:30,750 INFO trainer.py:328 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
(raylet) 2022-05-19 13:54:31,857 INFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=63729 --object-store-name=/tmp/ray/session_2022-05-19_13-54-16_649144_14093/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-19_13-54-16_649144_14093/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=63909 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:65260 --redis-password=5241590000000000 --startup-token=17 --runtime-env-hash=-2010331134
(raylet) 2022-05-19 13:54:31,857 INFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=63729 --object-store-name=/tmp/ray/session_2022-05-19_13-54-16_649144_14093/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-19_13-54-16_649144_14093/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=63909 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:65260 --redis-password=5241590000000000 --startup-token=18 --runtime-env-hash=-2010331134
(RolloutWorker pid=14179) 2022-05-19 13:54:39,442 WARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.execution.buffers` has been deprecated. Use `ray.rllib.utils.replay_buffers` instead. This will raise an error in the future!
(RolloutWorker pid=14180) 2022-05-19 13:54:39,492 WARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.execution.buffers` has been deprecated. Use `ray.rllib.utils.replay_buffers` instead. This will raise an error in the future!
(AIRPPOTrainer pid=14174) 2022-05-19 13:54:40,836 INFO trainable.py:163 -- Trainable.setup took 10.087 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
(AIRPPOTrainer pid=14174) 2022-05-19 13:54:40,836 WARNING util.py:65 -- Install gputil for GPU system monitoring.
(AIRPPOTrainer pid=14174) 2022-05-19 13:54:42,569 WARNING deprecation.py:47 -- DeprecationWarning: `slice` has been deprecated. Use `SampleBatch[start:stop]` instead. This will raise an error in the future!
Result for AIRPPOTrainer_cd8d6_00000:
agent_timesteps_total: 4000
counters:
num_agent_steps_sampled: 4000
num_agent_steps_trained: 4000
num_env_steps_sampled: 4000
num_env_steps_trained: 4000
custom_metrics: {}
date: 2022-05-19_13-54-44
done: false
episode_len_mean: 22.11731843575419
episode_media: {}
episode_reward_max: 87.0
episode_reward_mean: 22.11731843575419
episode_reward_min: 8.0
episodes_this_iter: 179
episodes_total: 179
experiment_id: 158c57d8b6e142ad85b393db57c8bdff
hostname: Kais-MacBook-Pro.local
info:
learner:
default_policy:
custom_metrics: {}
learner_stats:
cur_kl_coeff: 0.20000000298023224
cur_lr: 4.999999873689376e-05
entropy: 0.6653298139572144
entropy_coeff: 0.0
kl: 0.02798665314912796
model: {}
policy_loss: -0.0422092080116272
total_loss: 8.986403465270996
vf_explained_var: -0.06533512473106384
vf_loss: 9.023015022277832
num_agent_steps_trained: 128.0
num_agent_steps_sampled: 4000
num_agent_steps_trained: 4000
num_env_steps_sampled: 4000
num_env_steps_trained: 4000
iterations_since_restore: 1
node_ip: 127.0.0.1
num_agent_steps_sampled: 4000
num_agent_steps_trained: 4000
num_env_steps_sampled: 4000
num_env_steps_sampled_this_iter: 4000
num_env_steps_trained: 4000
num_env_steps_trained_this_iter: 4000
num_healthy_workers: 2
off_policy_estimator: {}
perf:
cpu_util_percent: 24.849999999999998
ram_util_percent: 61.199999999999996
pid: 14174
policy_reward_max: {}
policy_reward_mean: {}
policy_reward_min: {}
sampler_perf:
mean_action_processing_ms: 0.06886580197141673
mean_env_render_ms: 0.0
mean_env_wait_ms: 0.05465748139159193
mean_inference_ms: 0.6132523881103351
mean_raw_obs_processing_ms: 0.10609273714105154
sampler_results:
custom_metrics: {}
episode_len_mean: 22.11731843575419
episode_media: {}
episode_reward_max: 87.0
episode_reward_mean: 22.11731843575419
episode_reward_min: 8.0
episodes_this_iter: 179
hist_stats:
episode_lengths:
- 28
- 9
- 12
- 23
- 13
- 21
- 15
- 16
- 19
- 44
- 14
- 19
- 19
- 17
- 17
- 12
- 9
- 48
- 43
- 15
- 21
- 25
- 16
- 14
- 22
- 21
- 24
- 53
- 21
- 16
- 17
- 14
- 20
- 22
- 18
- 17
- 14
- 11
- 46
- 12
- 18
- 21
- 13
- 58
- 10
- 20
- 14
- 25
- 22
- 33
- 23
- 10
- 25
- 11
- 32
- 48
- 12
- 12
- 10
- 24
- 15
- 28
- 14
- 16
- 14
- 21
- 12
- 13
- 8
- 12
- 13
- 10
- 10
- 14
- 30
- 16
- 23
- 47
- 14
- 22
- 11
- 18
- 12
- 21
- 21
- 20
- 18
- 29
- 18
- 24
- 50
- 87
- 21
- 41
- 21
- 34
- 47
- 20
- 26
- 14
- 9
- 24
- 16
- 18
- 44
- 28
- 37
- 10
- 19
- 11
- 56
- 11
- 28
- 16
- 14
- 19
- 23
- 11
- 22
- 63
- 22
- 13
- 29
- 11
- 64
- 44
- 45
- 38
- 17
- 18
- 21
- 13
- 12
- 13
- 10
- 17
- 14
- 16
- 10
- 19
- 25
- 15
- 50
- 13
- 10
- 15
- 12
- 15
- 11
- 14
- 17
- 17
- 14
- 49
- 18
- 13
- 28
- 31
- 19
- 26
- 31
- 29
- 21
- 23
- 17
- 23
- 32
- 35
- 10
- 11
- 30
- 21
- 16
- 15
- 23
- 40
- 24
- 24
- 14
episode_reward:
- 28.0
- 9.0
- 12.0
- 23.0
- 13.0
- 21.0
- 15.0
- 16.0
- 19.0
- 44.0
- 14.0
- 19.0
- 19.0
- 17.0
- 17.0
- 12.0
- 9.0
- 48.0
- 43.0
- 15.0
- 21.0
- 25.0
- 16.0
- 14.0
- 22.0
- 21.0
- 24.0
- 53.0
- 21.0
- 16.0
- 17.0
- 14.0
- 20.0
- 22.0
- 18.0
- 17.0
- 14.0
- 11.0
- 46.0
- 12.0
- 18.0
- 21.0
- 13.0
- 58.0
- 10.0
- 20.0
- 14.0
- 25.0
- 22.0
- 33.0
- 23.0
- 10.0
- 25.0
- 11.0
- 32.0
- 48.0
- 12.0
- 12.0
- 10.0
- 24.0
- 15.0
- 28.0
- 14.0
- 16.0
- 14.0
- 21.0
- 12.0
- 13.0
- 8.0
- 12.0
- 13.0
- 10.0
- 10.0
- 14.0
- 30.0
- 16.0
- 23.0
- 47.0
- 14.0
- 22.0
- 11.0
- 18.0
- 12.0
- 21.0
- 21.0
- 20.0
- 18.0
- 29.0
- 18.0
- 24.0
- 50.0
- 87.0
- 21.0
- 41.0
- 21.0
- 34.0
- 47.0
- 20.0
- 26.0
- 14.0
- 9.0
- 24.0
- 16.0
- 18.0
- 44.0
- 28.0
- 37.0
- 10.0
- 19.0
- 11.0
- 56.0
- 11.0
- 28.0
- 16.0
- 14.0
- 19.0
- 23.0
- 11.0
- 22.0
- 63.0
- 22.0
- 13.0
- 29.0
- 11.0
- 64.0
- 44.0
- 45.0
- 38.0
- 17.0
- 18.0
- 21.0
- 13.0
- 12.0
- 13.0
- 10.0
- 17.0
- 14.0
- 16.0
- 10.0
- 19.0
- 25.0
- 15.0
- 50.0
- 13.0
- 10.0
- 15.0
- 12.0
- 15.0
- 11.0
- 14.0
- 17.0
- 17.0
- 14.0
- 49.0
- 18.0
- 13.0
- 28.0
- 31.0
- 19.0
- 26.0
- 31.0
- 29.0
- 21.0
- 23.0
- 17.0
- 23.0
- 32.0
- 35.0
- 10.0
- 11.0
- 30.0
- 21.0
- 16.0
- 15.0
- 23.0
- 40.0
- 24.0
- 24.0
- 14.0
off_policy_estimator: {}
policy_reward_max: {}
policy_reward_mean: {}
policy_reward_min: {}
sampler_perf:
mean_action_processing_ms: 0.06886580197141673
mean_env_render_ms: 0.0
mean_env_wait_ms: 0.05465748139159193
mean_inference_ms: 0.6132523881103351
mean_raw_obs_processing_ms: 0.10609273714105154
time_since_restore: 3.7304069995880127
time_this_iter_s: 3.7304069995880127
time_total_s: 3.7304069995880127
timers:
learn_throughput: 2006.2
learn_time_ms: 1993.819
load_throughput: 24708712.813
load_time_ms: 0.162
training_iteration_time_ms: 3726.731
update_time_ms: 1.95
timestamp: 1652964884
timesteps_since_restore: 0
timesteps_total: 4000
training_iteration: 1
trial_id: cd8d6_00000
warmup_time: 10.095139741897583
Result for AIRPPOTrainer_cd8d6_00000:
agent_timesteps_total: 12000
counters:
num_agent_steps_sampled: 12000
num_agent_steps_trained: 12000
num_env_steps_sampled: 12000
num_env_steps_trained: 12000
custom_metrics: {}
date: 2022-05-19_13-54-51
done: false
episode_len_mean: 65.15
episode_media: {}
episode_reward_max: 200.0
episode_reward_mean: 65.15
episode_reward_min: 9.0
episodes_this_iter: 44
episodes_total: 311
experiment_id: 158c57d8b6e142ad85b393db57c8bdff
hostname: Kais-MacBook-Pro.local
info:
learner:
default_policy:
custom_metrics: {}
learner_stats:
cur_kl_coeff: 0.30000001192092896
cur_lr: 4.999999873689376e-05
entropy: 0.5750519633293152
entropy_coeff: 0.0
kl: 0.012749233283102512
model: {}
policy_loss: -0.026830431073904037
total_loss: 9.414541244506836
vf_explained_var: 0.046859823167324066
vf_loss: 9.43754768371582
num_agent_steps_trained: 128.0
num_agent_steps_sampled: 12000
num_agent_steps_trained: 12000
num_env_steps_sampled: 12000
num_env_steps_trained: 12000
iterations_since_restore: 3
node_ip: 127.0.0.1
num_agent_steps_sampled: 12000
num_agent_steps_trained: 12000
num_env_steps_sampled: 12000
num_env_steps_sampled_this_iter: 4000
num_env_steps_trained: 12000
num_env_steps_trained_this_iter: 4000
num_healthy_workers: 2
off_policy_estimator: {}
perf:
cpu_util_percent: 20.9
ram_util_percent: 61.379999999999995
pid: 14174
policy_reward_max: {}
policy_reward_mean: {}
policy_reward_min: {}
sampler_perf:
mean_action_processing_ms: 0.06834399059626647
mean_env_render_ms: 0.0
mean_env_wait_ms: 0.05423359203664157
mean_inference_ms: 0.5997818239241897
mean_raw_obs_processing_ms: 0.0982917359628421
sampler_results:
custom_metrics: {}
episode_len_mean: 65.15
episode_media: {}
episode_reward_max: 200.0
episode_reward_mean: 65.15
episode_reward_min: 9.0
episodes_this_iter: 44
hist_stats:
episode_lengths:
- 34
- 37
- 38
- 23
- 29
- 56
- 38
- 13
- 10
- 18
- 40
- 23
- 46
- 84
- 29
- 44
- 54
- 32
- 30
- 100
- 28
- 67
- 47
- 40
- 74
- 133
- 32
- 28
- 86
- 133
- 46
- 60
- 17
- 43
- 12
- 51
- 57
- 70
- 54
- 73
- 16
- 29
- 113
- 45
- 31
- 44
- 103
- 62
- 72
- 20
- 15
- 35
- 12
- 9
- 24
- 10
- 102
- 93
- 73
- 27
- 52
- 144
- 19
- 140
- 91
- 133
- 147
- 140
- 90
- 14
- 73
- 71
- 200
- 55
- 184
- 103
- 196
- 168
- 177
- 38
- 33
- 50
- 149
- 67
- 87
- 25
- 134
- 42
- 26
- 24
- 121
- 61
- 109
- 19
- 200
- 60
- 40
- 51
- 88
- 30
episode_reward:
- 34.0
- 37.0
- 38.0
- 23.0
- 29.0
- 56.0
- 38.0
- 13.0
- 10.0
- 18.0
- 40.0
- 23.0
- 46.0
- 84.0
- 29.0
- 44.0
- 54.0
- 32.0
- 30.0
- 100.0
- 28.0
- 67.0
- 47.0
- 40.0
- 74.0
- 133.0
- 32.0
- 28.0
- 86.0
- 133.0
- 46.0
- 60.0
- 17.0
- 43.0
- 12.0
- 51.0
- 57.0
- 70.0
- 54.0
- 73.0
- 16.0
- 29.0
- 113.0
- 45.0
- 31.0
- 44.0
- 103.0
- 62.0
- 72.0
- 20.0
- 15.0
- 35.0
- 12.0
- 9.0
- 24.0
- 10.0
- 102.0
- 93.0
- 73.0
- 27.0
- 52.0
- 144.0
- 19.0
- 140.0
- 91.0
- 133.0
- 147.0
- 140.0
- 90.0
- 14.0
- 73.0
- 71.0
- 200.0
- 55.0
- 184.0
- 103.0
- 196.0
- 168.0
- 177.0
- 38.0
- 33.0
- 50.0
- 149.0
- 67.0
- 87.0
- 25.0
- 134.0
- 42.0
- 26.0
- 24.0
- 121.0
- 61.0
- 109.0
- 19.0
- 200.0
- 60.0
- 40.0
- 51.0
- 88.0
- 30.0
off_policy_estimator: {}
policy_reward_max: {}
policy_reward_mean: {}
policy_reward_min: {}
sampler_perf:
mean_action_processing_ms: 0.06834399059626647
mean_env_render_ms: 0.0
mean_env_wait_ms: 0.05423359203664157
mean_inference_ms: 0.5997818239241897
mean_raw_obs_processing_ms: 0.0982917359628421
time_since_restore: 10.289561986923218
time_this_iter_s: 3.3495230674743652
time_total_s: 10.289561986923218
timers:
learn_throughput: 2276.977
learn_time_ms: 1756.715
load_throughput: 20798201.653
load_time_ms: 0.192
training_iteration_time_ms: 3425.704
update_time_ms: 1.814
timestamp: 1652964891
timesteps_since_restore: 0
timesteps_total: 12000
training_iteration: 3
trial_id: cd8d6_00000
warmup_time: 10.095139741897583
Result for AIRPPOTrainer_cd8d6_00000:
agent_timesteps_total: 20000
counters:
num_agent_steps_sampled: 20000
num_agent_steps_trained: 20000
num_env_steps_sampled: 20000
num_env_steps_trained: 20000
custom_metrics: {}
date: 2022-05-19_13-54-57
done: true
episode_len_mean: 124.79
episode_media: {}
episode_reward_max: 200.0
episode_reward_mean: 124.79
episode_reward_min: 9.0
episodes_this_iter: 20
episodes_total: 354
experiment_id: 158c57d8b6e142ad85b393db57c8bdff
hostname: Kais-MacBook-Pro.local
info:
learner:
default_policy:
custom_metrics: {}
learner_stats:
cur_kl_coeff: 0.30000001192092896
cur_lr: 4.999999873689376e-05
entropy: 0.5436986684799194
entropy_coeff: 0.0
kl: 0.0034858626313507557
model: {}
policy_loss: -0.012989979237318039
total_loss: 9.49295425415039
vf_explained_var: 0.025460055097937584
vf_loss: 9.504897117614746
num_agent_steps_trained: 128.0
num_agent_steps_sampled: 20000
num_agent_steps_trained: 20000
num_env_steps_sampled: 20000
num_env_steps_trained: 20000
iterations_since_restore: 5
node_ip: 127.0.0.1
num_agent_steps_sampled: 20000
num_agent_steps_trained: 20000
num_env_steps_sampled: 20000
num_env_steps_sampled_this_iter: 4000
num_env_steps_trained: 20000
num_env_steps_trained_this_iter: 4000
num_healthy_workers: 2
off_policy_estimator: {}
perf:
cpu_util_percent: 24.599999999999998
ram_util_percent: 59.775
pid: 14174
policy_reward_max: {}
policy_reward_mean: {}
policy_reward_min: {}
sampler_perf:
mean_action_processing_ms: 0.06817872750804764
mean_env_render_ms: 0.0
mean_env_wait_ms: 0.05424549075766555
mean_inference_ms: 0.5976919122059019
mean_raw_obs_processing_ms: 0.09603803519062176
sampler_results:
custom_metrics: {}
episode_len_mean: 124.79
episode_media: {}
episode_reward_max: 200.0
episode_reward_mean: 124.79
episode_reward_min: 9.0
episodes_this_iter: 20
hist_stats:
episode_lengths:
- 45
- 31
- 44
- 103
- 62
- 72
- 20
- 15
- 35
- 12
- 9
- 24
- 10
- 102
- 93
- 73
- 27
- 52
- 144
- 19
- 140
- 91
- 133
- 147
- 140
- 90
- 14
- 73
- 71
- 200
- 55
- 184
- 103
- 196
- 168
- 177
- 38
- 33
- 50
- 149
- 67
- 87
- 25
- 134
- 42
- 26
- 24
- 121
- 61
- 109
- 19
- 200
- 60
- 40
- 51
- 88
- 30
- 200
- 186
- 200
- 182
- 196
- 200
- 200
- 200
- 200
- 200
- 200
- 43
- 200
- 109
- 156
- 200
- 183
- 200
- 200
- 200
- 200
- 200
- 107
- 200
- 200
- 200
- 200
- 200
- 200
- 200
- 200
- 200
- 200
- 200
- 89
- 200
- 200
- 200
- 200
- 200
- 200
- 200
- 200
episode_reward:
- 45.0
- 31.0
- 44.0
- 103.0
- 62.0
- 72.0
- 20.0
- 15.0
- 35.0
- 12.0
- 9.0
- 24.0
- 10.0
- 102.0
- 93.0
- 73.0
- 27.0
- 52.0
- 144.0
- 19.0
- 140.0
- 91.0
- 133.0
- 147.0
- 140.0
- 90.0
- 14.0
- 73.0
- 71.0
- 200.0
- 55.0
- 184.0
- 103.0
- 196.0
- 168.0
- 177.0
- 38.0
- 33.0
- 50.0
- 149.0
- 67.0
- 87.0
- 25.0
- 134.0
- 42.0
- 26.0
- 24.0
- 121.0
- 61.0
- 109.0
- 19.0
- 200.0
- 60.0
- 40.0
- 51.0
- 88.0
- 30.0
- 200.0
- 186.0
- 200.0
- 182.0
- 196.0
- 200.0
- 200.0
- 200.0
- 200.0
- 200.0
- 200.0
- 43.0
- 200.0
- 109.0
- 156.0
- 200.0
- 183.0
- 200.0
- 200.0
- 200.0
- 200.0
- 200.0
- 107.0
- 200.0
- 200.0
- 200.0
- 200.0
- 200.0
- 200.0
- 200.0
- 200.0
- 200.0
- 200.0
- 200.0
- 89.0
- 200.0
- 200.0
- 200.0
- 200.0
- 200.0
- 200.0
- 200.0
- 200.0
off_policy_estimator: {}
policy_reward_max: {}
policy_reward_mean: {}
policy_reward_min: {}
sampler_perf:
mean_action_processing_ms: 0.06817872750804764
mean_env_render_ms: 0.0
mean_env_wait_ms: 0.05424549075766555
mean_inference_ms: 0.5976919122059019
mean_raw_obs_processing_ms: 0.09603803519062176
time_since_restore: 16.702913284301758
time_this_iter_s: 3.1872010231018066
time_total_s: 16.702913284301758
timers:
learn_throughput: 2378.661
learn_time_ms: 1681.619
load_throughput: 16503261.853
load_time_ms: 0.242
training_iteration_time_ms: 3336.7
update_time_ms: 1.759
timestamp: 1652964897
timesteps_since_restore: 0
timesteps_total: 20000
training_iteration: 5
trial_id: cd8d6_00000
warmup_time: 10.095139741897583
2022-05-19 13:54:58,548 INFO tune.py:753 -- Total run time: 36.92 seconds (35.95 seconds for the tuning loop).
And then, using the obtained checkpoint, we evaluate the policy on a fresh environment:
num_eval_episodes = 3
rewards = evaluate_using_checkpoint(result.checkpoint, num_episodes=num_eval_episodes)
print(f"Average reward over {num_eval_episodes} episodes: " f"{np.mean(rewards)}")
2022-05-19 13:54:58,589 INFO trainer.py:1728 -- Your framework setting is 'tf', meaning you are using static-graph mode. Set framework='tf2' to enable eager execution with tf2.x. You may also then want to set eager_tracing=True in order to reach similar execution speed as with static-graph mode.
2022-05-19 13:54:58,590 WARNING deprecation.py:47 -- DeprecationWarning: `simple_optimizer` has been deprecated. This will raise an error in the future!
2022-05-19 13:54:58,591 INFO ppo.py:361 -- In multi-agent mode, policies will be optimized sequentially by the multi-GPU optimizer. Consider setting simple_optimizer=True if this doesn't work for you.
2022-05-19 13:54:58,591 INFO trainer.py:328 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
(RolloutWorker pid=14191) 2022-05-19 13:55:06,622 WARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.execution.buffers` has been deprecated. Use `ray.rllib.utils.replay_buffers` instead. This will raise an error in the future!
(RolloutWorker pid=14192) 2022-05-19 13:55:06,622 WARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.execution.buffers` has been deprecated. Use `ray.rllib.utils.replay_buffers` instead. This will raise an error in the future!
2022-05-19 13:55:07,968 WARNING util.py:65 -- Install gputil for GPU system monitoring.
2022-05-19 13:55:08,021 INFO trainable.py:589 -- Restored on 127.0.0.1 from checkpoint: /Users/kai/ray_results/AIRPPOTrainer_2022-05-19_13-54-16/AIRPPOTrainer_cd8d6_00000_0_2022-05-19_13-54-22/checkpoint_000005/checkpoint-5
2022-05-19 13:55:08,021 INFO trainable.py:597 -- Current state after restoring: {'_iteration': 5, '_timesteps_total': None, '_time_total': 16.702913284301758, '_episodes_total': 354}
Average reward over 3 episodes: 200.0