Logging results and uploading models to Weights & BiasesΒΆ

In this example, we train a simple XGBoost model and log the training results to Weights & Biases. We also save the resulting model checkpoints as artifacts.

Let’s start with installing our dependencies:

!pip install -qU "ray[tune]" sklearn xgboost_ray wandb

Then we need some imports:

import ray

from ray.air.config import RunConfig, ScalingConfig
from ray.air.result import Result
from ray.train.xgboost import XGBoostTrainer
from ray.air.callbacks.wandb import WandbLoggerCallback

We define a simple function that returns our training dataset as a Ray Dataset:

def get_train_dataset() -> ray.data.Dataset:
    dataset = ray.data.read_csv("s3://anonymous@air-example-data/breast_cancer.csv")
    return dataset

Now we define a simple training function. All the magic happens within the WandbLoggerCallback:

WandbLoggerCallback(
    project=wandb_project,
    save_checkpoints=True,
)

It will automatically log all results to Weights & Biases and upload the checkpoints as artifacts. It assumes you’re logged in into Wandb via an API key or wandb login.

def train_model(train_dataset: ray.data.Dataset, wandb_project: str) -> Result:
    """Train a simple XGBoost model and return the result."""
    trainer = XGBoostTrainer(
        scaling_config=ScalingConfig(num_workers=2),
        params={"tree_method": "auto"},
        label_column="target",
        datasets={"train": train_dataset},
        num_boost_round=10,
        run_config=RunConfig(
            callbacks=[
                # This is the part needed to enable logging to Weights & Biases.
                # It assumes you've logged in before, e.g. with `wandb login`.
                WandbLoggerCallback(
                    project=wandb_project,
                    save_checkpoints=True,
                )
            ]
        ),
    )
    result = trainer.fit()
    return result

Let’s kick off a run:

wandb_project = "ray_air_example"

train_dataset = get_train_dataset()
result = train_model(train_dataset=train_dataset, wandb_project=wandb_project)
2022-05-19 15:22:11,956	INFO services.py:1483 -- View the Ray dashboard at http://127.0.0.1:8266
2022-05-19 15:22:15,995	INFO wandb.py:172 -- Already logged into W&B.
== Status ==
Current time: 2022-05-19 15:22:42 (running for 00:00:26.61)
Memory usage on this node: 10.2/16.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/16 CPUs, 0/0 GPUs, 0.0/4.6 GiB heap, 0.0/2.0 GiB objects
Result logdir: /Users/kai/ray_results/XGBoostTrainer_2022-05-19_15-22-14
Number of trials: 1/1 (1 TERMINATED)
Trial name status loc iter total time (s) train-rmse
XGBoostTrainer_14a73_00000TERMINATED127.0.0.1:20065 10 10.2724 0.030717


(raylet) 2022-05-19 15:22:17,422	INFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=61838 --object-store-name=/tmp/ray/session_2022-05-19_15-22-09_017478_19912/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-19_15-22-09_017478_19912/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=63609 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:62933 --redis-password=5241590000000000 --startup-token=16 --runtime-env-hash=-2010331134
wandb: Currently logged in as: kaifricke. Use `wandb login --relogin` to force relogin
(GBDTTrainable pid=20065) UserWarning: Dataset 'train' has 1 blocks, which is less than the `num_workers` 2. This dataset will be automatically repartitioned to 2 blocks.
(raylet) 2022-05-19 15:22:23,215	INFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=61838 --object-store-name=/tmp/ray/session_2022-05-19_15-22-09_017478_19912/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-19_15-22-09_017478_19912/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=63609 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:62933 --redis-password=5241590000000000 --startup-token=17 --runtime-env-hash=-2010331069
Tracking run with wandb version 0.12.16
Run data is saved locally in /Users/kai/coding/ray/doc/source/ray-air/examples/wandb/run-20220519_152218-14a73_00000
(GBDTTrainable pid=20065) 2022-05-19 15:22:24,711	INFO main.py:980 -- [RayXGBoost] Created 2 new actors (2 total actors). Waiting until actors are ready for training.
(raylet) 2022-05-19 15:22:26,090	INFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=61838 --object-store-name=/tmp/ray/session_2022-05-19_15-22-09_017478_19912/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-19_15-22-09_017478_19912/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=63609 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:62933 --redis-password=5241590000000000 --startup-token=18 --runtime-env-hash=-2010331069
(raylet) 2022-05-19 15:22:26,234	INFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=61838 --object-store-name=/tmp/ray/session_2022-05-19_15-22-09_017478_19912/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-19_15-22-09_017478_19912/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=63609 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:62933 --redis-password=5241590000000000 --startup-token=19 --runtime-env-hash=-2010331134
(raylet) 2022-05-19 15:22:26,236	INFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=61838 --object-store-name=/tmp/ray/session_2022-05-19_15-22-09_017478_19912/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-19_15-22-09_017478_19912/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=63609 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:62933 --redis-password=5241590000000000 --startup-token=20 --runtime-env-hash=-2010331134
(raylet) 2022-05-19 15:22:26,239	INFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=61838 --object-store-name=/tmp/ray/session_2022-05-19_15-22-09_017478_19912/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-19_15-22-09_017478_19912/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=63609 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:62933 --redis-password=5241590000000000 --startup-token=21 --runtime-env-hash=-2010331134
(raylet) 2022-05-19 15:22:26,263	INFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=61838 --object-store-name=/tmp/ray/session_2022-05-19_15-22-09_017478_19912/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-19_15-22-09_017478_19912/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=63609 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:62933 --redis-password=5241590000000000 --startup-token=22 --runtime-env-hash=-2010331134
(GBDTTrainable pid=20065) 2022-05-19 15:22:29,260	INFO main.py:1025 -- [RayXGBoost] Starting XGBoost training.
(_RemoteRayXGBoostActor pid=20130) [15:22:29] task [xgboost.ray]:6859875216 got new rank 0
(_RemoteRayXGBoostActor pid=20131) [15:22:29] task [xgboost.ray]:4625795280 got new rank 1
wandb: Adding directory to artifact (/Users/kai/ray_results/XGBoostTrainer_2022-05-19_15-22-14/XGBoostTrainer_14a73_00000_0_2022-05-19_15-22-16/checkpoint_000000)... Done. 0.1s
Result for XGBoostTrainer_14a73_00000:
  date: 2022-05-19_15-22-31
  done: false
  experiment_id: 2d50bfe80d2a441e80f4ca05f7c3b607
  hostname: Kais-MacBook-Pro.local
  iterations_since_restore: 1
  node_ip: 127.0.0.1
  pid: 20065
  should_checkpoint: true
  time_since_restore: 10.080440044403076
  time_this_iter_s: 10.080440044403076
  time_total_s: 10.080440044403076
  timestamp: 1652970151
  timesteps_since_restore: 0
  train-rmse: 0.357284
  training_iteration: 1
  trial_id: 14a73_00000
  warmup_time: 0.006903171539306641
  
wandb: Adding directory to artifact (/Users/kai/ray_results/XGBoostTrainer_2022-05-19_15-22-14/XGBoostTrainer_14a73_00000_0_2022-05-19_15-22-16/checkpoint_000001)... Done. 0.1s
(GBDTTrainable pid=20065) 2022-05-19 15:22:32,051	INFO main.py:1519 -- [RayXGBoost] Finished XGBoost training on training data with total N=569 in 7.37 seconds (2.79 pure XGBoost training time).
wandb: Adding directory to artifact (/Users/kai/ray_results/XGBoostTrainer_2022-05-19_15-22-14/XGBoostTrainer_14a73_00000_0_2022-05-19_15-22-16/checkpoint_000002)... Done. 0.1s
wandb: Adding directory to artifact (/Users/kai/ray_results/XGBoostTrainer_2022-05-19_15-22-14/XGBoostTrainer_14a73_00000_0_2022-05-19_15-22-16/checkpoint_000003)... Done. 0.1s
wandb: Adding directory to artifact (/Users/kai/ray_results/XGBoostTrainer_2022-05-19_15-22-14/XGBoostTrainer_14a73_00000_0_2022-05-19_15-22-16/checkpoint_000004)... Done. 0.1s
wandb: Adding directory to artifact (/Users/kai/ray_results/XGBoostTrainer_2022-05-19_15-22-14/XGBoostTrainer_14a73_00000_0_2022-05-19_15-22-16/checkpoint_000005)... Done. 0.1s
wandb: Adding directory to artifact (/Users/kai/ray_results/XGBoostTrainer_2022-05-19_15-22-14/XGBoostTrainer_14a73_00000_0_2022-05-19_15-22-16/checkpoint_000006)... Done. 0.1s
wandb: Adding directory to artifact (/Users/kai/ray_results/XGBoostTrainer_2022-05-19_15-22-14/XGBoostTrainer_14a73_00000_0_2022-05-19_15-22-16/checkpoint_000007)... Done. 0.1s
wandb: Adding directory to artifact (/Users/kai/ray_results/XGBoostTrainer_2022-05-19_15-22-14/XGBoostTrainer_14a73_00000_0_2022-05-19_15-22-16/checkpoint_000008)... Done. 0.1s
wandb: Adding directory to artifact (/Users/kai/ray_results/XGBoostTrainer_2022-05-19_15-22-14/XGBoostTrainer_14a73_00000_0_2022-05-19_15-22-16/checkpoint_000009)... Done. 0.1s
wandb: Adding directory to artifact (/Users/kai/ray_results/XGBoostTrainer_2022-05-19_15-22-14/XGBoostTrainer_14a73_00000_0_2022-05-19_15-22-16/checkpoint_000009)... Done. 0.1s
Waiting for W&B process to finish... (success).
Result for XGBoostTrainer_14a73_00000:
  date: 2022-05-19_15-22-32
  done: true
  experiment_id: 2d50bfe80d2a441e80f4ca05f7c3b607
  experiment_tag: '0'
  hostname: Kais-MacBook-Pro.local
  iterations_since_restore: 10
  node_ip: 127.0.0.1
  pid: 20065
  should_checkpoint: true
  time_since_restore: 10.272444248199463
  time_this_iter_s: 0.023891210556030273
  time_total_s: 10.272444248199463
  timestamp: 1652970152
  timesteps_since_restore: 0
  train-rmse: 0.030717
  training_iteration: 10
  trial_id: 14a73_00000
  warmup_time: 0.006903171539306641
  
2022-05-19 15:22:42,727	INFO tune.py:753 -- Total run time: 27.83 seconds (26.61 seconds for the tuning loop).

Run history:


iterations_since_restoreβ–β–‚β–ƒβ–ƒβ–„β–…β–†β–†β–‡β–ˆ
time_since_restoreβ–β–‚β–ƒβ–ƒβ–„β–…β–…β–†β–‡β–ˆ
time_this_iter_sβ–ˆβ–β–β–β–β–β–β–β–β–
time_total_sβ–β–‚β–ƒβ–ƒβ–„β–…β–…β–†β–‡β–ˆ
timestampβ–β–β–β–β–β–β–β–β–ˆβ–ˆ
timesteps_since_restore▁▁▁▁▁▁▁▁▁▁
train-rmseβ–ˆβ–†β–„β–ƒβ–‚β–‚β–‚β–β–β–
training_iterationβ–β–‚β–ƒβ–ƒβ–„β–…β–†β–†β–‡β–ˆ
warmup_time▁▁▁▁▁▁▁▁▁▁

Run summary:


iterations_since_restore10
time_since_restore10.27244
time_this_iter_s0.02389
time_total_s10.27244
timestamp1652970152
timesteps_since_restore0
train-rmse0.03072
training_iteration10
warmup_time0.0069

Synced XGBoostTrainer_14a73_00000: https://wandb.ai/kaifricke/ray_air_example/runs/14a73_00000
Synced 5 W&B file(s), 0 media file(s), 21 artifact file(s) and 0 other file(s)
Find logs at: ./wandb/run-20220519_152218-14a73_00000/logs

Check out your WandB project to see the results!