Logging results and uploading models to Weights & Biases

In this example, we train a simple XGBoost model and log the training results to Weights & Biases. We also save the resulting model checkpoints as artifacts.

There are two ways to achieve this:

  1. Automatically using the ray.air.integrations.wandb.WandbLoggerCallback

  2. Manually using the wandb API

This tutorial will walk you through both options.

Let’s start with installing our dependencies:

!pip install -qU "ray[tune]" sklearn xgboost_ray wandb

Then we need some imports:

import ray

from ray.air.config import RunConfig, ScalingConfig
from ray.air.result import Result
from ray.air.integrations.wandb import WandbLoggerCallback

We define a simple function that returns our training dataset as a Ray Dataset:

def get_train_dataset() -> ray.data.Dataset:
    dataset = ray.data.read_csv("s3://anonymous@air-example-data/breast_cancer.csv")
    return dataset

And that’s the common parts. We now dive into the two options to interact with Weights and Biases.

Using the WandbLoggerCallback

The WandbLoggerCallback does all the logging and reporting for you. It is especially useful when you use an out-of-the-box trainer like XGBoostTrainer. In these trainers, you don’t define your own training loop, so using the AIR W&B callback is the best way to log your results to Weights and Biases.

First we define a simple training function.

All the magic happens within the WandbLoggerCallback:

WandbLoggerCallback(
    project=wandb_project,
    save_checkpoints=True,
)

It will automatically log all results to Weights & Biases and upload the checkpoints as artifacts. It assumes you’re logged in into Wandb via an API key or wandb login.

from ray.train.xgboost import XGBoostTrainer


def train_model_xgboost(train_dataset: ray.data.Dataset, wandb_project: str) -> Result:
    """Train a simple XGBoost model and return the result."""
    trainer = XGBoostTrainer(
        scaling_config=ScalingConfig(num_workers=2),
        params={"tree_method": "auto"},
        label_column="target",
        datasets={"train": train_dataset},
        num_boost_round=10,
        run_config=RunConfig(
            callbacks=[
                # This is the part needed to enable logging to Weights & Biases.
                # It assumes you've logged in before, e.g. with `wandb login`.
                WandbLoggerCallback(
                    project=wandb_project,
                    save_checkpoints=True,
                )
            ]
        ),
    )
    result = trainer.fit()
    return result

Let’s kick off a run:

wandb_project = "ray_air_example_xgboost"

train_dataset = get_train_dataset()
result = train_model_xgboost(train_dataset=train_dataset, wandb_project=wandb_project)
2022-10-28 16:28:19,325	INFO worker.py:1524 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265 
2022-10-28 16:28:22,993	WARNING read_api.py:297 -- ⚠️  The number of blocks in this dataset (1) limits its parallelism to 1 concurrent tasks. This is much less than the number of available CPU slots in the cluster. Use `.repartition(n)` to increase the number of dataset blocks.
2022-10-28 16:28:26,033	INFO wandb.py:267 -- Already logged into W&B.

Check out your WandB project to see the results!

Using the wandb API

When you define your own training loop, you sometimes want to manually interact with the Weights and Biases API. Ray AIR provides a setup_wandb() function that takes care of the initialization.

The main benefit here is that authentication to Weights and Biases is automatically set up for you, and sensible default names for your runs are set. Of course, you can override these.

Additionally in distributed training you often only want to report the results of the rank 0 worker. This can also be done automatically using our setup.

Let’s define a distributed training loop. The important part here are:

wandb = setup_wandb(config)

and later

wandb.log({"loss": loss.item()})

The call to setup_wandb() will setup your session, for instance calling wandb.init() with sensible defaults. Because we are in a distributed training setting, this will only happen for the rank 0 - all other workers get a mock object back, and any subsequent calls to wandb.XXX will be a no-op for these.

You can then use the wandb as usual:

from ray.air import session
from ray.air.integrations.wandb import setup_wandb
from ray.data.preprocessors import Concatenator

import numpy as np


import torch.optim as optim
import torch.nn as nn

def train_loop(config):
    wandb = setup_wandb(config)
    
    dataset = session.get_dataset_shard("train")

    model = nn.Linear(30, 2)

    optimizer = optim.SGD(
        model.parameters(),
        lr=config.get("lr", 0.01),
    )
    loss_fn = nn.CrossEntropyLoss()
    
    for batch in dataset.iter_torch_batches(batch_size=32):
        X = batch["data"]
        y = batch["target"]
        
        # Compute prediction error
        pred = model(X)
        loss = loss_fn(pred, y)

        # Backpropagation
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        session.report({"loss": loss.item()})
        wandb.log({"loss": loss.item()})
    

Let’s define a function to kick off the training - again, we can configure Weights and Biases settings in the config. But you could also just pass it to the setup function, e.g. like this:

setup_wandb(config, project="my_project")
from ray.train.torch import TorchTrainer


def train_model_torch(train_dataset: ray.data.Dataset, wandb_project: str) -> Result:
    """Train a simple XGBoost model and return the result."""
    trainer = TorchTrainer(
        train_loop_per_worker=train_loop,
        scaling_config=ScalingConfig(num_workers=2),
        train_loop_config={"lr": 0.01, "wandb": {"project": wandb_project}},
        datasets={"train": train_dataset},
        preprocessor=Concatenator("data", dtype=np.float32, exclude=["target"]),
    )
    result = trainer.fit()
    return result

Let’s kick off this run:

wandb_project = "ray_air_example_torch"

train_dataset = get_train_dataset()
result = train_model_torch(train_dataset=train_dataset, wandb_project=wandb_project)

Check out your WandB project to see the results!