Using Weights & Biases with Tune

Weights & Biases (Wandb) is a tool for experiment tracking, model optimizaton, and dataset versioning. It is very popular in the machine learning and data science community for its superb visualization tools.

Weights & Biases

Ray Tune currently offers two lightweight integrations for Weights & Biases. One is the WandbLoggerCallback, which automatically logs metrics reported to Tune to the Wandb API.

The other one is the @wandb_mixin decorator, which can be used with the function API. It automatically initializes the Wandb API with Tune’s training information. You can just use the Wandb API like you would normally do, e.g. using wandb.log() to log your training process.

Running A Weights & Biases Example

In the following example we’re going to use both of the above methods, namely the WandbLoggerCallback and the wandb_mixin decorator to log metrics. Let’s start with a few crucial imports:

import numpy as np
import wandb

from ray import air, tune
from ray.air import session
from ray.tune import Trainable
from ray.air.callbacks.wandb import WandbLoggerCallback
from ray.tune.integration.wandb import (
    WandbTrainableMixin,
    wandb_mixin,
)

Next, let’s define an easy objective function (a Tune Trainable) that reports a random loss to Tune. The objective function itself is not important for this example, since we want to focus on the Weights & Biases integration primarily.

def objective(config, checkpoint_dir=None):
    for i in range(30):
        loss = config["mean"] + config["sd"] * np.random.randn()
        session.report({"loss": loss})

Given that you provide an api_key_file pointing to your Weights & Biases API key, you cna define a simple grid-search Tune run using the WandbLoggerCallback as follows:

def tune_function(api_key_file):
    """Example for using a WandbLoggerCallback with the function API"""
    tuner = tune.Tuner(
        objective,
        tune_config=tune.TuneConfig(
            metric="loss",
            mode="min",
        ),
        run_config=air.RunConfig(
            callbacks=[
                WandbLoggerCallback(api_key_file=api_key_file, project="Wandb_example")
            ],
        ),
        param_space={
            "mean": tune.grid_search([1, 2, 3, 4, 5]),
            "sd": tune.uniform(0.2, 0.8),
        },
    )
    results = tuner.fit()

    return results.get_best_result().config

To use the wandb_mixin decorator, you can simply decorate the objective function from earlier. Note that we also use wandb.log(...) to log the loss to Weights & Biases as a dictionary. Otherwise, the decorated version of our objective is identical to its original.

@wandb_mixin
def decorated_objective(config, checkpoint_dir=None):
    for i in range(30):
        loss = config["mean"] + config["sd"] * np.random.randn()
        session.report({"loss": loss})
        wandb.log(dict(loss=loss))

With the decorated_objective defined, running a Tune experiment is as simple as providing this objective and passing the api_key_file to the wandb key of your Tune config:

def tune_decorated(api_key_file):
    """Example for using the @wandb_mixin decorator with the function API"""
    tuner = tune.Tuner(
        objective,
        tune_config=tune.TuneConfig(
            metric="loss",
            mode="min",
        ),
        param_space={
            "mean": tune.grid_search([1, 2, 3, 4, 5]),
            "sd": tune.uniform(0.2, 0.8),
            "wandb": {"api_key_file": api_key_file, "project": "Wandb_example"},
        },
    )
    results = tuner.fit()

    return results.get_best_result().config

Finally, you can also define a class-based Tune Trainable by using the WandbTrainableMixin to define your objective:

class WandbTrainable(WandbTrainableMixin, Trainable):
    def step(self):
        for i in range(30):
            loss = self.config["mean"] + self.config["sd"] * np.random.randn()
            wandb.log({"loss": loss})
        return {"loss": loss, "done": True}

Running Tune with this WandbTrainable works exactly the same as with the function API. The below tune_trainable function differs from tune_decorated above only in the first argument we pass to Tuner():

def tune_trainable(api_key_file):
    """Example for using a WandTrainableMixin with the class API"""
    tuner = tune.Tuner(
        WandbTrainable,
        tune_config=tune.TuneConfig(
            metric="loss",
            mode="min",
        ),
        param_space={
            "mean": tune.grid_search([1, 2, 3, 4, 5]),
            "sd": tune.uniform(0.2, 0.8),
            "wandb": {"api_key_file": api_key_file, "project": "Wandb_example"},
        },
    )
    results = tuner.fit()

    return results.get_best_result().config

Since you may not have an API key for Wandb, we can mock the Wandb logger and test all three of our training functions as follows. If you do have an API key file, make sure to set mock_api to False and pass in the right api_key_file below.

import tempfile
from unittest.mock import MagicMock

mock_api = True

api_key_file = "~/.wandb_api_key"

if mock_api:
    WandbLoggerCallback._logger_process_cls = MagicMock
    decorated_objective.__mixins__ = tuple()
    WandbTrainable._wandb = MagicMock()
    wandb = MagicMock()  # noqa: F811
    temp_file = tempfile.NamedTemporaryFile()
    temp_file.write(b"1234")
    temp_file.flush()
    api_key_file = temp_file.name

tune_function(api_key_file)
tune_decorated(api_key_file)
tune_trainable(api_key_file)

if mock_api:
    temp_file.close()
2022-07-22 15:39:38,323	INFO services.py:1483 -- View the Ray dashboard at http://127.0.0.1:8266
/Users/kai/coding/ray/python/ray/tune/trainable/function_trainable.py:643: DeprecationWarning: `checkpoint_dir` in `func(config, checkpoint_dir)` is being deprecated. To save and load checkpoint in trainable functions, please use the `ray.air.session` API:

from ray.air import session

def train(config):
    # ...
    session.report({"metric": metric}, checkpoint=checkpoint)

For more information please see https://docs.ray.io/en/master/ray-air/key-concepts.html#session

  DeprecationWarning,
== Status ==
Current time: 2022-07-22 15:39:47 (running for 00:00:06.01)
Memory usage on this node: 9.9/16.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/16 CPUs, 0/0 GPUs, 0.0/5.52 GiB heap, 0.0/2.0 GiB objects
Current best trial: 1e575_00000 with loss=0.6535282890948189 and parameters={'mean': 1, 'sd': 0.6540704916919089}
Result logdir: /Users/kai/ray_results/objective_2022-07-22_15-39-35
Number of trials: 5/5 (5 TERMINATED)
Trial name status loc mean sd iter total time (s) loss
objective_1e575_00000TERMINATED127.0.0.1:47932 10.65407 30 0.2035220.653528
objective_1e575_00001TERMINATED127.0.0.1:47941 20.72087 30 0.3142811.14091
objective_1e575_00002TERMINATED127.0.0.1:47942 30.680016 30 0.43947 2.11278
objective_1e575_00003TERMINATED127.0.0.1:47943 40.296117 30 0.4424534.33397
objective_1e575_00004TERMINATED127.0.0.1:47944 50.358219 30 0.3627295.41971


2022-07-22 15:39:41,596	INFO plugin_schema_manager.py:52 -- Loading the default runtime env schemas: ['/Users/kai/coding/ray/python/ray/_private/runtime_env/../../runtime_env/schemas/working_dir_schema.json', '/Users/kai/coding/ray/python/ray/_private/runtime_env/../../runtime_env/schemas/pip_schema.json'].
Result for objective_1e575_00000:
  date: 2022-07-22_15-39-44
  done: false
  experiment_id: 60ffbe63fc834195a37fabc078985531
  hostname: Kais-MacBook-Pro.local
  iterations_since_restore: 1
  loss: 0.4005309978356091
  node_ip: 127.0.0.1
  pid: 47932
  time_since_restore: 0.0001418590545654297
  time_this_iter_s: 0.0001418590545654297
  time_total_s: 0.0001418590545654297
  timestamp: 1658500784
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: 1e575_00000
  warmup_time: 0.002913236618041992
  
Result for objective_1e575_00000:
  date: 2022-07-22_15-39-44
  done: true
  experiment_id: 60ffbe63fc834195a37fabc078985531
  experiment_tag: 0_mean=1,sd=0.6541
  hostname: Kais-MacBook-Pro.local
  iterations_since_restore: 30
  loss: 0.6535282890948189
  node_ip: 127.0.0.1
  pid: 47932
  time_since_restore: 0.203521728515625
  time_this_iter_s: 0.003339052200317383
  time_total_s: 0.203521728515625
  timestamp: 1658500784
  timesteps_since_restore: 0
  training_iteration: 30
  trial_id: 1e575_00000
  warmup_time: 0.002913236618041992
  
Result for objective_1e575_00002:
  date: 2022-07-22_15-39-46
  done: false
  experiment_id: c812a92f07134341a2908abc6e315061
  hostname: Kais-MacBook-Pro.local
  iterations_since_restore: 1
  loss: 2.7700164667438716
  node_ip: 127.0.0.1
  pid: 47942
  time_since_restore: 0.00013971328735351562
  time_this_iter_s: 0.00013971328735351562
  time_total_s: 0.00013971328735351562
  timestamp: 1658500786
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: 1e575_00002
  warmup_time: 0.002918720245361328
  
Result for objective_1e575_00003:
  date: 2022-07-22_15-39-46
  done: false
  experiment_id: b97d28ec439342ae8dd7c7fa4ac4ccca
  hostname: Kais-MacBook-Pro.local
  iterations_since_restore: 1
  loss: 3.895346250529465
  node_ip: 127.0.0.1
  pid: 47943
  time_since_restore: 0.00013494491577148438
  time_this_iter_s: 0.00013494491577148438
  time_total_s: 0.00013494491577148438
  timestamp: 1658500786
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: 1e575_00003
  warmup_time: 0.0031499862670898438
  
Result for objective_1e575_00001:
  date: 2022-07-22_15-39-46
  done: false
  experiment_id: 7034e40ba23f495eb6974ad5bda1406d
  hostname: Kais-MacBook-Pro.local
  iterations_since_restore: 1
  loss: 1.8250068029519693
  node_ip: 127.0.0.1
  pid: 47941
  time_since_restore: 0.00015974044799804688
  time_this_iter_s: 0.00015974044799804688
  time_total_s: 0.00015974044799804688
  timestamp: 1658500786
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: 1e575_00001
  warmup_time: 0.0026862621307373047
  
Result for objective_1e575_00004:
  date: 2022-07-22_15-39-46
  done: false
  experiment_id: 6b7bf17ee7444b22b809897292864e19
  hostname: Kais-MacBook-Pro.local
  iterations_since_restore: 1
  loss: 5.098807619369106
  node_ip: 127.0.0.1
  pid: 47944
  time_since_restore: 0.00012803077697753906
  time_this_iter_s: 0.00012803077697753906
  time_total_s: 0.00012803077697753906
  timestamp: 1658500786
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: 1e575_00004
  warmup_time: 0.002666950225830078
  
Result for objective_1e575_00002:
  date: 2022-07-22_15-39-47
  done: true
  experiment_id: c812a92f07134341a2908abc6e315061
  experiment_tag: 2_mean=3,sd=0.6800
  hostname: Kais-MacBook-Pro.local
  iterations_since_restore: 30
  loss: 2.1127773612837975
  node_ip: 127.0.0.1
  pid: 47942
  time_since_restore: 0.4394698143005371
  time_this_iter_s: 0.005173921585083008
  time_total_s: 0.4394698143005371
  timestamp: 1658500787
  timesteps_since_restore: 0
  training_iteration: 30
  trial_id: 1e575_00002
  warmup_time: 0.002918720245361328
  
Result for objective_1e575_00001:
  date: 2022-07-22_15-39-47
  done: true
  experiment_id: 7034e40ba23f495eb6974ad5bda1406d
  experiment_tag: 1_mean=2,sd=0.7209
  hostname: Kais-MacBook-Pro.local
  iterations_since_restore: 30
  loss: 1.1409060371452806
  node_ip: 127.0.0.1
  pid: 47941
  time_since_restore: 0.31428098678588867
  time_this_iter_s: 0.008217096328735352
  time_total_s: 0.31428098678588867
  timestamp: 1658500787
  timesteps_since_restore: 0
  training_iteration: 30
  trial_id: 1e575_00001
  warmup_time: 0.0026862621307373047
  
Result for objective_1e575_00003:
  date: 2022-07-22_15-39-47
  done: true
  experiment_id: b97d28ec439342ae8dd7c7fa4ac4ccca
  experiment_tag: 3_mean=4,sd=0.2961
  hostname: Kais-MacBook-Pro.local
  iterations_since_restore: 30
  loss: 4.333967406156947
  node_ip: 127.0.0.1
  pid: 47943
  time_since_restore: 0.44245290756225586
  time_this_iter_s: 0.005827903747558594
  time_total_s: 0.44245290756225586
  timestamp: 1658500787
  timesteps_since_restore: 0
  training_iteration: 30
  trial_id: 1e575_00003
  warmup_time: 0.0031499862670898438
  
Result for objective_1e575_00004:
  date: 2022-07-22_15-39-47
  done: true
  experiment_id: 6b7bf17ee7444b22b809897292864e19
  experiment_tag: 4_mean=5,sd=0.3582
  hostname: Kais-MacBook-Pro.local
  iterations_since_restore: 30
  loss: 5.419707275520466
  node_ip: 127.0.0.1
  pid: 47944
  time_since_restore: 0.3627290725708008
  time_this_iter_s: 0.006065845489501953
  time_total_s: 0.3627290725708008
  timestamp: 1658500787
  timesteps_since_restore: 0
  training_iteration: 30
  trial_id: 1e575_00004
  warmup_time: 0.002666950225830078
  
2022-07-22 15:39:47,478	INFO tune.py:738 -- Total run time: 6.95 seconds (6.00 seconds for the tuning loop).
== Status ==
Current time: 2022-07-22 15:39:53 (running for 00:00:05.64)
Memory usage on this node: 9.8/16.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/16 CPUs, 0/0 GPUs, 0.0/5.52 GiB heap, 0.0/2.0 GiB objects
Current best trial: 227e1_00000 with loss=1.4158135642199134 and parameters={'mean': 1, 'sd': 0.35625806806413973, 'wandb': {'api_key_file': '/var/folders/b2/0_91bd757rz02lrmr920v0gw0000gn/T/tmp9qec20eq', 'project': 'Wandb_example'}}
Result logdir: /Users/kai/ray_results/objective_2022-07-22_15-39-47
Number of trials: 5/5 (5 TERMINATED)
Trial name status loc mean sd iter total time (s) loss
objective_227e1_00000TERMINATED127.0.0.1:47968 10.356258 30 0.08696011.41581
objective_227e1_00001TERMINATED127.0.0.1:47973 20.411041 30 0.371924 2.9165
objective_227e1_00002TERMINATED127.0.0.1:47974 30.359191 30 0.305055 2.57809
objective_227e1_00003TERMINATED127.0.0.1:47975 40.543202 30 0.218044 5.06532
objective_227e1_00004TERMINATED127.0.0.1:47976 50.777638 30 0.287682 6.36554


Result for objective_227e1_00000:
  date: 2022-07-22_15-39-50
  done: false
  experiment_id: e80ef3e4843c41068c733322d48e0817
  hostname: Kais-MacBook-Pro.local
  iterations_since_restore: 1
  loss: 0.27641082730463906
  node_ip: 127.0.0.1
  pid: 47968
  time_since_restore: 0.0001361370086669922
  time_this_iter_s: 0.0001361370086669922
  time_total_s: 0.0001361370086669922
  timestamp: 1658500790
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: 227e1_00000
  warmup_time: 0.003004789352416992
  
Result for objective_227e1_00000:
  date: 2022-07-22_15-39-50
  done: true
  experiment_id: e80ef3e4843c41068c733322d48e0817
  experiment_tag: 0_mean=1,sd=0.3563
  hostname: Kais-MacBook-Pro.local
  iterations_since_restore: 30
  loss: 1.4158135642199134
  node_ip: 127.0.0.1
  pid: 47968
  time_since_restore: 0.0869600772857666
  time_this_iter_s: 0.0022199153900146484
  time_total_s: 0.0869600772857666
  timestamp: 1658500790
  timesteps_since_restore: 0
  training_iteration: 30
  trial_id: 227e1_00000
  warmup_time: 0.003004789352416992
  
Result for objective_227e1_00001:
  date: 2022-07-22_15-39-52
  done: false
  experiment_id: bf0685a616354a02af154ac3601a2109
  hostname: Kais-MacBook-Pro.local
  iterations_since_restore: 1
  loss: 2.058177604134134
  node_ip: 127.0.0.1
  pid: 47973
  time_since_restore: 0.00015783309936523438
  time_this_iter_s: 0.00015783309936523438
  time_total_s: 0.00015783309936523438
  timestamp: 1658500792
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: 227e1_00001
  warmup_time: 0.0029697418212890625
  
Result for objective_227e1_00004:
  date: 2022-07-22_15-39-52
  done: false
  experiment_id: 1f45d26f052c443d8a4aef3279f4e29e
  hostname: Kais-MacBook-Pro.local
  iterations_since_restore: 1
  loss: 5.383672927239436
  node_ip: 127.0.0.1
  pid: 47976
  time_since_restore: 0.00013184547424316406
  time_this_iter_s: 0.00013184547424316406
  time_total_s: 0.00013184547424316406
  timestamp: 1658500792
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: 227e1_00004
  warmup_time: 0.0028159618377685547
  
Result for objective_227e1_00003:
  date: 2022-07-22_15-39-52
  done: false
  experiment_id: c4b18bff67ec45939614ad8b66cecb8c
  hostname: Kais-MacBook-Pro.local
  iterations_since_restore: 1
  loss: 2.6242029842903367
  node_ip: 127.0.0.1
  pid: 47975
  time_since_restore: 0.00014901161193847656
  time_this_iter_s: 0.00014901161193847656
  time_total_s: 0.00014901161193847656
  timestamp: 1658500792
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: 227e1_00003
  warmup_time: 0.0026941299438476562
  
Result for objective_227e1_00002:
  date: 2022-07-22_15-39-52
  done: false
  experiment_id: b84e7701625e49ef8056680eb616b611
  hostname: Kais-MacBook-Pro.local
  iterations_since_restore: 1
  loss: 3.2091889147367088
  node_ip: 127.0.0.1
  pid: 47974
  time_since_restore: 0.00016427040100097656
  time_this_iter_s: 0.00016427040100097656
  time_total_s: 0.00016427040100097656
  timestamp: 1658500792
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: 227e1_00002
  warmup_time: 0.0029571056365966797
  
Result for objective_227e1_00003:
  date: 2022-07-22_15-39-53
  done: true
  experiment_id: c4b18bff67ec45939614ad8b66cecb8c
  experiment_tag: 3_mean=4,sd=0.5432
  hostname: Kais-MacBook-Pro.local
  iterations_since_restore: 30
  loss: 5.065320265027247
  node_ip: 127.0.0.1
  pid: 47975
  time_since_restore: 0.21804404258728027
  time_this_iter_s: 0.011553049087524414
  time_total_s: 0.21804404258728027
  timestamp: 1658500793
  timesteps_since_restore: 0
  training_iteration: 30
  trial_id: 227e1_00003
  warmup_time: 0.0026941299438476562
  
Result for objective_227e1_00002:
  date: 2022-07-22_15-39-53
  done: true
  experiment_id: b84e7701625e49ef8056680eb616b611
  experiment_tag: 2_mean=3,sd=0.3592
  hostname: Kais-MacBook-Pro.local
  iterations_since_restore: 30
  loss: 2.578088712628635
  node_ip: 127.0.0.1
  pid: 47974
  time_since_restore: 0.3050551414489746
  time_this_iter_s: 0.005466938018798828
  time_total_s: 0.3050551414489746
  timestamp: 1658500793
  timesteps_since_restore: 0
  training_iteration: 30
  trial_id: 227e1_00002
  warmup_time: 0.0029571056365966797
  
Result for objective_227e1_00001:
  date: 2022-07-22_15-39-53
  done: true
  experiment_id: bf0685a616354a02af154ac3601a2109
  experiment_tag: 1_mean=2,sd=0.4110
  hostname: Kais-MacBook-Pro.local
  iterations_since_restore: 30
  loss: 2.9165001549045844
  node_ip: 127.0.0.1
  pid: 47973
  time_since_restore: 0.37192392349243164
  time_this_iter_s: 0.007360935211181641
  time_total_s: 0.37192392349243164
  timestamp: 1658500793
  timesteps_since_restore: 0
  training_iteration: 30
  trial_id: 227e1_00001
  warmup_time: 0.0029697418212890625
  
Result for objective_227e1_00004:
  date: 2022-07-22_15-39-53
  done: true
  experiment_id: 1f45d26f052c443d8a4aef3279f4e29e
  experiment_tag: 4_mean=5,sd=0.7776
  hostname: Kais-MacBook-Pro.local
  iterations_since_restore: 30
  loss: 6.365540480426036
  node_ip: 127.0.0.1
  pid: 47976
  time_since_restore: 0.28768181800842285
  time_this_iter_s: 0.003290891647338867
  time_total_s: 0.28768181800842285
  timestamp: 1658500793
  timesteps_since_restore: 0
  training_iteration: 30
  trial_id: 227e1_00004
  warmup_time: 0.0028159618377685547
  
2022-07-22 15:39:53,254	INFO tune.py:738 -- Total run time: 5.76 seconds (5.63 seconds for the tuning loop).
== Status ==
Current time: 2022-07-22 15:39:59 (running for 00:00:06.06)
Memory usage on this node: 10.1/16.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/16 CPUs, 0/0 GPUs, 0.0/5.52 GiB heap, 0.0/2.0 GiB objects
Current best trial: 25f04_00000 with loss=0.9941371354505734 and parameters={'mean': 1, 'sd': 0.5245309522439918}
Result logdir: /Users/kai/ray_results/WandbTrainable_2022-07-22_15-39-53
Number of trials: 5/5 (5 ERROR)
Trial name status loc mean sd iter total time (s) loss
WandbTrainable_25f04_00000ERROR 127.0.0.1:47994 10.524531 1 0.0008277890.994137
WandbTrainable_25f04_00001ERROR 127.0.0.1:48005 20.515265 1 0.00108528 2.31254
WandbTrainable_25f04_00002ERROR 127.0.0.1:48006 30.56327 1 0.00111198 3.43952
WandbTrainable_25f04_00003ERROR 127.0.0.1:48007 40.507054 1 0.0009930134.53341
WandbTrainable_25f04_00004ERROR 127.0.0.1:48008 50.372142 1 0.0008499625.13408

Number of errored trials: 5
Trial name # failureserror file
WandbTrainable_25f04_00000 1/Users/kai/ray_results/WandbTrainable_2022-07-22_15-39-53/WandbTrainable_25f04_00000_0_mean=1,sd=0.5245_2022-07-22_15-39-53/error.txt
WandbTrainable_25f04_00001 1/Users/kai/ray_results/WandbTrainable_2022-07-22_15-39-53/WandbTrainable_25f04_00001_1_mean=2,sd=0.5153_2022-07-22_15-39-56/error.txt
WandbTrainable_25f04_00002 1/Users/kai/ray_results/WandbTrainable_2022-07-22_15-39-53/WandbTrainable_25f04_00002_2_mean=3,sd=0.5633_2022-07-22_15-39-56/error.txt
WandbTrainable_25f04_00003 1/Users/kai/ray_results/WandbTrainable_2022-07-22_15-39-53/WandbTrainable_25f04_00003_3_mean=4,sd=0.5071_2022-07-22_15-39-56/error.txt
WandbTrainable_25f04_00004 1/Users/kai/ray_results/WandbTrainable_2022-07-22_15-39-53/WandbTrainable_25f04_00004_4_mean=5,sd=0.3721_2022-07-22_15-39-56/error.txt

2022-07-22 15:39:56,146	ERROR trial_runner.py:921 -- Trial WandbTrainable_25f04_00000: Error processing event.
ray.exceptions.RayTaskError(NotImplementedError): ray::WandbTrainable.save() (pid=47994, ip=127.0.0.1, repr=<__main__.WandbTrainable object at 0x11052de10>)
  File "/Users/kai/coding/ray/python/ray/tune/trainable/trainable.py", line 449, in save
    checkpoint_dict_or_path = self.save_checkpoint(checkpoint_dir)
  File "/Users/kai/coding/ray/python/ray/tune/trainable/trainable.py", line 1014, in save_checkpoint
    raise NotImplementedError
NotImplementedError
Result for WandbTrainable_25f04_00000:
  date: 2022-07-22_15-39-56
  done: true
  experiment_id: c0ac6bf4f2af45368a3c5c3e14e47115
  hostname: Kais-MacBook-Pro.local
  iterations_since_restore: 1
  loss: 0.9941371354505734
  node_ip: 127.0.0.1
  pid: 47994
  time_since_restore: 0.000827789306640625
  time_this_iter_s: 0.000827789306640625
  time_total_s: 0.000827789306640625
  timestamp: 1658500796
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: 25f04_00000
  warmup_time: 0.0031821727752685547
  
Result for WandbTrainable_25f04_00000:
  date: 2022-07-22_15-39-56
  done: true
  experiment_id: c0ac6bf4f2af45368a3c5c3e14e47115
  experiment_tag: 0_mean=1,sd=0.5245
  hostname: Kais-MacBook-Pro.local
  iterations_since_restore: 1
  loss: 0.9941371354505734
  node_ip: 127.0.0.1
  pid: 47994
  time_since_restore: 0.000827789306640625
  time_this_iter_s: 0.000827789306640625
  time_total_s: 0.000827789306640625
  timestamp: 1658500796
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: 25f04_00000
  warmup_time: 0.0031821727752685547
  
Result for WandbTrainable_25f04_00002:
  date: 2022-07-22_15-39-59
  done: true
  experiment_id: b4174fe95248493e8dedfcbc67549339
  hostname: Kais-MacBook-Pro.local
  iterations_since_restore: 1
  loss: 3.4395203958985836
  node_ip: 127.0.0.1
  pid: 48006
  time_since_restore: 0.0011119842529296875
  time_this_iter_s: 0.0011119842529296875
  time_total_s: 0.0011119842529296875
  timestamp: 1658500799
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: 25f04_00002
  warmup_time: 0.004413127899169922
  
2022-07-22 15:39:59,299	ERROR trial_runner.py:921 -- Trial WandbTrainable_25f04_00002: Error processing event.
ray.exceptions.RayTaskError(NotImplementedError): ray::WandbTrainable.save() (pid=48006, ip=127.0.0.1, repr=<__main__.WandbTrainable object at 0x11a54c8d0>)
  File "/Users/kai/coding/ray/python/ray/tune/trainable/trainable.py", line 449, in save
    checkpoint_dict_or_path = self.save_checkpoint(checkpoint_dir)
  File "/Users/kai/coding/ray/python/ray/tune/trainable/trainable.py", line 1014, in save_checkpoint
    raise NotImplementedError
NotImplementedError
2022-07-22 15:39:59,305	ERROR trial_runner.py:921 -- Trial WandbTrainable_25f04_00004: Error processing event.
ray.exceptions.RayTaskError(NotImplementedError): ray::WandbTrainable.save() (pid=48008, ip=127.0.0.1, repr=<__main__.WandbTrainable object at 0x11c314d90>)
  File "/Users/kai/coding/ray/python/ray/tune/trainable/trainable.py", line 449, in save
    checkpoint_dict_or_path = self.save_checkpoint(checkpoint_dir)
  File "/Users/kai/coding/ray/python/ray/tune/trainable/trainable.py", line 1014, in save_checkpoint
    raise NotImplementedError
NotImplementedError
2022-07-22 15:39:59,310	ERROR trial_runner.py:921 -- Trial WandbTrainable_25f04_00001: Error processing event.
ray.exceptions.RayTaskError(NotImplementedError): ray::WandbTrainable.save() (pid=48005, ip=127.0.0.1, repr=<__main__.WandbTrainable object at 0x10e56fb90>)
  File "/Users/kai/coding/ray/python/ray/tune/trainable/trainable.py", line 449, in save
    checkpoint_dict_or_path = self.save_checkpoint(checkpoint_dir)
  File "/Users/kai/coding/ray/python/ray/tune/trainable/trainable.py", line 1014, in save_checkpoint
    raise NotImplementedError
NotImplementedError
2022-07-22 15:39:59,324	ERROR trial_runner.py:921 -- Trial WandbTrainable_25f04_00003: Error processing event.
ray.exceptions.RayTaskError(NotImplementedError): ray::WandbTrainable.save() (pid=48007, ip=127.0.0.1, repr=<__main__.WandbTrainable object at 0x10b49ee50>)
  File "/Users/kai/coding/ray/python/ray/tune/trainable/trainable.py", line 449, in save
    checkpoint_dict_or_path = self.save_checkpoint(checkpoint_dir)
  File "/Users/kai/coding/ray/python/ray/tune/trainable/trainable.py", line 1014, in save_checkpoint
    raise NotImplementedError
NotImplementedError
Result for WandbTrainable_25f04_00001:
  date: 2022-07-22_15-39-59
  done: true
  experiment_id: b0920f67a88f4993b7ec85dee2f78022
  hostname: Kais-MacBook-Pro.local
  iterations_since_restore: 1
  loss: 2.3125440070079093
  node_ip: 127.0.0.1
  pid: 48005
  time_since_restore: 0.0010852813720703125
  time_this_iter_s: 0.0010852813720703125
  time_total_s: 0.0010852813720703125
  timestamp: 1658500799
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: 25f04_00001
  warmup_time: 0.0049626827239990234
  
Result for WandbTrainable_25f04_00004:
  date: 2022-07-22_15-39-59
  done: true
  experiment_id: 4435b2105eb24fbaba4778e33ce2e1a9
  hostname: Kais-MacBook-Pro.local
  iterations_since_restore: 1
  loss: 5.134083536061109
  node_ip: 127.0.0.1
  pid: 48008
  time_since_restore: 0.0008499622344970703
  time_this_iter_s: 0.0008499622344970703
  time_total_s: 0.0008499622344970703
  timestamp: 1658500799
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: 25f04_00004
  warmup_time: 0.0031480789184570312
  
Result for WandbTrainable_25f04_00002:
  date: 2022-07-22_15-39-59
  done: true
  experiment_id: b4174fe95248493e8dedfcbc67549339
  experiment_tag: 2_mean=3,sd=0.5633
  hostname: Kais-MacBook-Pro.local
  iterations_since_restore: 1
  loss: 3.4395203958985836
  node_ip: 127.0.0.1
  pid: 48006
  time_since_restore: 0.0011119842529296875
  time_this_iter_s: 0.0011119842529296875
  time_total_s: 0.0011119842529296875
  timestamp: 1658500799
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: 25f04_00002
  warmup_time: 0.004413127899169922
  
Result for WandbTrainable_25f04_00004:
  date: 2022-07-22_15-39-59
  done: true
  experiment_id: 4435b2105eb24fbaba4778e33ce2e1a9
  experiment_tag: 4_mean=5,sd=0.3721
  hostname: Kais-MacBook-Pro.local
  iterations_since_restore: 1
  loss: 5.134083536061109
  node_ip: 127.0.0.1
  pid: 48008
  time_since_restore: 0.0008499622344970703
  time_this_iter_s: 0.0008499622344970703
  time_total_s: 0.0008499622344970703
  timestamp: 1658500799
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: 25f04_00004
  warmup_time: 0.0031480789184570312
  
Result for WandbTrainable_25f04_00001:
  date: 2022-07-22_15-39-59
  done: true
  experiment_id: b0920f67a88f4993b7ec85dee2f78022
  experiment_tag: 1_mean=2,sd=0.5153
  hostname: Kais-MacBook-Pro.local
  iterations_since_restore: 1
  loss: 2.3125440070079093
  node_ip: 127.0.0.1
  pid: 48005
  time_since_restore: 0.0010852813720703125
  time_this_iter_s: 0.0010852813720703125
  time_total_s: 0.0010852813720703125
  timestamp: 1658500799
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: 25f04_00001
  warmup_time: 0.0049626827239990234
  
Result for WandbTrainable_25f04_00003:
  date: 2022-07-22_15-39-59
  done: true
  experiment_id: a667aef035a1475a883c166a014b756c
  hostname: Kais-MacBook-Pro.local
  iterations_since_restore: 1
  loss: 4.533407187147774
  node_ip: 127.0.0.1
  pid: 48007
  time_since_restore: 0.0009930133819580078
  time_this_iter_s: 0.0009930133819580078
  time_total_s: 0.0009930133819580078
  timestamp: 1658500799
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: 25f04_00003
  warmup_time: 0.0036199092864990234
  
Result for WandbTrainable_25f04_00003:
  date: 2022-07-22_15-39-59
  done: true
  experiment_id: a667aef035a1475a883c166a014b756c
  experiment_tag: 3_mean=4,sd=0.5071
  hostname: Kais-MacBook-Pro.local
  iterations_since_restore: 1
  loss: 4.533407187147774
  node_ip: 127.0.0.1
  pid: 48007
  time_since_restore: 0.0009930133819580078
  time_this_iter_s: 0.0009930133819580078
  time_total_s: 0.0009930133819580078
  timestamp: 1658500799
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: 25f04_00003
  warmup_time: 0.0036199092864990234
  
2022-07-22 15:39:59,455	ERROR tune.py:733 -- Trials did not complete: [WandbTrainable_25f04_00000, WandbTrainable_25f04_00001, WandbTrainable_25f04_00002, WandbTrainable_25f04_00003, WandbTrainable_25f04_00004]
2022-07-22 15:39:59,456	INFO tune.py:738 -- Total run time: 6.18 seconds (6.04 seconds for the tuning loop).

This completes our Tune and Wandb walk-through. In the following sections you can find more details on the API of the Tune-Wandb integration.

Tune Wandb API Reference

WandbLoggerCallback

class ray.air.callbacks.wandb.WandbLoggerCallback(project: Optional[str] = None, group: Optional[str] = None, api_key_file: Optional[str] = None, api_key: Optional[str] = None, excludes: Optional[List[str]] = None, log_config: bool = False, save_checkpoints: bool = False, **kwargs)[source]

Weights and biases (https://www.wandb.ai/) is a tool for experiment tracking, model optimization, and dataset versioning. This Ray Tune LoggerCallback sends metrics to Wandb for automatic tracking and visualization.

Parameters
  • project – Name of the Wandb project. Mandatory.

  • group – Name of the Wandb group. Defaults to the trainable name.

  • api_key_file – Path to file containing the Wandb API KEY. This file only needs to be present on the node running the Tune script if using the WandbLogger.

  • api_key – Wandb API Key. Alternative to setting api_key_file.

  • excludes – List of metrics that should be excluded from the log.

  • log_config – Boolean indicating if the config parameter of the results dict should be logged. This makes sense if parameters will change during training, e.g. with PopulationBasedTraining. Defaults to False.

  • save_checkpoints – If True, model checkpoints will be saved to Wandb as artifacts. Defaults to False.

  • **kwargs – The keyword arguments will be pased to wandb.init().

Wandb’s group, run_id and run_name are automatically selected by Tune, but can be overwritten by filling out the respective configuration values.

Please see here for all other valid configuration settings: https://docs.wandb.ai/library/init

Example:

from ray.tune.logger import DEFAULT_LOGGERS
from ray.air.callbacks.wandb import WandbLoggerCallback
tune.run(
    train_fn,
    config={
        # define search space here
        "parameter_1": tune.choice([1, 2, 3]),
        "parameter_2": tune.choice([4, 5, 6]),
    },
    callbacks=[WandbLoggerCallback(
        project="Optimization_Project",
        api_key_file="/path/to/file",
        log_config=True)])

Wandb-Mixin

ray.tune.integration.wandb.wandb_mixin(func: Callable)[source]

Weights and biases (https://www.wandb.ai/) is a tool for experiment tracking, model optimization, and dataset versioning. This Ray Tune Trainable mixin helps initializing the Wandb API for use with the Trainable class or with @wandb_mixin for the function API.

For basic usage, just prepend your training function with the @wandb_mixin decorator:

from ray.tune.integration.wandb import wandb_mixin

@wandb_mixin
def train_fn(config):
    wandb.log()

Wandb configuration is done by passing a wandb key to the param_space parameter of tune.Tuner() (see example below).

The content of the wandb config entry is passed to wandb.init() as keyword arguments. The exception are the following settings, which are used to configure the WandbTrainableMixin itself:

Parameters
  • api_key_file – Path to file containing the Wandb API KEY. This file must be on all nodes if using the wandb_mixin.

  • api_key – Wandb API Key. Alternative to setting api_key_file.

Wandb’s group, run_id and run_name are automatically selected by Tune, but can be overwritten by filling out the respective configuration values.

Please see here for all other valid configuration settings: https://docs.wandb.ai/library/init

Example:

from ray import tune
from ray.tune.integration.wandb import wandb_mixin

@wandb_mixin
def train_fn(config):
    for i in range(10):
        loss = self.config["a"] + self.config["b"]
        wandb.log({"loss": loss})
    tune.report(loss=loss, done=True)

tuner = tune.Tuner(
    train_fn,
    param_space={
        # define search space here
        "a": tune.choice([1, 2, 3]),
        "b": tune.choice([4, 5, 6]),
        # wandb configuration
        "wandb": {
            "project": "Optimization_Project",
            "api_key_file": "/path/to/file"
        }
    })
tuner.fit()