Logging results and uploading models to Comet ML

In this example, we train a simple XGBoost model and log the training results to Comet ML. We also save the resulting model checkpoints as artifacts.

Let’s start with installing our dependencies:

!pip install -qU "ray[tune]" sklearn xgboost_ray comet_ml

Then we need some imports:

import ray

from ray.air import RunConfig
from ray.air.result import Result
from ray.train.xgboost import XGBoostTrainer
from ray.air.callbacks.comet import CometLoggerCallback

We define a simple function that returns our training dataset as a Ray Dataset:

def get_train_dataset() -> ray.data.Dataset:
    dataset = ray.data.read_csv("s3://air-example-data/breast_cancer.csv")
    return dataset

Now we define a simple training function. All the magic happens within the CometLoggerCallback:

CometLoggerCallback(
    project_name=comet_project,
    save_checkpoints=True,
)

It will automatically log all results to Comet ML and upload the checkpoints as artifacts. It assumes you’re logged in into Comet via an API key or your ~./.comet.config.

def train_model(train_dataset: ray.data.Dataset, comet_project: str) -> Result:
    """Train a simple XGBoost model and return the result."""
    trainer = XGBoostTrainer(
        scaling_config={"num_workers": 2},
        params={"tree_method": "auto"},
        label_column="target",
        datasets={"train": train_dataset},
        num_boost_round=10,
        run_config=RunConfig(
            callbacks=[
                # This is the part needed to enable logging to Comet ML.
                # It assumes Comet ML can find a valid API (e.g. by setting
                # the ``COMET_API_KEY`` environment variable).
                CometLoggerCallback(
                    project_name=comet_project,
                    save_checkpoints=True,
                )
            ]
        ),
    )
    result = trainer.fit()
    return result

Let’s kick off a run:

comet_project = "ray_air_example"

train_dataset = get_train_dataset()
result = train_model(train_dataset=train_dataset, comet_project=comet_project)
2022-05-19 15:19:17,237	INFO services.py:1483 -- View the Ray dashboard at http://127.0.0.1:8265
== Status ==
Current time: 2022-05-19 15:19:35 (running for 00:00:14.95)
Memory usage on this node: 10.2/16.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/16 CPUs, 0/0 GPUs, 0.0/5.12 GiB heap, 0.0/2.0 GiB objects
Result logdir: /Users/kai/ray_results/XGBoostTrainer_2022-05-19_15-19-19
Number of trials: 1/1 (1 TERMINATED)
Trial name status loc iter total time (s) train-rmse
XGBoostTrainer_ac544_00000TERMINATED127.0.0.1:19852 10 9.7203 0.030717


COMET WARNING: As you are running in a Jupyter environment, you will need to call `experiment.end()` when finished to ensure all metrics and code are logged before exiting.
(raylet) 2022-05-19 15:19:21,584	INFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=61222 --object-store-name=/tmp/ray/session_2022-05-19_15-19-14_632568_19778/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-19_15-19-14_632568_19778/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=62873 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:61938 --redis-password=5241590000000000 --startup-token=16 --runtime-env-hash=-2010331134
COMET INFO: Experiment is live on comet.ml https://www.comet.ml/krfricke/ray-air-example/ecd3726ca127497ba7386003a249fad6

COMET WARNING: Failed to add tag(s) None to the experiment

COMET WARNING: Empty mapping given to log_params({}); ignoring
(GBDTTrainable pid=19852) UserWarning: Dataset 'train' has 1 blocks, which is less than the `num_workers` 2. This dataset will be automatically repartitioned to 2 blocks.
(raylet) 2022-05-19 15:19:24,628	INFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=61222 --object-store-name=/tmp/ray/session_2022-05-19_15-19-14_632568_19778/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-19_15-19-14_632568_19778/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=62873 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:61938 --redis-password=5241590000000000 --startup-token=17 --runtime-env-hash=-2010331069
(GBDTTrainable pid=19852) 2022-05-19 15:19:25,961	INFO main.py:980 -- [RayXGBoost] Created 2 new actors (2 total actors). Waiting until actors are ready for training.
(raylet) 2022-05-19 15:19:26,830	INFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=61222 --object-store-name=/tmp/ray/session_2022-05-19_15-19-14_632568_19778/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-19_15-19-14_632568_19778/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=62873 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:61938 --redis-password=5241590000000000 --startup-token=18 --runtime-env-hash=-2010331069
(raylet) 2022-05-19 15:19:26,918	INFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=61222 --object-store-name=/tmp/ray/session_2022-05-19_15-19-14_632568_19778/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-19_15-19-14_632568_19778/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=62873 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:61938 --redis-password=5241590000000000 --startup-token=20 --runtime-env-hash=-2010331134
(raylet) 2022-05-19 15:19:26,922	INFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=61222 --object-store-name=/tmp/ray/session_2022-05-19_15-19-14_632568_19778/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-19_15-19-14_632568_19778/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=62873 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:61938 --redis-password=5241590000000000 --startup-token=21 --runtime-env-hash=-2010331134
(raylet) 2022-05-19 15:19:26,922	INFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=61222 --object-store-name=/tmp/ray/session_2022-05-19_15-19-14_632568_19778/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-19_15-19-14_632568_19778/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=62873 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:61938 --redis-password=5241590000000000 --startup-token=22 --runtime-env-hash=-2010331134
(raylet) 2022-05-19 15:19:26,923	INFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=61222 --object-store-name=/tmp/ray/session_2022-05-19_15-19-14_632568_19778/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-19_15-19-14_632568_19778/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=62873 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:61938 --redis-password=5241590000000000 --startup-token=19 --runtime-env-hash=-2010331134
(GBDTTrainable pid=19852) 2022-05-19 15:19:29,272	INFO main.py:1025 -- [RayXGBoost] Starting XGBoost training.
(_RemoteRayXGBoostActor pid=19876) [15:19:29] task [xgboost.ray]:4505889744 got new rank 1
(_RemoteRayXGBoostActor pid=19875) [15:19:29] task [xgboost.ray]:6941849424 got new rank 0
COMET WARNING: The given value of the metric episodes_total was None; ignoring
COMET WARNING: The given value of the metric timesteps_total was None; ignoring
COMET INFO: Artifact 'checkpoint_XGBoostTrainer_ac544_00000' version 1.0.0 created
Result for XGBoostTrainer_ac544_00000:
  date: 2022-05-19_15-19-30
  done: false
  experiment_id: d3007bd6a2734b328fd90385485c5a8d
  hostname: Kais-MacBook-Pro.local
  iterations_since_restore: 1
  node_ip: 127.0.0.1
  pid: 19852
  should_checkpoint: true
  time_since_restore: 6.529659032821655
  time_this_iter_s: 6.529659032821655
  time_total_s: 6.529659032821655
  timestamp: 1652969970
  timesteps_since_restore: 0
  train-rmse: 0.357284
  training_iteration: 1
  trial_id: ac544_00000
  warmup_time: 0.003961086273193359
  
COMET INFO: Scheduling the upload of 3 assets for a size of 2.48 KB, this can take some time
COMET INFO: Artifact 'krfricke/checkpoint_XGBoostTrainer_ac544_00000:1.0.0' has started uploading asynchronously
COMET WARNING: The given value of the metric episodes_total was None; ignoring
COMET WARNING: The given value of the metric timesteps_total was None; ignoring
COMET INFO: Artifact 'checkpoint_XGBoostTrainer_ac544_00000' version 2.0.0 created (previous was: 1.0.0)
COMET INFO: Scheduling the upload of 3 assets for a size of 3.86 KB, this can take some time
COMET INFO: Artifact 'krfricke/checkpoint_XGBoostTrainer_ac544_00000:2.0.0' has started uploading asynchronously
COMET WARNING: The given value of the metric episodes_total was None; ignoring
COMET WARNING: The given value of the metric timesteps_total was None; ignoring
COMET INFO: Artifact 'checkpoint_XGBoostTrainer_ac544_00000' version 3.0.0 created (previous was: 2.0.0)
COMET INFO: Artifact 'krfricke/checkpoint_XGBoostTrainer_ac544_00000:1.0.0' has been fully uploaded successfully
COMET INFO: Scheduling the upload of 3 assets for a size of 5.31 KB, this can take some time
COMET INFO: Artifact 'krfricke/checkpoint_XGBoostTrainer_ac544_00000:3.0.0' has started uploading asynchronously
COMET WARNING: The given value of the metric episodes_total was None; ignoring
COMET WARNING: The given value of the metric timesteps_total was None; ignoring
COMET INFO: Artifact 'checkpoint_XGBoostTrainer_ac544_00000' version 4.0.0 created (previous was: 3.0.0)
COMET INFO: Artifact 'krfricke/checkpoint_XGBoostTrainer_ac544_00000:2.0.0' has been fully uploaded successfully
COMET INFO: Scheduling the upload of 3 assets for a size of 6.76 KB, this can take some time
COMET INFO: Artifact 'krfricke/checkpoint_XGBoostTrainer_ac544_00000:4.0.0' has started uploading asynchronously
COMET WARNING: The given value of the metric episodes_total was None; ignoring
COMET WARNING: The given value of the metric timesteps_total was None; ignoring
COMET INFO: Artifact 'checkpoint_XGBoostTrainer_ac544_00000' version 5.0.0 created (previous was: 4.0.0)
COMET INFO: Scheduling the upload of 3 assets for a size of 8.21 KB, this can take some time
COMET INFO: Artifact 'krfricke/checkpoint_XGBoostTrainer_ac544_00000:3.0.0' has been fully uploaded successfully
COMET INFO: Artifact 'krfricke/checkpoint_XGBoostTrainer_ac544_00000:5.0.0' has started uploading asynchronously
COMET WARNING: The given value of the metric episodes_total was None; ignoring
COMET WARNING: The given value of the metric timesteps_total was None; ignoring
COMET INFO: Artifact 'krfricke/checkpoint_XGBoostTrainer_ac544_00000:4.0.0' has been fully uploaded successfully
COMET INFO: Artifact 'checkpoint_XGBoostTrainer_ac544_00000' version 6.0.0 created (previous was: 5.0.0)
COMET INFO: Scheduling the upload of 3 assets for a size of 9.87 KB, this can take some time
COMET INFO: Artifact 'krfricke/checkpoint_XGBoostTrainer_ac544_00000:6.0.0' has started uploading asynchronously
COMET WARNING: The given value of the metric episodes_total was None; ignoring
COMET WARNING: The given value of the metric timesteps_total was None; ignoring
COMET INFO: Artifact 'krfricke/checkpoint_XGBoostTrainer_ac544_00000:5.0.0' has been fully uploaded successfully
COMET INFO: Artifact 'checkpoint_XGBoostTrainer_ac544_00000' version 7.0.0 created (previous was: 6.0.0)
COMET INFO: Scheduling the upload of 3 assets for a size of 11.46 KB, this can take some time
COMET INFO: Artifact 'krfricke/checkpoint_XGBoostTrainer_ac544_00000:7.0.0' has started uploading asynchronously
COMET WARNING: The given value of the metric episodes_total was None; ignoring
COMET WARNING: The given value of the metric timesteps_total was None; ignoring
COMET INFO: Artifact 'krfricke/checkpoint_XGBoostTrainer_ac544_00000:6.0.0' has been fully uploaded successfully
COMET INFO: Artifact 'checkpoint_XGBoostTrainer_ac544_00000' version 8.0.0 created (previous was: 7.0.0)
COMET INFO: Scheduling the upload of 3 assets for a size of 12.84 KB, this can take some time
COMET INFO: Artifact 'krfricke/checkpoint_XGBoostTrainer_ac544_00000:8.0.0' has started uploading asynchronously
COMET WARNING: The given value of the metric episodes_total was None; ignoring
COMET WARNING: The given value of the metric timesteps_total was None; ignoring
COMET INFO: Artifact 'krfricke/checkpoint_XGBoostTrainer_ac544_00000:7.0.0' has been fully uploaded successfully
COMET INFO: Artifact 'checkpoint_XGBoostTrainer_ac544_00000' version 9.0.0 created (previous was: 8.0.0)
COMET INFO: Scheduling the upload of 3 assets for a size of 14.36 KB, this can take some time
COMET INFO: Artifact 'krfricke/checkpoint_XGBoostTrainer_ac544_00000:9.0.0' has started uploading asynchronously
COMET WARNING: The given value of the metric episodes_total was None; ignoring
COMET WARNING: The given value of the metric timesteps_total was None; ignoring
COMET INFO: Artifact 'krfricke/checkpoint_XGBoostTrainer_ac544_00000:8.0.0' has been fully uploaded successfully
COMET INFO: Artifact 'checkpoint_XGBoostTrainer_ac544_00000' version 10.0.0 created (previous was: 9.0.0)
COMET INFO: Scheduling the upload of 3 assets for a size of 16.37 KB, this can take some time
COMET INFO: Artifact 'krfricke/checkpoint_XGBoostTrainer_ac544_00000:10.0.0' has started uploading asynchronously
(GBDTTrainable pid=19852) 2022-05-19 15:19:33,890	INFO main.py:1519 -- [RayXGBoost] Finished XGBoost training on training data with total N=569 in 7.96 seconds (4.61 pure XGBoost training time).
COMET INFO: Artifact 'krfricke/checkpoint_XGBoostTrainer_ac544_00000:9.0.0' has been fully uploaded successfully
COMET INFO: Artifact 'checkpoint_XGBoostTrainer_ac544_00000' version 11.0.0 created (previous was: 10.0.0)
COMET INFO: Scheduling the upload of 3 assets for a size of 16.39 KB, this can take some time
COMET INFO: Artifact 'krfricke/checkpoint_XGBoostTrainer_ac544_00000:11.0.0' has started uploading asynchronously
COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/krfricke/ray-air-example/ecd3726ca127497ba7386003a249fad6
COMET INFO:   Metrics [count] (min, max):
COMET INFO:     iterations_since_restore [10] : (1, 10)
COMET INFO:     time_since_restore [10]       : (6.529659032821655, 9.720295906066895)
COMET INFO:     time_this_iter_s [10]         : (0.3124058246612549, 6.529659032821655)
COMET INFO:     time_total_s [10]             : (6.529659032821655, 9.720295906066895)
COMET INFO:     timestamp [10]                : (1652969970, 1652969973)
COMET INFO:     timesteps_since_restore       : 0
COMET INFO:     train-rmse [10]               : (0.030717, 0.357284)
COMET INFO:     training_iteration [10]       : (1, 10)
COMET INFO:     warmup_time                   : 0.003961086273193359
COMET INFO:   Others:
COMET INFO:     Created from  : Ray
COMET INFO:     Name          : XGBoostTrainer_ac544_00000
COMET INFO:     experiment_id : d3007bd6a2734b328fd90385485c5a8d
COMET INFO:     trial_id      : ac544_00000
COMET INFO:   System Information:
COMET INFO:     date     : 2022-05-19_15-19-33
COMET INFO:     hostname : Kais-MacBook-Pro.local
COMET INFO:     node_ip  : 127.0.0.1
COMET INFO:     pid      : 19852
COMET INFO:   Uploads:
COMET INFO:     artifact assets     : 33 (107.92 KB)
COMET INFO:     artifacts           : 11
COMET INFO:     environment details : 1
COMET INFO:     filename            : 1
COMET INFO:     installed packages  : 1
COMET INFO:     notebook            : 1
COMET INFO:     source_code         : 1
COMET INFO: ---------------------------
COMET INFO: Uploading metrics, params, and assets to Comet before program termination (may take several seconds)
COMET INFO: The Python SDK has 3600 seconds to finish before aborting...
COMET INFO: Waiting for completion of the file uploads (may take several seconds)
COMET INFO: The Python SDK has 10800 seconds to finish before aborting...
COMET INFO: Still uploading 6 file(s), remaining 21.05 KB/116.69 KB
COMET INFO: Artifact 'krfricke/checkpoint_XGBoostTrainer_ac544_00000:10.0.0' has been fully uploaded successfully
COMET INFO: Artifact 'krfricke/checkpoint_XGBoostTrainer_ac544_00000:11.0.0' has been fully uploaded successfully
Result for XGBoostTrainer_ac544_00000:
  date: 2022-05-19_15-19-33
  done: true
  experiment_id: d3007bd6a2734b328fd90385485c5a8d
  experiment_tag: '0'
  hostname: Kais-MacBook-Pro.local
  iterations_since_restore: 10
  node_ip: 127.0.0.1
  pid: 19852
  should_checkpoint: true
  time_since_restore: 9.720295906066895
  time_this_iter_s: 0.39761900901794434
  time_total_s: 9.720295906066895
  timestamp: 1652969973
  timesteps_since_restore: 0
  train-rmse: 0.030717
  training_iteration: 10
  trial_id: ac544_00000
  warmup_time: 0.003961086273193359
  
2022-05-19 15:19:35,621	INFO tune.py:753 -- Total run time: 15.75 seconds (14.94 seconds for the tuning loop).

Check out your Comet ML project to see the results!