Fine-tune a Hugging Face Transformers Model#

This notebook is based on an official Hugging Face example, How to fine-tune a model on text classification. This notebook shows the process of conversion from vanilla HF to Ray Train without changing the training logic unless necessary.

This notebook consists of the following steps:

Set up Ray
Load the dataset
Preprocess the dataset with Ray Data
Run the training with Ray Train
Optionally, share the model with the community

Uncomment and run the following line to install all the necessary dependencies. (This notebook is being tested with transformers==4.19.1.):

#! pip install "datasets" "transformers>=4.19.0" "torch>=1.10.0" "mlflow"

Set up Ray#

Use ray.init() to initialize a local cluster. By default, this cluster contains only the machine you are running this notebook on. You can also run this notebook on an Anyscale cluster.

from pprint import pprint
import ray

ray.init()

Check the resources our cluster is composed of. If you are running this notebook on your local machine or Google Colab, you should see the number of CPU cores and GPUs available on the your machine.

pprint(ray.cluster_resources())

{'CPU': 48.0,
 'GPU': 4.0,
 'accelerator_type:None': 1.0,
 'memory': 206158430208.0,
 'node:10.0.27.125': 1.0,
 'node:__internal_head__': 1.0,
 'object_store_memory': 59052625920.0}

This notebook fine-tunes a HF Transformers model for one of the text classification task of the GLUE Benchmark. It runs the training using Ray Train.

You can change these two variables to control whether the training, which happens later, uses CPUs or GPUs, and how many workers to spawn. Each worker claims one CPU or GPU. Make sure to not request more resources than the resources present. By default, the training runs with one GPU worker.

use_gpu = True  # set this to False to run on CPUs
num_workers = 1  # set this to number of GPUs or CPUs you want to use

Fine-tune a model on a text classification task#

The GLUE Benchmark is a group of nine classification tasks on sentences or pairs of sentences. To learn more, see the original notebook.

Each task has a name that is its acronym, with mnli-mm to indicate that it is a mismatched version of MNLI. Each one has the same training set as mnli but different validation and test sets.

GLUE_TASKS = [
    "cola",
    "mnli",
    "mnli-mm",
    "mrpc",
    "qnli",
    "qqp",
    "rte",
    "sst2",
    "stsb",
    "wnli",
]

This notebook runs on any of the tasks in the list above, with any model checkpoint from the Model Hub as long as that model has a version with a classification head. Depending on the model and the GPU you are using, you might need to adjust the batch size to avoid out-of-memory errors. Set these three parameters, and the rest of the notebook should run smoothly:

task = "cola"
model_checkpoint = "distilbert-base-uncased"
batch_size = 16

Loading the dataset#

Use the HF Datasets library to download the data and get the metric to use for evaluation and to compare your model to the benchmark. You can do this comparison easily with the load_dataset and load_metric functions.

Apart from mnli-mm being special code, you can directly pass the task name to those functions.

Run the normal HF Datasets code to load the dataset from the Hub.

from datasets import load_dataset

actual_task = "mnli" if task == "mnli-mm" else task
datasets = load_dataset("glue", actual_task)

Reusing dataset glue (/home/ray/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)

The dataset object itself is a DatasetDict, which contains one key for the training, validation, and test set, with more keys for the mismatched validation and test set in the special case of mnli.

Preprocessing the data with Ray Data#

Before you can feed these texts to the model, you need to preprocess them. Preprocess them with a HF Transformers’ Tokenizer, which tokenizes the inputs, including converting the tokens to their corresponding IDs in the pretrained vocabulary, and puts them in a format the model expects. It also generates the other inputs that the model requires.

To do all of this preprocessing, instantiate your tokenizer with the AutoTokenizer.from_pretrained method, which ensures that you:

Get a tokenizer that corresponds to the model architecture you want to use.
Download the vocabulary used when pretraining this specific checkpoint.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

Pass use_fast=True to the preceding call to use one of the fast tokenizers, backed by Rust, from the HF Tokenizers library. These fast tokenizers are available for almost all models, but if you get an error with the previous call, remove the argument.

To preprocess the dataset, you need the names of the columns containing the sentence(s). The following dictionary keeps track of the correspondence task to column names:

task_to_keys = {
    "cola": ("sentence", None),
    "mnli": ("premise", "hypothesis"),
    "mnli-mm": ("premise", "hypothesis"),
    "mrpc": ("sentence1", "sentence2"),
    "qnli": ("question", "sentence"),
    "qqp": ("question1", "question2"),
    "rte": ("sentence1", "sentence2"),
    "sst2": ("sentence", None),
    "stsb": ("sentence1", "sentence2"),
    "wnli": ("sentence1", "sentence2"),
}

Instead of using HF Dataset objects directly, convert them to Ray Data. Arrow tables back both of them, so the conversion is straightforward. Use the built-in from_huggingface() function.

import ray.data

ray_datasets = {
    "train": ray.data.from_huggingface(datasets["train"]),
    "validation": ray.data.from_huggingface(datasets["validation"]),
    "test": ray.data.from_huggingface(datasets["test"]),
}
ray_datasets

{'train': MaterializedDataset(
    num_blocks=1,
    num_rows=8551,
    schema={sentence: string, label: int64, idx: int32}
 ),
 'validation': MaterializedDataset(
    num_blocks=1,
    num_rows=1043,
    schema={sentence: string, label: int64, idx: int32}
 ),
 'test': MaterializedDataset(
    num_blocks=1,
    num_rows=1063,
    schema={sentence: string, label: int64, idx: int32}
 )}

You can then write the function that preprocesses the samples. Feed them to the tokenizer with the argument truncation=True. This configuration ensures that the tokenizer truncates and pads to the longest sequence in the batch, any input longer than what the model selected can handle.

import numpy as np
from typing import Dict


# Tokenize input sentences
def collate_fn(examples: Dict[str, np.array]):
    sentence1_key, sentence2_key = task_to_keys[task]
    if sentence2_key is None:
        outputs = tokenizer(
            list(examples[sentence1_key]),
            truncation=True,
            padding="longest",
            return_tensors="pt",
        )
    else:
        outputs = tokenizer(
            list(examples[sentence1_key]),
            list(examples[sentence2_key]),
            truncation=True,
            padding="longest",
            return_tensors="pt",
        )

    outputs["labels"] = torch.LongTensor(examples["label"])

    # Move all input tensors to GPU
    for key, value in outputs.items():
        outputs[key] = value.cuda()

    return outputs

Fine-tuning the model with Ray Train#

Now that the data is ready, download the pretrained model and fine-tune it.

Because all of the tasks involve sentence classification, use the AutoModelForSequenceClassification class. For more specifics about each individual training component, see the original notebook. The original notebook uses the same tokenizer used to encode the dataset in this notebook’s preceding example.

The main difference when using Ray Train is that you need to define the training logic as a function (train_func). You pass this training function to the TorchTrainer to on every Ray worker. The training then proceeds using PyTorch DDP.

Note

Be sure to initialize the model, metric, and tokenizer within the function. Otherwise, you may encounter serialization errors.

import torch
import numpy as np

from datasets import load_metric
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

import ray.train
from ray.train.huggingface.transformers import prepare_trainer, RayTrainReportCallback

num_labels = 3 if task.startswith("mnli") else 1 if task == "stsb" else 2
metric_name = (
    "pearson"
    if task == "stsb"
    else "matthews_correlation"
    if task == "cola"
    else "accuracy"
)
model_name = model_checkpoint.split("/")[-1]
validation_key = (
    "validation_mismatched"
    if task == "mnli-mm"
    else "validation_matched"
    if task == "mnli"
    else "validation"
)
name = f"{model_name}-finetuned-{task}"

# Calculate the maximum steps per epoch based on the number of rows in the training dataset.
# Make sure to scale by the total number of training workers and the per device batch size.
max_steps_per_epoch = ray_datasets["train"].count() // (batch_size * num_workers)


def train_func(config):
    print(f"Is CUDA available: {torch.cuda.is_available()}")

    metric = load_metric("glue", actual_task)
    tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)
    model = AutoModelForSequenceClassification.from_pretrained(
        model_checkpoint, num_labels=num_labels
    )

    train_ds = ray.train.get_dataset_shard("train")
    eval_ds = ray.train.get_dataset_shard("eval")

    train_ds_iterable = train_ds.iter_torch_batches(
        batch_size=batch_size, collate_fn=collate_fn
    )
    eval_ds_iterable = eval_ds.iter_torch_batches(
        batch_size=batch_size, collate_fn=collate_fn
    )

    print("max_steps_per_epoch: ", max_steps_per_epoch)

    args = TrainingArguments(
        name,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        logging_strategy="epoch",
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        learning_rate=config.get("learning_rate", 2e-5),
        num_train_epochs=config.get("epochs", 2),
        weight_decay=config.get("weight_decay", 0.01),
        push_to_hub=False,
        max_steps=max_steps_per_epoch * config.get("epochs", 2),
        disable_tqdm=True,  # declutter the output a little
        no_cuda=not use_gpu,  # you need to explicitly set no_cuda if you want CPUs
        report_to="none",
    )

    def compute_metrics(eval_pred):
        predictions, labels = eval_pred
        if task != "stsb":
            predictions = np.argmax(predictions, axis=1)
        else:
            predictions = predictions[:, 0]
        return metric.compute(predictions=predictions, references=labels)

    trainer = Trainer(
        model,
        args,
        train_dataset=train_ds_iterable,
        eval_dataset=eval_ds_iterable,
        tokenizer=tokenizer,
        compute_metrics=compute_metrics,
    )

    trainer.add_callback(RayTrainReportCallback())

    trainer = prepare_trainer(trainer)

    print("Starting training")
    trainer.train()

2023-09-06 14:25:28.144428: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-09-06 14:25:28.284936: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-09-06 14:25:29.025734: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-09-06 14:25:29.025801: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-09-06 14:25:29.025807: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
comet_ml is installed but `COMET_API_KEY` is not set.

With your train_func complete, you can now instantiate the TorchTrainer. Aside from calling the function, set the scaling_config, which controls the amount of workers and resources used, and the datasets to use for training and evaluation.

from ray.train.torch import TorchTrainer
from ray.train import RunConfig, ScalingConfig, CheckpointConfig

trainer = TorchTrainer(
    train_func,
    scaling_config=ScalingConfig(num_workers=num_workers, use_gpu=use_gpu),
    datasets={
        "train": ray_datasets["train"],
        "eval": ray_datasets["validation"],
    },
    run_config=RunConfig(
        checkpoint_config=CheckpointConfig(
            num_to_keep=1,
            checkpoint_score_attribute="eval_loss",
            checkpoint_score_order="min",
        ),
    ),
)

Finally, call the fit method to start training with Ray Train. Save the Result object to a variable so you can access metrics and checkpoints.

result = trainer.fit()

Tune Status

Current time:	2023-09-06 14:27:12
Running for:	00:01:40.12
Memory:	18.4/186.6 GiB

System Info

Using FIFO scheduling algorithm.
Logical resource usage: 1.0/48 CPUs, 1.0/4 GPUs (0.0/1.0 accelerator_type:None)

Trial Status

Trial name	status	loc	iter	total time (s)	loss	learning_rate	epoch
TorchTrainer_e8bd4_00000	TERMINATED	10.0.27.125:43821	2	76.6259	0.3866	0	1.5

(TrainTrainable pid=43821) 2023-09-06 14:25:35.638885: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
(TrainTrainable pid=43821) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
(TrainTrainable pid=43821) 2023-09-06 14:25:35.782950: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
(TrainTrainable pid=43821) 2023-09-06 14:25:36.501583: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
(TrainTrainable pid=43821) 2023-09-06 14:25:36.501653: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
(TrainTrainable pid=43821) 2023-09-06 14:25:36.501660: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
(TrainTrainable pid=43821) comet_ml is installed but `COMET_API_KEY` is not set.
(TorchTrainer pid=43821) Starting distributed worker processes: ['43946 (10.0.27.125)']
(RayTrainWorker pid=43946) Setting up process group for: env:// [rank=0, world_size=1]
(RayTrainWorker pid=43946) 2023-09-06 14:25:42.756510: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
(RayTrainWorker pid=43946) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
(RayTrainWorker pid=43946) 2023-09-06 14:25:42.903398: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
(SplitCoordinator pid=44017) Auto configuring locality_with_output=['84374908fd32ea9885fdd6d21aadf2ce3e296daf28a26522e7a8d026']
(RayTrainWorker pid=43946) 2023-09-06 14:25:43.737476: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
(RayTrainWorker pid=43946) 2023-09-06 14:25:43.737544: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
(RayTrainWorker pid=43946) 2023-09-06 14:25:43.737554: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
(RayTrainWorker pid=43946) comet_ml is installed but `COMET_API_KEY` is not set.

(RayTrainWorker pid=43946) Is CUDA available: True

(RayTrainWorker pid=43946) Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_projector.weight']
(RayTrainWorker pid=43946) - This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
(RayTrainWorker pid=43946) - This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
(RayTrainWorker pid=43946) Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias', 'pre_classifier.bias', 'pre_classifier.weight']
(RayTrainWorker pid=43946) You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
(SplitCoordinator pid=44016) Auto configuring locality_with_output=['84374908fd32ea9885fdd6d21aadf2ce3e296daf28a26522e7a8d026']

(RayTrainWorker pid=43946) max_steps_per_epoch:  534

(RayTrainWorker pid=43946) max_steps is given, it will override any value given in num_train_epochs
(RayTrainWorker pid=43946) /home/ray/anaconda3/lib/python3.9/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
(RayTrainWorker pid=43946)   warnings.warn(

(RayTrainWorker pid=43946) Starting training

(RayTrainWorker pid=43946) ***** Running training *****
(RayTrainWorker pid=43946)   Num examples = 17088
(RayTrainWorker pid=43946)   Num Epochs = 9223372036854775807
(RayTrainWorker pid=43946)   Instantaneous batch size per device = 16
(RayTrainWorker pid=43946)   Total train batch size (w. parallel, distributed & accumulation) = 16
(RayTrainWorker pid=43946)   Gradient Accumulation steps = 1
(RayTrainWorker pid=43946)   Total optimization steps = 1068
(RayTrainWorker pid=43946) /tmp/ipykernel_43503/4088900328.py:23: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:206.)
(SplitCoordinator pid=44016) Executing DAG InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]
(SplitCoordinator pid=44016) Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=['84374908fd32ea9885fdd6d21aadf2ce3e296daf28a26522e7a8d026'], preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
(SplitCoordinator pid=44016) Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`

(RayTrainWorker pid=43946) [W reducer.cpp:1300] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())

(RayTrainWorker pid=43946) {'loss': 0.5414, 'learning_rate': 9.9812734082397e-06, 'epoch': 0.5}

(RayTrainWorker pid=43946) ***** Running Evaluation *****
(RayTrainWorker pid=43946)   Num examples: Unknown
(RayTrainWorker pid=43946)   Batch size = 16
(SplitCoordinator pid=44017) Executing DAG InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]
(SplitCoordinator pid=44017) Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=['84374908fd32ea9885fdd6d21aadf2ce3e296daf28a26522e7a8d026'], preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
(SplitCoordinator pid=44017) Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`

(RayTrainWorker pid=43946) Saving model checkpoint to distilbert-base-uncased-finetuned-cola/checkpoint-535
(RayTrainWorker pid=43946) Configuration saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/config.json

(RayTrainWorker pid=43946) {'eval_loss': 0.5018134117126465, 'eval_matthews_correlation': 0.4145623770066859, 'eval_runtime': 0.6595, 'eval_samples_per_second': 1581.584, 'eval_steps_per_second': 100.081, 'epoch': 0.5}

(RayTrainWorker pid=43946) Model weights saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/pytorch_model.bin
(RayTrainWorker pid=43946) tokenizer config file saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/tokenizer_config.json
(RayTrainWorker pid=43946) Special tokens file saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/special_tokens_map.json
(RayTrainWorker pid=43946) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/cluster_storage/ray_results/TorchTrainer_2023-09-06_14-25-31/TorchTrainer_e8bd4_00000_0_2023-09-06_14-25-32/checkpoint_000000)
(SplitCoordinator pid=44016) Executing DAG InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]
(SplitCoordinator pid=44016) Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=['84374908fd32ea9885fdd6d21aadf2ce3e296daf28a26522e7a8d026'], preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
(SplitCoordinator pid=44016) Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`

(RayTrainWorker pid=43946) {'loss': 0.3866, 'learning_rate': 0.0, 'epoch': 1.5}

(RayTrainWorker pid=43946) ***** Running Evaluation *****
(RayTrainWorker pid=43946)   Num examples: Unknown
(RayTrainWorker pid=43946)   Batch size = 16
(SplitCoordinator pid=44017) Executing DAG InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]
(SplitCoordinator pid=44017) Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=['84374908fd32ea9885fdd6d21aadf2ce3e296daf28a26522e7a8d026'], preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
(SplitCoordinator pid=44017) Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`

(RayTrainWorker pid=43946) Saving model checkpoint to distilbert-base-uncased-finetuned-cola/checkpoint-1068
(RayTrainWorker pid=43946) Configuration saved in distilbert-base-uncased-finetuned-cola/checkpoint-1068/config.json

(RayTrainWorker pid=43946) {'eval_loss': 0.5527923107147217, 'eval_matthews_correlation': 0.44860917123689154, 'eval_runtime': 0.6646, 'eval_samples_per_second': 1569.42, 'eval_steps_per_second': 99.311, 'epoch': 1.5}

(RayTrainWorker pid=43946) Model weights saved in distilbert-base-uncased-finetuned-cola/checkpoint-1068/pytorch_model.bin
(RayTrainWorker pid=43946) tokenizer config file saved in distilbert-base-uncased-finetuned-cola/checkpoint-1068/tokenizer_config.json
(RayTrainWorker pid=43946) Special tokens file saved in distilbert-base-uncased-finetuned-cola/checkpoint-1068/special_tokens_map.json
(RayTrainWorker pid=43946) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/cluster_storage/ray_results/TorchTrainer_2023-09-06_14-25-31/TorchTrainer_e8bd4_00000_0_2023-09-06_14-25-32/checkpoint_000001)
(RayTrainWorker pid=43946) 
(RayTrainWorker pid=43946) 
(RayTrainWorker pid=43946) Training completed. Do not forget to share your model on huggingface.co/models =)
(RayTrainWorker pid=43946) 
(RayTrainWorker pid=43946) 

(RayTrainWorker pid=43946) {'train_runtime': 66.0485, 'train_samples_per_second': 258.719, 'train_steps_per_second': 16.17, 'train_loss': 0.46413421630859375, 'epoch': 1.5}

2023-09-06 14:27:12,180	WARNING experiment_state.py:371 -- Experiment checkpoint syncing has been triggered multiple times in the last 30.0 seconds. A sync will be triggered whenever a trial has checkpointed more than `num_to_keep` times since last sync or if 300 seconds have passed since last sync. If you have set `num_to_keep` in your `CheckpointConfig`, consider increasing the checkpoint frequency or keeping more checkpoints. You can supress this warning by changing the `TUNE_WARN_EXCESSIVE_EXPERIMENT_CHECKPOINT_SYNC_THRESHOLD_S` environment variable.
2023-09-06 14:27:12,184	INFO tune.py:1141 -- Total run time: 100.17 seconds (85.12 seconds for the tuning loop).

You can use the returned Result object to access metrics and the Ray Train Checkpoint associated with the last iteration.

result

Result(
  metrics={'loss': 0.3866, 'learning_rate': 0.0, 'epoch': 1.5, 'step': 1068, 'eval_loss': 0.5527923107147217, 'eval_matthews_correlation': 0.44860917123689154, 'eval_runtime': 0.6646, 'eval_samples_per_second': 1569.42, 'eval_steps_per_second': 99.311},
  path='/mnt/cluster_storage/ray_results/TorchTrainer_2023-09-06_14-25-31/TorchTrainer_e8bd4_00000_0_2023-09-06_14-25-32',
  filesystem='local',
  checkpoint=Checkpoint(filesystem=local, path=/mnt/cluster_storage/ray_results/TorchTrainer_2023-09-06_14-25-31/TorchTrainer_e8bd4_00000_0_2023-09-06_14-25-32/checkpoint_000001)
)

Tune hyperparameters with Ray Tune#

To tune any hyperparameters of the model, pass your TorchTrainer into a Tuner and define the search space.

You can also take advantage of the advanced search algorithms and schedulers from Ray Tune. This example uses an ASHAScheduler to aggresively terminate underperforming trials.

from ray import tune
from ray.tune import Tuner
from ray.tune.schedulers.async_hyperband import ASHAScheduler

tune_epochs = 4
tuner = Tuner(
    trainer,
    param_space={
        "train_loop_config": {
            "learning_rate": tune.grid_search([2e-5, 2e-4, 2e-3, 2e-2]),
            "epochs": tune_epochs,
        }
    },
    tune_config=tune.TuneConfig(
        metric="eval_loss",
        mode="min",
        num_samples=1,
        scheduler=ASHAScheduler(
            max_t=tune_epochs,
        ),
    ),
    run_config=RunConfig(
        name="tune_transformers",
        checkpoint_config=CheckpointConfig(
            num_to_keep=1,
            checkpoint_score_attribute="eval_loss",
            checkpoint_score_order="min",
        ),
    ),
)

2023-09-06 14:46:47,821	INFO tuner_internal.py:508 -- A `RunConfig` was passed to both the `Tuner` and the `TorchTrainer`. The run config passed to the `Tuner` is the one that will be used.

tune_results = tuner.fit()

Tune Status

Current time:	2023-09-06 14:49:04
Running for:	00:02:16.18
Memory:	19.6/186.6 GiB

System Info

Using AsyncHyperBand: num_stopped=4
Bracket: Iter 4.000: -0.6517604142427444 | Iter 1.000: -0.5936744660139084
Logical resource usage: 1.0/48 CPUs, 1.0/4 GPUs (0.0/1.0 accelerator_type:None)

Trial Status

Trial name	status	loc	train_loop_config/le arning_rate	iter	total time (s)	loss	learning_rate	epoch
TorchTrainer_e1825_00000	TERMINATED	10.0.27.125:57496	2e-05	4	128.443	0.1934	0	3.25
TorchTrainer_e1825_00001	TERMINATED	10.0.27.125:57497	0.0002	1	41.2486	0.616	0.000149906	0.25
TorchTrainer_e1825_00002	TERMINATED	10.0.27.125:57498	0.002	1	41.1336	0.6699	0.00149906	0.25
TorchTrainer_e1825_00003	TERMINATED	10.0.27.125:57499	0.02	4	126.699	0.6073	0	3.25

(TrainTrainable pid=57498) 2023-09-06 14:46:52.049839: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
(TrainTrainable pid=57498) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
(TrainTrainable pid=57498) 2023-09-06 14:46:52.195780: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
(TrainTrainable pid=57498) 2023-09-06 14:46:52.944517: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
(TrainTrainable pid=57498) 2023-09-06 14:46:52.944590: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
(TrainTrainable pid=57498) 2023-09-06 14:46:52.944597: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
(TrainTrainable pid=57498) comet_ml is installed but `COMET_API_KEY` is not set.
(TorchTrainer pid=57498) Starting distributed worker processes: ['57731 (10.0.27.125)']
(TrainTrainable pid=57499) 2023-09-06 14:46:52.229406: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA [repeated 3x across cluster]
(TrainTrainable pid=57499) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. [repeated 3x across cluster]
(TrainTrainable pid=57499) 2023-09-06 14:46:52.378805: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`. [repeated 3x across cluster]
(RayTrainWorker pid=57741) Setting up process group for: env:// [rank=0, world_size=1]
(TrainTrainable pid=57499) 2023-09-06 14:46:53.174151: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 [repeated 6x across cluster]
(TrainTrainable pid=57499) 2023-09-06 14:46:53.174160: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. [repeated 3x across cluster]
(TrainTrainable pid=57499) comet_ml is installed but `COMET_API_KEY` is not set. [repeated 3x across cluster]
(SplitCoordinator pid=57927) Auto configuring locality_with_output=['84374908fd32ea9885fdd6d21aadf2ce3e296daf28a26522e7a8d026']

(RayTrainWorker pid=57741) Is CUDA available: True
(RayTrainWorker pid=57741) max_steps_per_epoch:  534

(RayTrainWorker pid=57741) Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_projector.weight', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_transform.bias']
(RayTrainWorker pid=57741) - This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
(RayTrainWorker pid=57741) - This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
(RayTrainWorker pid=57741) Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.weight', 'classifier.weight', 'pre_classifier.bias']
(RayTrainWorker pid=57741) You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
(TorchTrainer pid=57499) Starting distributed worker processes: ['57746 (10.0.27.125)'] [repeated 3x across cluster]
(RayTrainWorker pid=57740) 2023-09-06 14:47:00.036649: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA [repeated 4x across cluster]
(RayTrainWorker pid=57740) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. [repeated 4x across cluster]
(RayTrainWorker pid=57740) 2023-09-06 14:47:00.198894: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`. [repeated 4x across cluster]
(RayTrainWorker pid=57746) Setting up process group for: env:// [rank=0, world_size=1] [repeated 3x across cluster]
(RayTrainWorker pid=57740) 2023-09-06 14:47:01.085704: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 [repeated 8x across cluster]
(RayTrainWorker pid=57740) 2023-09-06 14:47:01.085711: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. [repeated 4x across cluster]
(RayTrainWorker pid=57740) comet_ml is installed but `COMET_API_KEY` is not set. [repeated 4x across cluster]
(SplitCoordinator pid=57965) Auto configuring locality_with_output=['84374908fd32ea9885fdd6d21aadf2ce3e296daf28a26522e7a8d026'] [repeated 7x across cluster]

(RayTrainWorker pid=57741) Starting training

(RayTrainWorker pid=57741) max_steps is given, it will override any value given in num_train_epochs
(RayTrainWorker pid=57741) /home/ray/anaconda3/lib/python3.9/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
(RayTrainWorker pid=57741)   warnings.warn(
(RayTrainWorker pid=57746) Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias', 'vocab_transform.bias']
(RayTrainWorker pid=57746) Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.weight', 'classifier.bias', 'pre_classifier.bias']
(RayTrainWorker pid=57731) Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_transform.weight', 'vocab_projector.weight', 'vocab_layer_norm.bias']
(RayTrainWorker pid=57740) Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_transform.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight']
(RayTrainWorker pid=57740) Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.bias', 'classifier.weight']
(RayTrainWorker pid=57741) ***** Running training *****
(RayTrainWorker pid=57741)   Num examples = 34176
(RayTrainWorker pid=57741)   Num Epochs = 9223372036854775807
(RayTrainWorker pid=57741)   Instantaneous batch size per device = 16
(RayTrainWorker pid=57741)   Total train batch size (w. parallel, distributed & accumulation) = 16
(RayTrainWorker pid=57741)   Gradient Accumulation steps = 1
(RayTrainWorker pid=57741)   Total optimization steps = 2136
(RayTrainWorker pid=57741) /tmp/ipykernel_43503/4088900328.py:23: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:206.)
(SplitCoordinator pid=57927) Executing DAG InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]
(SplitCoordinator pid=57927) Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=['84374908fd32ea9885fdd6d21aadf2ce3e296daf28a26522e7a8d026'], preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
(SplitCoordinator pid=57927) Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`

(RayTrainWorker pid=57741) [W reducer.cpp:1300] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())

(RayTrainWorker pid=57741) {'loss': 0.5481, 'learning_rate': 1.4990636704119851e-05, 'epoch': 0.25}
(RayTrainWorker pid=57740) Is CUDA available: True [repeated 3x across cluster]
(RayTrainWorker pid=57740) max_steps_per_epoch:  534 [repeated 3x across cluster]
(RayTrainWorker pid=57740) Starting training [repeated 3x across cluster]

(RayTrainWorker pid=57741) ***** Running Evaluation *****
(RayTrainWorker pid=57741)   Num examples: Unknown
(RayTrainWorker pid=57741)   Batch size = 16
(RayTrainWorker pid=57740) - This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). [repeated 3x across cluster]
(RayTrainWorker pid=57740) - This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). [repeated 3x across cluster]
(RayTrainWorker pid=57731) Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.weight', 'classifier.weight', 'pre_classifier.bias']
(RayTrainWorker pid=57740) You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. [repeated 3x across cluster]
(RayTrainWorker pid=57740) max_steps is given, it will override any value given in num_train_epochs [repeated 3x across cluster]
(RayTrainWorker pid=57740) /home/ray/anaconda3/lib/python3.9/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning [repeated 3x across cluster]
(RayTrainWorker pid=57740)   warnings.warn( [repeated 3x across cluster]
(RayTrainWorker pid=57740) ***** Running training ***** [repeated 3x across cluster]
(RayTrainWorker pid=57740)   Num examples = 34176 [repeated 3x across cluster]
(RayTrainWorker pid=57740)   Num Epochs = 9223372036854775807 [repeated 3x across cluster]
(RayTrainWorker pid=57740)   Instantaneous batch size per device = 16 [repeated 3x across cluster]
(RayTrainWorker pid=57740)   Total train batch size (w. parallel, distributed & accumulation) = 16 [repeated 3x across cluster]
(RayTrainWorker pid=57740)   Gradient Accumulation steps = 1 [repeated 3x across cluster]
(RayTrainWorker pid=57740)   Total optimization steps = 2136 [repeated 3x across cluster]
(RayTrainWorker pid=57740) /tmp/ipykernel_43503/4088900328.py:23: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:206.) [repeated 3x across cluster]
(SplitCoordinator pid=57965) Executing DAG InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)] [repeated 3x across cluster]
(SplitCoordinator pid=57965) Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=['84374908fd32ea9885fdd6d21aadf2ce3e296daf28a26522e7a8d026'], preserve_order=False, actor_locality_enabled=True, verbose_progress=False) [repeated 3x across cluster]
(SplitCoordinator pid=57965) Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True` [repeated 3x across cluster]
(RayTrainWorker pid=57740) [W reducer.cpp:1300] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) [repeated 3x across cluster]

(RayTrainWorker pid=57741) {'eval_loss': 0.5202918648719788, 'eval_matthews_correlation': 0.37321205597032797, 'eval_runtime': 0.7255, 'eval_samples_per_second': 1437.704, 'eval_steps_per_second': 90.976, 'epoch': 0.25}

(RayTrainWorker pid=57741) Saving model checkpoint to distilbert-base-uncased-finetuned-cola/checkpoint-535
(RayTrainWorker pid=57741) Configuration saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/config.json
(RayTrainWorker pid=57741) Model weights saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/pytorch_model.bin
(RayTrainWorker pid=57741) tokenizer config file saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/tokenizer_config.json
(RayTrainWorker pid=57741) Special tokens file saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/special_tokens_map.json
(RayTrainWorker pid=57741) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/ray/ray_results/tune_transformers/TorchTrainer_e1825_00000_0_learning_rate=0.0000_2023-09-06_14-46-48/checkpoint_000000)

(RayTrainWorker pid=57746) {'loss': 0.6064, 'learning_rate': 0.009981273408239701, 'epoch': 1.25} [repeated 4x across cluster]
(RayTrainWorker pid=57740) {'eval_loss': 0.6181353330612183, 'eval_matthews_correlation': 0.0, 'eval_runtime': 0.7543, 'eval_samples_per_second': 1382.828, 'eval_steps_per_second': 87.504, 'epoch': 0.25} [repeated 3x across cluster]

(RayTrainWorker pid=57746) ***** Running Evaluation ***** [repeated 4x across cluster]
(RayTrainWorker pid=57746)   Num examples: Unknown [repeated 4x across cluster]
(RayTrainWorker pid=57746)   Batch size = 16 [repeated 4x across cluster]
(SplitCoordinator pid=57954) Executing DAG InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)] [repeated 6x across cluster]
(SplitCoordinator pid=57954) Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=['84374908fd32ea9885fdd6d21aadf2ce3e296daf28a26522e7a8d026'], preserve_order=False, actor_locality_enabled=True, verbose_progress=False) [repeated 6x across cluster]
(SplitCoordinator pid=57954) Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True` [repeated 6x across cluster]
(RayTrainWorker pid=57740) Saving model checkpoint to distilbert-base-uncased-finetuned-cola/checkpoint-535 [repeated 3x across cluster]
(RayTrainWorker pid=57740) Configuration saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/config.json [repeated 3x across cluster]
(RayTrainWorker pid=57740) Model weights saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/pytorch_model.bin [repeated 3x across cluster]
(RayTrainWorker pid=57740) tokenizer config file saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/tokenizer_config.json [repeated 3x across cluster]
(RayTrainWorker pid=57740) Special tokens file saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/special_tokens_map.json [repeated 3x across cluster]
(RayTrainWorker pid=57740) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/ray/ray_results/tune_transformers/TorchTrainer_e1825_00001_1_learning_rate=0.0002_2023-09-06_14-46-48/checkpoint_000000) [repeated 3x across cluster]

(RayTrainWorker pid=57746) {'loss': 0.6061, 'learning_rate': 0.004971910112359551, 'epoch': 2.25} [repeated 2x across cluster]
(RayTrainWorker pid=57741) {'eval_loss': 0.5246258974075317, 'eval_matthews_correlation': 0.489934557943789, 'eval_runtime': 0.6462, 'eval_samples_per_second': 1614.032, 'eval_steps_per_second': 102.134, 'epoch': 1.25} [repeated 2x across cluster]

(RayTrainWorker pid=57746) ***** Running Evaluation ***** [repeated 2x across cluster]
(RayTrainWorker pid=57746)   Num examples: Unknown [repeated 2x across cluster]
(RayTrainWorker pid=57746)   Batch size = 16 [repeated 2x across cluster]
(SplitCoordinator pid=57927) Executing DAG InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)] [repeated 4x across cluster]
(SplitCoordinator pid=57927) Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=['84374908fd32ea9885fdd6d21aadf2ce3e296daf28a26522e7a8d026'], preserve_order=False, actor_locality_enabled=True, verbose_progress=False) [repeated 4x across cluster]
(SplitCoordinator pid=57927) Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True` [repeated 4x across cluster]
(RayTrainWorker pid=57741) Saving model checkpoint to distilbert-base-uncased-finetuned-cola/checkpoint-1070 [repeated 2x across cluster]
(RayTrainWorker pid=57741) Configuration saved in distilbert-base-uncased-finetuned-cola/checkpoint-1070/config.json [repeated 2x across cluster]
(RayTrainWorker pid=57741) Model weights saved in distilbert-base-uncased-finetuned-cola/checkpoint-1070/pytorch_model.bin [repeated 2x across cluster]
(RayTrainWorker pid=57741) tokenizer config file saved in distilbert-base-uncased-finetuned-cola/checkpoint-1070/tokenizer_config.json [repeated 2x across cluster]
(RayTrainWorker pid=57741) Special tokens file saved in distilbert-base-uncased-finetuned-cola/checkpoint-1070/special_tokens_map.json [repeated 2x across cluster]
(RayTrainWorker pid=57741) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/ray/ray_results/tune_transformers/TorchTrainer_e1825_00000_0_learning_rate=0.0000_2023-09-06_14-46-48/checkpoint_000001) [repeated 2x across cluster]

(RayTrainWorker pid=57746) {'loss': 0.6073, 'learning_rate': 0.0, 'epoch': 3.25} [repeated 2x across cluster]
(RayTrainWorker pid=57741) {'eval_loss': 0.6450843811035156, 'eval_matthews_correlation': 0.5259674254268325, 'eval_runtime': 0.6474, 'eval_samples_per_second': 1611.106, 'eval_steps_per_second': 101.949, 'epoch': 2.25} [repeated 2x across cluster]

(RayTrainWorker pid=57746) ***** Running Evaluation ***** [repeated 2x across cluster]
(RayTrainWorker pid=57746)   Num examples: Unknown [repeated 2x across cluster]
(RayTrainWorker pid=57746)   Batch size = 16 [repeated 2x across cluster]
(SplitCoordinator pid=57927) Executing DAG InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)] [repeated 4x across cluster]
(SplitCoordinator pid=57927) Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=['84374908fd32ea9885fdd6d21aadf2ce3e296daf28a26522e7a8d026'], preserve_order=False, actor_locality_enabled=True, verbose_progress=False) [repeated 4x across cluster]
(SplitCoordinator pid=57927) Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True` [repeated 4x across cluster]
(RayTrainWorker pid=57741) Saving model checkpoint to distilbert-base-uncased-finetuned-cola/checkpoint-1605 [repeated 2x across cluster]
(RayTrainWorker pid=57741) Configuration saved in distilbert-base-uncased-finetuned-cola/checkpoint-1605/config.json [repeated 2x across cluster]
(RayTrainWorker pid=57741) Model weights saved in distilbert-base-uncased-finetuned-cola/checkpoint-1605/pytorch_model.bin [repeated 2x across cluster]
(RayTrainWorker pid=57741) tokenizer config file saved in distilbert-base-uncased-finetuned-cola/checkpoint-1605/tokenizer_config.json [repeated 2x across cluster]
(RayTrainWorker pid=57741) Special tokens file saved in distilbert-base-uncased-finetuned-cola/checkpoint-1605/special_tokens_map.json [repeated 2x across cluster]
(RayTrainWorker pid=57741) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/ray/ray_results/tune_transformers/TorchTrainer_e1825_00000_0_learning_rate=0.0000_2023-09-06_14-46-48/checkpoint_000002) [repeated 2x across cluster]

(RayTrainWorker pid=57746) 
(RayTrainWorker pid=57746) 
(RayTrainWorker pid=57746) Training completed. Do not forget to share your model on huggingface.co/models =)
(RayTrainWorker pid=57746) 
(RayTrainWorker pid=57746) 

(RayTrainWorker pid=57746) {'train_runtime': 115.5377, 'train_samples_per_second': 295.8, 'train_steps_per_second': 18.487, 'train_loss': 0.6787891173630618, 'epoch': 3.25}

2023-09-06 14:49:04,574	INFO tune.py:1141 -- Total run time: 136.19 seconds (136.17 seconds for the tuning loop).
(RayTrainWorker pid=57741) 
(RayTrainWorker pid=57741) 
(RayTrainWorker pid=57741) Training completed. Do not forget to share your model on huggingface.co/models =)
(RayTrainWorker pid=57741) 
(RayTrainWorker pid=57741) 

(RayTrainWorker pid=57741) {'train_runtime': 117.6791, 'train_samples_per_second': 290.417, 'train_steps_per_second': 18.151, 'train_loss': 0.3468295286657212, 'epoch': 3.25}

View the results of the tuning run as a dataframe, and find the best result.

tune_results.get_dataframe().sort_values("eval_loss")

	loss	learning_rate	epoch	step	eval_loss	eval_matthews_correlation	eval_runtime	eval_samples_per_second	eval_steps_per_second	timestamp	...	time_total_s	pid	hostname	node_ip	time_since_restore	iterations_since_restore	checkpoint_dir_name	config/train_loop_config/learning_rate	config/train_loop_config/epochs	logdir
1	0.6160	0.000150	0.25	535	0.618135	0.000000	0.7543	1382.828	87.504	1694036857	...	41.248600	57497	ip-10-0-27-125	10.0.27.125	41.248600	1	checkpoint_000000	0.00020	4	e1825_00001
2	0.6699	0.001499	0.25	535	0.619657	0.000000	0.7449	1400.202	88.603	1694036856	...	41.133609	57498	ip-10-0-27-125	10.0.27.125	41.133609	1	checkpoint_000000	0.00200	4	e1825_00002
3	0.6073	0.000000	3.25	2136	0.619694	0.000000	0.6329	1648.039	104.286	1694036942	...	126.699238	57499	ip-10-0-27-125	10.0.27.125	126.699238	4	checkpoint_000003	0.02000	4	e1825_00003
0	0.1934	0.000000	3.25	2136	0.747960	0.520756	0.6530	1597.187	101.068	1694036944	...	128.443495	57496	ip-10-0-27-125	10.0.27.125	128.443495	4	checkpoint_000003	0.00002	4	e1825_00000

4 rows × 26 columns

best_result = tune_results.get_best_result()

Fine-tune a Hugging Face Transformers Model#

Set up Ray#

Fine-tune a model on a text classification task#

Loading the dataset#

Preprocessing the data with Ray Data#

Fine-tuning the model with Ray Train#

Tune Status

System Info

Trial Status

Tune hyperparameters with Ray Tune#

Tune Status

System Info

Trial Status

Share the model#

See also#