Fine-tune a πŸ€— Transformers modelΒΆ

This notebook is based on an official πŸ€— notebook - β€œHow to fine-tune a model on text classification”. The main aim of this notebook is to show the process of conversion from vanilla πŸ€— to Ray AIR πŸ€— without changing the training logic unless necessary.

In this notebook, we will:

  1. Set up Ray

  2. Load the dataset

  3. Preprocess the dataset

  4. Run the training with Ray AIR

  5. Predict on test data with Ray AIR

  6. Optionally, share the model with the community

Uncomment and run the following line in order to install all the necessary dependencies (this notebook is being tested with transformers==4.19.1):

#! pip install "datasets" "transformers>=4.19.0" "torch>=1.10.0" "mlflow" "ray[air]>=1.13"

Set up Ray ΒΆ

We will use ray.init() to initialize a local cluster. By default, this cluster will be compromised of only the machine you are running this notebook on. You can also run this notebook on an Anyscale cluster.

This notebook will not run in Ray Client mode.

from pprint import pprint
import ray

ray.init()
RayContext(dashboard_url='', python_version='3.7.13', ray_version='2.0.0.dev0', ray_commit='e2ee2140f97ca08b70fd0f7561038b7f8d958d63', address_info={'node_ip_address': '172.28.0.2', 'raylet_ip_address': '172.28.0.2', 'redis_address': None, 'object_store_address': '/tmp/ray/session_2022-05-12_18-30-10_467499_75/sockets/plasma_store', 'raylet_socket_name': '/tmp/ray/session_2022-05-12_18-30-10_467499_75/sockets/raylet', 'webui_url': '', 'session_dir': '/tmp/ray/session_2022-05-12_18-30-10_467499_75', 'metrics_export_port': 64840, 'gcs_address': '172.28.0.2:58661', 'address': '172.28.0.2:58661', 'node_id': '65d091b8f504ccd72024fd0b1a8445a8f9ea43e86bcbf67868c22ba7'})

We can check the resources our cluster is composed of. If you are running this notebook on your local machine or Google Colab, you should see the number of CPU cores and GPUs available on the said machine.

pprint(ray.cluster_resources())
{'CPU': 2.0,
 'GPU': 1.0,
 'accelerator_type:T4': 1.0,
 'memory': 7855477556.0,
 'node:172.28.0.2': 1.0,
 'object_store_memory': 3927738777.0}

In this notebook, we will see how to fine-tune one of the πŸ€— Transformers model to a text classification task of the GLUE Benchmark. We will be running the training using Ray AIR.

You can change those two variables to control whether the training (which we will get to later) uses CPUs or GPUs, and how many workers should be spawned. Each worker will claim one CPU or GPU. Make sure not to request more resources than the resources present!

By default, we will run the training with one GPU worker.

use_gpu = True  # set this to False to run on CPUs
num_workers = 1  # set this to number of GPUs/CPUs you want to use

Fine-tuning a model on a text classification taskΒΆ

The GLUE Benchmark is a group of nine classification tasks on sentences or pairs of sentences. If you would like to learn more, refer to the original notebook.

Each task is named by its acronym, with mnli-mm standing for the mismatched version of MNLI (so same training set as mnli but different validation and test sets):

GLUE_TASKS = ["cola", "mnli", "mnli-mm", "mrpc", "qnli", "qqp", "rte", "sst2", "stsb", "wnli"]

This notebook is built to run on any of the tasks in the list above, with any model checkpoint from the Model Hub as long as that model has a version with a classification head. Depending on you model and the GPU you are using, you might need to adjust the batch size to avoid out-of-memory errors. Set those three parameters, then the rest of the notebook should run smoothly:

task = "cola"
model_checkpoint = "distilbert-base-uncased"
batch_size = 16

Loading the dataset ΒΆ

We will use the πŸ€— Datasets library to download the data and get the metric we need to use for evaluation (to compare our model to the benchmark). This can be easily done with the functions load_dataset and load_metric.

Apart from mnli-mm being a special code, we can directly pass our task name to those functions.

As Ray AIR doesn’t provide integrations for πŸ€— Datasets yet, we will simply run the normal πŸ€— Datasets code to load the dataset from the Hub.

from datasets import load_dataset

actual_task = "mnli" if task == "mnli-mm" else task
datasets = load_dataset("glue", actual_task)
Downloading and preparing dataset glue/cola (download: 368.14 KiB, generated: 596.73 KiB, post-processed: Unknown size, total: 964.86 KiB) to /root/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad...
Dataset glue downloaded and prepared to /root/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad. Subsequent calls will reuse this data.

The dataset object itself is DatasetDict, which contains one key for the training, validation and test set (with more keys for the mismatched validation and test set in the special case of mnli).

We will also need the metric. In order to avoid serialization errors, we will load the metric inside the training workers later. Therefore, now we will just define the function we will use.

from datasets import load_metric

def load_metric_fn():
    return load_metric('glue', actual_task)

The metric is an instance of datasets.Metric.

Preprocessing the data ΒΆ

Before we can feed those texts to our model, we need to preprocess them. This is done by a πŸ€— Transformers Tokenizer which will (as the name indicates) tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary) and put it in a format the model expects, as well as generate the other inputs that model requires.

To do all of this, we instantiate our tokenizer with the AutoTokenizer.from_pretrained method, which will ensure:

  • we get a tokenizer that corresponds to the model architecture we want to use,

  • we download the vocabulary used when pretraining this specific checkpoint.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

We pass along use_fast=True to the call above to use one of the fast tokenizers (backed by Rust) from the πŸ€— Tokenizers library. Those fast tokenizers are available for almost all models, but if you got an error with the previous call, remove that argument.

To preprocess our dataset, we will thus need the names of the columns containing the sentence(s). The following dictionary keeps track of the correspondence task to column names:

task_to_keys = {
    "cola": ("sentence", None),
    "mnli": ("premise", "hypothesis"),
    "mnli-mm": ("premise", "hypothesis"),
    "mrpc": ("sentence1", "sentence2"),
    "qnli": ("question", "sentence"),
    "qqp": ("question1", "question2"),
    "rte": ("sentence1", "sentence2"),
    "sst2": ("sentence", None),
    "stsb": ("sentence1", "sentence2"),
    "wnli": ("sentence1", "sentence2"),
}

We can them write the function that will preprocess our samples. We just feed them to the tokenizer with the argument truncation=True. This will ensure that an input longer that what the model selected can handle will be truncated to the maximum length accepted by the model.

def preprocess_function(examples, *, tokenizer):
    sentence1_key, sentence2_key = task_to_keys[task]
    if sentence2_key is None:
        return tokenizer(examples[sentence1_key], truncation=True)
    return tokenizer(examples[sentence1_key], examples[sentence2_key], truncation=True)

To apply this function on all the sentences (or pairs of sentences) in our dataset, we just use the map method of our dataset object we created earlier. This will apply the function on all the elements of all the splits in dataset, so our training, validation and testing data will be preprocessed in one single command.

encoded_datasets = datasets.map(preprocess_function, batched=True, fn_kwargs=dict(tokenizer=tokenizer))

For Ray AIR, instead of using πŸ€— Dataset objects directly, we will convert them to Ray Datasets. Both are backed by Arrow tables, so the conversion is straightforward. We will use the built-in ray.data.from_huggingface function.

import ray.data

ray_datasets = ray.data.from_huggingface(encoded_datasets)

Fine-tuning the model with Ray AIR ΒΆ

Now that our data is ready, we can download the pretrained model and fine-tune it.

Since all our tasks are about sentence classification, we use the AutoModelForSequenceClassification class.

We will not go into details about each specific component of the training (see the original notebook for that). The tokenizer is the same as we have used to encoded the dataset before.

The main difference when using the Ray AIR is that we need to create our πŸ€— Transformers Trainer inside a function (trainer_init_per_worker) and return it. That function will be passed to the HuggingFaceTrainer and ran on every Ray worker. The training will then proceed by the means of PyTorch DDP.

Make sure that you initialize the model, metric and tokenizer inside that function. Otherwise, you may run into serialization errors.

Furthermore, push_to_hub=True is not yet supported. Ray will however checkpoint the model at every epoch, allowing you to push it to hub manually. We will do that after the training.

If you wish to use thrid party logging libraries, such as MLFlow or Weights&Biases, do not set them in TrainingArguments (they will be automatically disabled) - instead, you should be passing Ray AIR callbacks to HuggingFaceTrainer’s run_config. In this example, we will use MLFlow.

from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
import numpy as np
import torch

num_labels = 3 if task.startswith("mnli") else 1 if task=="stsb" else 2
metric_name = "pearson" if task == "stsb" else "matthews_correlation" if task == "cola" else "accuracy"
model_name = model_checkpoint.split("/")[-1]
validation_key = "validation_mismatched" if task == "mnli-mm" else "validation_matched" if task == "mnli" else "validation"
name = f"{model_name}-finetuned-{task}"

def trainer_init_per_worker(train_dataset, eval_dataset = None, **config):
    print(f"Is CUDA available: {torch.cuda.is_available()}")
    metric = load_metric_fn()
    tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)
    model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)
    args = TrainingArguments(
        name,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        logging_strategy="epoch",
        learning_rate=2e-5,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        num_train_epochs=5,
        weight_decay=0.01,
        push_to_hub=False,
        disable_tqdm=True,  # declutter the output a little
        no_cuda=not use_gpu,  # you need to explicitly set no_cuda if you want CPUs
    )

    def compute_metrics(eval_pred):
        predictions, labels = eval_pred
        if task != "stsb":
            predictions = np.argmax(predictions, axis=1)
        else:
            predictions = predictions[:, 0]
        return metric.compute(predictions=predictions, references=labels)

    trainer = Trainer(
        model,
        args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        tokenizer=tokenizer,
        compute_metrics=compute_metrics
    )

    print("Starting training")
    return trainer

With our trainer_init_per_worker complete, we can now instantiate the HuggingFaceTrainer. Aside from the function, we set the scaling_config, controlling the amount of workers and resources used, and the datasets we will use for training and evaluation.

We specify the MlflowLoggerCallback inside the run_config.

from ray.train.huggingface import HuggingFaceTrainer
from ray.air.config import RunConfig, ScalingConfig
from ray.air.callbacks.mlflow import MLflowLoggerCallback

trainer = HuggingFaceTrainer(
    trainer_init_per_worker=trainer_init_per_worker,
    scaling_config=ScalingConfig(num_workers=num_workers, use_gpu=use_gpu),
    datasets={"train": ray_datasets["train"], "evaluation": ray_datasets[validation_key]},
    run_config=RunConfig(callbacks=[MLflowLoggerCallback(experiment_name=name)])
)

Finally, we call the fit method to being training with Ray AIR. We will save the Result object to a variable so we can access metrics and checkpoints.

result = trainer.fit()
== Status ==
Current time: 2022-05-12 18:35:14 (running for 00:03:48.08)
Memory usage on this node: 5.7/12.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/2 CPUs, 0/1 GPUs, 0.0/7.32 GiB heap, 0.0/3.66 GiB objects (0.0/1.0 accelerator_type:T4)
Result logdir: /root/ray_results/HuggingFaceTrainer_2022-05-12_18-31-26
Number of trials: 1/1 (1 TERMINATED)
Trial name status loc iter total time (s) loss learning_rate epoch
HuggingFaceTrainer_bb9dd_00000TERMINATED172.28.0.2:419 5 222.3910.1575 1.30841e-06 5


(RayTrainWorker pid=455) 2022-05-12 18:31:33,158	INFO torch.py:347 -- Setting up process group for: env:// [rank=0, world_size=1]
(RayTrainWorker pid=455) Is CUDA available: True
Downloading builder script: 5.76kB [00:00, 6.35MB/s]                   
Downloading:   0%|          | 0.00/256M [00:00<?, ?B/s]
Downloading:   2%|▏         | 5.63M/256M [00:00<00:04, 59.1MB/s]
Downloading:   5%|▍         | 12.2M/256M [00:00<00:03, 65.0MB/s]
Downloading:   7%|β–‹         | 18.5M/256M [00:00<00:03, 65.6MB/s]
Downloading:  10%|β–‰         | 25.3M/256M [00:00<00:03, 67.5MB/s]
Downloading:  12%|β–ˆβ–        | 31.7M/256M [00:00<00:03, 66.6MB/s]
Downloading:  15%|β–ˆβ–Œ        | 38.3M/256M [00:00<00:03, 67.6MB/s]
Downloading:  18%|β–ˆβ–Š        | 44.8M/256M [00:00<00:03, 67.6MB/s]
Downloading:  20%|β–ˆβ–ˆ        | 51.2M/256M [00:00<00:03, 66.6MB/s]
Downloading:  23%|β–ˆβ–ˆβ–Ž       | 57.9M/256M [00:00<00:03, 67.5MB/s]
Downloading:  25%|β–ˆβ–ˆβ–Œ       | 64.7M/256M [00:01<00:02, 68.6MB/s]
Downloading:  28%|β–ˆβ–ˆβ–Š       | 71.2M/256M [00:01<00:02, 66.6MB/s]
Downloading:  31%|β–ˆβ–ˆβ–ˆ       | 78.0M/256M [00:01<00:02, 67.9MB/s]
Downloading:  33%|β–ˆβ–ˆβ–ˆβ–Ž      | 84.5M/256M [00:01<00:02, 68.0MB/s]
Downloading:  36%|β–ˆβ–ˆβ–ˆβ–Œ      | 91.1M/256M [00:01<00:02, 68.2MB/s]
Downloading:  38%|β–ˆβ–ˆβ–ˆβ–Š      | 97.7M/256M [00:01<00:02, 68.5MB/s]
Downloading:  41%|β–ˆβ–ˆβ–ˆβ–ˆ      | 104M/256M [00:01<00:02, 62.8MB/s] 
Downloading:  43%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž     | 110M/256M [00:01<00:02, 58.5MB/s]
Downloading:  46%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ     | 117M/256M [00:01<00:02, 60.5MB/s]
Downloading:  48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š     | 123M/256M [00:01<00:02, 61.7MB/s]
Downloading:  50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 129M/256M [00:02<00:02, 63.0MB/s]
Downloading:  53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž    | 135M/256M [00:02<00:01, 64.0MB/s]
Downloading:  55%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ    | 142M/256M [00:02<00:01, 62.2MB/s]
Downloading:  58%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š    | 148M/256M [00:02<00:01, 61.0MB/s]
Downloading:  60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ    | 154M/256M [00:02<00:01, 62.2MB/s]
Downloading:  62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 160M/256M [00:02<00:01, 62.1MB/s]
Downloading:  65%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ   | 166M/256M [00:02<00:01, 64.1MB/s]
Downloading:  67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹   | 172M/256M [00:02<00:01, 64.4MB/s]
Downloading:  73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž  | 186M/256M [00:02<00:01, 67.3MB/s]
Downloading:  75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ  | 192M/256M [00:03<00:00, 68.0MB/s]
Downloading:  78%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š  | 199M/256M [00:03<00:00, 70.0MB/s]
Downloading:  81%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  | 206M/256M [00:03<00:00, 69.6MB/s]
Downloading:  83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 213M/256M [00:03<00:00, 70.1MB/s]
Downloading:  86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 220M/256M [00:03<00:00, 69.1MB/s]
Downloading:  89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 226M/256M [00:03<00:00, 68.4MB/s]
Downloading:  91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 233M/256M [00:03<00:00, 62.3MB/s]
Downloading:  93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 239M/256M [00:03<00:00, 60.2MB/s]
Downloading:  96%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 245M/256M [00:03<00:00, 61.8MB/s]
Downloading: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 256M/256M [00:04<00:00, 65.0MB/s]
(RayTrainWorker pid=455) Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_projector.bias', 'vocab_projector.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias']
(RayTrainWorker pid=455) - This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
(RayTrainWorker pid=455) - This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
(RayTrainWorker pid=455) Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.bias', 'pre_classifier.weight', 'classifier.weight']
(RayTrainWorker pid=455) You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
(RayTrainWorker pid=455) /usr/local/lib/python3.7/dist-packages/transformers/optimization.py:309: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
(RayTrainWorker pid=455)   FutureWarning,
(RayTrainWorker pid=455) Starting training
(RayTrainWorker pid=455) ***** Running training *****
(RayTrainWorker pid=455)   Num examples = 8551
(RayTrainWorker pid=455)   Num Epochs = 5
(RayTrainWorker pid=455)   Instantaneous batch size per device = 16
(RayTrainWorker pid=455)   Total train batch size (w. parallel, distributed & accumulation) = 16
(RayTrainWorker pid=455)   Gradient Accumulation steps = 1
(RayTrainWorker pid=455)   Total optimization steps = 2675
(RayTrainWorker pid=455) The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: idx, sentence. If idx, sentence are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
(RayTrainWorker pid=455) [W reducer.cpp:1289] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
(RayTrainWorker pid=455) {'loss': 0.5441, 'learning_rate': 1.6261682242990654e-05, 'epoch': 0.93}
(RayTrainWorker pid=455) ***** Running Evaluation *****
(RayTrainWorker pid=455)   Num examples = 1043
(RayTrainWorker pid=455)   Batch size = 16
(RayTrainWorker pid=455) The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: idx, sentence. If idx, sentence are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
(RayTrainWorker pid=455) Saving model checkpoint to distilbert-base-uncased-finetuned-cola/checkpoint-535
(RayTrainWorker pid=455) Configuration saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/config.json
(RayTrainWorker pid=455) {'eval_loss': 0.4999416470527649, 'eval_matthews_correlation': 0.3991733676966143, 'eval_runtime': 1.0378, 'eval_samples_per_second': 1004.976, 'eval_steps_per_second': 63.594, 'epoch': 1.0}
(RayTrainWorker pid=455) Model weights saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/pytorch_model.bin
(RayTrainWorker pid=455) tokenizer config file saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/tokenizer_config.json
(RayTrainWorker pid=455) Special tokens file saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/special_tokens_map.json
Trial HuggingFaceTrainer_bb9dd_00000 reported loss=0.5441,learning_rate=1.6261682242990654e-05,epoch=1.0,step=535,eval_loss=0.4999416470527649,eval_matthews_correlation=0.3991733676966143,eval_runtime=1.0378,eval_samples_per_second=1004.976,eval_steps_per_second=63.594,_timestamp=1652380362,_time_this_iter_s=66.77899646759033,_training_iteration=1,should_checkpoint=True with parameters={}.
(RayTrainWorker pid=455) {'loss': 0.3886, 'learning_rate': 1.2523364485981309e-05, 'epoch': 1.87}
(RayTrainWorker pid=455) ***** Running Evaluation *****
(RayTrainWorker pid=455)   Num examples = 1043
(RayTrainWorker pid=455)   Batch size = 16
(RayTrainWorker pid=455) The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: idx, sentence. If idx, sentence are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
(RayTrainWorker pid=455) {'eval_loss': 0.5397436618804932, 'eval_matthews_correlation': 0.5085739436587455, 'eval_runtime': 1.0792, 'eval_samples_per_second': 966.488, 'eval_steps_per_second': 61.158, 'epoch': 2.0}
(RayTrainWorker pid=455) Saving model checkpoint to distilbert-base-uncased-finetuned-cola/checkpoint-1070
(RayTrainWorker pid=455) Configuration saved in distilbert-base-uncased-finetuned-cola/checkpoint-1070/config.json
(RayTrainWorker pid=455) Model weights saved in distilbert-base-uncased-finetuned-cola/checkpoint-1070/pytorch_model.bin
(RayTrainWorker pid=455) tokenizer config file saved in distilbert-base-uncased-finetuned-cola/checkpoint-1070/tokenizer_config.json
(RayTrainWorker pid=455) Special tokens file saved in distilbert-base-uncased-finetuned-cola/checkpoint-1070/special_tokens_map.json
Trial HuggingFaceTrainer_bb9dd_00000 reported loss=0.3886,learning_rate=1.2523364485981309e-05,epoch=2.0,step=1070,eval_loss=0.5397436618804932,eval_matthews_correlation=0.5085739436587455,eval_runtime=1.0792,eval_samples_per_second=966.488,eval_steps_per_second=61.158,_timestamp=1652380400,_time_this_iter_s=37.84357762336731,_training_iteration=2,should_checkpoint=True with parameters={}.
(RayTrainWorker pid=455) {'loss': 0.2746, 'learning_rate': 8.785046728971963e-06, 'epoch': 2.8}
(RayTrainWorker pid=455) ***** Running Evaluation *****
(RayTrainWorker pid=455)   Num examples = 1043
(RayTrainWorker pid=455)   Batch size = 16
(RayTrainWorker pid=455) The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: idx, sentence. If idx, sentence are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
(RayTrainWorker pid=455) Saving model checkpoint to distilbert-base-uncased-finetuned-cola/checkpoint-1605
(RayTrainWorker pid=455) Configuration saved in distilbert-base-uncased-finetuned-cola/checkpoint-1605/config.json
(RayTrainWorker pid=455) {'eval_loss': 0.6648283004760742, 'eval_matthews_correlation': 0.5141951979542654, 'eval_runtime': 1.1148, 'eval_samples_per_second': 935.563, 'eval_steps_per_second': 59.202, 'epoch': 3.0}
(RayTrainWorker pid=455) Model weights saved in distilbert-base-uncased-finetuned-cola/checkpoint-1605/pytorch_model.bin
(RayTrainWorker pid=455) tokenizer config file saved in distilbert-base-uncased-finetuned-cola/checkpoint-1605/tokenizer_config.json
(RayTrainWorker pid=455) Special tokens file saved in distilbert-base-uncased-finetuned-cola/checkpoint-1605/special_tokens_map.json
Trial HuggingFaceTrainer_bb9dd_00000 reported loss=0.2746,learning_rate=8.785046728971963e-06,epoch=3.0,step=1605,eval_loss=0.6648283004760742,eval_matthews_correlation=0.5141951979542654,eval_runtime=1.1148,eval_samples_per_second=935.563,eval_steps_per_second=59.202,_timestamp=1652380437,_time_this_iter_s=36.976723432540894,_training_iteration=3,should_checkpoint=True with parameters={}.
(RayTrainWorker pid=455) {'loss': 0.196, 'learning_rate': 5.046728971962617e-06, 'epoch': 3.74}
(RayTrainWorker pid=455) ***** Running Evaluation *****
(RayTrainWorker pid=455)   Num examples = 1043
(RayTrainWorker pid=455)   Batch size = 16
(RayTrainWorker pid=455) The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: idx, sentence. If idx, sentence are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
(RayTrainWorker pid=455) Saving model checkpoint to distilbert-base-uncased-finetuned-cola/checkpoint-2140
(RayTrainWorker pid=455) Configuration saved in distilbert-base-uncased-finetuned-cola/checkpoint-2140/config.json
(RayTrainWorker pid=455) {'eval_loss': 0.7566447854042053, 'eval_matthews_correlation': 0.5518326707011334, 'eval_runtime': 1.1113, 'eval_samples_per_second': 938.535, 'eval_steps_per_second': 59.39, 'epoch': 4.0}
(RayTrainWorker pid=455) Model weights saved in distilbert-base-uncased-finetuned-cola/checkpoint-2140/pytorch_model.bin
(RayTrainWorker pid=455) tokenizer config file saved in distilbert-base-uncased-finetuned-cola/checkpoint-2140/tokenizer_config.json
(RayTrainWorker pid=455) Special tokens file saved in distilbert-base-uncased-finetuned-cola/checkpoint-2140/special_tokens_map.json
Trial HuggingFaceTrainer_bb9dd_00000 reported loss=0.196,learning_rate=5.046728971962617e-06,epoch=4.0,step=2140,eval_loss=0.7566447854042053,eval_matthews_correlation=0.5518326707011334,eval_runtime=1.1113,eval_samples_per_second=938.535,eval_steps_per_second=59.39,_timestamp=1652380474,_time_this_iter_s=36.68935775756836,_training_iteration=4,should_checkpoint=True with parameters={}.
(RayTrainWorker pid=455) {'loss': 0.1575, 'learning_rate': 1.308411214953271e-06, 'epoch': 4.67}
(RayTrainWorker pid=455) Saving model checkpoint to distilbert-base-uncased-finetuned-cola/checkpoint-2675
(RayTrainWorker pid=455) Configuration saved in distilbert-base-uncased-finetuned-cola/checkpoint-2675/config.json
(RayTrainWorker pid=455) Model weights saved in distilbert-base-uncased-finetuned-cola/checkpoint-2675/pytorch_model.bin
(RayTrainWorker pid=455) tokenizer config file saved in distilbert-base-uncased-finetuned-cola/checkpoint-2675/tokenizer_config.json
(RayTrainWorker pid=455) Special tokens file saved in distilbert-base-uncased-finetuned-cola/checkpoint-2675/special_tokens_map.json
(RayTrainWorker pid=455) ***** Running Evaluation *****
(RayTrainWorker pid=455)   Num examples = 1043
(RayTrainWorker pid=455)   Batch size = 16
(RayTrainWorker pid=455) The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: idx, sentence. If idx, sentence are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
(RayTrainWorker pid=455) Saving model checkpoint to distilbert-base-uncased-finetuned-cola/checkpoint-2675
(RayTrainWorker pid=455) Configuration saved in distilbert-base-uncased-finetuned-cola/checkpoint-2675/config.json
(RayTrainWorker pid=455) {'eval_loss': 0.8616615533828735, 'eval_matthews_correlation': 0.5420036503219092, 'eval_runtime': 1.2577, 'eval_samples_per_second': 829.302, 'eval_steps_per_second': 52.477, 'epoch': 5.0}
(RayTrainWorker pid=455) Model weights saved in distilbert-base-uncased-finetuned-cola/checkpoint-2675/pytorch_model.bin
(RayTrainWorker pid=455) tokenizer config file saved in distilbert-base-uncased-finetuned-cola/checkpoint-2675/tokenizer_config.json
(RayTrainWorker pid=455) Special tokens file saved in distilbert-base-uncased-finetuned-cola/checkpoint-2675/special_tokens_map.json
(RayTrainWorker pid=455) 
(RayTrainWorker pid=455) 
(RayTrainWorker pid=455) Training completed. Do not forget to share your model on huggingface.co/models =)
(RayTrainWorker pid=455) 
(RayTrainWorker pid=455) 
(RayTrainWorker pid=455) {'train_runtime': 187.8585, 'train_samples_per_second': 227.592, 'train_steps_per_second': 14.239, 'train_loss': 0.30010223103460865, 'epoch': 5.0}
Trial HuggingFaceTrainer_bb9dd_00000 reported loss=0.1575,learning_rate=1.308411214953271e-06,epoch=5.0,step=2675,eval_loss=0.8616615533828735,eval_matthews_correlation=0.5420036503219092,eval_runtime=1.2577,eval_samples_per_second=829.302,eval_steps_per_second=52.477,train_runtime=187.8585,train_samples_per_second=227.592,train_steps_per_second=14.239,train_loss=0.30010223103460865,_timestamp=1652380513,_time_this_iter_s=39.63672137260437,_training_iteration=5,should_checkpoint=True with parameters={}.
Trial HuggingFaceTrainer_bb9dd_00000 completed. Last result: loss=0.1575,learning_rate=1.308411214953271e-06,epoch=5.0,step=2675,eval_loss=0.8616615533828735,eval_matthews_correlation=0.5420036503219092,eval_runtime=1.2577,eval_samples_per_second=829.302,eval_steps_per_second=52.477,train_runtime=187.8585,train_samples_per_second=227.592,train_steps_per_second=14.239,train_loss=0.30010223103460865,_timestamp=1652380513,_time_this_iter_s=39.63672137260437,_training_iteration=5,should_checkpoint=True
2022-05-12 18:35:14,803	INFO tune.py:753 -- Total run time: 228.34 seconds (228.07 seconds for the tuning loop).

You can use the returned Result object to access metrics and the Ray AIR Checkpoint associated with the last iteration.

result
Result(metrics={'loss': 0.1575, 'learning_rate': 1.308411214953271e-06, 'epoch': 5.0, 'step': 2675, 'eval_loss': 0.8616615533828735, 'eval_matthews_correlation': 0.5420036503219092, 'eval_runtime': 1.2577, 'eval_samples_per_second': 829.302, 'eval_steps_per_second': 52.477, 'train_runtime': 187.8585, 'train_samples_per_second': 227.592, 'train_steps_per_second': 14.239, 'train_loss': 0.30010223103460865, '_timestamp': 1652380513, '_time_this_iter_s': 39.63672137260437, '_training_iteration': 5, 'time_this_iter_s': 39.64510202407837, 'should_checkpoint': True, 'done': True, 'timesteps_total': None, 'episodes_total': None, 'training_iteration': 5, 'trial_id': 'bb9dd_00000', 'experiment_id': 'db0c5ea784a44980819bf5e1bfb72c04', 'date': '2022-05-12_18-35-13', 'timestamp': 1652380513, 'time_total_s': 222.39091277122498, 'pid': 419, 'hostname': 'e618da00601e', 'node_ip': '172.28.0.2', 'config': {}, 'time_since_restore': 222.39091277122498, 'timesteps_since_restore': 0, 'iterations_since_restore': 5, 'warmup_time': 0.004034996032714844, 'experiment_tag': '0'}, checkpoint=<ray.air.checkpoint.Checkpoint object at 0x7f9ffd9d9c90>, error=None)

Predict on test data with Ray AIR ΒΆ

You can now use the checkpoint to run prediction with HuggingFacePredictor, which wraps around πŸ€— Pipelines. In order to distribute prediction, we use BatchPredictor. While this is not necessary for the very small example we are using (you could use HuggingFacePredictor directly), it will scale well to a large dataset.

from ray.train.huggingface import HuggingFacePredictor
from ray.train.batch_predictor import BatchPredictor
import pandas as pd

sentences = ['Bill whistled past the house.',
  'The car honked its way down the road.',
  'Bill pushed Harry off the sofa.',
  'the kittens yawned awake and played.',
  'I demand that the more John eats, the more he pay.']
predictor = BatchPredictor.from_checkpoint(
    checkpoint=result.checkpoint,
    predictor_cls=HuggingFacePredictor,
    task="text-classification",
)
data = ray.data.from_pandas(pd.DataFrame(sentences, columns=["sentence"]))
prediction = predictor.predict(data)
prediction.show()
Map Progress (2 actors 1 pending):   0%|          | 0/1 [00:12<?, ?it/s](BlockWorker pid=735) 2022-05-12 18:36:08.491769: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
Map Progress (2 actors 1 pending): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:16<00:00, 16.63s/it]
label score
0 LABEL_1 0.998539
1 LABEL_1 0.997706
2 LABEL_1 0.998476
3 LABEL_1 0.998498
4 LABEL_0 0.533578

Share the model ΒΆ

To be able to share your model with the community, there are a few more steps to follow.

We have conducted the training on the Ray cluster, but share the model from the local enviroment - this will allow us to easily authenticate.

First you have to store your authentication token from the Hugging Face website (sign up here if you haven’t already!) then execute the following cell and input your username and password:

from huggingface_hub import notebook_login

notebook_login()

Then you need to install Git-LFS. Uncomment the following instructions:

# !apt install git-lfs

Now, load the model and tokenizer locally, and recreate the πŸ€— Transformers Trainer:

from ray.train.huggingface import HuggingFaceCheckpoint

checkpoint = HuggingFaceCheckpoint.from_checkpoint(result.checkpoint)
hf_trainer = checkpoint.get_model(model=AutoModelForSequenceClassification)

You can now upload the result of the training to the Hub, just execute this instruction:

hf_trainer.push_to_hub()

You can now share this model with all your friends, family, favorite pets: they can all load it with the identifier "your-username/the-name-you-picked" so for instance:

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("sgugger/my-awesome-model")