Fine-tune a π€ Transformers model
Contents
Fine-tune a π€ Transformers model#
This notebook is based on an official π€ notebook - βHow to fine-tune a model on text classificationβ. The main aim of this notebook is to show the process of conversion from vanilla π€ to Ray AIR π€ without changing the training logic unless necessary.
In this notebook, we will:
Uncomment and run the following line in order to install all the necessary dependencies (this notebook is being tested with transformers==4.19.1
):
#! pip install "datasets" "transformers>=4.19.0" "torch>=1.10.0" "mlflow" "ray[air]>=1.13"
Set up Ray #
We will use ray.init()
to initialize a local cluster. By default, this cluster will be comprised of only the machine you are running this notebook on. You can also run this notebook on an Anyscale cluster.
from pprint import pprint
import ray
ray.init()
2022-08-25 10:09:51,282 INFO worker.py:1223 -- Using address localhost:9031 set in the environment variable RAY_ADDRESS
2022-08-25 10:09:51,697 INFO worker.py:1333 -- Connecting to existing Ray cluster at address: 172.31.80.117:9031...
2022-08-25 10:09:51,706 INFO worker.py:1509 -- Connected to Ray cluster. View the dashboard at https://session-i8ddtfaxhwypbvnyb9uzg7xs.i.anyscaleuserdata-staging.com/auth/?token=agh0_CkcwRQIhAJXwvxwq31GryaWthvXGCXZebsijbuqi7qL2pCa5uROOAiBGjzsyXAJFHLlaEI9zSlNI8ewtghKg5UV3t8NmlxuMcRJmEiCtvjcKE0VPiU7iQx51P9oPQjfpo5g1RJXccVSS5005cBgCIgNuL2E6DAj9xazjBhDwj4veAUIMCP3ClJgGEPCPi94B-gEeChxzZXNfaThERFRmQVhId1lwYlZueWI5dVpnN3hT&redirect_to=dashboard
2022-08-25 10:09:51,709 INFO packaging.py:342 -- Pushing file package 'gcs://_ray_pkg_3332f64b0a461fddc20be71129115d0a.zip' (0.34MiB) to Ray cluster...
2022-08-25 10:09:51,714 INFO packaging.py:351 -- Successfully pushed file package 'gcs://_ray_pkg_3332f64b0a461fddc20be71129115d0a.zip'.
Ray
We can check the resources our cluster is composed of. If you are running this notebook on your local machine or Google Colab, you should see the number of CPU cores and GPUs available on the said machine.
pprint(ray.cluster_resources())
{'CPU': 208.0,
'GPU': 16.0,
'accelerator_type:T4': 4.0,
'memory': 616693614180.0,
'node:172.31.76.237': 1.0,
'node:172.31.80.117': 1.0,
'node:172.31.85.193': 1.0,
'node:172.31.85.32': 1.0,
'node:172.31.90.137': 1.0,
'object_store_memory': 259318055729.0}
In this notebook, we will see how to fine-tune one of the π€ Transformers model to a text classification task of the GLUE Benchmark. We will be running the training using Ray AIR.
You can change those two variables to control whether the training (which we will get to later) uses CPUs or GPUs, and how many workers should be spawned. Each worker will claim one CPU or GPU. Make sure not to request more resources than the resources present!
By default, we will run the training with one GPU worker.
use_gpu = True # set this to False to run on CPUs
num_workers = 1 # set this to number of GPUs/CPUs you want to use
Fine-tuning a model on a text classification task#
The GLUE Benchmark is a group of nine classification tasks on sentences or pairs of sentences. If you would like to learn more, refer to the original notebook.
Each task is named by its acronym, with mnli-mm
standing for the mismatched version of MNLI (so same training set as mnli
but different validation and test sets):
GLUE_TASKS = ["cola", "mnli", "mnli-mm", "mrpc", "qnli", "qqp", "rte", "sst2", "stsb", "wnli"]
This notebook is built to run on any of the tasks in the list above, with any model checkpoint from the Model Hub as long as that model has a version with a classification head. Depending on your model and the GPU you are using, you might need to adjust the batch size to avoid out-of-memory errors. Set those three parameters, then the rest of the notebook should run smoothly:
task = "cola"
model_checkpoint = "distilbert-base-uncased"
batch_size = 16
Loading the dataset #
We will use the π€ Datasets library to download the data and get the metric we need to use for evaluation (to compare our model to the benchmark). This can be easily done with the functions load_dataset
and load_metric
.
Apart from mnli-mm
being a special code, we can directly pass our task name to those functions.
As Ray AIR doesnβt provide integrations for π€ Datasets yet, we will simply run the normal π€ Datasets code to load the dataset from the Hub.
from datasets import load_dataset
actual_task = "mnli" if task == "mnli-mm" else task
datasets = load_dataset("glue", actual_task)
The dataset
object itself is DatasetDict
, which contains one key for the training, validation, and test set (with more keys for the mismatched validation and test set in the special case of mnli
).
We will also need the metric. In order to avoid serialization errors, we will load the metric inside the training workers later. Therefore, now we will just define the function we will use.
from datasets import load_metric
def load_metric_fn():
return load_metric('glue', actual_task)
The metric is an instance of datasets.Metric
.
Preprocessing the data with Ray AIR #
Before we can feed those texts to our model, we need to preprocess them. This is done by a π€ Transformersβ Tokenizer
, which will (as the name indicates) tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary) and put it in a format the model expects, as well as generate the other inputs that model requires.
To do all of this, we instantiate our tokenizer with the AutoTokenizer.from_pretrained
method, which will ensure that:
we get a tokenizer that corresponds to the model architecture we want to use,
we download the vocabulary used when pretraining this specific checkpoint.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)
We pass along use_fast=True
to the call above to use one of the fast tokenizers (backed by Rust) from the π€ Tokenizers library. Those fast tokenizers are available for almost all models, but if you got an error with the previous call, remove that argument.
To preprocess our dataset, we will thus need the names of the columns containing the sentence(s). The following dictionary keeps track of the correspondence task to column names:
task_to_keys = {
"cola": ("sentence", None),
"mnli": ("premise", "hypothesis"),
"mnli-mm": ("premise", "hypothesis"),
"mrpc": ("sentence1", "sentence2"),
"qnli": ("question", "sentence"),
"qqp": ("question1", "question2"),
"rte": ("sentence1", "sentence2"),
"sst2": ("sentence", None),
"stsb": ("sentence1", "sentence2"),
"wnli": ("sentence1", "sentence2"),
}
For Ray AIR, instead of using π€ Dataset objects directly, we will convert them to Ray Datasets. Both are backed by Arrow tables, so the conversion is straightforward. We will use the built-in ray.data.from_huggingface
function.
import ray.data
ray_datasets = ray.data.from_huggingface(datasets)
ray_datasets
{'train': Dataset(num_blocks=1, num_rows=8551, schema={sentence: string, label: int64, idx: int32}),
'validation': Dataset(num_blocks=1, num_rows=1043, schema={sentence: string, label: int64, idx: int32}),
'test': Dataset(num_blocks=1, num_rows=1063, schema={sentence: string, label: int64, idx: int32})}
We can then write the function that will preprocess our samples. We just feed them to the tokenizer
with the argument truncation=True
. This will ensure that an input longer than what the model selected can handle will be truncated to the maximum length accepted by the model.
We use a BatchMapper
to create a Ray AIR preprocessor that will map the function to the dataset in a distributed fashion. It will run during training and prediction.
import pandas as pd
from ray.data.preprocessors import BatchMapper
def preprocess_function(examples: pd.DataFrame):
# if we only have one column, we are inferring.
# no need to tokenize in that case.
if len(examples.columns) == 1:
return examples
examples = examples.to_dict("list")
sentence1_key, sentence2_key = task_to_keys[task]
if sentence2_key is None:
ret = tokenizer(examples[sentence1_key], truncation=True)
else:
ret = tokenizer(examples[sentence1_key], examples[sentence2_key], truncation=True)
# Add back the original columns
ret = {**examples, **ret}
return pd.DataFrame.from_dict(ret)
batch_encoder = BatchMapper(preprocess_function, batch_format="pandas")
Fine-tuning the model with Ray AIR #
Now that our data is ready, we can download the pretrained model and fine-tune it.
Since all our tasks are about sentence classification, we use the AutoModelForSequenceClassification
class.
We will not go into details about each specific component of the training (see the original notebook for that). The tokenizer is the same as we have used to encoded the dataset before.
The main difference when using the Ray AIR is that we need to create our π€ Transformers Trainer
inside a function (trainer_init_per_worker
) and return it. That function will be passed to the HuggingFaceTrainer
and will run on every Ray worker. The training will then proceed by the means of PyTorch DDP.
Make sure that you initialize the model, metric, and tokenizer inside that function. Otherwise, you may run into serialization errors.
Furthermore, push_to_hub=True
is not yet supported. Ray will, however, checkpoint the model at every epoch, allowing you to push it to hub manually. We will do that after the training.
If you wish to use thrid party logging libraries, such as MLflow or Weights&Biases, do not set them in TrainingArguments
(they will be automatically disabled) - instead, you should pass Ray AIR callbacks to HuggingFaceTrainer
βs run_config
. In this example, we will use MLflow.
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
import numpy as np
import torch
num_labels = 3 if task.startswith("mnli") else 1 if task=="stsb" else 2
metric_name = "pearson" if task == "stsb" else "matthews_correlation" if task == "cola" else "accuracy"
model_name = model_checkpoint.split("/")[-1]
validation_key = "validation_mismatched" if task == "mnli-mm" else "validation_matched" if task == "mnli" else "validation"
name = f"{model_name}-finetuned-{task}"
def trainer_init_per_worker(train_dataset, eval_dataset = None, **config):
print(f"Is CUDA available: {torch.cuda.is_available()}")
metric = load_metric_fn()
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)
args = TrainingArguments(
name,
evaluation_strategy="epoch",
save_strategy="epoch",
logging_strategy="epoch",
learning_rate=config.get("learning_rate", 2e-5),
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
num_train_epochs=config.get("epochs", 2),
weight_decay=config.get("weight_decay", 0.01),
push_to_hub=False,
disable_tqdm=True, # declutter the output a little
no_cuda=not use_gpu, # you need to explicitly set no_cuda if you want CPUs
)
def compute_metrics(eval_pred):
predictions, labels = eval_pred
if task != "stsb":
predictions = np.argmax(predictions, axis=1)
else:
predictions = predictions[:, 0]
return metric.compute(predictions=predictions, references=labels)
trainer = Trainer(
model,
args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
tokenizer=tokenizer,
compute_metrics=compute_metrics
)
print("Starting training")
return trainer
With our trainer_init_per_worker
complete, we can now instantiate the HuggingFaceTrainer
. Aside from the function, we set the scaling_config
, controlling the amount of workers and resources used, and the datasets
we will use for training and evaluation.
We specify the MLflowLoggerCallback
inside the run_config
, and pass the preprocessor we have defined earlier as an argument. The preprocessor will be included with the returned Checkpoint
, meaning it will also be applied during inference.
from ray.train.huggingface import HuggingFaceTrainer
from ray.air.config import RunConfig, ScalingConfig, CheckpointConfig
from ray.air.integrations.mlflow import MLflowLoggerCallback
trainer = HuggingFaceTrainer(
trainer_init_per_worker=trainer_init_per_worker,
scaling_config=ScalingConfig(num_workers=num_workers, use_gpu=use_gpu),
datasets={
"train": ray_datasets["train"],
"evaluation": ray_datasets[validation_key],
},
run_config=RunConfig(
callbacks=[MLflowLoggerCallback(experiment_name=name)],
checkpoint_config=CheckpointConfig(
num_to_keep=1,
checkpoint_score_attribute="eval_loss",
checkpoint_score_order="min",
),
),
preprocessor=batch_encoder,
)
Finally, we call the fit
method to start training with Ray AIR. We will save the Result
object to a variable so we can access metrics and checkpoints.
result = trainer.fit()
Current time: 2022-08-25 10:14:09 (running for 00:04:06.45)
Memory usage on this node: 4.3/62.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/208 CPUs, 0/16 GPUs, 0.0/574.34 GiB heap, 0.0/241.51 GiB objects (0.0/4.0 accelerator_type:T4)
Result logdir: /home/ray/ray_results/HuggingFaceTrainer_2022-08-25_10-10-02
Number of trials: 1/1 (1 TERMINATED)
Trial name | status | loc | iter | total time (s) | loss | learning_rate | epoch |
---|---|---|---|---|---|---|---|
HuggingFaceTrainer_c1ff5_00000 | TERMINATED | 172.31.90.137:947 | 2 | 200.217 | 0.3886 | 0 | 2 |
(RayTrainWorker pid=1114, ip=172.31.90.137) 2022-08-25 10:10:44,617 INFO config.py:71 -- Setting up process group for: env:// [rank=0, world_size=4]
(RayTrainWorker pid=1114, ip=172.31.90.137) Is CUDA available: True
(RayTrainWorker pid=1116, ip=172.31.90.137) Is CUDA available: True
(RayTrainWorker pid=1117, ip=172.31.90.137) Is CUDA available: True
(RayTrainWorker pid=1115, ip=172.31.90.137) Is CUDA available: True
Downloading builder script: 5.76kB [00:00, 6.45MB/s]
Downloading builder script: 5.76kB [00:00, 6.91MB/s]
Downloading builder script: 5.76kB [00:00, 6.44MB/s]
Downloading builder script: 5.76kB [00:00, 6.94MB/s]
Downloading tokenizer_config.json: 100%|ββββββββββ| 28.0/28.0 [00:00<00:00, 30.5kB/s]
Downloading config.json: 100%|ββββββββββ| 483/483 [00:00<00:00, 817kB/s]
Downloading vocab.txt: 0%| | 0.00/226k [00:00<?, ?B/s]
Downloading vocab.txt: 18%|ββ | 41.0k/226k [00:00<00:00, 353kB/s]
Downloading vocab.txt: 100%|ββββββββββ| 226k/226k [00:00<00:00, 773kB/s]
Downloading tokenizer.json: 0%| | 0.00/455k [00:00<?, ?B/s]
Downloading tokenizer.json: 6%|β | 28.0k/455k [00:00<00:01, 227kB/s]
Downloading tokenizer.json: 24%|βββ | 111k/455k [00:00<00:00, 488kB/s]
Downloading tokenizer.json: 42%|βββββ | 191k/455k [00:00<00:00, 559kB/s]
Downloading tokenizer.json: 67%|βββββββ | 303k/455k [00:00<00:00, 694kB/s]
Downloading tokenizer.json: 100%|ββββββββββ| 455k/455k [00:00<00:00, 815kB/s]
Downloading pytorch_model.bin: 0%| | 0.00/256M [00:00<?, ?B/s]
Downloading pytorch_model.bin: 0%| | 1.20M/256M [00:00<00:21, 12.6MB/s]
Downloading pytorch_model.bin: 2%|β | 6.02M/256M [00:00<00:07, 34.9MB/s]
Downloading pytorch_model.bin: 6%|β | 15.0M/256M [00:00<00:04, 62.0MB/s]
Downloading pytorch_model.bin: 9%|β | 24.0M/256M [00:00<00:03, 74.8MB/s]
Downloading pytorch_model.bin: 13%|ββ | 33.1M/256M [00:00<00:02, 82.3MB/s]
Downloading pytorch_model.bin: 17%|ββ | 42.2M/256M [00:00<00:02, 86.7MB/s]
Downloading pytorch_model.bin: 20%|ββ | 51.4M/256M [00:00<00:02, 89.8MB/s]
Downloading pytorch_model.bin: 24%|βββ | 60.6M/256M [00:00<00:02, 91.8MB/s]
Downloading pytorch_model.bin: 27%|βββ | 69.8M/256M [00:00<00:02, 93.3MB/s]
Downloading pytorch_model.bin: 31%|βββ | 78.9M/256M [00:01<00:01, 94.2MB/s]
Downloading pytorch_model.bin: 34%|ββββ | 88.0M/256M [00:01<00:01, 94.6MB/s]
Downloading pytorch_model.bin: 38%|ββββ | 97.2M/256M [00:01<00:01, 95.1MB/s]
Downloading pytorch_model.bin: 42%|βββββ | 106M/256M [00:01<00:01, 95.6MB/s]
Downloading pytorch_model.bin: 45%|βββββ | 116M/256M [00:01<00:01, 96.0MB/s]
Downloading pytorch_model.bin: 49%|βββββ | 125M/256M [00:01<00:01, 96.2MB/s]
Downloading pytorch_model.bin: 52%|ββββββ | 134M/256M [00:01<00:01, 96.0MB/s]
Downloading pytorch_model.bin: 56%|ββββββ | 143M/256M [00:01<00:01, 96.1MB/s]
Downloading pytorch_model.bin: 60%|ββββββ | 152M/256M [00:01<00:01, 96.0MB/s]
Downloading pytorch_model.bin: 63%|βββββββ | 162M/256M [00:01<00:01, 96.2MB/s]
Downloading pytorch_model.bin: 67%|βββββββ | 171M/256M [00:02<00:00, 96.1MB/s]
Downloading pytorch_model.bin: 70%|βββββββ | 180M/256M [00:02<00:00, 96.2MB/s]
Downloading pytorch_model.bin: 74%|ββββββββ | 189M/256M [00:02<00:00, 96.2MB/s]
Downloading pytorch_model.bin: 78%|ββββββββ | 198M/256M [00:02<00:00, 96.2MB/s]
Downloading pytorch_model.bin: 81%|ββββββββ | 208M/256M [00:02<00:00, 95.9MB/s]
Downloading pytorch_model.bin: 85%|βββββββββ | 217M/256M [00:02<00:00, 95.9MB/s]
Downloading pytorch_model.bin: 88%|βββββββββ | 226M/256M [00:02<00:00, 96.2MB/s]
Downloading pytorch_model.bin: 92%|ββββββββββ| 235M/256M [00:02<00:00, 96.1MB/s]
Downloading pytorch_model.bin: 96%|ββββββββββ| 244M/256M [00:02<00:00, 96.1MB/s]
Downloading pytorch_model.bin: 100%|ββββββββββ| 256M/256M [00:02<00:00, 91.6MB/s]
(RayTrainWorker pid=1117, ip=172.31.90.137) Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_transform.weight', 'vocab_layer_norm.bias']
(RayTrainWorker pid=1117, ip=172.31.90.137) - This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
(RayTrainWorker pid=1117, ip=172.31.90.137) - This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
(RayTrainWorker pid=1117, ip=172.31.90.137) Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'classifier.bias', 'classifier.weight', 'pre_classifier.weight']
(RayTrainWorker pid=1117, ip=172.31.90.137) You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
(RayTrainWorker pid=1114, ip=172.31.90.137) Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_projector.weight', 'vocab_transform.weight']
(RayTrainWorker pid=1114, ip=172.31.90.137) - This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
(RayTrainWorker pid=1114, ip=172.31.90.137) - This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
(RayTrainWorker pid=1114, ip=172.31.90.137) Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'pre_classifier.weight', 'classifier.weight', 'classifier.bias']
(RayTrainWorker pid=1114, ip=172.31.90.137) You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
(RayTrainWorker pid=1116, ip=172.31.90.137) Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_projector.weight']
(RayTrainWorker pid=1116, ip=172.31.90.137) - This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
(RayTrainWorker pid=1116, ip=172.31.90.137) - This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
(RayTrainWorker pid=1116, ip=172.31.90.137) Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight']
(RayTrainWorker pid=1116, ip=172.31.90.137) You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
(RayTrainWorker pid=1115, ip=172.31.90.137) Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_projector.weight', 'vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight']
(RayTrainWorker pid=1115, ip=172.31.90.137) - This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
(RayTrainWorker pid=1115, ip=172.31.90.137) - This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
(RayTrainWorker pid=1115, ip=172.31.90.137) Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.weight', 'classifier.bias', 'pre_classifier.bias']
(RayTrainWorker pid=1115, ip=172.31.90.137) You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
(RayTrainWorker pid=1114, ip=172.31.90.137) Starting training
(RayTrainWorker pid=1116, ip=172.31.90.137) Starting training
(RayTrainWorker pid=1117, ip=172.31.90.137) Starting training
(RayTrainWorker pid=1115, ip=172.31.90.137) Starting training
(RayTrainWorker pid=1114, ip=172.31.90.137) ***** Running training *****
(RayTrainWorker pid=1114, ip=172.31.90.137) Num examples = 8551
(RayTrainWorker pid=1114, ip=172.31.90.137) Num Epochs = 2
(RayTrainWorker pid=1114, ip=172.31.90.137) Instantaneous batch size per device = 16
(RayTrainWorker pid=1114, ip=172.31.90.137) Total train batch size (w. parallel, distributed & accumulation) = 64
(RayTrainWorker pid=1114, ip=172.31.90.137) Gradient Accumulation steps = 1
(RayTrainWorker pid=1114, ip=172.31.90.137) Total optimization steps = 1070
(RayTrainWorker pid=1114, ip=172.31.90.137) The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: sentence, idx. If sentence, idx are not expected by `DistilBertForSequenceClassification.forward`, you can safely ignore this message.
(RayTrainWorker pid=1114, ip=172.31.90.137) {'loss': 0.5437, 'learning_rate': 1e-05, 'epoch': 1.0}
(RayTrainWorker pid=1114, ip=172.31.90.137) ***** Running Evaluation *****
(RayTrainWorker pid=1114, ip=172.31.90.137) Num examples = 1043
(RayTrainWorker pid=1114, ip=172.31.90.137) Batch size = 16
(RayTrainWorker pid=1114, ip=172.31.90.137) The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: sentence, idx. If sentence, idx are not expected by `DistilBertForSequenceClassification.forward`, you can safely ignore this message.
(RayTrainWorker pid=1114, ip=172.31.90.137) {'eval_loss': 0.5794203281402588, 'eval_matthews_correlation': 0.3293676852500821, 'eval_runtime': 0.9804, 'eval_samples_per_second': 277.441, 'eval_steps_per_second': 5.1, 'epoch': 1.0}
(RayTrainWorker pid=1114, ip=172.31.90.137) Saving model checkpoint to distilbert-base-uncased-finetuned-cola/checkpoint-535
(RayTrainWorker pid=1114, ip=172.31.90.137) Configuration saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/config.json
(RayTrainWorker pid=1114, ip=172.31.90.137) Model weights saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/pytorch_model.bin
(RayTrainWorker pid=1114, ip=172.31.90.137) tokenizer config file saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/tokenizer_config.json
(RayTrainWorker pid=1114, ip=172.31.90.137) Special tokens file saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/special_tokens_map.json
Result for HuggingFaceTrainer_c1ff5_00000:
_time_this_iter_s: 90.87123560905457
_timestamp: 1661447540
_training_iteration: 1
date: 2022-08-25_10-12-20
done: false
epoch: 1.0
eval_loss: 0.5794203281402588
eval_matthews_correlation: 0.3293676852500821
eval_runtime: 0.9804
eval_samples_per_second: 277.441
eval_steps_per_second: 5.1
experiment_id: 592e02b25b254bd1a3743904313dc85b
hostname: ip-172-31-90-137
iterations_since_restore: 1
learning_rate: 1.0e-05
loss: 0.5437
node_ip: 172.31.90.137
pid: 947
should_checkpoint: true
step: 535
time_since_restore: 103.24057936668396
time_this_iter_s: 103.24057936668396
time_total_s: 103.24057936668396
timestamp: 1661447540
timesteps_since_restore: 0
training_iteration: 1
trial_id: c1ff5_00000
warmup_time: 0.003858327865600586
(RayTrainWorker pid=1114, ip=172.31.90.137) Saving model checkpoint to distilbert-base-uncased-finetuned-cola/checkpoint-1070
(RayTrainWorker pid=1114, ip=172.31.90.137) Configuration saved in distilbert-base-uncased-finetuned-cola/checkpoint-1070/config.json
(RayTrainWorker pid=1114, ip=172.31.90.137) Model weights saved in distilbert-base-uncased-finetuned-cola/checkpoint-1070/pytorch_model.bin
(RayTrainWorker pid=1114, ip=172.31.90.137) tokenizer config file saved in distilbert-base-uncased-finetuned-cola/checkpoint-1070/tokenizer_config.json
(RayTrainWorker pid=1114, ip=172.31.90.137) Special tokens file saved in distilbert-base-uncased-finetuned-cola/checkpoint-1070/special_tokens_map.json
(RayTrainWorker pid=1114, ip=172.31.90.137) {'loss': 0.3886, 'learning_rate': 0.0, 'epoch': 2.0}
(RayTrainWorker pid=1114, ip=172.31.90.137) ***** Running Evaluation *****
(RayTrainWorker pid=1114, ip=172.31.90.137) Num examples = 1043
(RayTrainWorker pid=1114, ip=172.31.90.137) Batch size = 16
(RayTrainWorker pid=1114, ip=172.31.90.137) The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: sentence, idx. If sentence, idx are not expected by `DistilBertForSequenceClassification.forward`, you can safely ignore this message.
(RayTrainWorker pid=1114, ip=172.31.90.137) {'eval_loss': 0.6215357184410095, 'eval_matthews_correlation': 0.42957017514952434, 'eval_runtime': 0.9956, 'eval_samples_per_second': 273.204, 'eval_steps_per_second': 5.022, 'epoch': 2.0}
(RayTrainWorker pid=1114, ip=172.31.90.137) Saving model checkpoint to distilbert-base-uncased-finetuned-cola/checkpoint-1070
(RayTrainWorker pid=1114, ip=172.31.90.137) Configuration saved in distilbert-base-uncased-finetuned-cola/checkpoint-1070/config.json
(RayTrainWorker pid=1114, ip=172.31.90.137) Model weights saved in distilbert-base-uncased-finetuned-cola/checkpoint-1070/pytorch_model.bin
(RayTrainWorker pid=1114, ip=172.31.90.137) tokenizer config file saved in distilbert-base-uncased-finetuned-cola/checkpoint-1070/tokenizer_config.json
(RayTrainWorker pid=1114, ip=172.31.90.137) Special tokens file saved in distilbert-base-uncased-finetuned-cola/checkpoint-1070/special_tokens_map.json
(RayTrainWorker pid=1114, ip=172.31.90.137) {'train_runtime': 174.4696, 'train_samples_per_second': 98.023, 'train_steps_per_second': 6.133, 'train_loss': 0.4661755713346963, 'epoch': 2.0}
(RayTrainWorker pid=1114, ip=172.31.90.137)
(RayTrainWorker pid=1114, ip=172.31.90.137)
(RayTrainWorker pid=1114, ip=172.31.90.137) Training completed. Do not forget to share your model on huggingface.co/models =)
(RayTrainWorker pid=1114, ip=172.31.90.137)
(RayTrainWorker pid=1114, ip=172.31.90.137)
Result for HuggingFaceTrainer_c1ff5_00000:
_time_this_iter_s: 96.96447467803955
_timestamp: 1661447637
_training_iteration: 2
date: 2022-08-25_10-13-57
done: false
epoch: 2.0
eval_loss: 0.6215357184410095
eval_matthews_correlation: 0.42957017514952434
eval_runtime: 0.9956
eval_samples_per_second: 273.204
eval_steps_per_second: 5.022
experiment_id: 592e02b25b254bd1a3743904313dc85b
hostname: ip-172-31-90-137
iterations_since_restore: 2
learning_rate: 0.0
loss: 0.3886
node_ip: 172.31.90.137
pid: 947
should_checkpoint: true
step: 1070
time_since_restore: 200.21722102165222
time_this_iter_s: 96.97664165496826
time_total_s: 200.21722102165222
timestamp: 1661447637
timesteps_since_restore: 0
train_loss: 0.4661755713346963
train_runtime: 174.4696
train_samples_per_second: 98.023
train_steps_per_second: 6.133
training_iteration: 2
trial_id: c1ff5_00000
warmup_time: 0.003858327865600586
Result for HuggingFaceTrainer_c1ff5_00000:
_time_this_iter_s: 96.96447467803955
_timestamp: 1661447637
_training_iteration: 2
date: 2022-08-25_10-13-57
done: true
epoch: 2.0
eval_loss: 0.6215357184410095
eval_matthews_correlation: 0.42957017514952434
eval_runtime: 0.9956
eval_samples_per_second: 273.204
eval_steps_per_second: 5.022
experiment_id: 592e02b25b254bd1a3743904313dc85b
experiment_tag: '0'
hostname: ip-172-31-90-137
iterations_since_restore: 2
learning_rate: 0.0
loss: 0.3886
node_ip: 172.31.90.137
pid: 947
should_checkpoint: true
step: 1070
time_since_restore: 200.21722102165222
time_this_iter_s: 96.97664165496826
time_total_s: 200.21722102165222
timestamp: 1661447637
timesteps_since_restore: 0
train_loss: 0.4661755713346963
train_runtime: 174.4696
train_samples_per_second: 98.023
train_steps_per_second: 6.133
training_iteration: 2
trial_id: c1ff5_00000
warmup_time: 0.003858327865600586
2022-08-25 10:14:09,300 INFO tune.py:758 -- Total run time: 246.67 seconds (246.44 seconds for the tuning loop).
You can use the returned Result
object to access metrics and the Ray AIR Checkpoint
associated with the last iteration.
result
Result(metrics={'loss': 0.3886, 'learning_rate': 0.0, 'epoch': 2.0, 'step': 1070, 'eval_loss': 0.6215357184410095, 'eval_matthews_correlation': 0.42957017514952434, 'eval_runtime': 0.9956, 'eval_samples_per_second': 273.204, 'eval_steps_per_second': 5.022, 'train_runtime': 174.4696, 'train_samples_per_second': 98.023, 'train_steps_per_second': 6.133, 'train_loss': 0.4661755713346963, '_timestamp': 1661447637, '_time_this_iter_s': 96.96447467803955, '_training_iteration': 2, 'should_checkpoint': True, 'done': True, 'trial_id': 'c1ff5_00000', 'experiment_tag': '0'}, error=None, log_dir=PosixPath('/home/ray/ray_results/HuggingFaceTrainer_2022-08-25_10-10-02/HuggingFaceTrainer_c1ff5_00000_0_2022-08-25_10-10-04'))
Tune hyperparameters with Ray AIR #
If we would like to tune any hyperparameters of the model, we can do so by simply passing our HuggingFaceTrainer
into a Tuner
and defining the search space.
We can also take advantage of the advanced search algorithms and schedulers provided by Ray Tune. In this example, we will use an ASHAScheduler
to aggresively terminate underperforming trials.
from ray import tune
from ray.tune import Tuner
from ray.tune.schedulers.async_hyperband import ASHAScheduler
tune_epochs = 4
tuner = Tuner(
trainer,
param_space={
"trainer_init_config": {
"learning_rate": tune.grid_search([2e-5, 2e-4, 2e-3, 2e-2]),
"epochs": tune_epochs,
}
},
tune_config=tune.TuneConfig(
metric="eval_loss",
mode="min",
num_samples=1,
scheduler=ASHAScheduler(
max_t=tune_epochs,
)
),
run_config=RunConfig(
checkpoint_config=CheckpointConfig(num_to_keep=1, checkpoint_score_attribute="eval_loss", checkpoint_score_order="min")
),
)
tune_results = tuner.fit()
Current time: 2022-08-25 10:20:13 (running for 00:06:01.75)
Memory usage on this node: 4.4/62.0 GiB
Using AsyncHyperBand: num_stopped=4 Bracket: Iter 4.000: -0.8064090609550476 | Iter 1.000: -0.6378736793994904
Resources requested: 0/208 CPUs, 0/16 GPUs, 0.0/574.34 GiB heap, 0.0/241.51 GiB objects (0.0/4.0 accelerator_type:T4)
Current best trial: 5654d_00001 with eval_loss=0.6492420434951782 and parameters={'trainer_init_config': {'learning_rate': 0.0002, 'epochs': 4}}
Result logdir: /home/ray/ray_results/HuggingFaceTrainer_2022-08-25_10-14-11
Number of trials: 4/4 (4 TERMINATED)
Trial name | status | loc | trainer_init_conf... | iter | total time (s) | loss | learning_rate | epoch |
---|---|---|---|---|---|---|---|---|
HuggingFaceTrainer_5654d_00000 | TERMINATED | 172.31.90.137:1729 | 2e-05 | 4 | 347.171 | 0.1958 | 0 | 4 |
HuggingFaceTrainer_5654d_00001 | TERMINATED | 172.31.76.237:1805 | 0.0002 | 1 | 95.2492 | 0.6225 | 0.00015 | 1 |
HuggingFaceTrainer_5654d_00002 | TERMINATED | 172.31.85.32:1322 | 0.002 | 1 | 93.7613 | 0.6463 | 0.0015 | 1 |
HuggingFaceTrainer_5654d_00003 | TERMINATED | 172.31.85.193:1060 | 0.02 | 1 | 99.3677 | 0.926 | 0.015 | 1 |
(RayTrainWorker pid=1789, ip=172.31.90.137) 2022-08-25 10:14:23,379 INFO config.py:71 -- Setting up process group for: env:// [rank=0, world_size=4]
(RayTrainWorker pid=1792, ip=172.31.90.137) Is CUDA available: True
(RayTrainWorker pid=1790, ip=172.31.90.137) Is CUDA available: True
(RayTrainWorker pid=1791, ip=172.31.90.137) Is CUDA available: True
(RayTrainWorker pid=1789, ip=172.31.90.137) Is CUDA available: True
(RayTrainWorker pid=1974, ip=172.31.76.237) 2022-08-25 10:14:29,354 INFO config.py:71 -- Setting up process group for: env:// [rank=0, world_size=4]
(RayTrainWorker pid=1977, ip=172.31.76.237) Is CUDA available: True
(RayTrainWorker pid=1976, ip=172.31.76.237) Is CUDA available: True
(RayTrainWorker pid=1975, ip=172.31.76.237) Is CUDA available: True
(RayTrainWorker pid=1974, ip=172.31.76.237) Is CUDA available: True
(RayTrainWorker pid=1483, ip=172.31.85.32) 2022-08-25 10:14:35,313 INFO config.py:71 -- Setting up process group for: env:// [rank=0, world_size=4]
(RayTrainWorker pid=1790, ip=172.31.90.137) Starting training
(RayTrainWorker pid=1792, ip=172.31.90.137) Starting training
(RayTrainWorker pid=1791, ip=172.31.90.137) Starting training
(RayTrainWorker pid=1789, ip=172.31.90.137) Starting training
(RayTrainWorker pid=1789, ip=172.31.90.137) ***** Running training *****
(RayTrainWorker pid=1789, ip=172.31.90.137) Num examples = 8551
(RayTrainWorker pid=1789, ip=172.31.90.137) Num Epochs = 4
(RayTrainWorker pid=1789, ip=172.31.90.137) Instantaneous batch size per device = 16
(RayTrainWorker pid=1789, ip=172.31.90.137) Total train batch size (w. parallel, distributed & accumulation) = 64
(RayTrainWorker pid=1789, ip=172.31.90.137) Gradient Accumulation steps = 1
(RayTrainWorker pid=1789, ip=172.31.90.137) Total optimization steps = 2140
(RayTrainWorker pid=1483, ip=172.31.85.32) Is CUDA available: True
(RayTrainWorker pid=1485, ip=172.31.85.32) Is CUDA available: True
(RayTrainWorker pid=1486, ip=172.31.85.32) Is CUDA available: True
(RayTrainWorker pid=1484, ip=172.31.85.32) Is CUDA available: True
(RayTrainWorker pid=1977, ip=172.31.76.237) Starting training
(RayTrainWorker pid=1976, ip=172.31.76.237) Starting training
(RayTrainWorker pid=1975, ip=172.31.76.237) Starting training
(RayTrainWorker pid=1974, ip=172.31.76.237) Starting training
(RayTrainWorker pid=1974, ip=172.31.76.237) ***** Running training *****
(RayTrainWorker pid=1974, ip=172.31.76.237) Num examples = 8551
(RayTrainWorker pid=1974, ip=172.31.76.237) Num Epochs = 4
(RayTrainWorker pid=1974, ip=172.31.76.237) Instantaneous batch size per device = 16
(RayTrainWorker pid=1974, ip=172.31.76.237) Total train batch size (w. parallel, distributed & accumulation) = 64
(RayTrainWorker pid=1974, ip=172.31.76.237) Gradient Accumulation steps = 1
(RayTrainWorker pid=1974, ip=172.31.76.237) Total optimization steps = 2140
(RayTrainWorker pid=1483, ip=172.31.85.32) Starting training
(RayTrainWorker pid=1485, ip=172.31.85.32) Starting training
(RayTrainWorker pid=1486, ip=172.31.85.32) Starting training
(RayTrainWorker pid=1484, ip=172.31.85.32) Starting training
(RayTrainWorker pid=1483, ip=172.31.85.32) ***** Running training *****
(RayTrainWorker pid=1483, ip=172.31.85.32) Num examples = 8551
(RayTrainWorker pid=1483, ip=172.31.85.32) Num Epochs = 4
(RayTrainWorker pid=1483, ip=172.31.85.32) Instantaneous batch size per device = 16
(RayTrainWorker pid=1483, ip=172.31.85.32) Total train batch size (w. parallel, distributed & accumulation) = 64
(RayTrainWorker pid=1483, ip=172.31.85.32) Gradient Accumulation steps = 1
(RayTrainWorker pid=1483, ip=172.31.85.32) Total optimization steps = 2140
(RayTrainWorker pid=1223, ip=172.31.85.193) 2022-08-25 10:14:48,193 INFO config.py:71 -- Setting up process group for: env:// [rank=0, world_size=4]
(RayTrainWorker pid=1223, ip=172.31.85.193) Is CUDA available: True
(RayTrainWorker pid=1224, ip=172.31.85.193) Is CUDA available: True
(RayTrainWorker pid=1226, ip=172.31.85.193) Is CUDA available: True
(RayTrainWorker pid=1225, ip=172.31.85.193) Is CUDA available: True
Downloading builder script: 5.76kB [00:00, 6.59MB/s]
Downloading builder script: 5.76kB [00:00, 6.52MB/s]
Downloading builder script: 5.76kB [00:00, 6.07MB/s]
Downloading builder script: 5.76kB [00:00, 6.81MB/s]
Downloading tokenizer_config.json: 100%|ββββββββββ| 28.0/28.0 [00:00<00:00, 46.0kB/s]
Downloading config.json: 100%|ββββββββββ| 483/483 [00:00<00:00, 766kB/s]
Downloading vocab.txt: 0%| | 0.00/226k [00:00<?, ?B/s]
Downloading vocab.txt: 32%|ββββ | 72.0k/226k [00:00<00:00, 624kB/s]
Downloading vocab.txt: 100%|ββββββββββ| 226k/226k [00:00<00:00, 966kB/s]
Downloading tokenizer.json: 0%| | 0.00/455k [00:00<?, ?B/s]
Downloading tokenizer.json: 6%|β | 29.0k/455k [00:00<00:01, 233kB/s]
Downloading tokenizer.json: 30%|βββ | 136k/455k [00:00<00:00, 600kB/s]
Downloading tokenizer.json: 100%|ββββββββββ| 455k/455k [00:00<00:00, 1.44MB/s]
Downloading pytorch_model.bin: 0%| | 0.00/256M [00:00<?, ?B/s]
Downloading pytorch_model.bin: 1%| | 2.32M/256M [00:00<00:10, 24.4MB/s]
Downloading pytorch_model.bin: 4%|β | 11.0M/256M [00:00<00:04, 63.4MB/s]
Downloading pytorch_model.bin: 8%|β | 20.0M/256M [00:00<00:03, 77.7MB/s]
Downloading pytorch_model.bin: 11%|ββ | 29.1M/256M [00:00<00:02, 84.8MB/s]
Downloading pytorch_model.bin: 15%|ββ | 38.2M/256M [00:00<00:02, 88.5MB/s]
Downloading pytorch_model.bin: 18%|ββ | 47.3M/256M [00:00<00:02, 90.7MB/s]
Downloading pytorch_model.bin: 22%|βββ | 56.4M/256M [00:00<00:02, 92.4MB/s]
Downloading pytorch_model.bin: 26%|βββ | 65.5M/256M [00:00<00:02, 93.4MB/s]
Downloading pytorch_model.bin: 29%|βββ | 74.7M/256M [00:00<00:02, 94.2MB/s]
Downloading pytorch_model.bin: 33%|ββββ | 83.8M/256M [00:01<00:01, 94.8MB/s]
Downloading pytorch_model.bin: 36%|ββββ | 93.0M/256M [00:01<00:01, 95.1MB/s]
Downloading pytorch_model.bin: 40%|ββββ | 102M/256M [00:01<00:01, 95.4MB/s]
Downloading pytorch_model.bin: 44%|βββββ | 111M/256M [00:01<00:01, 95.6MB/s]
Downloading pytorch_model.bin: 47%|βββββ | 120M/256M [00:01<00:01, 95.7MB/s]
Downloading pytorch_model.bin: 51%|βββββ | 130M/256M [00:01<00:01, 95.8MB/s]
Downloading pytorch_model.bin: 54%|ββββββ | 139M/256M [00:01<00:01, 95.8MB/s]
Downloading pytorch_model.bin: 58%|ββββββ | 148M/256M [00:01<00:01, 95.9MB/s]
Downloading pytorch_model.bin: 61%|βββββββ | 157M/256M [00:01<00:01, 96.1MB/s]
Downloading pytorch_model.bin: 65%|βββββββ | 166M/256M [00:01<00:00, 96.1MB/s]
Downloading pytorch_model.bin: 69%|βββββββ | 175M/256M [00:02<00:00, 96.1MB/s]
Downloading pytorch_model.bin: 72%|ββββββββ | 185M/256M [00:02<00:00, 96.2MB/s]
Downloading pytorch_model.bin: 76%|ββββββββ | 194M/256M [00:02<00:00, 96.2MB/s]
Downloading pytorch_model.bin: 79%|ββββββββ | 203M/256M [00:02<00:00, 96.1MB/s]
Downloading pytorch_model.bin: 83%|βββββββββ | 212M/256M [00:02<00:00, 96.1MB/s]
Downloading pytorch_model.bin: 87%|βββββββββ | 221M/256M [00:02<00:00, 96.2MB/s]
Downloading pytorch_model.bin: 90%|βββββββββ | 231M/256M [00:02<00:00, 96.2MB/s]
Downloading pytorch_model.bin: 94%|ββββββββββ| 240M/256M [00:02<00:00, 96.1MB/s]
Downloading pytorch_model.bin: 97%|ββββββββββ| 249M/256M [00:02<00:00, 96.0MB/s]
Downloading pytorch_model.bin: 100%|ββββββββββ| 256M/256M [00:02<00:00, 93.2MB/s]
(RayTrainWorker pid=1223, ip=172.31.85.193) Starting training
(RayTrainWorker pid=1226, ip=172.31.85.193) Starting training
(RayTrainWorker pid=1225, ip=172.31.85.193) Starting training
(RayTrainWorker pid=1224, ip=172.31.85.193) Starting training
(RayTrainWorker pid=1223, ip=172.31.85.193) ***** Running training *****
(RayTrainWorker pid=1223, ip=172.31.85.193) Num examples = 8551
(RayTrainWorker pid=1223, ip=172.31.85.193) Num Epochs = 4
(RayTrainWorker pid=1223, ip=172.31.85.193) Instantaneous batch size per device = 16
(RayTrainWorker pid=1223, ip=172.31.85.193) Total train batch size (w. parallel, distributed & accumulation) = 64
(RayTrainWorker pid=1223, ip=172.31.85.193) Gradient Accumulation steps = 1
(RayTrainWorker pid=1223, ip=172.31.85.193) Total optimization steps = 2140
(RayTrainWorker pid=1789, ip=172.31.90.137) ***** Running Evaluation *****
(RayTrainWorker pid=1789, ip=172.31.90.137) Num examples = 1043
(RayTrainWorker pid=1789, ip=172.31.90.137) Batch size = 16
(RayTrainWorker pid=1789, ip=172.31.90.137) {'loss': 0.5458, 'learning_rate': 1.5000000000000002e-05, 'epoch': 1.0}
(RayTrainWorker pid=1789, ip=172.31.90.137) The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: sentence, idx. If sentence, idx are not expected by `DistilBertForSequenceClassification.forward`, you can safely ignore this message.
(RayTrainWorker pid=1789, ip=172.31.90.137) {'eval_loss': 0.6037685871124268, 'eval_matthews_correlation': 0.3654892178274207, 'eval_runtime': 0.9847, 'eval_samples_per_second': 276.225, 'eval_steps_per_second': 5.078, 'epoch': 1.0}
(RayTrainWorker pid=1789, ip=172.31.90.137) Saving model checkpoint to distilbert-base-uncased-finetuned-cola/checkpoint-535
(RayTrainWorker pid=1789, ip=172.31.90.137) Configuration saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/config.json
(RayTrainWorker pid=1789, ip=172.31.90.137) Model weights saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/pytorch_model.bin
(RayTrainWorker pid=1789, ip=172.31.90.137) tokenizer config file saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/tokenizer_config.json
(RayTrainWorker pid=1789, ip=172.31.90.137) Special tokens file saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/special_tokens_map.json
Result for HuggingFaceTrainer_5654d_00000:
_time_this_iter_s: 85.01727724075317
_timestamp: 1661447753
_training_iteration: 1
date: 2022-08-25_10-15-53
done: false
epoch: 1.0
eval_loss: 0.6037685871124268
eval_matthews_correlation: 0.3654892178274207
eval_runtime: 0.9847
eval_samples_per_second: 276.225
eval_steps_per_second: 5.078
experiment_id: cee1b96afcf344e89482e3c5e298a412
hostname: ip-172-31-90-137
iterations_since_restore: 1
learning_rate: 1.5000000000000002e-05
loss: 0.5458
node_ip: 172.31.90.137
pid: 1729
should_checkpoint: true
step: 535
time_since_restore: 94.93232989311218
time_this_iter_s: 94.93232989311218
time_total_s: 94.93232989311218
timestamp: 1661447753
timesteps_since_restore: 0
training_iteration: 1
trial_id: 5654d_00000
warmup_time: 0.0037021636962890625
(RayTrainWorker pid=1974, ip=172.31.76.237) {'loss': 0.6225, 'learning_rate': 0.00015000000000000001, 'epoch': 1.0}
(RayTrainWorker pid=1974, ip=172.31.76.237) ***** Running Evaluation *****
(RayTrainWorker pid=1974, ip=172.31.76.237) Num examples = 1043
(RayTrainWorker pid=1974, ip=172.31.76.237) Batch size = 16
(RayTrainWorker pid=1974, ip=172.31.76.237) The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: idx, sentence. If idx, sentence are not expected by `DistilBertForSequenceClassification.forward`, you can safely ignore this message.
(RayTrainWorker pid=1974, ip=172.31.76.237) {'eval_loss': 0.6492420434951782, 'eval_matthews_correlation': 0.0, 'eval_runtime': 1.0157, 'eval_samples_per_second': 267.792, 'eval_steps_per_second': 4.923, 'epoch': 1.0}
(RayTrainWorker pid=1974, ip=172.31.76.237) Saving model checkpoint to distilbert-base-uncased-finetuned-cola/checkpoint-535
(RayTrainWorker pid=1974, ip=172.31.76.237) Configuration saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/config.json
(RayTrainWorker pid=1974, ip=172.31.76.237) Model weights saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/pytorch_model.bin
(RayTrainWorker pid=1974, ip=172.31.76.237) tokenizer config file saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/tokenizer_config.json
(RayTrainWorker pid=1974, ip=172.31.76.237) Special tokens file saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/special_tokens_map.json
Result for HuggingFaceTrainer_5654d_00001:
_time_this_iter_s: 84.79700112342834
_timestamp: 1661447759
_training_iteration: 1
date: 2022-08-25_10-16-00
done: true
epoch: 1.0
eval_loss: 0.6492420434951782
eval_matthews_correlation: 0.0
eval_runtime: 1.0157
eval_samples_per_second: 267.792
eval_steps_per_second: 4.923
experiment_id: 88145f9344584715a4bd7d018f751b12
hostname: ip-172-31-76-237
iterations_since_restore: 1
learning_rate: 0.00015000000000000001
loss: 0.6225
node_ip: 172.31.76.237
pid: 1805
should_checkpoint: true
step: 535
time_since_restore: 95.24916434288025
time_this_iter_s: 95.24916434288025
time_total_s: 95.24916434288025
timestamp: 1661447760
timesteps_since_restore: 0
training_iteration: 1
trial_id: 5654d_00001
warmup_time: 0.003660917282104492
(RayTrainWorker pid=1483, ip=172.31.85.32) {'loss': 0.6463, 'learning_rate': 0.0015, 'epoch': 1.0}
(RayTrainWorker pid=1483, ip=172.31.85.32) ***** Running Evaluation *****
(RayTrainWorker pid=1483, ip=172.31.85.32) Num examples = 1043
(RayTrainWorker pid=1483, ip=172.31.85.32) Batch size = 16
(RayTrainWorker pid=1483, ip=172.31.85.32) The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: sentence, idx. If sentence, idx are not expected by `DistilBertForSequenceClassification.forward`, you can safely ignore this message.
(RayTrainWorker pid=1483, ip=172.31.85.32) {'eval_loss': 0.6586529612541199, 'eval_matthews_correlation': 0.0, 'eval_runtime': 0.9576, 'eval_samples_per_second': 284.05, 'eval_steps_per_second': 5.222, 'epoch': 1.0}
(RayTrainWorker pid=1483, ip=172.31.85.32) Saving model checkpoint to distilbert-base-uncased-finetuned-cola/checkpoint-535
(RayTrainWorker pid=1483, ip=172.31.85.32) Configuration saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/config.json
(RayTrainWorker pid=1483, ip=172.31.85.32) Model weights saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/pytorch_model.bin
(RayTrainWorker pid=1483, ip=172.31.85.32) tokenizer config file saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/tokenizer_config.json
(RayTrainWorker pid=1483, ip=172.31.85.32) Special tokens file saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/special_tokens_map.json
Result for HuggingFaceTrainer_5654d_00002:
_time_this_iter_s: 84.01720070838928
_timestamp: 1661447764
_training_iteration: 1
date: 2022-08-25_10-16-04
done: true
epoch: 1.0
eval_loss: 0.6586529612541199
eval_matthews_correlation: 0.0
eval_runtime: 0.9576
eval_samples_per_second: 284.05
eval_steps_per_second: 5.222
experiment_id: 5f8ab183779d40379d59ea615f9d5411
hostname: ip-172-31-85-32
iterations_since_restore: 1
learning_rate: 0.0015
loss: 0.6463
node_ip: 172.31.85.32
pid: 1322
should_checkpoint: true
step: 535
time_since_restore: 93.76131749153137
time_this_iter_s: 93.76131749153137
time_total_s: 93.76131749153137
timestamp: 1661447764
timesteps_since_restore: 0
training_iteration: 1
trial_id: 5654d_00002
warmup_time: 0.004533290863037109
(RayTrainWorker pid=1223, ip=172.31.85.193) ***** Running Evaluation *****
(RayTrainWorker pid=1223, ip=172.31.85.193) Num examples = 1043
(RayTrainWorker pid=1223, ip=172.31.85.193) Batch size = 16
(RayTrainWorker pid=1223, ip=172.31.85.193) The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: idx, sentence. If idx, sentence are not expected by `DistilBertForSequenceClassification.forward`, you can safely ignore this message.
(RayTrainWorker pid=1223, ip=172.31.85.193) {'loss': 0.926, 'learning_rate': 0.015, 'epoch': 1.0}
(RayTrainWorker pid=1223, ip=172.31.85.193) {'eval_loss': 0.6529427766799927, 'eval_matthews_correlation': 0.0, 'eval_runtime': 0.9428, 'eval_samples_per_second': 288.51, 'eval_steps_per_second': 5.303, 'epoch': 1.0}
(RayTrainWorker pid=1223, ip=172.31.85.193) Saving model checkpoint to distilbert-base-uncased-finetuned-cola/checkpoint-535
(RayTrainWorker pid=1223, ip=172.31.85.193) Configuration saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/config.json
(RayTrainWorker pid=1223, ip=172.31.85.193) Model weights saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/pytorch_model.bin
(RayTrainWorker pid=1223, ip=172.31.85.193) tokenizer config file saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/tokenizer_config.json
(RayTrainWorker pid=1223, ip=172.31.85.193) Special tokens file saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/special_tokens_map.json
Result for HuggingFaceTrainer_5654d_00003:
_time_this_iter_s: 89.4301290512085
_timestamp: 1661447782
_training_iteration: 1
date: 2022-08-25_10-16-22
done: true
epoch: 1.0
eval_loss: 0.6529427766799927
eval_matthews_correlation: 0.0
eval_runtime: 0.9428
eval_samples_per_second: 288.51
eval_steps_per_second: 5.303
experiment_id: 8495977eeefd405fa4d9c1ea8fa735e1
hostname: ip-172-31-85-193
iterations_since_restore: 1
learning_rate: 0.015
loss: 0.926
node_ip: 172.31.85.193
pid: 1060
should_checkpoint: true
step: 535
time_since_restore: 99.36774587631226
time_this_iter_s: 99.36774587631226
time_total_s: 99.36774587631226
timestamp: 1661447782
timesteps_since_restore: 0
training_iteration: 1
trial_id: 5654d_00003
warmup_time: 0.004132509231567383
(RayTrainWorker pid=1789, ip=172.31.90.137) ***** Running Evaluation *****
(RayTrainWorker pid=1789, ip=172.31.90.137) Num examples = 1043
(RayTrainWorker pid=1789, ip=172.31.90.137) Batch size = 16
(RayTrainWorker pid=1789, ip=172.31.90.137) The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: sentence, idx. If sentence, idx are not expected by `DistilBertForSequenceClassification.forward`, you can safely ignore this message.
(RayTrainWorker pid=1789, ip=172.31.90.137) {'loss': 0.3841, 'learning_rate': 1e-05, 'epoch': 2.0}
(RayTrainWorker pid=1789, ip=172.31.90.137) {'eval_loss': 0.5994958281517029, 'eval_matthews_correlation': 0.4573244914254411, 'eval_runtime': 0.9442, 'eval_samples_per_second': 288.066, 'eval_steps_per_second': 5.295, 'epoch': 2.0}
(RayTrainWorker pid=1789, ip=172.31.90.137) Saving model checkpoint to distilbert-base-uncased-finetuned-cola/checkpoint-1070
(RayTrainWorker pid=1789, ip=172.31.90.137) Configuration saved in distilbert-base-uncased-finetuned-cola/checkpoint-1070/config.json
(RayTrainWorker pid=1789, ip=172.31.90.137) Model weights saved in distilbert-base-uncased-finetuned-cola/checkpoint-1070/pytorch_model.bin
(RayTrainWorker pid=1789, ip=172.31.90.137) tokenizer config file saved in distilbert-base-uncased-finetuned-cola/checkpoint-1070/tokenizer_config.json
(RayTrainWorker pid=1789, ip=172.31.90.137) Special tokens file saved in distilbert-base-uncased-finetuned-cola/checkpoint-1070/special_tokens_map.json
Result for HuggingFaceTrainer_5654d_00000:
_time_this_iter_s: 76.82565689086914
_timestamp: 1661447830
_training_iteration: 2
date: 2022-08-25_10-17-10
done: false
epoch: 2.0
eval_loss: 0.5994958281517029
eval_matthews_correlation: 0.4573244914254411
eval_runtime: 0.9442
eval_samples_per_second: 288.066
eval_steps_per_second: 5.295
experiment_id: cee1b96afcf344e89482e3c5e298a412
hostname: ip-172-31-90-137
iterations_since_restore: 2
learning_rate: 1.0e-05
loss: 0.3841
node_ip: 172.31.90.137
pid: 1729
should_checkpoint: true
step: 1070
time_since_restore: 171.76071190834045
time_this_iter_s: 76.82838201522827
time_total_s: 171.76071190834045
timestamp: 1661447830
timesteps_since_restore: 0
training_iteration: 2
trial_id: 5654d_00000
warmup_time: 0.0037021636962890625
(RayTrainWorker pid=1789, ip=172.31.90.137) ***** Running Evaluation *****
(RayTrainWorker pid=1789, ip=172.31.90.137) Num examples = 1043
(RayTrainWorker pid=1789, ip=172.31.90.137) Batch size = 16
(RayTrainWorker pid=1789, ip=172.31.90.137) The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: sentence, idx. If sentence, idx are not expected by `DistilBertForSequenceClassification.forward`, you can safely ignore this message.
(RayTrainWorker pid=1789, ip=172.31.90.137) {'loss': 0.2687, 'learning_rate': 5e-06, 'epoch': 3.0}
(RayTrainWorker pid=1789, ip=172.31.90.137) {'eval_loss': 0.6935313940048218, 'eval_matthews_correlation': 0.5300538425561, 'eval_runtime': 1.0176, 'eval_samples_per_second': 267.305, 'eval_steps_per_second': 4.914, 'epoch': 3.0}
(RayTrainWorker pid=1789, ip=172.31.90.137) Saving model checkpoint to distilbert-base-uncased-finetuned-cola/checkpoint-1605
(RayTrainWorker pid=1789, ip=172.31.90.137) Configuration saved in distilbert-base-uncased-finetuned-cola/checkpoint-1605/config.json
(RayTrainWorker pid=1789, ip=172.31.90.137) Model weights saved in distilbert-base-uncased-finetuned-cola/checkpoint-1605/pytorch_model.bin
(RayTrainWorker pid=1789, ip=172.31.90.137) tokenizer config file saved in distilbert-base-uncased-finetuned-cola/checkpoint-1605/tokenizer_config.json
(RayTrainWorker pid=1789, ip=172.31.90.137) Special tokens file saved in distilbert-base-uncased-finetuned-cola/checkpoint-1605/special_tokens_map.json
Result for HuggingFaceTrainer_5654d_00000:
_time_this_iter_s: 76.47252488136292
_timestamp: 1661447906
_training_iteration: 3
date: 2022-08-25_10-18-26
done: false
epoch: 3.0
eval_loss: 0.6935313940048218
eval_matthews_correlation: 0.5300538425561
eval_runtime: 1.0176
eval_samples_per_second: 267.305
eval_steps_per_second: 4.914
experiment_id: cee1b96afcf344e89482e3c5e298a412
hostname: ip-172-31-90-137
iterations_since_restore: 3
learning_rate: 5.0e-06
loss: 0.2687
node_ip: 172.31.90.137
pid: 1729
should_checkpoint: true
step: 1605
time_since_restore: 248.23273348808289
time_this_iter_s: 76.47202157974243
time_total_s: 248.23273348808289
timestamp: 1661447906
timesteps_since_restore: 0
training_iteration: 3
trial_id: 5654d_00000
warmup_time: 0.0037021636962890625
(RayTrainWorker pid=1789, ip=172.31.90.137) Saving model checkpoint to distilbert-base-uncased-finetuned-cola/checkpoint-2140
(RayTrainWorker pid=1789, ip=172.31.90.137) Configuration saved in distilbert-base-uncased-finetuned-cola/checkpoint-2140/config.json
(RayTrainWorker pid=1789, ip=172.31.90.137) Model weights saved in distilbert-base-uncased-finetuned-cola/checkpoint-2140/pytorch_model.bin
(RayTrainWorker pid=1789, ip=172.31.90.137) tokenizer config file saved in distilbert-base-uncased-finetuned-cola/checkpoint-2140/tokenizer_config.json
(RayTrainWorker pid=1789, ip=172.31.90.137) Special tokens file saved in distilbert-base-uncased-finetuned-cola/checkpoint-2140/special_tokens_map.json
(RayTrainWorker pid=1789, ip=172.31.90.137) ***** Running Evaluation *****
(RayTrainWorker pid=1789, ip=172.31.90.137) Num examples = 1043
(RayTrainWorker pid=1789, ip=172.31.90.137) Batch size = 16
(RayTrainWorker pid=1789, ip=172.31.90.137) The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: sentence, idx. If sentence, idx are not expected by `DistilBertForSequenceClassification.forward`, you can safely ignore this message.
(RayTrainWorker pid=1789, ip=172.31.90.137) {'loss': 0.1958, 'learning_rate': 0.0, 'epoch': 4.0}
(RayTrainWorker pid=1789, ip=172.31.90.137) Saving model checkpoint to distilbert-base-uncased-finetuned-cola/checkpoint-2140
(RayTrainWorker pid=1789, ip=172.31.90.137) Configuration saved in distilbert-base-uncased-finetuned-cola/checkpoint-2140/config.json
(RayTrainWorker pid=1789, ip=172.31.90.137) {'eval_loss': 0.8064090609550476, 'eval_matthews_correlation': 0.5322860764824153, 'eval_runtime': 1.0006, 'eval_samples_per_second': 271.827, 'eval_steps_per_second': 4.997, 'epoch': 4.0}
(RayTrainWorker pid=1789, ip=172.31.90.137) Model weights saved in distilbert-base-uncased-finetuned-cola/checkpoint-2140/pytorch_model.bin
(RayTrainWorker pid=1789, ip=172.31.90.137) tokenizer config file saved in distilbert-base-uncased-finetuned-cola/checkpoint-2140/tokenizer_config.json
(RayTrainWorker pid=1789, ip=172.31.90.137) Special tokens file saved in distilbert-base-uncased-finetuned-cola/checkpoint-2140/special_tokens_map.json
(RayTrainWorker pid=1789, ip=172.31.90.137)
(RayTrainWorker pid=1789, ip=172.31.90.137)
(RayTrainWorker pid=1789, ip=172.31.90.137) Training completed. Do not forget to share your model on huggingface.co/models =)
(RayTrainWorker pid=1789, ip=172.31.90.137)
(RayTrainWorker pid=1789, ip=172.31.90.137)
(RayTrainWorker pid=1789, ip=172.31.90.137) {'train_runtime': 329.1948, 'train_samples_per_second': 103.902, 'train_steps_per_second': 6.501, 'train_loss': 0.34860724689804506, 'epoch': 4.0}
Result for HuggingFaceTrainer_5654d_00000:
_time_this_iter_s: 98.92064905166626
_timestamp: 1661448005
_training_iteration: 4
date: 2022-08-25_10-20-05
done: true
epoch: 4.0
eval_loss: 0.8064090609550476
eval_matthews_correlation: 0.5322860764824153
eval_runtime: 1.0006
eval_samples_per_second: 271.827
eval_steps_per_second: 4.997
experiment_id: cee1b96afcf344e89482e3c5e298a412
hostname: ip-172-31-90-137
iterations_since_restore: 4
learning_rate: 0.0
loss: 0.1958
node_ip: 172.31.90.137
pid: 1729
should_checkpoint: true
step: 2140
time_since_restore: 347.1705844402313
time_this_iter_s: 98.93785095214844
time_total_s: 347.1705844402313
timestamp: 1661448005
timesteps_since_restore: 0
train_loss: 0.34860724689804506
train_runtime: 329.1948
train_samples_per_second: 103.902
train_steps_per_second: 6.501
training_iteration: 4
trial_id: 5654d_00000
warmup_time: 0.0037021636962890625
2022-08-25 10:20:13,409 INFO tune.py:758 -- Total run time: 361.90 seconds (361.74 seconds for the tuning loop).
We can view the results of the tuning run as a dataframe, and obtain the best result.
tune_results.get_dataframe().sort_values("eval_loss")
loss | learning_rate | epoch | step | eval_loss | eval_matthews_correlation | eval_runtime | eval_samples_per_second | eval_steps_per_second | _timestamp | ... | pid | hostname | node_ip | time_since_restore | timesteps_since_restore | iterations_since_restore | warmup_time | config/trainer_init_config/epochs | config/trainer_init_config/learning_rate | logdir | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 0.6225 | 0.00015 | 1.0 | 535 | 0.649242 | 0.000000 | 1.0157 | 267.792 | 4.923 | 1661447759 | ... | 1805 | ip-172-31-76-237 | 172.31.76.237 | 95.249164 | 0 | 1 | 0.003661 | 4 | 0.00020 | /home/ray/ray_results/HuggingFaceTrainer_2022-... |
3 | 0.9260 | 0.01500 | 1.0 | 535 | 0.652943 | 0.000000 | 0.9428 | 288.510 | 5.303 | 1661447782 | ... | 1060 | ip-172-31-85-193 | 172.31.85.193 | 99.367746 | 0 | 1 | 0.004133 | 4 | 0.02000 | /home/ray/ray_results/HuggingFaceTrainer_2022-... |
2 | 0.6463 | 0.00150 | 1.0 | 535 | 0.658653 | 0.000000 | 0.9576 | 284.050 | 5.222 | 1661447764 | ... | 1322 | ip-172-31-85-32 | 172.31.85.32 | 93.761317 | 0 | 1 | 0.004533 | 4 | 0.00200 | /home/ray/ray_results/HuggingFaceTrainer_2022-... |
0 | 0.1958 | 0.00000 | 4.0 | 2140 | 0.806409 | 0.532286 | 1.0006 | 271.827 | 4.997 | 1661448005 | ... | 1729 | ip-172-31-90-137 | 172.31.90.137 | 347.170584 | 0 | 4 | 0.003702 | 4 | 0.00002 | /home/ray/ray_results/HuggingFaceTrainer_2022-... |
4 rows Γ 33 columns
best_result = tune_results.get_best_result()
Predict on test data with Ray AIR #
You can now use the checkpoint to run prediction with HuggingFacePredictor
, which wraps around π€ Pipelines. In order to distribute prediction, we use BatchPredictor
. While this is not necessary for the very small example we are using (you could use HuggingFacePredictor
directly), it will scale well to a large dataset.
from ray.train.huggingface import HuggingFacePredictor
from ray.train.batch_predictor import BatchPredictor
import pandas as pd
predictor = BatchPredictor.from_checkpoint(
checkpoint=best_result.checkpoint,
predictor_cls=HuggingFacePredictor,
task="text-classification",
device=0 if use_gpu else -1, # -1 is CPU, otherwise device index
)
prediction = predictor.predict(ray_datasets["test"].map_batches(lambda x: x[["sentence"]]), num_gpus_per_worker=int(use_gpu))
prediction.show()
Map_Batches: 100%|ββββββββββ| 1/1 [00:00<00:00, 12.41it/s]
Map_Batches: 100%|ββββββββββ| 1/1 [00:00<00:00, 7.46it/s]
Map Progress (1 actors 1 pending): 100%|ββββββββββ| 1/1 [00:18<00:00, 18.46s/it]
{'label': 'LABEL_1', 'score': 0.6822417974472046}
{'label': 'LABEL_1', 'score': 0.6822402477264404}
{'label': 'LABEL_1', 'score': 0.6822407841682434}
{'label': 'LABEL_1', 'score': 0.6822386980056763}
{'label': 'LABEL_1', 'score': 0.6822428107261658}
{'label': 'LABEL_1', 'score': 0.6822453737258911}
{'label': 'LABEL_1', 'score': 0.6822437047958374}
{'label': 'LABEL_1', 'score': 0.6822428703308105}
{'label': 'LABEL_1', 'score': 0.6822431683540344}
{'label': 'LABEL_1', 'score': 0.6822426915168762}
{'label': 'LABEL_1', 'score': 0.6822447776794434}
{'label': 'LABEL_1', 'score': 0.6822456121444702}
{'label': 'LABEL_1', 'score': 0.6822471022605896}
{'label': 'LABEL_1', 'score': 0.6822477579116821}
{'label': 'LABEL_1', 'score': 0.682244598865509}
{'label': 'LABEL_1', 'score': 0.6822422742843628}
{'label': 'LABEL_1', 'score': 0.6822470426559448}
{'label': 'LABEL_1', 'score': 0.6822417378425598}
{'label': 'LABEL_1', 'score': 0.6822449564933777}
{'label': 'LABEL_1', 'score': 0.682239294052124}