{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Fine-tune a 🤗 Transformers model"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "VaFMt6AIhYbK"
},
"source": [
"This notebook is based on [an official 🤗 notebook - \"How to fine-tune a model on text classification\"](https://github.com/huggingface/notebooks/blob/6ca682955173cc9d36ffa431ddda505a048cbe80/examples/text_classification.ipynb). The main aim of this notebook is to show the process of conversion from vanilla 🤗 to [Ray AIR](https://docs.ray.io/en/latest/ray-air/getting-started.html) 🤗 without changing the training logic unless necessary.\n",
"\n",
"In this notebook, we will:\n",
"1. [Set up Ray](#setup)\n",
"2. [Load the dataset](#load)\n",
"3. [Preprocess the dataset with Ray AIR](#preprocess)\n",
"4. [Run the training with Ray AIR](#train)\n",
"5. [Predict on test data with Ray AIR](#predict)\n",
"6. [Optionally, share the model with the community](#share)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "sQbdfyWQhYbO"
},
"source": [
"Uncomment and run the following line in order to install all the necessary dependencies (this notebook is being tested with `transformers==4.19.1`):"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"id": "YajFzmkthYbO"
},
"outputs": [],
"source": [
"#! pip install \"datasets\" \"transformers>=4.19.0\" \"torch>=1.10.0\" \"mlflow\" \"ray[air]>=1.13\""
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "pvSRaEHChYbP"
},
"source": [
"## Set up Ray "
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "LRdL3kWBhYbQ"
},
"source": [
"We will use `ray.init()` to initialize a local cluster. By default, this cluster will be compromised of only the machine you are running this notebook on. You can also run this notebook on an Anyscale cluster.\n",
"\n",
"Note: this notebook *will not* run in Ray Client mode."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "MOsHUjgdIrIW",
"outputId": "e527bdbb-2f28-4142-cca0-762e0566cbcd"
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"2022-08-25 10:09:51,282\tINFO worker.py:1223 -- Using address localhost:9031 set in the environment variable RAY_ADDRESS\n",
"2022-08-25 10:09:51,697\tINFO worker.py:1333 -- Connecting to existing Ray cluster at address: 172.31.80.117:9031...\n",
"2022-08-25 10:09:51,706\tINFO worker.py:1509 -- Connected to Ray cluster. View the dashboard at \u001b[1m\u001b[32mhttps://session-i8ddtfaxhwypbvnyb9uzg7xs.i.anyscaleuserdata-staging.com/auth/?token=agh0_CkcwRQIhAJXwvxwq31GryaWthvXGCXZebsijbuqi7qL2pCa5uROOAiBGjzsyXAJFHLlaEI9zSlNI8ewtghKg5UV3t8NmlxuMcRJmEiCtvjcKE0VPiU7iQx51P9oPQjfpo5g1RJXccVSS5005cBgCIgNuL2E6DAj9xazjBhDwj4veAUIMCP3ClJgGEPCPi94B-gEeChxzZXNfaThERFRmQVhId1lwYlZueWI5dVpnN3hT&redirect_to=dashboard \u001b[39m\u001b[22m\n",
"2022-08-25 10:09:51,709\tINFO packaging.py:342 -- Pushing file package 'gcs://_ray_pkg_3332f64b0a461fddc20be71129115d0a.zip' (0.34MiB) to Ray cluster...\n",
"2022-08-25 10:09:51,714\tINFO packaging.py:351 -- Successfully pushed file package 'gcs://_ray_pkg_3332f64b0a461fddc20be71129115d0a.zip'.\n"
]
},
{
"data": {
"text/html": [
"
\n"
],
"text/plain": [
"RayContext(dashboard_url='session-i8ddtfaxhwypbvnyb9uzg7xs.i.anyscaleuserdata-staging.com/auth/?token=agh0_CkcwRQIhAJXwvxwq31GryaWthvXGCXZebsijbuqi7qL2pCa5uROOAiBGjzsyXAJFHLlaEI9zSlNI8ewtghKg5UV3t8NmlxuMcRJmEiCtvjcKE0VPiU7iQx51P9oPQjfpo5g1RJXccVSS5005cBgCIgNuL2E6DAj9xazjBhDwj4veAUIMCP3ClJgGEPCPi94B-gEeChxzZXNfaThERFRmQVhId1lwYlZueWI5dVpnN3hT&redirect_to=dashboard', python_version='3.8.5', ray_version='2.0.0', ray_commit='cba26cc83f6b5b8a2ff166594a65cb74c0ec8740', address_info={'node_ip_address': '172.31.80.117', 'raylet_ip_address': '172.31.80.117', 'redis_address': None, 'object_store_address': '/tmp/ray/session_2022-08-25_09-57-39_455459_216/sockets/plasma_store', 'raylet_socket_name': '/tmp/ray/session_2022-08-25_09-57-39_455459_216/sockets/raylet', 'webui_url': 'session-i8ddtfaxhwypbvnyb9uzg7xs.i.anyscaleuserdata-staging.com/auth/?token=agh0_CkcwRQIhAJXwvxwq31GryaWthvXGCXZebsijbuqi7qL2pCa5uROOAiBGjzsyXAJFHLlaEI9zSlNI8ewtghKg5UV3t8NmlxuMcRJmEiCtvjcKE0VPiU7iQx51P9oPQjfpo5g1RJXccVSS5005cBgCIgNuL2E6DAj9xazjBhDwj4veAUIMCP3ClJgGEPCPi94B-gEeChxzZXNfaThERFRmQVhId1lwYlZueWI5dVpnN3hT&redirect_to=dashboard', 'session_dir': '/tmp/ray/session_2022-08-25_09-57-39_455459_216', 'metrics_export_port': 55366, 'gcs_address': '172.31.80.117:9031', 'address': '172.31.80.117:9031', 'dashboard_agent_listen_port': 52365, 'node_id': '422ff33444fd0f870aa6e718628407400a0ec9483a637c3026c3f9a3'})"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from pprint import pprint\n",
"import ray\n",
"\n",
"ray.init()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "oJiSdWy2hYbR"
},
"source": [
"We can check the resources our cluster is composed of. If you are running this notebook on your local machine or Google Colab, you should see the number of CPU cores and GPUs available on the said machine."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "KlMz0dt9hYbS",
"outputId": "2d485449-ee69-4334-fcba-47e0ceb63078"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'CPU': 208.0,\n",
" 'GPU': 16.0,\n",
" 'accelerator_type:T4': 4.0,\n",
" 'memory': 616693614180.0,\n",
" 'node:172.31.76.237': 1.0,\n",
" 'node:172.31.80.117': 1.0,\n",
" 'node:172.31.85.193': 1.0,\n",
" 'node:172.31.85.32': 1.0,\n",
" 'node:172.31.90.137': 1.0,\n",
" 'object_store_memory': 259318055729.0}\n"
]
}
],
"source": [
"pprint(ray.cluster_resources())"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "uS6oeJELhYbS"
},
"source": [
"In this notebook, we will see how to fine-tune one of the [🤗 Transformers](https://github.com/huggingface/transformers) model to a text classification task of the [GLUE Benchmark](https://gluebenchmark.com/). We will be running the training using [Ray AIR](https://docs.ray.io/en/latest/ray-air/getting-started.html).\n",
"\n",
"You can change those two variables to control whether the training (which we will get to later) uses CPUs or GPUs, and how many workers should be spawned. Each worker will claim one CPU or GPU. Make sure not to request more resources than the resources present!\n",
"\n",
"By default, we will run the training with one GPU worker."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"id": "gAbhv9OqhYbT"
},
"outputs": [],
"source": [
"use_gpu = True # set this to False to run on CPUs\n",
"num_workers = 1 # set this to number of GPUs/CPUs you want to use"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "rEJBSTyZIrIb"
},
"source": [
"## Fine-tuning a model on a text classification task"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "kTCFado4IrIc"
},
"source": [
"The GLUE Benchmark is a group of nine classification tasks on sentences or pairs of sentences. If you would like to learn more, refer to the [original notebook](https://github.com/huggingface/notebooks/blob/6ca682955173cc9d36ffa431ddda505a048cbe80/examples/text_classification.ipynb).\n",
"\n",
"Each task is named by its acronym, with `mnli-mm` standing for the mismatched version of MNLI (so same training set as `mnli` but different validation and test sets):"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"id": "YZbiBDuGIrId"
},
"outputs": [],
"source": [
"GLUE_TASKS = [\"cola\", \"mnli\", \"mnli-mm\", \"mrpc\", \"qnli\", \"qqp\", \"rte\", \"sst2\", \"stsb\", \"wnli\"]"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "4RRkXuteIrIh"
},
"source": [
"This notebook is built to run on any of the tasks in the list above, with any model checkpoint from the [Model Hub](https://huggingface.co/models) as long as that model has a version with a classification head. Depending on your model and the GPU you are using, you might need to adjust the batch size to avoid out-of-memory errors. Set those three parameters, then the rest of the notebook should run smoothly:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"id": "zVvslsfMIrIh"
},
"outputs": [],
"source": [
"task = \"cola\"\n",
"model_checkpoint = \"distilbert-base-uncased\"\n",
"batch_size = 16"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "whPRbBNbIrIl"
},
"source": [
"### Loading the dataset "
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "W7QYTpxXIrIl"
},
"source": [
"We will use the [🤗 Datasets](https://github.com/huggingface/datasets) library to download the data and get the metric we need to use for evaluation (to compare our model to the benchmark). This can be easily done with the functions `load_dataset` and `load_metric`.\n",
"\n",
"Apart from `mnli-mm` being a special code, we can directly pass our task name to those functions.\n",
"\n",
"As Ray AIR doesn't provide integrations for 🤗 Datasets yet, we will simply run the normal 🤗 Datasets code to load the dataset from the Hub."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 200
},
"id": "MwhAeEOuhYbV",
"outputId": "3aff8c73-d6eb-4784-890a-a419403b5bda"
},
"outputs": [],
"source": [
"from datasets import load_dataset\n",
"\n",
"actual_task = \"mnli\" if task == \"mnli-mm\" else task\n",
"datasets = load_dataset(\"glue\", actual_task)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "RzfPtOMoIrIu"
},
"source": [
"The `dataset` object itself is [`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict), which contains one key for the training, validation, and test set (with more keys for the mismatched validation and test set in the special case of `mnli`)."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "_TOee7nohYbW"
},
"source": [
"We will also need the metric. In order to avoid serialization errors, we will load the metric inside the training workers later. Therefore, now we will just define the function we will use."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"id": "FNE583uBhYbW"
},
"outputs": [],
"source": [
"from datasets import load_metric\n",
"\n",
"def load_metric_fn():\n",
" return load_metric('glue', actual_task)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "lnjDIuQ3IrI-"
},
"source": [
"The metric is an instance of [`datasets.Metric`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Metric)."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "n9qywopnIrJH"
},
"source": [
"### Preprocessing the data with Ray AIR "
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "YVx71GdAIrJH"
},
"source": [
"Before we can feed those texts to our model, we need to preprocess them. This is done by a 🤗 Transformers' `Tokenizer`, which will (as the name indicates) tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary) and put it in a format the model expects, as well as generate the other inputs that model requires.\n",
"\n",
"To do all of this, we instantiate our tokenizer with the `AutoTokenizer.from_pretrained` method, which will ensure that:\n",
"\n",
"- we get a tokenizer that corresponds to the model architecture we want to use,\n",
"- we download the vocabulary used when pretraining this specific checkpoint."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 145
},
"id": "eXNLu_-nIrJI",
"outputId": "f545a7a5-f341-4315-cd89-9942a657aa31"
},
"outputs": [],
"source": [
"from transformers import AutoTokenizer\n",
"\n",
"tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Vl6IidfdIrJK"
},
"source": [
"We pass along `use_fast=True` to the call above to use one of the fast tokenizers (backed by Rust) from the 🤗 Tokenizers library. Those fast tokenizers are available for almost all models, but if you got an error with the previous call, remove that argument."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "qo_0B1M2IrJM"
},
"source": [
"To preprocess our dataset, we will thus need the names of the columns containing the sentence(s). The following dictionary keeps track of the correspondence task to column names:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"id": "fyGdtK9oIrJM"
},
"outputs": [],
"source": [
"task_to_keys = {\n",
" \"cola\": (\"sentence\", None),\n",
" \"mnli\": (\"premise\", \"hypothesis\"),\n",
" \"mnli-mm\": (\"premise\", \"hypothesis\"),\n",
" \"mrpc\": (\"sentence1\", \"sentence2\"),\n",
" \"qnli\": (\"question\", \"sentence\"),\n",
" \"qqp\": (\"question1\", \"question2\"),\n",
" \"rte\": (\"sentence1\", \"sentence2\"),\n",
" \"sst2\": (\"sentence\", None),\n",
" \"stsb\": (\"sentence1\", \"sentence2\"),\n",
" \"wnli\": (\"sentence1\", \"sentence2\"),\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "256fOuzjhYbY"
},
"source": [
"For Ray AIR, instead of using 🤗 Dataset objects directly, we will convert them to [Ray Datasets](https://docs.ray.io/en/latest/data/dataset.html). Both are backed by Arrow tables, so the conversion is straightforward. We will use the built-in `ray.data.from_huggingface` function."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'train': Dataset(num_blocks=1, num_rows=8551, schema={sentence: string, label: int64, idx: int32}),\n",
" 'validation': Dataset(num_blocks=1, num_rows=1043, schema={sentence: string, label: int64, idx: int32}),\n",
" 'test': Dataset(num_blocks=1, num_rows=1063, schema={sentence: string, label: int64, idx: int32})}"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import ray.data\n",
"\n",
"ray_datasets = ray.data.from_huggingface(datasets)\n",
"ray_datasets"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "2C0hcmp9IrJQ"
},
"source": [
"We can then write the function that will preprocess our samples. We just feed them to the `tokenizer` with the argument `truncation=True`. This will ensure that an input longer than what the model selected can handle will be truncated to the maximum length accepted by the model.\n",
"\n",
"We use a `BatchMapper` to create a Ray AIR preprocessor that will map the function to the dataset in a distributed fashion. It will run during training and prediction."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"id": "vc0BSBLIIrJQ"
},
"outputs": [],
"source": [
"import pandas as pd\n",
"from ray.data.preprocessors import BatchMapper\n",
"\n",
"def preprocess_function(examples: pd.DataFrame):\n",
" # if we only have one column, we are inferring.\n",
" # no need to tokenize in that case. \n",
" if len(examples.columns) == 1:\n",
" return examples\n",
" examples = examples.to_dict(\"list\")\n",
" sentence1_key, sentence2_key = task_to_keys[task]\n",
" if sentence2_key is None:\n",
" ret = tokenizer(examples[sentence1_key], truncation=True)\n",
" else:\n",
" ret = tokenizer(examples[sentence1_key], examples[sentence2_key], truncation=True)\n",
" # Add back the original columns\n",
" ret = {**examples, **ret}\n",
" return pd.DataFrame.from_dict(ret)\n",
"\n",
"batch_encoder = BatchMapper(preprocess_function, batch_format=\"pandas\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "545PP3o8IrJV"
},
"source": [
"### Fine-tuning the model with Ray AIR "
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "FBiW8UpKIrJW"
},
"source": [
"Now that our data is ready, we can download the pretrained model and fine-tune it.\n",
"\n",
"Since all our tasks are about sentence classification, we use the `AutoModelForSequenceClassification` class.\n",
"\n",
"We will not go into details about each specific component of the training (see the [original notebook](https://github.com/huggingface/notebooks/blob/6ca682955173cc9d36ffa431ddda505a048cbe80/examples/text_classification.ipynb) for that). The tokenizer is the same as we have used to encoded the dataset before.\n",
"\n",
"The main difference when using the Ray AIR is that we need to create our 🤗 Transformers `Trainer` inside a function (`trainer_init_per_worker`) and return it. That function will be passed to the `HuggingFaceTrainer` and will run on every Ray worker. The training will then proceed by the means of PyTorch DDP.\n",
"\n",
"Make sure that you initialize the model, metric, and tokenizer inside that function. Otherwise, you may run into serialization errors.\n",
"\n",
"Furthermore, `push_to_hub=True` is not yet supported. Ray will, however, checkpoint the model at every epoch, allowing you to push it to hub manually. We will do that after the training.\n",
"\n",
"If you wish to use thrid party logging libraries, such as MLflow or Weights&Biases, do not set them in `TrainingArguments` (they will be automatically disabled) - instead, you should pass Ray AIR callbacks to `HuggingFaceTrainer`'s `run_config`. In this example, we will use MLflow."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"id": "TlqNaB8jIrJW"
},
"outputs": [],
"source": [
"from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer\n",
"import numpy as np\n",
"import torch\n",
"\n",
"num_labels = 3 if task.startswith(\"mnli\") else 1 if task==\"stsb\" else 2\n",
"metric_name = \"pearson\" if task == \"stsb\" else \"matthews_correlation\" if task == \"cola\" else \"accuracy\"\n",
"model_name = model_checkpoint.split(\"/\")[-1]\n",
"validation_key = \"validation_mismatched\" if task == \"mnli-mm\" else \"validation_matched\" if task == \"mnli\" else \"validation\"\n",
"name = f\"{model_name}-finetuned-{task}\"\n",
"\n",
"def trainer_init_per_worker(train_dataset, eval_dataset = None, **config):\n",
" print(f\"Is CUDA available: {torch.cuda.is_available()}\")\n",
" metric = load_metric_fn()\n",
" tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)\n",
" model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)\n",
" args = TrainingArguments(\n",
" name,\n",
" evaluation_strategy=\"epoch\",\n",
" save_strategy=\"epoch\",\n",
" logging_strategy=\"epoch\",\n",
" learning_rate=config.get(\"learning_rate\", 2e-5),\n",
" per_device_train_batch_size=batch_size,\n",
" per_device_eval_batch_size=batch_size,\n",
" num_train_epochs=config.get(\"epochs\", 2),\n",
" weight_decay=config.get(\"weight_decay\", 0.01),\n",
" push_to_hub=False,\n",
" disable_tqdm=True, # declutter the output a little\n",
" no_cuda=not use_gpu, # you need to explicitly set no_cuda if you want CPUs\n",
" )\n",
"\n",
" def compute_metrics(eval_pred):\n",
" predictions, labels = eval_pred\n",
" if task != \"stsb\":\n",
" predictions = np.argmax(predictions, axis=1)\n",
" else:\n",
" predictions = predictions[:, 0]\n",
" return metric.compute(predictions=predictions, references=labels)\n",
"\n",
" trainer = Trainer(\n",
" model,\n",
" args,\n",
" train_dataset=train_dataset,\n",
" eval_dataset=eval_dataset,\n",
" tokenizer=tokenizer,\n",
" compute_metrics=compute_metrics\n",
" )\n",
"\n",
" print(\"Starting training\")\n",
" return trainer"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "CdzABDVcIrJg"
},
"source": [
"With our `trainer_init_per_worker` complete, we can now instantiate the `HuggingFaceTrainer`. Aside from the function, we set the `scaling_config`, controlling the amount of workers and resources used, and the `datasets` we will use for training and evaluation.\n",
"\n",
"We specify the `MLflowLoggerCallback` inside the `run_config`, and pass the preprocessor we have defined earlier as an argument. The preprocessor will be included with the returned `Checkpoint`, meaning it will also be applied during inference."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"id": "RElw7OgLhYba"
},
"outputs": [],
"source": [
"from ray.train.huggingface import HuggingFaceTrainer\n",
"from ray.air.config import RunConfig, ScalingConfig, CheckpointConfig\n",
"from ray.air.integrations.mlflow import MLflowLoggerCallback\n",
"\n",
"trainer = HuggingFaceTrainer(\n",
" trainer_init_per_worker=trainer_init_per_worker,\n",
" scaling_config=ScalingConfig(num_workers=num_workers, use_gpu=use_gpu),\n",
" datasets={\"train\": ray_datasets[\"train\"], \"evaluation\": ray_datasets[validation_key]},\n",
" run_config=RunConfig(\n",
" callbacks=[MLflowLoggerCallback(experiment_name=name)],\n",
" checkpoint_config=CheckpointConfig(num_to_keep=1, checkpoint_score_attribute=\"eval_loss\", checkpoint_score_order=\"min\"),\n",
" ),\n",
" preprocessor=batch_encoder,\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "XvS136zKhYba"
},
"source": [
"Finally, we call the `fit` method to start training with Ray AIR. We will save the `Result` object to a variable so we can access metrics and checkpoints."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 1000
},
"id": "uNx5pyRlIrJh",
"outputId": "8496fe4f-f1c3-48ad-a6d3-b16a65716135"
},
"outputs": [
{
"data": {
"text/html": [
"== Status == Current time: 2022-08-25 10:14:09 (running for 00:04:06.45) Memory usage on this node: 4.3/62.0 GiB Using FIFO scheduling algorithm. Resources requested: 0/208 CPUs, 0/16 GPUs, 0.0/574.34 GiB heap, 0.0/241.51 GiB objects (0.0/4.0 accelerator_type:T4) Result logdir: /home/ray/ray_results/HuggingFaceTrainer_2022-08-25_10-10-02 Number of trials: 1/1 (1 TERMINATED)
\n",
"\n",
"
Trial name
status
loc
iter
total time (s)
loss
learning_rate
epoch
\n",
"\n",
"\n",
"
HuggingFaceTrainer_c1ff5_00000
TERMINATED
172.31.90.137:947
2
200.217
0.3886
0
2
\n",
"\n",
"
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"(RayTrainWorker pid=1114, ip=172.31.90.137) 2022-08-25 10:10:44,617\tINFO config.py:71 -- Setting up process group for: env:// [rank=0, world_size=4]\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"(RayTrainWorker pid=1114, ip=172.31.90.137) Is CUDA available: True\n",
"(RayTrainWorker pid=1116, ip=172.31.90.137) Is CUDA available: True\n",
"(RayTrainWorker pid=1117, ip=172.31.90.137) Is CUDA available: True\n",
"(RayTrainWorker pid=1115, ip=172.31.90.137) Is CUDA available: True\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Downloading builder script: 5.76kB [00:00, 6.45MB/s] \n",
"Downloading builder script: 5.76kB [00:00, 6.91MB/s] \n",
"Downloading builder script: 5.76kB [00:00, 6.44MB/s] \n",
"Downloading builder script: 5.76kB [00:00, 6.94MB/s] \n",
"Downloading tokenizer_config.json: 100%|██████████| 28.0/28.0 [00:00<00:00, 30.5kB/s]\n",
"Downloading config.json: 100%|██████████| 483/483 [00:00<00:00, 817kB/s]\n",
"Downloading vocab.txt: 0%| | 0.00/226k [00:00, ?B/s]\n",
"Downloading vocab.txt: 18%|█▊ | 41.0k/226k [00:00<00:00, 353kB/s]\n",
"Downloading vocab.txt: 100%|██████████| 226k/226k [00:00<00:00, 773kB/s] \n",
"Downloading tokenizer.json: 0%| | 0.00/455k [00:00, ?B/s]\n",
"Downloading tokenizer.json: 6%|▌ | 28.0k/455k [00:00<00:01, 227kB/s]\n",
"Downloading tokenizer.json: 24%|██▍ | 111k/455k [00:00<00:00, 488kB/s] \n",
"Downloading tokenizer.json: 42%|████▏ | 191k/455k [00:00<00:00, 559kB/s]\n",
"Downloading tokenizer.json: 67%|██████▋ | 303k/455k [00:00<00:00, 694kB/s]\n",
"Downloading tokenizer.json: 100%|██████████| 455k/455k [00:00<00:00, 815kB/s]\n",
"Downloading pytorch_model.bin: 0%| | 0.00/256M [00:00, ?B/s]\n",
"Downloading pytorch_model.bin: 0%| | 1.20M/256M [00:00<00:21, 12.6MB/s]\n",
"Downloading pytorch_model.bin: 2%|▏ | 6.02M/256M [00:00<00:07, 34.9MB/s]\n",
"Downloading pytorch_model.bin: 6%|▌ | 15.0M/256M [00:00<00:04, 62.0MB/s]\n",
"Downloading pytorch_model.bin: 9%|▉ | 24.0M/256M [00:00<00:03, 74.8MB/s]\n",
"Downloading pytorch_model.bin: 13%|█▎ | 33.1M/256M [00:00<00:02, 82.3MB/s]\n",
"Downloading pytorch_model.bin: 17%|█▋ | 42.2M/256M [00:00<00:02, 86.7MB/s]\n",
"Downloading pytorch_model.bin: 20%|██ | 51.4M/256M [00:00<00:02, 89.8MB/s]\n",
"Downloading pytorch_model.bin: 24%|██▎ | 60.6M/256M [00:00<00:02, 91.8MB/s]\n",
"Downloading pytorch_model.bin: 27%|██▋ | 69.8M/256M [00:00<00:02, 93.3MB/s]\n",
"Downloading pytorch_model.bin: 31%|███ | 78.9M/256M [00:01<00:01, 94.2MB/s]\n",
"Downloading pytorch_model.bin: 34%|███▍ | 88.0M/256M [00:01<00:01, 94.6MB/s]\n",
"Downloading pytorch_model.bin: 38%|███▊ | 97.2M/256M [00:01<00:01, 95.1MB/s]\n",
"Downloading pytorch_model.bin: 42%|████▏ | 106M/256M [00:01<00:01, 95.6MB/s] \n",
"Downloading pytorch_model.bin: 45%|████▌ | 116M/256M [00:01<00:01, 96.0MB/s]\n",
"Downloading pytorch_model.bin: 49%|████▉ | 125M/256M [00:01<00:01, 96.2MB/s]\n",
"Downloading pytorch_model.bin: 52%|█████▏ | 134M/256M [00:01<00:01, 96.0MB/s]\n",
"Downloading pytorch_model.bin: 56%|█████▌ | 143M/256M [00:01<00:01, 96.1MB/s]\n",
"Downloading pytorch_model.bin: 60%|█████▉ | 152M/256M [00:01<00:01, 96.0MB/s]\n",
"Downloading pytorch_model.bin: 63%|██████▎ | 162M/256M [00:01<00:01, 96.2MB/s]\n",
"Downloading pytorch_model.bin: 67%|██████▋ | 171M/256M [00:02<00:00, 96.1MB/s]\n",
"Downloading pytorch_model.bin: 70%|███████ | 180M/256M [00:02<00:00, 96.2MB/s]\n",
"Downloading pytorch_model.bin: 74%|███████▍ | 189M/256M [00:02<00:00, 96.2MB/s]\n",
"Downloading pytorch_model.bin: 78%|███████▊ | 198M/256M [00:02<00:00, 96.2MB/s]\n",
"Downloading pytorch_model.bin: 81%|████████ | 208M/256M [00:02<00:00, 95.9MB/s]\n",
"Downloading pytorch_model.bin: 85%|████████▍ | 217M/256M [00:02<00:00, 95.9MB/s]\n",
"Downloading pytorch_model.bin: 88%|████████▊ | 226M/256M [00:02<00:00, 96.2MB/s]\n",
"Downloading pytorch_model.bin: 92%|█████████▏| 235M/256M [00:02<00:00, 96.1MB/s]\n",
"Downloading pytorch_model.bin: 96%|█████████▌| 244M/256M [00:02<00:00, 96.1MB/s]\n",
"Downloading pytorch_model.bin: 100%|██████████| 256M/256M [00:02<00:00, 91.6MB/s]\n",
"(RayTrainWorker pid=1117, ip=172.31.90.137) Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_transform.weight', 'vocab_layer_norm.bias']\n",
"(RayTrainWorker pid=1117, ip=172.31.90.137) - This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n",
"(RayTrainWorker pid=1117, ip=172.31.90.137) - This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n",
"(RayTrainWorker pid=1117, ip=172.31.90.137) Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'classifier.bias', 'classifier.weight', 'pre_classifier.weight']\n",
"(RayTrainWorker pid=1117, ip=172.31.90.137) You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n",
"(RayTrainWorker pid=1114, ip=172.31.90.137) Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_projector.weight', 'vocab_transform.weight']\n",
"(RayTrainWorker pid=1114, ip=172.31.90.137) - This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n",
"(RayTrainWorker pid=1114, ip=172.31.90.137) - This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n",
"(RayTrainWorker pid=1114, ip=172.31.90.137) Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'pre_classifier.weight', 'classifier.weight', 'classifier.bias']\n",
"(RayTrainWorker pid=1114, ip=172.31.90.137) You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n",
"(RayTrainWorker pid=1116, ip=172.31.90.137) Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_projector.weight']\n",
"(RayTrainWorker pid=1116, ip=172.31.90.137) - This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n",
"(RayTrainWorker pid=1116, ip=172.31.90.137) - This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n",
"(RayTrainWorker pid=1116, ip=172.31.90.137) Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight']\n",
"(RayTrainWorker pid=1116, ip=172.31.90.137) You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n",
"(RayTrainWorker pid=1115, ip=172.31.90.137) Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_projector.weight', 'vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight']\n",
"(RayTrainWorker pid=1115, ip=172.31.90.137) - This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n",
"(RayTrainWorker pid=1115, ip=172.31.90.137) - This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n",
"(RayTrainWorker pid=1115, ip=172.31.90.137) Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.weight', 'classifier.bias', 'pre_classifier.bias']\n",
"(RayTrainWorker pid=1115, ip=172.31.90.137) You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"(RayTrainWorker pid=1114, ip=172.31.90.137) Starting training\n",
"(RayTrainWorker pid=1116, ip=172.31.90.137) Starting training\n",
"(RayTrainWorker pid=1117, ip=172.31.90.137) Starting training\n",
"(RayTrainWorker pid=1115, ip=172.31.90.137) Starting training\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"(RayTrainWorker pid=1114, ip=172.31.90.137) ***** Running training *****\n",
"(RayTrainWorker pid=1114, ip=172.31.90.137) Num examples = 8551\n",
"(RayTrainWorker pid=1114, ip=172.31.90.137) Num Epochs = 2\n",
"(RayTrainWorker pid=1114, ip=172.31.90.137) Instantaneous batch size per device = 16\n",
"(RayTrainWorker pid=1114, ip=172.31.90.137) Total train batch size (w. parallel, distributed & accumulation) = 64\n",
"(RayTrainWorker pid=1114, ip=172.31.90.137) Gradient Accumulation steps = 1\n",
"(RayTrainWorker pid=1114, ip=172.31.90.137) Total optimization steps = 1070\n",
"(RayTrainWorker pid=1114, ip=172.31.90.137) The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: sentence, idx. If sentence, idx are not expected by `DistilBertForSequenceClassification.forward`, you can safely ignore this message.\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"(RayTrainWorker pid=1114, ip=172.31.90.137) {'loss': 0.5437, 'learning_rate': 1e-05, 'epoch': 1.0}\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"(RayTrainWorker pid=1114, ip=172.31.90.137) ***** Running Evaluation *****\n",
"(RayTrainWorker pid=1114, ip=172.31.90.137) Num examples = 1043\n",
"(RayTrainWorker pid=1114, ip=172.31.90.137) Batch size = 16\n",
"(RayTrainWorker pid=1114, ip=172.31.90.137) The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: sentence, idx. If sentence, idx are not expected by `DistilBertForSequenceClassification.forward`, you can safely ignore this message.\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"(RayTrainWorker pid=1114, ip=172.31.90.137) {'eval_loss': 0.5794203281402588, 'eval_matthews_correlation': 0.3293676852500821, 'eval_runtime': 0.9804, 'eval_samples_per_second': 277.441, 'eval_steps_per_second': 5.1, 'epoch': 1.0}\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"(RayTrainWorker pid=1114, ip=172.31.90.137) Saving model checkpoint to distilbert-base-uncased-finetuned-cola/checkpoint-535\n",
"(RayTrainWorker pid=1114, ip=172.31.90.137) Configuration saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/config.json\n",
"(RayTrainWorker pid=1114, ip=172.31.90.137) Model weights saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/pytorch_model.bin\n",
"(RayTrainWorker pid=1114, ip=172.31.90.137) tokenizer config file saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/tokenizer_config.json\n",
"(RayTrainWorker pid=1114, ip=172.31.90.137) Special tokens file saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/special_tokens_map.json\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Result for HuggingFaceTrainer_c1ff5_00000:\n",
" _time_this_iter_s: 90.87123560905457\n",
" _timestamp: 1661447540\n",
" _training_iteration: 1\n",
" date: 2022-08-25_10-12-20\n",
" done: false\n",
" epoch: 1.0\n",
" eval_loss: 0.5794203281402588\n",
" eval_matthews_correlation: 0.3293676852500821\n",
" eval_runtime: 0.9804\n",
" eval_samples_per_second: 277.441\n",
" eval_steps_per_second: 5.1\n",
" experiment_id: 592e02b25b254bd1a3743904313dc85b\n",
" hostname: ip-172-31-90-137\n",
" iterations_since_restore: 1\n",
" learning_rate: 1.0e-05\n",
" loss: 0.5437\n",
" node_ip: 172.31.90.137\n",
" pid: 947\n",
" should_checkpoint: true\n",
" step: 535\n",
" time_since_restore: 103.24057936668396\n",
" time_this_iter_s: 103.24057936668396\n",
" time_total_s: 103.24057936668396\n",
" timestamp: 1661447540\n",
" timesteps_since_restore: 0\n",
" training_iteration: 1\n",
" trial_id: c1ff5_00000\n",
" warmup_time: 0.003858327865600586\n",
" \n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"(RayTrainWorker pid=1114, ip=172.31.90.137) Saving model checkpoint to distilbert-base-uncased-finetuned-cola/checkpoint-1070\n",
"(RayTrainWorker pid=1114, ip=172.31.90.137) Configuration saved in distilbert-base-uncased-finetuned-cola/checkpoint-1070/config.json\n",
"(RayTrainWorker pid=1114, ip=172.31.90.137) Model weights saved in distilbert-base-uncased-finetuned-cola/checkpoint-1070/pytorch_model.bin\n",
"(RayTrainWorker pid=1114, ip=172.31.90.137) tokenizer config file saved in distilbert-base-uncased-finetuned-cola/checkpoint-1070/tokenizer_config.json\n",
"(RayTrainWorker pid=1114, ip=172.31.90.137) Special tokens file saved in distilbert-base-uncased-finetuned-cola/checkpoint-1070/special_tokens_map.json\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"(RayTrainWorker pid=1114, ip=172.31.90.137) {'loss': 0.3886, 'learning_rate': 0.0, 'epoch': 2.0}\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"(RayTrainWorker pid=1114, ip=172.31.90.137) ***** Running Evaluation *****\n",
"(RayTrainWorker pid=1114, ip=172.31.90.137) Num examples = 1043\n",
"(RayTrainWorker pid=1114, ip=172.31.90.137) Batch size = 16\n",
"(RayTrainWorker pid=1114, ip=172.31.90.137) The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: sentence, idx. If sentence, idx are not expected by `DistilBertForSequenceClassification.forward`, you can safely ignore this message.\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"(RayTrainWorker pid=1114, ip=172.31.90.137) {'eval_loss': 0.6215357184410095, 'eval_matthews_correlation': 0.42957017514952434, 'eval_runtime': 0.9956, 'eval_samples_per_second': 273.204, 'eval_steps_per_second': 5.022, 'epoch': 2.0}\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"(RayTrainWorker pid=1114, ip=172.31.90.137) Saving model checkpoint to distilbert-base-uncased-finetuned-cola/checkpoint-1070\n",
"(RayTrainWorker pid=1114, ip=172.31.90.137) Configuration saved in distilbert-base-uncased-finetuned-cola/checkpoint-1070/config.json\n",
"(RayTrainWorker pid=1114, ip=172.31.90.137) Model weights saved in distilbert-base-uncased-finetuned-cola/checkpoint-1070/pytorch_model.bin\n",
"(RayTrainWorker pid=1114, ip=172.31.90.137) tokenizer config file saved in distilbert-base-uncased-finetuned-cola/checkpoint-1070/tokenizer_config.json\n",
"(RayTrainWorker pid=1114, ip=172.31.90.137) Special tokens file saved in distilbert-base-uncased-finetuned-cola/checkpoint-1070/special_tokens_map.json\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"(RayTrainWorker pid=1114, ip=172.31.90.137) {'train_runtime': 174.4696, 'train_samples_per_second': 98.023, 'train_steps_per_second': 6.133, 'train_loss': 0.4661755713346963, 'epoch': 2.0}\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"(RayTrainWorker pid=1114, ip=172.31.90.137) \n",
"(RayTrainWorker pid=1114, ip=172.31.90.137) \n",
"(RayTrainWorker pid=1114, ip=172.31.90.137) Training completed. Do not forget to share your model on huggingface.co/models =)\n",
"(RayTrainWorker pid=1114, ip=172.31.90.137) \n",
"(RayTrainWorker pid=1114, ip=172.31.90.137) \n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Result for HuggingFaceTrainer_c1ff5_00000:\n",
" _time_this_iter_s: 96.96447467803955\n",
" _timestamp: 1661447637\n",
" _training_iteration: 2\n",
" date: 2022-08-25_10-13-57\n",
" done: false\n",
" epoch: 2.0\n",
" eval_loss: 0.6215357184410095\n",
" eval_matthews_correlation: 0.42957017514952434\n",
" eval_runtime: 0.9956\n",
" eval_samples_per_second: 273.204\n",
" eval_steps_per_second: 5.022\n",
" experiment_id: 592e02b25b254bd1a3743904313dc85b\n",
" hostname: ip-172-31-90-137\n",
" iterations_since_restore: 2\n",
" learning_rate: 0.0\n",
" loss: 0.3886\n",
" node_ip: 172.31.90.137\n",
" pid: 947\n",
" should_checkpoint: true\n",
" step: 1070\n",
" time_since_restore: 200.21722102165222\n",
" time_this_iter_s: 96.97664165496826\n",
" time_total_s: 200.21722102165222\n",
" timestamp: 1661447637\n",
" timesteps_since_restore: 0\n",
" train_loss: 0.4661755713346963\n",
" train_runtime: 174.4696\n",
" train_samples_per_second: 98.023\n",
" train_steps_per_second: 6.133\n",
" training_iteration: 2\n",
" trial_id: c1ff5_00000\n",
" warmup_time: 0.003858327865600586\n",
" \n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Result for HuggingFaceTrainer_c1ff5_00000:\n",
" _time_this_iter_s: 96.96447467803955\n",
" _timestamp: 1661447637\n",
" _training_iteration: 2\n",
" date: 2022-08-25_10-13-57\n",
" done: true\n",
" epoch: 2.0\n",
" eval_loss: 0.6215357184410095\n",
" eval_matthews_correlation: 0.42957017514952434\n",
" eval_runtime: 0.9956\n",
" eval_samples_per_second: 273.204\n",
" eval_steps_per_second: 5.022\n",
" experiment_id: 592e02b25b254bd1a3743904313dc85b\n",
" experiment_tag: '0'\n",
" hostname: ip-172-31-90-137\n",
" iterations_since_restore: 2\n",
" learning_rate: 0.0\n",
" loss: 0.3886\n",
" node_ip: 172.31.90.137\n",
" pid: 947\n",
" should_checkpoint: true\n",
" step: 1070\n",
" time_since_restore: 200.21722102165222\n",
" time_this_iter_s: 96.97664165496826\n",
" time_total_s: 200.21722102165222\n",
" timestamp: 1661447637\n",
" timesteps_since_restore: 0\n",
" train_loss: 0.4661755713346963\n",
" train_runtime: 174.4696\n",
" train_samples_per_second: 98.023\n",
" train_steps_per_second: 6.133\n",
" training_iteration: 2\n",
" trial_id: c1ff5_00000\n",
" warmup_time: 0.003858327865600586\n",
" \n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"2022-08-25 10:14:09,300\tINFO tune.py:758 -- Total run time: 246.67 seconds (246.44 seconds for the tuning loop).\n"
]
}
],
"source": [
"result = trainer.fit()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "4cnWqUWmhYba"
},
"source": [
"You can use the returned `Result` object to access metrics and the Ray AIR `Checkpoint` associated with the last iteration."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "AMN5qjUwhYba",
"outputId": "7b754c36-c58b-4ff4-d7a8-63ec9764bd0c"
},
"outputs": [
{
"data": {
"text/plain": [
"Result(metrics={'loss': 0.3886, 'learning_rate': 0.0, 'epoch': 2.0, 'step': 1070, 'eval_loss': 0.6215357184410095, 'eval_matthews_correlation': 0.42957017514952434, 'eval_runtime': 0.9956, 'eval_samples_per_second': 273.204, 'eval_steps_per_second': 5.022, 'train_runtime': 174.4696, 'train_samples_per_second': 98.023, 'train_steps_per_second': 6.133, 'train_loss': 0.4661755713346963, '_timestamp': 1661447637, '_time_this_iter_s': 96.96447467803955, '_training_iteration': 2, 'should_checkpoint': True, 'done': True, 'trial_id': 'c1ff5_00000', 'experiment_tag': '0'}, error=None, log_dir=PosixPath('/home/ray/ray_results/HuggingFaceTrainer_2022-08-25_10-10-02/HuggingFaceTrainer_c1ff5_00000_0_2022-08-25_10-10-04'))"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"result"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Tune hyperparameters with Ray AIR "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If we would like to tune any hyperparameters of the model, we can do so by simply passing our `HuggingFaceTrainer` into a `Tuner` and defining the search space.\n",
"\n",
"We can also take advantage of the advanced search algorithms and schedulers provided by Ray Tune. In this example, we will use an `ASHAScheduler` to aggresively terminate underperforming trials."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"from ray import tune\n",
"from ray.tune import Tuner\n",
"from ray.tune.schedulers.async_hyperband import ASHAScheduler\n",
"\n",
"tune_epochs = 4\n",
"tuner = Tuner(\n",
" trainer,\n",
" param_space={\n",
" \"trainer_init_config\": {\n",
" \"learning_rate\": tune.grid_search([2e-5, 2e-4, 2e-3, 2e-2]),\n",
" \"epochs\": tune_epochs,\n",
" }\n",
" },\n",
" tune_config=tune.TuneConfig(\n",
" metric=\"eval_loss\",\n",
" mode=\"min\",\n",
" num_samples=1,\n",
" scheduler=ASHAScheduler(\n",
" max_t=tune_epochs,\n",
" )\n",
" ),\n",
" run_config=RunConfig(\n",
" checkpoint_config=CheckpointConfig(num_to_keep=1, checkpoint_score_attribute=\"eval_loss\", checkpoint_score_order=\"min\")\n",
" ),\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"== Status == Current time: 2022-08-25 10:20:13 (running for 00:06:01.75) Memory usage on this node: 4.4/62.0 GiB Using AsyncHyperBand: num_stopped=4\n",
"Bracket: Iter 4.000: -0.8064090609550476 | Iter 1.000: -0.6378736793994904 Resources requested: 0/208 CPUs, 0/16 GPUs, 0.0/574.34 GiB heap, 0.0/241.51 GiB objects (0.0/4.0 accelerator_type:T4) Current best trial: 5654d_00001 with eval_loss=0.6492420434951782 and parameters={'trainer_init_config': {'learning_rate': 0.0002, 'epochs': 4}} Result logdir: /home/ray/ray_results/HuggingFaceTrainer_2022-08-25_10-14-11 Number of trials: 4/4 (4 TERMINATED)
\n",
"\n",
"
Trial name
status
loc
trainer_init_conf...
iter
total time (s)
loss
learning_rate
epoch
\n",
"\n",
"\n",
"
HuggingFaceTrainer_5654d_00000
TERMINATED
172.31.90.137:1729
2e-05
4
347.171
0.1958
0
4
\n",
"
HuggingFaceTrainer_5654d_00001
TERMINATED
172.31.76.237:1805
0.0002
1
95.2492
0.6225
0.00015
1
\n",
"
HuggingFaceTrainer_5654d_00002
TERMINATED
172.31.85.32:1322
0.002
1
93.7613
0.6463
0.0015
1
\n",
"
HuggingFaceTrainer_5654d_00003
TERMINATED
172.31.85.193:1060
0.02
1
99.3677
0.926
0.015
1
\n",
"\n",
"
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"(RayTrainWorker pid=1789, ip=172.31.90.137) 2022-08-25 10:14:23,379\tINFO config.py:71 -- Setting up process group for: env:// [rank=0, world_size=4]\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"(RayTrainWorker pid=1792, ip=172.31.90.137) Is CUDA available: True\n",
"(RayTrainWorker pid=1790, ip=172.31.90.137) Is CUDA available: True\n",
"(RayTrainWorker pid=1791, ip=172.31.90.137) Is CUDA available: True\n",
"(RayTrainWorker pid=1789, ip=172.31.90.137) Is CUDA available: True\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"(RayTrainWorker pid=1974, ip=172.31.76.237) 2022-08-25 10:14:29,354\tINFO config.py:71 -- Setting up process group for: env:// [rank=0, world_size=4]\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"(RayTrainWorker pid=1977, ip=172.31.76.237) Is CUDA available: True\n",
"(RayTrainWorker pid=1976, ip=172.31.76.237) Is CUDA available: True\n",
"(RayTrainWorker pid=1975, ip=172.31.76.237) Is CUDA available: True\n",
"(RayTrainWorker pid=1974, ip=172.31.76.237) Is CUDA available: True\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"(RayTrainWorker pid=1483, ip=172.31.85.32) 2022-08-25 10:14:35,313\tINFO config.py:71 -- Setting up process group for: env:// [rank=0, world_size=4]\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"(RayTrainWorker pid=1790, ip=172.31.90.137) Starting training\n",
"(RayTrainWorker pid=1792, ip=172.31.90.137) Starting training\n",
"(RayTrainWorker pid=1791, ip=172.31.90.137) Starting training\n",
"(RayTrainWorker pid=1789, ip=172.31.90.137) Starting training\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"(RayTrainWorker pid=1789, ip=172.31.90.137) ***** Running training *****\n",
"(RayTrainWorker pid=1789, ip=172.31.90.137) Num examples = 8551\n",
"(RayTrainWorker pid=1789, ip=172.31.90.137) Num Epochs = 4\n",
"(RayTrainWorker pid=1789, ip=172.31.90.137) Instantaneous batch size per device = 16\n",
"(RayTrainWorker pid=1789, ip=172.31.90.137) Total train batch size (w. parallel, distributed & accumulation) = 64\n",
"(RayTrainWorker pid=1789, ip=172.31.90.137) Gradient Accumulation steps = 1\n",
"(RayTrainWorker pid=1789, ip=172.31.90.137) Total optimization steps = 2140\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"(RayTrainWorker pid=1483, ip=172.31.85.32) Is CUDA available: True\n",
"(RayTrainWorker pid=1485, ip=172.31.85.32) Is CUDA available: True\n",
"(RayTrainWorker pid=1486, ip=172.31.85.32) Is CUDA available: True\n",
"(RayTrainWorker pid=1484, ip=172.31.85.32) Is CUDA available: True\n",
"(RayTrainWorker pid=1977, ip=172.31.76.237) Starting training\n",
"(RayTrainWorker pid=1976, ip=172.31.76.237) Starting training\n",
"(RayTrainWorker pid=1975, ip=172.31.76.237) Starting training\n",
"(RayTrainWorker pid=1974, ip=172.31.76.237) Starting training\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"(RayTrainWorker pid=1974, ip=172.31.76.237) ***** Running training *****\n",
"(RayTrainWorker pid=1974, ip=172.31.76.237) Num examples = 8551\n",
"(RayTrainWorker pid=1974, ip=172.31.76.237) Num Epochs = 4\n",
"(RayTrainWorker pid=1974, ip=172.31.76.237) Instantaneous batch size per device = 16\n",
"(RayTrainWorker pid=1974, ip=172.31.76.237) Total train batch size (w. parallel, distributed & accumulation) = 64\n",
"(RayTrainWorker pid=1974, ip=172.31.76.237) Gradient Accumulation steps = 1\n",
"(RayTrainWorker pid=1974, ip=172.31.76.237) Total optimization steps = 2140\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"(RayTrainWorker pid=1483, ip=172.31.85.32) Starting training\n",
"(RayTrainWorker pid=1485, ip=172.31.85.32) Starting training\n",
"(RayTrainWorker pid=1486, ip=172.31.85.32) Starting training\n",
"(RayTrainWorker pid=1484, ip=172.31.85.32) Starting training\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"(RayTrainWorker pid=1483, ip=172.31.85.32) ***** Running training *****\n",
"(RayTrainWorker pid=1483, ip=172.31.85.32) Num examples = 8551\n",
"(RayTrainWorker pid=1483, ip=172.31.85.32) Num Epochs = 4\n",
"(RayTrainWorker pid=1483, ip=172.31.85.32) Instantaneous batch size per device = 16\n",
"(RayTrainWorker pid=1483, ip=172.31.85.32) Total train batch size (w. parallel, distributed & accumulation) = 64\n",
"(RayTrainWorker pid=1483, ip=172.31.85.32) Gradient Accumulation steps = 1\n",
"(RayTrainWorker pid=1483, ip=172.31.85.32) Total optimization steps = 2140\n",
"(RayTrainWorker pid=1223, ip=172.31.85.193) 2022-08-25 10:14:48,193\tINFO config.py:71 -- Setting up process group for: env:// [rank=0, world_size=4]\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"(RayTrainWorker pid=1223, ip=172.31.85.193) Is CUDA available: True\n",
"(RayTrainWorker pid=1224, ip=172.31.85.193) Is CUDA available: True\n",
"(RayTrainWorker pid=1226, ip=172.31.85.193) Is CUDA available: True\n",
"(RayTrainWorker pid=1225, ip=172.31.85.193) Is CUDA available: True\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Downloading builder script: 5.76kB [00:00, 6.59MB/s] \n",
"Downloading builder script: 5.76kB [00:00, 6.52MB/s] \n",
"Downloading builder script: 5.76kB [00:00, 6.07MB/s] \n",
"Downloading builder script: 5.76kB [00:00, 6.81MB/s] \n",
"Downloading tokenizer_config.json: 100%|██████████| 28.0/28.0 [00:00<00:00, 46.0kB/s]\n",
"Downloading config.json: 100%|██████████| 483/483 [00:00<00:00, 766kB/s]\n",
"Downloading vocab.txt: 0%| | 0.00/226k [00:00, ?B/s]\n",
"Downloading vocab.txt: 32%|███▏ | 72.0k/226k [00:00<00:00, 624kB/s]\n",
"Downloading vocab.txt: 100%|██████████| 226k/226k [00:00<00:00, 966kB/s] \n",
"Downloading tokenizer.json: 0%| | 0.00/455k [00:00, ?B/s]\n",
"Downloading tokenizer.json: 6%|▋ | 29.0k/455k [00:00<00:01, 233kB/s]\n",
"Downloading tokenizer.json: 30%|██▉ | 136k/455k [00:00<00:00, 600kB/s] \n",
"Downloading tokenizer.json: 100%|██████████| 455k/455k [00:00<00:00, 1.44MB/s]\n",
"Downloading pytorch_model.bin: 0%| | 0.00/256M [00:00, ?B/s]\n",
"Downloading pytorch_model.bin: 1%| | 2.32M/256M [00:00<00:10, 24.4MB/s]\n",
"Downloading pytorch_model.bin: 4%|▍ | 11.0M/256M [00:00<00:04, 63.4MB/s]\n",
"Downloading pytorch_model.bin: 8%|▊ | 20.0M/256M [00:00<00:03, 77.7MB/s]\n",
"Downloading pytorch_model.bin: 11%|█▏ | 29.1M/256M [00:00<00:02, 84.8MB/s]\n",
"Downloading pytorch_model.bin: 15%|█▍ | 38.2M/256M [00:00<00:02, 88.5MB/s]\n",
"Downloading pytorch_model.bin: 18%|█▊ | 47.3M/256M [00:00<00:02, 90.7MB/s]\n",
"Downloading pytorch_model.bin: 22%|██▏ | 56.4M/256M [00:00<00:02, 92.4MB/s]\n",
"Downloading pytorch_model.bin: 26%|██▌ | 65.5M/256M [00:00<00:02, 93.4MB/s]\n",
"Downloading pytorch_model.bin: 29%|██▉ | 74.7M/256M [00:00<00:02, 94.2MB/s]\n",
"Downloading pytorch_model.bin: 33%|███▎ | 83.8M/256M [00:01<00:01, 94.8MB/s]\n",
"Downloading pytorch_model.bin: 36%|███▋ | 93.0M/256M [00:01<00:01, 95.1MB/s]\n",
"Downloading pytorch_model.bin: 40%|███▉ | 102M/256M [00:01<00:01, 95.4MB/s] \n",
"Downloading pytorch_model.bin: 44%|████▎ | 111M/256M [00:01<00:01, 95.6MB/s]\n",
"Downloading pytorch_model.bin: 47%|████▋ | 120M/256M [00:01<00:01, 95.7MB/s]\n",
"Downloading pytorch_model.bin: 51%|█████ | 130M/256M [00:01<00:01, 95.8MB/s]\n",
"Downloading pytorch_model.bin: 54%|█████▍ | 139M/256M [00:01<00:01, 95.8MB/s]\n",
"Downloading pytorch_model.bin: 58%|█████▊ | 148M/256M [00:01<00:01, 95.9MB/s]\n",
"Downloading pytorch_model.bin: 61%|██████▏ | 157M/256M [00:01<00:01, 96.1MB/s]\n",
"Downloading pytorch_model.bin: 65%|██████▌ | 166M/256M [00:01<00:00, 96.1MB/s]\n",
"Downloading pytorch_model.bin: 69%|██████▊ | 175M/256M [00:02<00:00, 96.1MB/s]\n",
"Downloading pytorch_model.bin: 72%|███████▏ | 185M/256M [00:02<00:00, 96.2MB/s]\n",
"Downloading pytorch_model.bin: 76%|███████▌ | 194M/256M [00:02<00:00, 96.2MB/s]\n",
"Downloading pytorch_model.bin: 79%|███████▉ | 203M/256M [00:02<00:00, 96.1MB/s]\n",
"Downloading pytorch_model.bin: 83%|████████▎ | 212M/256M [00:02<00:00, 96.1MB/s]\n",
"Downloading pytorch_model.bin: 87%|████████▋ | 221M/256M [00:02<00:00, 96.2MB/s]\n",
"Downloading pytorch_model.bin: 90%|█████████ | 231M/256M [00:02<00:00, 96.2MB/s]\n",
"Downloading pytorch_model.bin: 94%|█████████▍| 240M/256M [00:02<00:00, 96.1MB/s]\n",
"Downloading pytorch_model.bin: 97%|█████████▋| 249M/256M [00:02<00:00, 96.0MB/s]\n",
"Downloading pytorch_model.bin: 100%|██████████| 256M/256M [00:02<00:00, 93.2MB/s]\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"(RayTrainWorker pid=1223, ip=172.31.85.193) Starting training\n",
"(RayTrainWorker pid=1226, ip=172.31.85.193) Starting training\n",
"(RayTrainWorker pid=1225, ip=172.31.85.193) Starting training\n",
"(RayTrainWorker pid=1224, ip=172.31.85.193) Starting training\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"(RayTrainWorker pid=1223, ip=172.31.85.193) ***** Running training *****\n",
"(RayTrainWorker pid=1223, ip=172.31.85.193) Num examples = 8551\n",
"(RayTrainWorker pid=1223, ip=172.31.85.193) Num Epochs = 4\n",
"(RayTrainWorker pid=1223, ip=172.31.85.193) Instantaneous batch size per device = 16\n",
"(RayTrainWorker pid=1223, ip=172.31.85.193) Total train batch size (w. parallel, distributed & accumulation) = 64\n",
"(RayTrainWorker pid=1223, ip=172.31.85.193) Gradient Accumulation steps = 1\n",
"(RayTrainWorker pid=1223, ip=172.31.85.193) Total optimization steps = 2140\n",
"(RayTrainWorker pid=1789, ip=172.31.90.137) ***** Running Evaluation *****\n",
"(RayTrainWorker pid=1789, ip=172.31.90.137) Num examples = 1043\n",
"(RayTrainWorker pid=1789, ip=172.31.90.137) Batch size = 16\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"(RayTrainWorker pid=1789, ip=172.31.90.137) {'loss': 0.5458, 'learning_rate': 1.5000000000000002e-05, 'epoch': 1.0}\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"(RayTrainWorker pid=1789, ip=172.31.90.137) The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: sentence, idx. If sentence, idx are not expected by `DistilBertForSequenceClassification.forward`, you can safely ignore this message.\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"(RayTrainWorker pid=1789, ip=172.31.90.137) {'eval_loss': 0.6037685871124268, 'eval_matthews_correlation': 0.3654892178274207, 'eval_runtime': 0.9847, 'eval_samples_per_second': 276.225, 'eval_steps_per_second': 5.078, 'epoch': 1.0}\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"(RayTrainWorker pid=1789, ip=172.31.90.137) Saving model checkpoint to distilbert-base-uncased-finetuned-cola/checkpoint-535\n",
"(RayTrainWorker pid=1789, ip=172.31.90.137) Configuration saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/config.json\n",
"(RayTrainWorker pid=1789, ip=172.31.90.137) Model weights saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/pytorch_model.bin\n",
"(RayTrainWorker pid=1789, ip=172.31.90.137) tokenizer config file saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/tokenizer_config.json\n",
"(RayTrainWorker pid=1789, ip=172.31.90.137) Special tokens file saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/special_tokens_map.json\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Result for HuggingFaceTrainer_5654d_00000:\n",
" _time_this_iter_s: 85.01727724075317\n",
" _timestamp: 1661447753\n",
" _training_iteration: 1\n",
" date: 2022-08-25_10-15-53\n",
" done: false\n",
" epoch: 1.0\n",
" eval_loss: 0.6037685871124268\n",
" eval_matthews_correlation: 0.3654892178274207\n",
" eval_runtime: 0.9847\n",
" eval_samples_per_second: 276.225\n",
" eval_steps_per_second: 5.078\n",
" experiment_id: cee1b96afcf344e89482e3c5e298a412\n",
" hostname: ip-172-31-90-137\n",
" iterations_since_restore: 1\n",
" learning_rate: 1.5000000000000002e-05\n",
" loss: 0.5458\n",
" node_ip: 172.31.90.137\n",
" pid: 1729\n",
" should_checkpoint: true\n",
" step: 535\n",
" time_since_restore: 94.93232989311218\n",
" time_this_iter_s: 94.93232989311218\n",
" time_total_s: 94.93232989311218\n",
" timestamp: 1661447753\n",
" timesteps_since_restore: 0\n",
" training_iteration: 1\n",
" trial_id: 5654d_00000\n",
" warmup_time: 0.0037021636962890625\n",
" \n",
"(RayTrainWorker pid=1974, ip=172.31.76.237) {'loss': 0.6225, 'learning_rate': 0.00015000000000000001, 'epoch': 1.0}\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"(RayTrainWorker pid=1974, ip=172.31.76.237) ***** Running Evaluation *****\n",
"(RayTrainWorker pid=1974, ip=172.31.76.237) Num examples = 1043\n",
"(RayTrainWorker pid=1974, ip=172.31.76.237) Batch size = 16\n",
"(RayTrainWorker pid=1974, ip=172.31.76.237) The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: idx, sentence. If idx, sentence are not expected by `DistilBertForSequenceClassification.forward`, you can safely ignore this message.\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"(RayTrainWorker pid=1974, ip=172.31.76.237) {'eval_loss': 0.6492420434951782, 'eval_matthews_correlation': 0.0, 'eval_runtime': 1.0157, 'eval_samples_per_second': 267.792, 'eval_steps_per_second': 4.923, 'epoch': 1.0}\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"(RayTrainWorker pid=1974, ip=172.31.76.237) Saving model checkpoint to distilbert-base-uncased-finetuned-cola/checkpoint-535\n",
"(RayTrainWorker pid=1974, ip=172.31.76.237) Configuration saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/config.json\n",
"(RayTrainWorker pid=1974, ip=172.31.76.237) Model weights saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/pytorch_model.bin\n",
"(RayTrainWorker pid=1974, ip=172.31.76.237) tokenizer config file saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/tokenizer_config.json\n",
"(RayTrainWorker pid=1974, ip=172.31.76.237) Special tokens file saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/special_tokens_map.json\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Result for HuggingFaceTrainer_5654d_00001:\n",
" _time_this_iter_s: 84.79700112342834\n",
" _timestamp: 1661447759\n",
" _training_iteration: 1\n",
" date: 2022-08-25_10-16-00\n",
" done: true\n",
" epoch: 1.0\n",
" eval_loss: 0.6492420434951782\n",
" eval_matthews_correlation: 0.0\n",
" eval_runtime: 1.0157\n",
" eval_samples_per_second: 267.792\n",
" eval_steps_per_second: 4.923\n",
" experiment_id: 88145f9344584715a4bd7d018f751b12\n",
" hostname: ip-172-31-76-237\n",
" iterations_since_restore: 1\n",
" learning_rate: 0.00015000000000000001\n",
" loss: 0.6225\n",
" node_ip: 172.31.76.237\n",
" pid: 1805\n",
" should_checkpoint: true\n",
" step: 535\n",
" time_since_restore: 95.24916434288025\n",
" time_this_iter_s: 95.24916434288025\n",
" time_total_s: 95.24916434288025\n",
" timestamp: 1661447760\n",
" timesteps_since_restore: 0\n",
" training_iteration: 1\n",
" trial_id: 5654d_00001\n",
" warmup_time: 0.003660917282104492\n",
" \n",
"(RayTrainWorker pid=1483, ip=172.31.85.32) {'loss': 0.6463, 'learning_rate': 0.0015, 'epoch': 1.0}\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"(RayTrainWorker pid=1483, ip=172.31.85.32) ***** Running Evaluation *****\n",
"(RayTrainWorker pid=1483, ip=172.31.85.32) Num examples = 1043\n",
"(RayTrainWorker pid=1483, ip=172.31.85.32) Batch size = 16\n",
"(RayTrainWorker pid=1483, ip=172.31.85.32) The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: sentence, idx. If sentence, idx are not expected by `DistilBertForSequenceClassification.forward`, you can safely ignore this message.\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"(RayTrainWorker pid=1483, ip=172.31.85.32) {'eval_loss': 0.6586529612541199, 'eval_matthews_correlation': 0.0, 'eval_runtime': 0.9576, 'eval_samples_per_second': 284.05, 'eval_steps_per_second': 5.222, 'epoch': 1.0}\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"(RayTrainWorker pid=1483, ip=172.31.85.32) Saving model checkpoint to distilbert-base-uncased-finetuned-cola/checkpoint-535\n",
"(RayTrainWorker pid=1483, ip=172.31.85.32) Configuration saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/config.json\n",
"(RayTrainWorker pid=1483, ip=172.31.85.32) Model weights saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/pytorch_model.bin\n",
"(RayTrainWorker pid=1483, ip=172.31.85.32) tokenizer config file saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/tokenizer_config.json\n",
"(RayTrainWorker pid=1483, ip=172.31.85.32) Special tokens file saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/special_tokens_map.json\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Result for HuggingFaceTrainer_5654d_00002:\n",
" _time_this_iter_s: 84.01720070838928\n",
" _timestamp: 1661447764\n",
" _training_iteration: 1\n",
" date: 2022-08-25_10-16-04\n",
" done: true\n",
" epoch: 1.0\n",
" eval_loss: 0.6586529612541199\n",
" eval_matthews_correlation: 0.0\n",
" eval_runtime: 0.9576\n",
" eval_samples_per_second: 284.05\n",
" eval_steps_per_second: 5.222\n",
" experiment_id: 5f8ab183779d40379d59ea615f9d5411\n",
" hostname: ip-172-31-85-32\n",
" iterations_since_restore: 1\n",
" learning_rate: 0.0015\n",
" loss: 0.6463\n",
" node_ip: 172.31.85.32\n",
" pid: 1322\n",
" should_checkpoint: true\n",
" step: 535\n",
" time_since_restore: 93.76131749153137\n",
" time_this_iter_s: 93.76131749153137\n",
" time_total_s: 93.76131749153137\n",
" timestamp: 1661447764\n",
" timesteps_since_restore: 0\n",
" training_iteration: 1\n",
" trial_id: 5654d_00002\n",
" warmup_time: 0.004533290863037109\n",
" \n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"(RayTrainWorker pid=1223, ip=172.31.85.193) ***** Running Evaluation *****\n",
"(RayTrainWorker pid=1223, ip=172.31.85.193) Num examples = 1043\n",
"(RayTrainWorker pid=1223, ip=172.31.85.193) Batch size = 16\n",
"(RayTrainWorker pid=1223, ip=172.31.85.193) The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: idx, sentence. If idx, sentence are not expected by `DistilBertForSequenceClassification.forward`, you can safely ignore this message.\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"(RayTrainWorker pid=1223, ip=172.31.85.193) {'loss': 0.926, 'learning_rate': 0.015, 'epoch': 1.0}\n",
"(RayTrainWorker pid=1223, ip=172.31.85.193) {'eval_loss': 0.6529427766799927, 'eval_matthews_correlation': 0.0, 'eval_runtime': 0.9428, 'eval_samples_per_second': 288.51, 'eval_steps_per_second': 5.303, 'epoch': 1.0}\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"(RayTrainWorker pid=1223, ip=172.31.85.193) Saving model checkpoint to distilbert-base-uncased-finetuned-cola/checkpoint-535\n",
"(RayTrainWorker pid=1223, ip=172.31.85.193) Configuration saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/config.json\n",
"(RayTrainWorker pid=1223, ip=172.31.85.193) Model weights saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/pytorch_model.bin\n",
"(RayTrainWorker pid=1223, ip=172.31.85.193) tokenizer config file saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/tokenizer_config.json\n",
"(RayTrainWorker pid=1223, ip=172.31.85.193) Special tokens file saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/special_tokens_map.json\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Result for HuggingFaceTrainer_5654d_00003:\n",
" _time_this_iter_s: 89.4301290512085\n",
" _timestamp: 1661447782\n",
" _training_iteration: 1\n",
" date: 2022-08-25_10-16-22\n",
" done: true\n",
" epoch: 1.0\n",
" eval_loss: 0.6529427766799927\n",
" eval_matthews_correlation: 0.0\n",
" eval_runtime: 0.9428\n",
" eval_samples_per_second: 288.51\n",
" eval_steps_per_second: 5.303\n",
" experiment_id: 8495977eeefd405fa4d9c1ea8fa735e1\n",
" hostname: ip-172-31-85-193\n",
" iterations_since_restore: 1\n",
" learning_rate: 0.015\n",
" loss: 0.926\n",
" node_ip: 172.31.85.193\n",
" pid: 1060\n",
" should_checkpoint: true\n",
" step: 535\n",
" time_since_restore: 99.36774587631226\n",
" time_this_iter_s: 99.36774587631226\n",
" time_total_s: 99.36774587631226\n",
" timestamp: 1661447782\n",
" timesteps_since_restore: 0\n",
" training_iteration: 1\n",
" trial_id: 5654d_00003\n",
" warmup_time: 0.004132509231567383\n",
" \n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"(RayTrainWorker pid=1789, ip=172.31.90.137) ***** Running Evaluation *****\n",
"(RayTrainWorker pid=1789, ip=172.31.90.137) Num examples = 1043\n",
"(RayTrainWorker pid=1789, ip=172.31.90.137) Batch size = 16\n",
"(RayTrainWorker pid=1789, ip=172.31.90.137) The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: sentence, idx. If sentence, idx are not expected by `DistilBertForSequenceClassification.forward`, you can safely ignore this message.\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"(RayTrainWorker pid=1789, ip=172.31.90.137) {'loss': 0.3841, 'learning_rate': 1e-05, 'epoch': 2.0}\n",
"(RayTrainWorker pid=1789, ip=172.31.90.137) {'eval_loss': 0.5994958281517029, 'eval_matthews_correlation': 0.4573244914254411, 'eval_runtime': 0.9442, 'eval_samples_per_second': 288.066, 'eval_steps_per_second': 5.295, 'epoch': 2.0}\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"(RayTrainWorker pid=1789, ip=172.31.90.137) Saving model checkpoint to distilbert-base-uncased-finetuned-cola/checkpoint-1070\n",
"(RayTrainWorker pid=1789, ip=172.31.90.137) Configuration saved in distilbert-base-uncased-finetuned-cola/checkpoint-1070/config.json\n",
"(RayTrainWorker pid=1789, ip=172.31.90.137) Model weights saved in distilbert-base-uncased-finetuned-cola/checkpoint-1070/pytorch_model.bin\n",
"(RayTrainWorker pid=1789, ip=172.31.90.137) tokenizer config file saved in distilbert-base-uncased-finetuned-cola/checkpoint-1070/tokenizer_config.json\n",
"(RayTrainWorker pid=1789, ip=172.31.90.137) Special tokens file saved in distilbert-base-uncased-finetuned-cola/checkpoint-1070/special_tokens_map.json\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Result for HuggingFaceTrainer_5654d_00000:\n",
" _time_this_iter_s: 76.82565689086914\n",
" _timestamp: 1661447830\n",
" _training_iteration: 2\n",
" date: 2022-08-25_10-17-10\n",
" done: false\n",
" epoch: 2.0\n",
" eval_loss: 0.5994958281517029\n",
" eval_matthews_correlation: 0.4573244914254411\n",
" eval_runtime: 0.9442\n",
" eval_samples_per_second: 288.066\n",
" eval_steps_per_second: 5.295\n",
" experiment_id: cee1b96afcf344e89482e3c5e298a412\n",
" hostname: ip-172-31-90-137\n",
" iterations_since_restore: 2\n",
" learning_rate: 1.0e-05\n",
" loss: 0.3841\n",
" node_ip: 172.31.90.137\n",
" pid: 1729\n",
" should_checkpoint: true\n",
" step: 1070\n",
" time_since_restore: 171.76071190834045\n",
" time_this_iter_s: 76.82838201522827\n",
" time_total_s: 171.76071190834045\n",
" timestamp: 1661447830\n",
" timesteps_since_restore: 0\n",
" training_iteration: 2\n",
" trial_id: 5654d_00000\n",
" warmup_time: 0.0037021636962890625\n",
" \n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"(RayTrainWorker pid=1789, ip=172.31.90.137) ***** Running Evaluation *****\n",
"(RayTrainWorker pid=1789, ip=172.31.90.137) Num examples = 1043\n",
"(RayTrainWorker pid=1789, ip=172.31.90.137) Batch size = 16\n",
"(RayTrainWorker pid=1789, ip=172.31.90.137) The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: sentence, idx. If sentence, idx are not expected by `DistilBertForSequenceClassification.forward`, you can safely ignore this message.\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"(RayTrainWorker pid=1789, ip=172.31.90.137) {'loss': 0.2687, 'learning_rate': 5e-06, 'epoch': 3.0}\n",
"(RayTrainWorker pid=1789, ip=172.31.90.137) {'eval_loss': 0.6935313940048218, 'eval_matthews_correlation': 0.5300538425561, 'eval_runtime': 1.0176, 'eval_samples_per_second': 267.305, 'eval_steps_per_second': 4.914, 'epoch': 3.0}\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"(RayTrainWorker pid=1789, ip=172.31.90.137) Saving model checkpoint to distilbert-base-uncased-finetuned-cola/checkpoint-1605\n",
"(RayTrainWorker pid=1789, ip=172.31.90.137) Configuration saved in distilbert-base-uncased-finetuned-cola/checkpoint-1605/config.json\n",
"(RayTrainWorker pid=1789, ip=172.31.90.137) Model weights saved in distilbert-base-uncased-finetuned-cola/checkpoint-1605/pytorch_model.bin\n",
"(RayTrainWorker pid=1789, ip=172.31.90.137) tokenizer config file saved in distilbert-base-uncased-finetuned-cola/checkpoint-1605/tokenizer_config.json\n",
"(RayTrainWorker pid=1789, ip=172.31.90.137) Special tokens file saved in distilbert-base-uncased-finetuned-cola/checkpoint-1605/special_tokens_map.json\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Result for HuggingFaceTrainer_5654d_00000:\n",
" _time_this_iter_s: 76.47252488136292\n",
" _timestamp: 1661447906\n",
" _training_iteration: 3\n",
" date: 2022-08-25_10-18-26\n",
" done: false\n",
" epoch: 3.0\n",
" eval_loss: 0.6935313940048218\n",
" eval_matthews_correlation: 0.5300538425561\n",
" eval_runtime: 1.0176\n",
" eval_samples_per_second: 267.305\n",
" eval_steps_per_second: 4.914\n",
" experiment_id: cee1b96afcf344e89482e3c5e298a412\n",
" hostname: ip-172-31-90-137\n",
" iterations_since_restore: 3\n",
" learning_rate: 5.0e-06\n",
" loss: 0.2687\n",
" node_ip: 172.31.90.137\n",
" pid: 1729\n",
" should_checkpoint: true\n",
" step: 1605\n",
" time_since_restore: 248.23273348808289\n",
" time_this_iter_s: 76.47202157974243\n",
" time_total_s: 248.23273348808289\n",
" timestamp: 1661447906\n",
" timesteps_since_restore: 0\n",
" training_iteration: 3\n",
" trial_id: 5654d_00000\n",
" warmup_time: 0.0037021636962890625\n",
" \n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"(RayTrainWorker pid=1789, ip=172.31.90.137) Saving model checkpoint to distilbert-base-uncased-finetuned-cola/checkpoint-2140\n",
"(RayTrainWorker pid=1789, ip=172.31.90.137) Configuration saved in distilbert-base-uncased-finetuned-cola/checkpoint-2140/config.json\n",
"(RayTrainWorker pid=1789, ip=172.31.90.137) Model weights saved in distilbert-base-uncased-finetuned-cola/checkpoint-2140/pytorch_model.bin\n",
"(RayTrainWorker pid=1789, ip=172.31.90.137) tokenizer config file saved in distilbert-base-uncased-finetuned-cola/checkpoint-2140/tokenizer_config.json\n",
"(RayTrainWorker pid=1789, ip=172.31.90.137) Special tokens file saved in distilbert-base-uncased-finetuned-cola/checkpoint-2140/special_tokens_map.json\n",
"(RayTrainWorker pid=1789, ip=172.31.90.137) ***** Running Evaluation *****\n",
"(RayTrainWorker pid=1789, ip=172.31.90.137) Num examples = 1043\n",
"(RayTrainWorker pid=1789, ip=172.31.90.137) Batch size = 16\n",
"(RayTrainWorker pid=1789, ip=172.31.90.137) The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: sentence, idx. If sentence, idx are not expected by `DistilBertForSequenceClassification.forward`, you can safely ignore this message.\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"(RayTrainWorker pid=1789, ip=172.31.90.137) {'loss': 0.1958, 'learning_rate': 0.0, 'epoch': 4.0}\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"(RayTrainWorker pid=1789, ip=172.31.90.137) Saving model checkpoint to distilbert-base-uncased-finetuned-cola/checkpoint-2140\n",
"(RayTrainWorker pid=1789, ip=172.31.90.137) Configuration saved in distilbert-base-uncased-finetuned-cola/checkpoint-2140/config.json\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"(RayTrainWorker pid=1789, ip=172.31.90.137) {'eval_loss': 0.8064090609550476, 'eval_matthews_correlation': 0.5322860764824153, 'eval_runtime': 1.0006, 'eval_samples_per_second': 271.827, 'eval_steps_per_second': 4.997, 'epoch': 4.0}\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"(RayTrainWorker pid=1789, ip=172.31.90.137) Model weights saved in distilbert-base-uncased-finetuned-cola/checkpoint-2140/pytorch_model.bin\n",
"(RayTrainWorker pid=1789, ip=172.31.90.137) tokenizer config file saved in distilbert-base-uncased-finetuned-cola/checkpoint-2140/tokenizer_config.json\n",
"(RayTrainWorker pid=1789, ip=172.31.90.137) Special tokens file saved in distilbert-base-uncased-finetuned-cola/checkpoint-2140/special_tokens_map.json\n",
"(RayTrainWorker pid=1789, ip=172.31.90.137) \n",
"(RayTrainWorker pid=1789, ip=172.31.90.137) \n",
"(RayTrainWorker pid=1789, ip=172.31.90.137) Training completed. Do not forget to share your model on huggingface.co/models =)\n",
"(RayTrainWorker pid=1789, ip=172.31.90.137) \n",
"(RayTrainWorker pid=1789, ip=172.31.90.137) \n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"(RayTrainWorker pid=1789, ip=172.31.90.137) {'train_runtime': 329.1948, 'train_samples_per_second': 103.902, 'train_steps_per_second': 6.501, 'train_loss': 0.34860724689804506, 'epoch': 4.0}\n",
"Result for HuggingFaceTrainer_5654d_00000:\n",
" _time_this_iter_s: 98.92064905166626\n",
" _timestamp: 1661448005\n",
" _training_iteration: 4\n",
" date: 2022-08-25_10-20-05\n",
" done: true\n",
" epoch: 4.0\n",
" eval_loss: 0.8064090609550476\n",
" eval_matthews_correlation: 0.5322860764824153\n",
" eval_runtime: 1.0006\n",
" eval_samples_per_second: 271.827\n",
" eval_steps_per_second: 4.997\n",
" experiment_id: cee1b96afcf344e89482e3c5e298a412\n",
" hostname: ip-172-31-90-137\n",
" iterations_since_restore: 4\n",
" learning_rate: 0.0\n",
" loss: 0.1958\n",
" node_ip: 172.31.90.137\n",
" pid: 1729\n",
" should_checkpoint: true\n",
" step: 2140\n",
" time_since_restore: 347.1705844402313\n",
" time_this_iter_s: 98.93785095214844\n",
" time_total_s: 347.1705844402313\n",
" timestamp: 1661448005\n",
" timesteps_since_restore: 0\n",
" train_loss: 0.34860724689804506\n",
" train_runtime: 329.1948\n",
" train_samples_per_second: 103.902\n",
" train_steps_per_second: 6.501\n",
" training_iteration: 4\n",
" trial_id: 5654d_00000\n",
" warmup_time: 0.0037021636962890625\n",
" \n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"2022-08-25 10:20:13,409\tINFO tune.py:758 -- Total run time: 361.90 seconds (361.74 seconds for the tuning loop).\n"
]
}
],
"source": [
"tune_results = tuner.fit()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can view the results of the tuning run as a dataframe, and obtain the best result."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
loss
\n",
"
learning_rate
\n",
"
epoch
\n",
"
step
\n",
"
eval_loss
\n",
"
eval_matthews_correlation
\n",
"
eval_runtime
\n",
"
eval_samples_per_second
\n",
"
eval_steps_per_second
\n",
"
_timestamp
\n",
"
...
\n",
"
pid
\n",
"
hostname
\n",
"
node_ip
\n",
"
time_since_restore
\n",
"
timesteps_since_restore
\n",
"
iterations_since_restore
\n",
"
warmup_time
\n",
"
config/trainer_init_config/epochs
\n",
"
config/trainer_init_config/learning_rate
\n",
"
logdir
\n",
"
\n",
" \n",
" \n",
"
\n",
"
1
\n",
"
0.6225
\n",
"
0.00015
\n",
"
1.0
\n",
"
535
\n",
"
0.649242
\n",
"
0.000000
\n",
"
1.0157
\n",
"
267.792
\n",
"
4.923
\n",
"
1661447759
\n",
"
...
\n",
"
1805
\n",
"
ip-172-31-76-237
\n",
"
172.31.76.237
\n",
"
95.249164
\n",
"
0
\n",
"
1
\n",
"
0.003661
\n",
"
4
\n",
"
0.00020
\n",
"
/home/ray/ray_results/HuggingFaceTrainer_2022-...
\n",
"
\n",
"
\n",
"
3
\n",
"
0.9260
\n",
"
0.01500
\n",
"
1.0
\n",
"
535
\n",
"
0.652943
\n",
"
0.000000
\n",
"
0.9428
\n",
"
288.510
\n",
"
5.303
\n",
"
1661447782
\n",
"
...
\n",
"
1060
\n",
"
ip-172-31-85-193
\n",
"
172.31.85.193
\n",
"
99.367746
\n",
"
0
\n",
"
1
\n",
"
0.004133
\n",
"
4
\n",
"
0.02000
\n",
"
/home/ray/ray_results/HuggingFaceTrainer_2022-...
\n",
"
\n",
"
\n",
"
2
\n",
"
0.6463
\n",
"
0.00150
\n",
"
1.0
\n",
"
535
\n",
"
0.658653
\n",
"
0.000000
\n",
"
0.9576
\n",
"
284.050
\n",
"
5.222
\n",
"
1661447764
\n",
"
...
\n",
"
1322
\n",
"
ip-172-31-85-32
\n",
"
172.31.85.32
\n",
"
93.761317
\n",
"
0
\n",
"
1
\n",
"
0.004533
\n",
"
4
\n",
"
0.00200
\n",
"
/home/ray/ray_results/HuggingFaceTrainer_2022-...
\n",
"
\n",
"
\n",
"
0
\n",
"
0.1958
\n",
"
0.00000
\n",
"
4.0
\n",
"
2140
\n",
"
0.806409
\n",
"
0.532286
\n",
"
1.0006
\n",
"
271.827
\n",
"
4.997
\n",
"
1661448005
\n",
"
...
\n",
"
1729
\n",
"
ip-172-31-90-137
\n",
"
172.31.90.137
\n",
"
347.170584
\n",
"
0
\n",
"
4
\n",
"
0.003702
\n",
"
4
\n",
"
0.00002
\n",
"
/home/ray/ray_results/HuggingFaceTrainer_2022-...
\n",
"
\n",
" \n",
"
\n",
"
4 rows × 33 columns
\n",
"
"
],
"text/plain": [
" loss learning_rate epoch step eval_loss eval_matthews_correlation \\\n",
"1 0.6225 0.00015 1.0 535 0.649242 0.000000 \n",
"3 0.9260 0.01500 1.0 535 0.652943 0.000000 \n",
"2 0.6463 0.00150 1.0 535 0.658653 0.000000 \n",
"0 0.1958 0.00000 4.0 2140 0.806409 0.532286 \n",
"\n",
" eval_runtime eval_samples_per_second eval_steps_per_second _timestamp \\\n",
"1 1.0157 267.792 4.923 1661447759 \n",
"3 0.9428 288.510 5.303 1661447782 \n",
"2 0.9576 284.050 5.222 1661447764 \n",
"0 1.0006 271.827 4.997 1661448005 \n",
"\n",
" ... pid hostname node_ip time_since_restore \\\n",
"1 ... 1805 ip-172-31-76-237 172.31.76.237 95.249164 \n",
"3 ... 1060 ip-172-31-85-193 172.31.85.193 99.367746 \n",
"2 ... 1322 ip-172-31-85-32 172.31.85.32 93.761317 \n",
"0 ... 1729 ip-172-31-90-137 172.31.90.137 347.170584 \n",
"\n",
" timesteps_since_restore iterations_since_restore warmup_time \\\n",
"1 0 1 0.003661 \n",
"3 0 1 0.004133 \n",
"2 0 1 0.004533 \n",
"0 0 4 0.003702 \n",
"\n",
" config/trainer_init_config/epochs config/trainer_init_config/learning_rate \\\n",
"1 4 0.00020 \n",
"3 4 0.02000 \n",
"2 4 0.00200 \n",
"0 4 0.00002 \n",
"\n",
" logdir \n",
"1 /home/ray/ray_results/HuggingFaceTrainer_2022-... \n",
"3 /home/ray/ray_results/HuggingFaceTrainer_2022-... \n",
"2 /home/ray/ray_results/HuggingFaceTrainer_2022-... \n",
"0 /home/ray/ray_results/HuggingFaceTrainer_2022-... \n",
"\n",
"[4 rows x 33 columns]"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tune_results.get_dataframe().sort_values(\"eval_loss\")"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"best_result = tune_results.get_best_result()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Predict on test data with Ray AIR "
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Tfoyu1q7hYbb"
},
"source": [
"You can now use the checkpoint to run prediction with `HuggingFacePredictor`, which wraps around [🤗 Pipelines](https://huggingface.co/docs/transformers/main_classes/pipelines). In order to distribute prediction, we use `BatchPredictor`. While this is not necessary for the very small example we are using (you could use `HuggingFacePredictor` directly), it will scale well to a large dataset."
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 262
},
"id": "UOUcBkX8IrJi",
"outputId": "4dc16812-1400-482d-8c3f-85991ce4b081"
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Map_Batches: 100%|██████████| 1/1 [00:00<00:00, 12.41it/s]\n",
"Map_Batches: 100%|██████████| 1/1 [00:00<00:00, 7.46it/s]\n",
"Map Progress (1 actors 1 pending): 100%|██████████| 1/1 [00:18<00:00, 18.46s/it]\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'label': 'LABEL_1', 'score': 0.6822417974472046}\n",
"{'label': 'LABEL_1', 'score': 0.6822402477264404}\n",
"{'label': 'LABEL_1', 'score': 0.6822407841682434}\n",
"{'label': 'LABEL_1', 'score': 0.6822386980056763}\n",
"{'label': 'LABEL_1', 'score': 0.6822428107261658}\n",
"{'label': 'LABEL_1', 'score': 0.6822453737258911}\n",
"{'label': 'LABEL_1', 'score': 0.6822437047958374}\n",
"{'label': 'LABEL_1', 'score': 0.6822428703308105}\n",
"{'label': 'LABEL_1', 'score': 0.6822431683540344}\n",
"{'label': 'LABEL_1', 'score': 0.6822426915168762}\n",
"{'label': 'LABEL_1', 'score': 0.6822447776794434}\n",
"{'label': 'LABEL_1', 'score': 0.6822456121444702}\n",
"{'label': 'LABEL_1', 'score': 0.6822471022605896}\n",
"{'label': 'LABEL_1', 'score': 0.6822477579116821}\n",
"{'label': 'LABEL_1', 'score': 0.682244598865509}\n",
"{'label': 'LABEL_1', 'score': 0.6822422742843628}\n",
"{'label': 'LABEL_1', 'score': 0.6822470426559448}\n",
"{'label': 'LABEL_1', 'score': 0.6822417378425598}\n",
"{'label': 'LABEL_1', 'score': 0.6822449564933777}\n",
"{'label': 'LABEL_1', 'score': 0.682239294052124}\n"
]
}
],
"source": [
"from ray.train.huggingface import HuggingFacePredictor\n",
"from ray.train.batch_predictor import BatchPredictor\n",
"import pandas as pd\n",
"\n",
"predictor = BatchPredictor.from_checkpoint(\n",
" checkpoint=best_result.checkpoint,\n",
" predictor_cls=HuggingFacePredictor,\n",
" task=\"text-classification\",\n",
" device=0 if use_gpu else -1, # -1 is CPU, otherwise device index\n",
")\n",
"prediction = predictor.predict(ray_datasets[\"test\"].map_batches(lambda x: x[[\"sentence\"]]), num_gpus_per_worker=int(use_gpu))\n",
"prediction.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Share the model "
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "mS8PId_NhYbb"
},
"source": [
"To be able to share your model with the community, there are a few more steps to follow.\n",
"\n",
"We have conducted the training on the Ray cluster, but share the model from the local enviroment - this will allow us to easily authenticate.\n",
"\n",
"First you have to store your authentication token from the Hugging Face website (sign up [here](https://huggingface.co/join) if you haven't already!) then execute the following cell and input your username and password:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "2LClXkN8hYbb",
"tags": [
"remove-cell-ci"
]
},
"outputs": [],
"source": [
"from huggingface_hub import notebook_login\n",
"\n",
"notebook_login()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "SybKUDryhYbb"
},
"source": [
"Then you need to install Git-LFS. Uncomment the following instructions:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "_wF6aT-0hYbb",
"tags": [
"remove-cell-ci"
]
},
"outputs": [],
"source": [
"# !apt install git-lfs"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "5fr6E0e8hYbb"
},
"source": [
"Now, load the model and tokenizer locally, and recreate the 🤗 Transformers `Trainer`:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "cjH2A8m6hYbc",
"tags": [
"remove-cell-ci"
]
},
"outputs": [],
"source": [
"from ray.train.huggingface import HuggingFaceCheckpoint\n",
"\n",
"checkpoint = HuggingFaceCheckpoint.from_checkpoint(result.checkpoint)\n",
"hf_trainer = checkpoint.get_model(model=AutoModelForSequenceClassification)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "tgV2xKfFhYbc"
},
"source": [
"You can now upload the result of the training to the Hub, just execute this instruction:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "XSkfJe3nhYbc",
"tags": [
"remove-cell-ci"
]
},
"outputs": [],
"source": [
"hf_trainer.push_to_hub()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "UL-Boc4dhYbc"
},
"source": [
"You can now share this model with all your friends, family, favorite pets: they can all load it with the identifier `\"your-username/the-name-you-picked\"` so for instance:\n",
"\n",
"```python\n",
"from transformers import AutoModelForSequenceClassification\n",
"\n",
"model = AutoModelForSequenceClassification.from_pretrained(\"sgugger/my-awesome-model\")\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "ipJBReeWhYbc",
"tags": [
"remove-cell-ci"
]
},
"outputs": [],
"source": []
}
],
"metadata": {
"accelerator": "GPU",
"colab": {
"collapsed_sections": [],
"name": "huggingface_text_classification.ipynb",
"provenance": []
},
"kernelspec": {
"display_name": "Python 3.8.9 64-bit",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.6"
},
"vscode": {
"interpreter": {
"hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6"
}
}
},
"nbformat": 4,
"nbformat_minor": 0
}