{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Fine-tune a Hugging Face Transformers Model" ] }, { "cell_type": "markdown", "metadata": { "id": "VaFMt6AIhYbK" }, "source": [ "This notebook is based on an official Hugging Face example, [How to fine-tune a model on text classification](https://github.com/huggingface/notebooks/blob/6ca682955173cc9d36ffa431ddda505a048cbe80/examples/text_classification.ipynb). This notebook shows the process of conversion from vanilla HF to Ray Train without changing the training logic unless necessary.\n", "\n", "This notebook consists of the following steps:\n", "1. [Set up Ray](#hf-setup)\n", "2. [Load the dataset](#hf-load)\n", "3. [Preprocess the dataset with Ray Data](#hf-preprocess)\n", "4. [Run the training with Ray Train](#hf-train)\n", "5. [Optionally, share the model with the community](#hf-share)" ] }, { "cell_type": "markdown", "metadata": { "id": "sQbdfyWQhYbO" }, "source": [ "Uncomment and run the following line to install all the necessary dependencies. (This notebook is being tested with `transformers==4.19.1`.):" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "id": "YajFzmkthYbO" }, "outputs": [], "source": [ "#! pip install \"datasets\" \"transformers>=4.19.0\" \"torch>=1.10.0\" \"mlflow\"" ] }, { "cell_type": "markdown", "metadata": { "id": "pvSRaEHChYbP" }, "source": [ "(hf-setup)=\n", "## Set up Ray" ] }, { "cell_type": "markdown", "metadata": { "id": "LRdL3kWBhYbQ" }, "source": [ "Use `ray.init()` to initialize a local cluster. By default, this cluster contains only the machine you are running this notebook on. You can also run this notebook on an [Anyscale](https://www.anyscale.com/) cluster." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "MOsHUjgdIrIW", "outputId": "e527bdbb-2f28-4142-cca0-762e0566cbcd", "tags": [] }, "outputs": [], "source": [ "from pprint import pprint\n", "import ray\n", "\n", "ray.init()" ] }, { "cell_type": "markdown", "metadata": { "id": "oJiSdWy2hYbR" }, "source": [ "Check the resources our cluster is composed of. If you are running this notebook on your local machine or Google Colab, you should see the number of CPU cores and GPUs available on the your machine." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "KlMz0dt9hYbS", "outputId": "2d485449-ee69-4334-fcba-47e0ceb63078", "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'CPU': 48.0,\n", " 'GPU': 4.0,\n", " 'accelerator_type:None': 1.0,\n", " 'memory': 206158430208.0,\n", " 'node:10.0.27.125': 1.0,\n", " 'node:__internal_head__': 1.0,\n", " 'object_store_memory': 59052625920.0}\n" ] } ], "source": [ "pprint(ray.cluster_resources())" ] }, { "cell_type": "markdown", "metadata": { "id": "uS6oeJELhYbS" }, "source": [ "This notebook fine-tunes a [HF Transformers](https://github.com/huggingface/transformers) model for one of the text classification task of the [GLUE Benchmark](https://gluebenchmark.com/). It runs the training using Ray Train.\n", "\n", "You can change these two variables to control whether the training, which happens later, uses CPUs or GPUs, and how many workers to spawn. Each worker claims one CPU or GPU. Make sure to not request more resources than the resources present. By default, the training runs with one GPU worker." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "id": "gAbhv9OqhYbT", "tags": [] }, "outputs": [], "source": [ "use_gpu = True # set this to False to run on CPUs\n", "num_workers = 1 # set this to number of GPUs or CPUs you want to use" ] }, { "cell_type": "markdown", "metadata": { "id": "rEJBSTyZIrIb" }, "source": [ "## Fine-tune a model on a text classification task" ] }, { "cell_type": "markdown", "metadata": { "id": "kTCFado4IrIc" }, "source": [ "The GLUE Benchmark is a group of nine classification tasks on sentences or pairs of sentences. To learn more, see the [original notebook](https://github.com/huggingface/notebooks/blob/6ca682955173cc9d36ffa431ddda505a048cbe80/examples/text_classification.ipynb).\n", "\n", "Each task has a name that is its acronym, with `mnli-mm` to indicate that it is a mismatched version of MNLI. Each one has the same training set as `mnli` but different validation and test sets." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "id": "YZbiBDuGIrId", "tags": [] }, "outputs": [], "source": [ "GLUE_TASKS = [\n", " \"cola\",\n", " \"mnli\",\n", " \"mnli-mm\",\n", " \"mrpc\",\n", " \"qnli\",\n", " \"qqp\",\n", " \"rte\",\n", " \"sst2\",\n", " \"stsb\",\n", " \"wnli\",\n", "]" ] }, { "cell_type": "markdown", "metadata": { "id": "4RRkXuteIrIh" }, "source": [ "This notebook runs on any of the tasks in the list above, with any model checkpoint from the [Model Hub](https://huggingface.co/models) as long as that model has a version with a classification head. Depending on the model and the GPU you are using, you might need to adjust the batch size to avoid out-of-memory errors. Set these three parameters, and the rest of the notebook should run smoothly:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "id": "zVvslsfMIrIh", "tags": [] }, "outputs": [], "source": [ "task = \"cola\"\n", "model_checkpoint = \"distilbert-base-uncased\"\n", "batch_size = 16" ] }, { "cell_type": "markdown", "metadata": { "id": "whPRbBNbIrIl" }, "source": [ "(hf-load)=\n", "### Loading the dataset" ] }, { "cell_type": "markdown", "metadata": { "id": "W7QYTpxXIrIl" }, "source": [ "Use the [HF Datasets](https://github.com/huggingface/datasets) library to download the data and get the metric to use for evaluation and to compare your model to the benchmark. You can do this comparison easily with the `load_dataset` and `load_metric` functions.\n", "\n", "Apart from `mnli-mm` being special code, you can directly pass the task name to those functions.\n", "\n", "Run the normal HF Datasets code to load the dataset from the Hub." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 200 }, "id": "MwhAeEOuhYbV", "outputId": "3aff8c73-d6eb-4784-890a-a419403b5bda", "tags": [] }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Reusing dataset glue (/home/ray/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "8217c4d4e1e7402c92477b3e2cf8961c", "version_major": 2, "version_minor": 0 }, "text/plain": [ " 0%| | 0/3 [00:00\n", "
\n", "
\n", "

Tune Status

\n", " \n", "\n", "\n", "\n", "\n", "\n", "
Current time:2023-09-06 14:27:12
Running for: 00:01:40.12
Memory: 18.4/186.6 GiB
\n", "
\n", "
\n", "
\n", "

System Info

\n", " Using FIFO scheduling algorithm.
Logical resource usage: 1.0/48 CPUs, 1.0/4 GPUs (0.0/1.0 accelerator_type:None)\n", "
\n", " \n", "
\n", "
\n", "
\n", "

Trial Status

\n", " \n", "\n", "\n", "\n", "\n", "\n", "\n", "
Trial name status loc iter total time (s) loss learning_rate epoch
TorchTrainer_e8bd4_00000TERMINATED10.0.27.125:43821 2 76.62590.3866 0 1.5
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[2m\u001b[36m(TrainTrainable pid=43821)\u001b[0m 2023-09-06 14:25:35.638885: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F AVX512_VNNI FMA\n", "\u001b[2m\u001b[36m(TrainTrainable pid=43821)\u001b[0m To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.\n", "\u001b[2m\u001b[36m(TrainTrainable pid=43821)\u001b[0m 2023-09-06 14:25:35.782950: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.\n", "\u001b[2m\u001b[36m(TrainTrainable pid=43821)\u001b[0m 2023-09-06 14:25:36.501583: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64\n", "\u001b[2m\u001b[36m(TrainTrainable pid=43821)\u001b[0m 2023-09-06 14:25:36.501653: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64\n", "\u001b[2m\u001b[36m(TrainTrainable pid=43821)\u001b[0m 2023-09-06 14:25:36.501660: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.\n", "\u001b[2m\u001b[36m(TrainTrainable pid=43821)\u001b[0m comet_ml is installed but `COMET_API_KEY` is not set.\n", "\u001b[2m\u001b[36m(TorchTrainer pid=43821)\u001b[0m Starting distributed worker processes: ['43946 (10.0.27.125)']\n", "\u001b[2m\u001b[36m(RayTrainWorker pid=43946)\u001b[0m Setting up process group for: env:// [rank=0, world_size=1]\n", "\u001b[2m\u001b[36m(RayTrainWorker pid=43946)\u001b[0m 2023-09-06 14:25:42.756510: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F AVX512_VNNI FMA\n", "\u001b[2m\u001b[36m(RayTrainWorker pid=43946)\u001b[0m To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.\n", "\u001b[2m\u001b[36m(RayTrainWorker pid=43946)\u001b[0m 2023-09-06 14:25:42.903398: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.\n", "\u001b[2m\u001b[36m(SplitCoordinator pid=44017)\u001b[0m Auto configuring locality_with_output=['84374908fd32ea9885fdd6d21aadf2ce3e296daf28a26522e7a8d026']\n", "\u001b[2m\u001b[36m(RayTrainWorker pid=43946)\u001b[0m 2023-09-06 14:25:43.737476: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64\n", "\u001b[2m\u001b[36m(RayTrainWorker pid=43946)\u001b[0m 2023-09-06 14:25:43.737544: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64\n", "\u001b[2m\u001b[36m(RayTrainWorker pid=43946)\u001b[0m 2023-09-06 14:25:43.737554: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.\n", "\u001b[2m\u001b[36m(RayTrainWorker pid=43946)\u001b[0m comet_ml is installed but `COMET_API_KEY` is not set.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\u001b[2m\u001b[36m(RayTrainWorker pid=43946)\u001b[0m Is CUDA available: True\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[2m\u001b[36m(RayTrainWorker pid=43946)\u001b[0m Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_projector.weight']\n", "\u001b[2m\u001b[36m(RayTrainWorker pid=43946)\u001b[0m - This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n", "\u001b[2m\u001b[36m(RayTrainWorker pid=43946)\u001b[0m - This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n", "\u001b[2m\u001b[36m(RayTrainWorker pid=43946)\u001b[0m Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias', 'pre_classifier.bias', 'pre_classifier.weight']\n", "\u001b[2m\u001b[36m(RayTrainWorker pid=43946)\u001b[0m You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n", "\u001b[2m\u001b[36m(SplitCoordinator pid=44016)\u001b[0m Auto configuring locality_with_output=['84374908fd32ea9885fdd6d21aadf2ce3e296daf28a26522e7a8d026']\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\u001b[2m\u001b[36m(RayTrainWorker pid=43946)\u001b[0m max_steps_per_epoch: 534\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[2m\u001b[36m(RayTrainWorker pid=43946)\u001b[0m max_steps is given, it will override any value given in num_train_epochs\n", "\u001b[2m\u001b[36m(RayTrainWorker pid=43946)\u001b[0m /home/ray/anaconda3/lib/python3.9/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning\n", "\u001b[2m\u001b[36m(RayTrainWorker pid=43946)\u001b[0m warnings.warn(\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\u001b[2m\u001b[36m(RayTrainWorker pid=43946)\u001b[0m Starting training\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[2m\u001b[36m(RayTrainWorker pid=43946)\u001b[0m ***** Running training *****\n", "\u001b[2m\u001b[36m(RayTrainWorker pid=43946)\u001b[0m Num examples = 17088\n", "\u001b[2m\u001b[36m(RayTrainWorker pid=43946)\u001b[0m Num Epochs = 9223372036854775807\n", "\u001b[2m\u001b[36m(RayTrainWorker pid=43946)\u001b[0m Instantaneous batch size per device = 16\n", "\u001b[2m\u001b[36m(RayTrainWorker pid=43946)\u001b[0m Total train batch size (w. parallel, distributed & accumulation) = 16\n", "\u001b[2m\u001b[36m(RayTrainWorker pid=43946)\u001b[0m Gradient Accumulation steps = 1\n", "\u001b[2m\u001b[36m(RayTrainWorker pid=43946)\u001b[0m Total optimization steps = 1068\n", "\u001b[2m\u001b[36m(RayTrainWorker pid=43946)\u001b[0m /tmp/ipykernel_43503/4088900328.py:23: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:206.)\n", "\u001b[2m\u001b[36m(SplitCoordinator pid=44016)\u001b[0m Executing DAG InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]\n", "\u001b[2m\u001b[36m(SplitCoordinator pid=44016)\u001b[0m Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=['84374908fd32ea9885fdd6d21aadf2ce3e296daf28a26522e7a8d026'], preserve_order=False, actor_locality_enabled=True, verbose_progress=False)\n", "\u001b[2m\u001b[36m(SplitCoordinator pid=44016)\u001b[0m Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "", "version_major": 2, "version_minor": 0 }, "text/plain": [ "(pid=44016) Running 0: 0%| | 0/1 [00:00 OutputSplitter[split(1, equal=True)]\n", "\u001b[2m\u001b[36m(SplitCoordinator pid=44017)\u001b[0m Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=['84374908fd32ea9885fdd6d21aadf2ce3e296daf28a26522e7a8d026'], preserve_order=False, actor_locality_enabled=True, verbose_progress=False)\n", "\u001b[2m\u001b[36m(SplitCoordinator pid=44017)\u001b[0m Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "", "version_major": 2, "version_minor": 0 }, "text/plain": [ "(pid=44017) Running 0: 0%| | 0/1 [00:00 OutputSplitter[split(1, equal=True)]\n", "\u001b[2m\u001b[36m(SplitCoordinator pid=44016)\u001b[0m Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=['84374908fd32ea9885fdd6d21aadf2ce3e296daf28a26522e7a8d026'], preserve_order=False, actor_locality_enabled=True, verbose_progress=False)\n", "\u001b[2m\u001b[36m(SplitCoordinator pid=44016)\u001b[0m Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "", "version_major": 2, "version_minor": 0 }, "text/plain": [ "(pid=44016) Running 0: 0%| | 0/1 [00:00 OutputSplitter[split(1, equal=True)]\n", "\u001b[2m\u001b[36m(SplitCoordinator pid=44017)\u001b[0m Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=['84374908fd32ea9885fdd6d21aadf2ce3e296daf28a26522e7a8d026'], preserve_order=False, actor_locality_enabled=True, verbose_progress=False)\n", "\u001b[2m\u001b[36m(SplitCoordinator pid=44017)\u001b[0m Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "", "version_major": 2, "version_minor": 0 }, "text/plain": [ "(pid=44017) Running 0: 0%| | 0/1 [00:00\n", "
\n", "
\n", "

Tune Status

\n", " \n", "\n", "\n", "\n", "\n", "\n", "
Current time:2023-09-06 14:49:04
Running for: 00:02:16.18
Memory: 19.6/186.6 GiB
\n", "
\n", "
\n", "
\n", "

System Info

\n", " Using AsyncHyperBand: num_stopped=4
Bracket: Iter 4.000: -0.6517604142427444 | Iter 1.000: -0.5936744660139084
Logical resource usage: 1.0/48 CPUs, 1.0/4 GPUs (0.0/1.0 accelerator_type:None)\n", "
\n", " \n", "
\n", "
\n", "
\n", "

Trial Status

\n", " \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Trial name status loc train_loop_config/le\n", "arning_rate iter total time (s) loss learning_rate epoch
TorchTrainer_e1825_00000TERMINATED10.0.27.125:574962e-05 4 128.443 0.1934 0 3.25
TorchTrainer_e1825_00001TERMINATED10.0.27.125:574970.0002 1 41.24860.616 0.000149906 0.25
TorchTrainer_e1825_00002TERMINATED10.0.27.125:574980.002 1 41.13360.6699 0.00149906 0.25
TorchTrainer_e1825_00003TERMINATED10.0.27.125:574990.02 4 126.699 0.6073 0 3.25
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[2m\u001b[36m(TrainTrainable pid=57498)\u001b[0m 2023-09-06 14:46:52.049839: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F AVX512_VNNI FMA\n", "\u001b[2m\u001b[36m(TrainTrainable pid=57498)\u001b[0m To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.\n", "\u001b[2m\u001b[36m(TrainTrainable pid=57498)\u001b[0m 2023-09-06 14:46:52.195780: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.\n", "\u001b[2m\u001b[36m(TrainTrainable pid=57498)\u001b[0m 2023-09-06 14:46:52.944517: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64\n", "\u001b[2m\u001b[36m(TrainTrainable pid=57498)\u001b[0m 2023-09-06 14:46:52.944590: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64\n", "\u001b[2m\u001b[36m(TrainTrainable pid=57498)\u001b[0m 2023-09-06 14:46:52.944597: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.\n", "\u001b[2m\u001b[36m(TrainTrainable pid=57498)\u001b[0m comet_ml is installed but `COMET_API_KEY` is not set.\n", "\u001b[2m\u001b[36m(TorchTrainer pid=57498)\u001b[0m Starting distributed worker processes: ['57731 (10.0.27.125)']\n", "\u001b[2m\u001b[36m(TrainTrainable pid=57499)\u001b[0m 2023-09-06 14:46:52.229406: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F AVX512_VNNI FMA\u001b[32m [repeated 3x across cluster]\u001b[0m\n", "\u001b[2m\u001b[36m(TrainTrainable pid=57499)\u001b[0m To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.\u001b[32m [repeated 3x across cluster]\u001b[0m\n", "\u001b[2m\u001b[36m(TrainTrainable pid=57499)\u001b[0m 2023-09-06 14:46:52.378805: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.\u001b[32m [repeated 3x across cluster]\u001b[0m\n", "\u001b[2m\u001b[36m(RayTrainWorker pid=57741)\u001b[0m Setting up process group for: env:// [rank=0, world_size=1]\n", "\u001b[2m\u001b[36m(TrainTrainable pid=57499)\u001b[0m 2023-09-06 14:46:53.174151: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64\u001b[32m [repeated 6x across cluster]\u001b[0m\n", "\u001b[2m\u001b[36m(TrainTrainable pid=57499)\u001b[0m 2023-09-06 14:46:53.174160: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.\u001b[32m [repeated 3x across cluster]\u001b[0m\n", "\u001b[2m\u001b[36m(TrainTrainable pid=57499)\u001b[0m comet_ml is installed but `COMET_API_KEY` is not set.\u001b[32m [repeated 3x across cluster]\u001b[0m\n", "\u001b[2m\u001b[36m(SplitCoordinator pid=57927)\u001b[0m Auto configuring locality_with_output=['84374908fd32ea9885fdd6d21aadf2ce3e296daf28a26522e7a8d026']\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\u001b[2m\u001b[36m(RayTrainWorker pid=57741)\u001b[0m Is CUDA available: True\n", "\u001b[2m\u001b[36m(RayTrainWorker pid=57741)\u001b[0m max_steps_per_epoch: 534\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[2m\u001b[36m(RayTrainWorker pid=57741)\u001b[0m Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_projector.weight', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_transform.bias']\n", "\u001b[2m\u001b[36m(RayTrainWorker pid=57741)\u001b[0m - This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n", "\u001b[2m\u001b[36m(RayTrainWorker pid=57741)\u001b[0m - This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n", "\u001b[2m\u001b[36m(RayTrainWorker pid=57741)\u001b[0m Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.weight', 'classifier.weight', 'pre_classifier.bias']\n", "\u001b[2m\u001b[36m(RayTrainWorker pid=57741)\u001b[0m You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n", "\u001b[2m\u001b[36m(TorchTrainer pid=57499)\u001b[0m Starting distributed worker processes: ['57746 (10.0.27.125)']\u001b[32m [repeated 3x across cluster]\u001b[0m\n", "\u001b[2m\u001b[36m(RayTrainWorker pid=57740)\u001b[0m 2023-09-06 14:47:00.036649: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F AVX512_VNNI FMA\u001b[32m [repeated 4x across cluster]\u001b[0m\n", "\u001b[2m\u001b[36m(RayTrainWorker pid=57740)\u001b[0m To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.\u001b[32m [repeated 4x across cluster]\u001b[0m\n", "\u001b[2m\u001b[36m(RayTrainWorker pid=57740)\u001b[0m 2023-09-06 14:47:00.198894: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.\u001b[32m [repeated 4x across cluster]\u001b[0m\n", "\u001b[2m\u001b[36m(RayTrainWorker pid=57746)\u001b[0m Setting up process group for: env:// [rank=0, world_size=1]\u001b[32m [repeated 3x across cluster]\u001b[0m\n", "\u001b[2m\u001b[36m(RayTrainWorker pid=57740)\u001b[0m 2023-09-06 14:47:01.085704: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64\u001b[32m [repeated 8x across cluster]\u001b[0m\n", "\u001b[2m\u001b[36m(RayTrainWorker pid=57740)\u001b[0m 2023-09-06 14:47:01.085711: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.\u001b[32m [repeated 4x across cluster]\u001b[0m\n", "\u001b[2m\u001b[36m(RayTrainWorker pid=57740)\u001b[0m comet_ml is installed but `COMET_API_KEY` is not set.\u001b[32m [repeated 4x across cluster]\u001b[0m\n", "\u001b[2m\u001b[36m(SplitCoordinator pid=57965)\u001b[0m Auto configuring locality_with_output=['84374908fd32ea9885fdd6d21aadf2ce3e296daf28a26522e7a8d026']\u001b[32m [repeated 7x across cluster]\u001b[0m\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\u001b[2m\u001b[36m(RayTrainWorker pid=57741)\u001b[0m Starting training\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[2m\u001b[36m(RayTrainWorker pid=57741)\u001b[0m max_steps is given, it will override any value given in num_train_epochs\n", "\u001b[2m\u001b[36m(RayTrainWorker pid=57741)\u001b[0m /home/ray/anaconda3/lib/python3.9/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning\n", "\u001b[2m\u001b[36m(RayTrainWorker pid=57741)\u001b[0m warnings.warn(\n", "\u001b[2m\u001b[36m(RayTrainWorker pid=57746)\u001b[0m Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias', 'vocab_transform.bias']\n", "\u001b[2m\u001b[36m(RayTrainWorker pid=57746)\u001b[0m Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.weight', 'classifier.bias', 'pre_classifier.bias']\n", "\u001b[2m\u001b[36m(RayTrainWorker pid=57731)\u001b[0m Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_transform.weight', 'vocab_projector.weight', 'vocab_layer_norm.bias']\n", "\u001b[2m\u001b[36m(RayTrainWorker pid=57740)\u001b[0m Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_transform.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight']\n", "\u001b[2m\u001b[36m(RayTrainWorker pid=57740)\u001b[0m Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.bias', 'classifier.weight']\n", "\u001b[2m\u001b[36m(RayTrainWorker pid=57741)\u001b[0m ***** Running training *****\n", "\u001b[2m\u001b[36m(RayTrainWorker pid=57741)\u001b[0m Num examples = 34176\n", "\u001b[2m\u001b[36m(RayTrainWorker pid=57741)\u001b[0m Num Epochs = 9223372036854775807\n", "\u001b[2m\u001b[36m(RayTrainWorker pid=57741)\u001b[0m Instantaneous batch size per device = 16\n", "\u001b[2m\u001b[36m(RayTrainWorker pid=57741)\u001b[0m Total train batch size (w. parallel, distributed & accumulation) = 16\n", "\u001b[2m\u001b[36m(RayTrainWorker pid=57741)\u001b[0m Gradient Accumulation steps = 1\n", "\u001b[2m\u001b[36m(RayTrainWorker pid=57741)\u001b[0m Total optimization steps = 2136\n", "\u001b[2m\u001b[36m(RayTrainWorker pid=57741)\u001b[0m /tmp/ipykernel_43503/4088900328.py:23: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:206.)\n", "\u001b[2m\u001b[36m(SplitCoordinator pid=57927)\u001b[0m Executing DAG InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]\n", "\u001b[2m\u001b[36m(SplitCoordinator pid=57927)\u001b[0m Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=['84374908fd32ea9885fdd6d21aadf2ce3e296daf28a26522e7a8d026'], preserve_order=False, actor_locality_enabled=True, verbose_progress=False)\n", "\u001b[2m\u001b[36m(SplitCoordinator pid=57927)\u001b[0m Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "", "version_major": 2, "version_minor": 0 }, "text/plain": [ "(pid=57927) Running 0: 0%| | 0/1 [00:00 OutputSplitter[split(1, equal=True)]\u001b[32m [repeated 3x across cluster]\u001b[0m\n", "\u001b[2m\u001b[36m(SplitCoordinator pid=57965)\u001b[0m Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=['84374908fd32ea9885fdd6d21aadf2ce3e296daf28a26522e7a8d026'], preserve_order=False, actor_locality_enabled=True, verbose_progress=False)\u001b[32m [repeated 3x across cluster]\u001b[0m\n", "\u001b[2m\u001b[36m(SplitCoordinator pid=57965)\u001b[0m Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`\u001b[32m [repeated 3x across cluster]\u001b[0m\n", "\u001b[2m\u001b[36m(RayTrainWorker pid=57740)\u001b[0m [W reducer.cpp:1300] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())\u001b[32m [repeated 3x across cluster]\u001b[0m\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "", "version_major": 2, "version_minor": 0 }, "text/plain": [ "(pid=57928) Running 0: 0%| | 0/1 [00:00 OutputSplitter[split(1, equal=True)]\u001b[32m [repeated 6x across cluster]\u001b[0m\n", "\u001b[2m\u001b[36m(SplitCoordinator pid=57954)\u001b[0m Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=['84374908fd32ea9885fdd6d21aadf2ce3e296daf28a26522e7a8d026'], preserve_order=False, actor_locality_enabled=True, verbose_progress=False)\u001b[32m [repeated 6x across cluster]\u001b[0m\n", "\u001b[2m\u001b[36m(SplitCoordinator pid=57954)\u001b[0m Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`\u001b[32m [repeated 6x across cluster]\u001b[0m\n", "\u001b[2m\u001b[36m(RayTrainWorker pid=57740)\u001b[0m Saving model checkpoint to distilbert-base-uncased-finetuned-cola/checkpoint-535\u001b[32m [repeated 3x across cluster]\u001b[0m\n", "\u001b[2m\u001b[36m(RayTrainWorker pid=57740)\u001b[0m Configuration saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/config.json\u001b[32m [repeated 3x across cluster]\u001b[0m\n", "\u001b[2m\u001b[36m(RayTrainWorker pid=57740)\u001b[0m Model weights saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/pytorch_model.bin\u001b[32m [repeated 3x across cluster]\u001b[0m\n", "\u001b[2m\u001b[36m(RayTrainWorker pid=57740)\u001b[0m tokenizer config file saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/tokenizer_config.json\u001b[32m [repeated 3x across cluster]\u001b[0m\n", "\u001b[2m\u001b[36m(RayTrainWorker pid=57740)\u001b[0m Special tokens file saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/special_tokens_map.json\u001b[32m [repeated 3x across cluster]\u001b[0m\n", "\u001b[2m\u001b[36m(RayTrainWorker pid=57740)\u001b[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/ray/ray_results/tune_transformers/TorchTrainer_e1825_00001_1_learning_rate=0.0002_2023-09-06_14-46-48/checkpoint_000000)\u001b[32m [repeated 3x across cluster]\u001b[0m\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "", "version_major": 2, "version_minor": 0 }, "text/plain": [ "(pid=57955) Running 0: 0%| | 0/1 [00:00 OutputSplitter[split(1, equal=True)]\u001b[32m [repeated 4x across cluster]\u001b[0m\n", "\u001b[2m\u001b[36m(SplitCoordinator pid=57927)\u001b[0m Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=['84374908fd32ea9885fdd6d21aadf2ce3e296daf28a26522e7a8d026'], preserve_order=False, actor_locality_enabled=True, verbose_progress=False)\u001b[32m [repeated 4x across cluster]\u001b[0m\n", "\u001b[2m\u001b[36m(SplitCoordinator pid=57927)\u001b[0m Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`\u001b[32m [repeated 4x across cluster]\u001b[0m\n", "\u001b[2m\u001b[36m(RayTrainWorker pid=57741)\u001b[0m Saving model checkpoint to distilbert-base-uncased-finetuned-cola/checkpoint-1070\u001b[32m [repeated 2x across cluster]\u001b[0m\n", "\u001b[2m\u001b[36m(RayTrainWorker pid=57741)\u001b[0m Configuration saved in distilbert-base-uncased-finetuned-cola/checkpoint-1070/config.json\u001b[32m [repeated 2x across cluster]\u001b[0m\n", "\u001b[2m\u001b[36m(RayTrainWorker pid=57741)\u001b[0m Model weights saved in distilbert-base-uncased-finetuned-cola/checkpoint-1070/pytorch_model.bin\u001b[32m [repeated 2x across cluster]\u001b[0m\n", "\u001b[2m\u001b[36m(RayTrainWorker pid=57741)\u001b[0m tokenizer config file saved in distilbert-base-uncased-finetuned-cola/checkpoint-1070/tokenizer_config.json\u001b[32m [repeated 2x across cluster]\u001b[0m\n", "\u001b[2m\u001b[36m(RayTrainWorker pid=57741)\u001b[0m Special tokens file saved in distilbert-base-uncased-finetuned-cola/checkpoint-1070/special_tokens_map.json\u001b[32m [repeated 2x across cluster]\u001b[0m\n", "\u001b[2m\u001b[36m(RayTrainWorker pid=57741)\u001b[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/ray/ray_results/tune_transformers/TorchTrainer_e1825_00000_0_learning_rate=0.0000_2023-09-06_14-46-48/checkpoint_000001)\u001b[32m [repeated 2x across cluster]\u001b[0m\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "", "version_major": 2, "version_minor": 0 }, "text/plain": [ "(pid=57955) Running 0: 0%| | 0/1 [00:00 OutputSplitter[split(1, equal=True)]\u001b[32m [repeated 4x across cluster]\u001b[0m\n", "\u001b[2m\u001b[36m(SplitCoordinator pid=57927)\u001b[0m Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=['84374908fd32ea9885fdd6d21aadf2ce3e296daf28a26522e7a8d026'], preserve_order=False, actor_locality_enabled=True, verbose_progress=False)\u001b[32m [repeated 4x across cluster]\u001b[0m\n", "\u001b[2m\u001b[36m(SplitCoordinator pid=57927)\u001b[0m Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`\u001b[32m [repeated 4x across cluster]\u001b[0m\n", "\u001b[2m\u001b[36m(RayTrainWorker pid=57741)\u001b[0m Saving model checkpoint to distilbert-base-uncased-finetuned-cola/checkpoint-1605\u001b[32m [repeated 2x across cluster]\u001b[0m\n", "\u001b[2m\u001b[36m(RayTrainWorker pid=57741)\u001b[0m Configuration saved in distilbert-base-uncased-finetuned-cola/checkpoint-1605/config.json\u001b[32m [repeated 2x across cluster]\u001b[0m\n", "\u001b[2m\u001b[36m(RayTrainWorker pid=57741)\u001b[0m Model weights saved in distilbert-base-uncased-finetuned-cola/checkpoint-1605/pytorch_model.bin\u001b[32m [repeated 2x across cluster]\u001b[0m\n", "\u001b[2m\u001b[36m(RayTrainWorker pid=57741)\u001b[0m tokenizer config file saved in distilbert-base-uncased-finetuned-cola/checkpoint-1605/tokenizer_config.json\u001b[32m [repeated 2x across cluster]\u001b[0m\n", "\u001b[2m\u001b[36m(RayTrainWorker pid=57741)\u001b[0m Special tokens file saved in distilbert-base-uncased-finetuned-cola/checkpoint-1605/special_tokens_map.json\u001b[32m [repeated 2x across cluster]\u001b[0m\n", "\u001b[2m\u001b[36m(RayTrainWorker pid=57741)\u001b[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/ray/ray_results/tune_transformers/TorchTrainer_e1825_00000_0_learning_rate=0.0000_2023-09-06_14-46-48/checkpoint_000002)\u001b[32m [repeated 2x across cluster]\u001b[0m\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "", "version_major": 2, "version_minor": 0 }, "text/plain": [ "(pid=57955) Running 0: 0%| | 0/1 [00:00\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
losslearning_rateepochstepeval_losseval_matthews_correlationeval_runtimeeval_samples_per_secondeval_steps_per_secondtimestamp...time_total_spidhostnamenode_iptime_since_restoreiterations_since_restorecheckpoint_dir_nameconfig/train_loop_config/learning_rateconfig/train_loop_config/epochslogdir
10.61600.0001500.255350.6181350.0000000.75431382.82887.5041694036857...41.24860057497ip-10-0-27-12510.0.27.12541.2486001checkpoint_0000000.000204e1825_00001
20.66990.0014990.255350.6196570.0000000.74491400.20288.6031694036856...41.13360957498ip-10-0-27-12510.0.27.12541.1336091checkpoint_0000000.002004e1825_00002
30.60730.0000003.2521360.6196940.0000000.63291648.039104.2861694036942...126.69923857499ip-10-0-27-12510.0.27.125126.6992384checkpoint_0000030.020004e1825_00003
00.19340.0000003.2521360.7479600.5207560.65301597.187101.0681694036944...128.44349557496ip-10-0-27-12510.0.27.125128.4434954checkpoint_0000030.000024e1825_00000
\n", "

4 rows × 26 columns

\n", "" ], "text/plain": [ " loss learning_rate epoch step eval_loss eval_matthews_correlation \\\n", "1 0.6160 0.000150 0.25 535 0.618135 0.000000 \n", "2 0.6699 0.001499 0.25 535 0.619657 0.000000 \n", "3 0.6073 0.000000 3.25 2136 0.619694 0.000000 \n", "0 0.1934 0.000000 3.25 2136 0.747960 0.520756 \n", "\n", " eval_runtime eval_samples_per_second eval_steps_per_second timestamp \\\n", "1 0.7543 1382.828 87.504 1694036857 \n", "2 0.7449 1400.202 88.603 1694036856 \n", "3 0.6329 1648.039 104.286 1694036942 \n", "0 0.6530 1597.187 101.068 1694036944 \n", "\n", " ... time_total_s pid hostname node_ip time_since_restore \\\n", "1 ... 41.248600 57497 ip-10-0-27-125 10.0.27.125 41.248600 \n", "2 ... 41.133609 57498 ip-10-0-27-125 10.0.27.125 41.133609 \n", "3 ... 126.699238 57499 ip-10-0-27-125 10.0.27.125 126.699238 \n", "0 ... 128.443495 57496 ip-10-0-27-125 10.0.27.125 128.443495 \n", "\n", " iterations_since_restore checkpoint_dir_name \\\n", "1 1 checkpoint_000000 \n", "2 1 checkpoint_000000 \n", "3 4 checkpoint_000003 \n", "0 4 checkpoint_000003 \n", "\n", " config/train_loop_config/learning_rate config/train_loop_config/epochs \\\n", "1 0.00020 4 \n", "2 0.00200 4 \n", "3 0.02000 4 \n", "0 0.00002 4 \n", "\n", " logdir \n", "1 e1825_00001 \n", "2 e1825_00002 \n", "3 e1825_00003 \n", "0 e1825_00000 \n", "\n", "[4 rows x 26 columns]" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tune_results.get_dataframe().sort_values(\"eval_loss\")" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "best_result = tune_results.get_best_result()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(hf-share)=\n", "### Share the model" ] }, { "cell_type": "markdown", "metadata": { "id": "mS8PId_NhYbb" }, "source": [ "To share the model with the community, a few more steps follow.\n", "\n", "You conducted the training on the Ray cluster, but want share the model from the local enviroment. This configuration allows you to easily authenticate.\n", "\n", "First, store your authentication token from the Hugging Face website. Sign up [here](https://huggingface.co/join) if you haven't already. Then execute the following cell and input your username and password:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "2LClXkN8hYbb", "tags": [ "remove-cell-ci" ] }, "outputs": [], "source": [ "from huggingface_hub import notebook_login\n", "\n", "notebook_login()" ] }, { "cell_type": "markdown", "metadata": { "id": "SybKUDryhYbb" }, "source": [ "Then you need to install Git-LFS. Uncomment the following instructions:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "id": "_wF6aT-0hYbb", "tags": [ "remove-cell-ci" ] }, "outputs": [], "source": [ "# !apt install git-lfs" ] }, { "cell_type": "markdown", "metadata": { "id": "5fr6E0e8hYbb" }, "source": [ "Load the model with the best-performing checkpoint:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "cjH2A8m6hYbc", "tags": [] }, "outputs": [], "source": [ "import os\n", "from ray.train import Checkpoint\n", "\n", "checkpoint: Checkpoint = best_result.checkpoint\n", "\n", "with checkpoint.as_directory() as checkpoint_dir:\n", " checkpoint_path = os.path.join(checkpoint_dir, \"checkpoint\")\n", " model = AutoModelForSequenceClassification.from_pretrained(checkpoint_path)" ] }, { "cell_type": "markdown", "metadata": { "id": "tgV2xKfFhYbc" }, "source": [ "You can now upload the result of the training to the Hub. Execute this instruction:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "XSkfJe3nhYbc", "tags": [ "remove-cell-ci" ] }, "outputs": [], "source": [ "model.push_to_hub()" ] }, { "cell_type": "markdown", "metadata": { "id": "UL-Boc4dhYbc" }, "source": [ "You can now share this model. Others can load it with the identifier `\"your-username/the-name-you-picked\"`. For example:\n", "\n", "```python\n", "from transformers import AutoModelForSequenceClassification\n", "\n", "model = AutoModelForSequenceClassification.from_pretrained(\"sgugger/my-awesome-model\")\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## See also\n", "\n", "* {doc}`Ray Train Examples <../../examples>` for more use cases\n", "* {ref}`Ray Train User Guides ` for how-to guides\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "accelerator": "GPU", "colab": { "collapsed_sections": [], "name": "huggingface_text_classification.ipynb", "provenance": [] }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.8" }, "orphan": true, "vscode": { "interpreter": { "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6" } } }, "nbformat": 4, "nbformat_minor": 4 }