Fine-tune a Hugging Face Transformers Model#
This notebook is based on an official Hugging Face example, How to fine-tune a model on text classification. This notebook shows the process of conversion from vanilla HF to Ray Train without changing the training logic unless necessary.
This notebook consists of the following steps:
Uncomment and run the following line to install all the necessary dependencies. (This notebook is being tested with transformers==4.19.1
.):
#! pip install "datasets" "transformers>=4.19.0" "torch>=1.10.0" "mlflow"
Set up Ray#
Use ray.init()
to initialize a local cluster. By default, this cluster contains only the machine you are running this notebook on. You can also run this notebook on an Anyscale cluster.
from pprint import pprint
import ray
ray.init()
Check the resources our cluster is composed of. If you are running this notebook on your local machine or Google Colab, you should see the number of CPU cores and GPUs available on the your machine.
pprint(ray.cluster_resources())
{'CPU': 48.0,
'GPU': 4.0,
'accelerator_type:T4': 1.0,
'anyscale/accelerator_shape:4xT4': 1.0,
'anyscale/node-group:head': 1.0,
'anyscale/provider:aws': 1.0,
'anyscale/region:us-west-2': 1.0,
'memory': 206158430208.0,
'node:10.0.114.132': 1.0,
'node:__internal_head__': 1.0,
'object_store_memory': 58913938636.0}
This notebook fine-tunes a HF Transformers model for one of the text classification task of the GLUE Benchmark. It runs the training using Ray Train.
You can change these two variables to control whether the training, which happens later, uses CPUs or GPUs, and how many workers to spawn. Each worker claims one CPU or GPU. Make sure to not request more resources than the resources present. By default, the training runs with one GPU worker.
use_gpu = True # set this to False to run on CPUs
num_workers = 1 # set this to number of GPUs or CPUs you want to use
Fine-tune a model on a text classification task#
The GLUE Benchmark is a group of nine classification tasks on sentences or pairs of sentences. To learn more, see the original notebook.
Each task has a name that is its acronym, with mnli-mm
to indicate that it is a mismatched version of MNLI. Each one has the same training set as mnli
but different validation and test sets.
GLUE_TASKS = [
"cola",
"mnli",
"mnli-mm",
"mrpc",
"qnli",
"qqp",
"rte",
"sst2",
"stsb",
"wnli",
]
This notebook runs on any of the tasks in the list above, with any model checkpoint from the Model Hub as long as that model has a version with a classification head. Depending on the model and the GPU you are using, you might need to adjust the batch size to avoid out-of-memory errors. Set these three parameters, and the rest of the notebook should run smoothly:
task = "cola"
model_checkpoint = "distilbert-base-uncased"
batch_size = 16
Loading the dataset#
Use the HF Datasets library to download the data and get the metric to use for evaluation and to compare your model to the benchmark. You can do this comparison easily with the load_dataset
and load_metric
functions.
Apart from mnli-mm
being special code, you can directly pass the task name to those functions.
Run the normal HF Datasets code to load the dataset from the Hub.
from datasets import load_dataset
actual_task = "mnli" if task == "mnli-mm" else task
datasets = load_dataset("glue", actual_task)
The dataset
object itself is a DatasetDict
, which contains one key for the training, validation, and test set, with more keys for the mismatched validation and test set in the special case of mnli
.
Preprocessing the data with Ray Data#
Before you can feed these texts to the model, you need to preprocess them. Preprocess them with a HF Transformers’ Tokenizer
, which tokenizes the inputs, including converting the tokens to their corresponding IDs in the pretrained vocabulary, and puts them in a format the model expects. It also generates the other inputs that the model requires.
To do all of this preprocessing, instantiate your tokenizer with the AutoTokenizer.from_pretrained
method, which ensures that you:
Get a tokenizer that corresponds to the model architecture you want to use.
Download the vocabulary used when pretraining this specific checkpoint.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)
/home/ray/anaconda3/lib/python3.9/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
_torch_pytree._register_pytree_node(
/home/ray/anaconda3/lib/python3.9/site-packages/huggingface_hub/file_download.py:795: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(
Pass use_fast=True
to the preceding call to use one of the fast tokenizers, backed by Rust, from the HF Tokenizers library. These fast tokenizers are available for almost all models, but if you get an error with the previous call, remove the argument.
To preprocess the dataset, you need the names of the columns containing the sentence(s). The following dictionary keeps track of the correspondence task to column names:
task_to_keys = {
"cola": ("sentence", None),
"mnli": ("premise", "hypothesis"),
"mnli-mm": ("premise", "hypothesis"),
"mrpc": ("sentence1", "sentence2"),
"qnli": ("question", "sentence"),
"qqp": ("question1", "question2"),
"rte": ("sentence1", "sentence2"),
"sst2": ("sentence", None),
"stsb": ("sentence1", "sentence2"),
"wnli": ("sentence1", "sentence2"),
}
Instead of using HF Dataset objects directly, convert them to Ray Data. Arrow tables back both of them, so the conversion is straightforward. Use the built-in from_huggingface()
function.
import ray.data
ray_datasets = {
"train": ray.data.from_huggingface(datasets["train"]),
"validation": ray.data.from_huggingface(datasets["validation"]),
"test": ray.data.from_huggingface(datasets["test"]),
}
ray_datasets
{'train': Dataset(num_rows=8551, schema={sentence: string, label: int64, idx: int32}),
'validation': Dataset(num_rows=1043, schema={sentence: string, label: int64, idx: int32}),
'test': Dataset(num_rows=1063, schema={sentence: string, label: int64, idx: int32})}
You can then write the function that preprocesses the samples. Feed them to the tokenizer
with the argument truncation=True
. This configuration ensures that the tokenizer
truncates and pads to the longest sequence in the batch, any input longer than what the model selected can handle.
import numpy as np
from typing import Dict
# Tokenize input sentences
def collate_fn(examples: Dict[str, np.array]):
sentence1_key, sentence2_key = task_to_keys[task]
if sentence2_key is None:
outputs = tokenizer(
list(examples[sentence1_key]),
truncation=True,
padding="longest",
return_tensors="pt",
)
else:
outputs = tokenizer(
list(examples[sentence1_key]),
list(examples[sentence2_key]),
truncation=True,
padding="longest",
return_tensors="pt",
)
outputs["labels"] = torch.LongTensor(examples["label"])
# Move all input tensors to GPU
for key, value in outputs.items():
outputs[key] = value.cuda()
return outputs
Fine-tuning the model with Ray Train#
Now that the data is ready, download the pretrained model and fine-tune it.
Because all of the tasks involve sentence classification, use the AutoModelForSequenceClassification
class. For more specifics about each individual training component, see the original notebook. The original notebook uses the same tokenizer used to encode the dataset in this notebook’s preceding example.
The main difference when using Ray Train is that you need to define the training logic as a function (train_func
). You pass this training function to the TorchTrainer
to on every Ray worker. The training then proceeds using PyTorch DDP.
Note
Be sure to initialize the model, metric, and tokenizer within the function. Otherwise, you may encounter serialization errors.
import torch
import numpy as np
from evaluate import load
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
import ray.train
from ray.train.huggingface.transformers import prepare_trainer, RayTrainReportCallback
num_labels = 3 if task.startswith("mnli") else 1 if task == "stsb" else 2
metric_name = (
"pearson"
if task == "stsb"
else "matthews_correlation"
if task == "cola"
else "accuracy"
)
model_name = model_checkpoint.split("/")[-1]
validation_key = (
"validation_mismatched"
if task == "mnli-mm"
else "validation_matched"
if task == "mnli"
else "validation"
)
name = f"{model_name}-finetuned-{task}"
# Calculate the maximum steps per epoch based on the number of rows in the training dataset.
# Make sure to scale by the total number of training workers and the per device batch size.
max_steps_per_epoch = ray_datasets["train"].count() // (batch_size * num_workers)
def train_func(config):
print(f"Is CUDA available: {torch.cuda.is_available()}")
metric = load("glue", actual_task)
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)
model = AutoModelForSequenceClassification.from_pretrained(
model_checkpoint, num_labels=num_labels
)
train_ds = ray.train.get_dataset_shard("train")
eval_ds = ray.train.get_dataset_shard("eval")
train_ds_iterable = train_ds.iter_torch_batches(
batch_size=batch_size, collate_fn=collate_fn
)
eval_ds_iterable = eval_ds.iter_torch_batches(
batch_size=batch_size, collate_fn=collate_fn
)
print("max_steps_per_epoch: ", max_steps_per_epoch)
args = TrainingArguments(
name,
evaluation_strategy="epoch",
save_strategy="epoch",
logging_strategy="epoch",
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
learning_rate=config.get("learning_rate", 2e-5),
num_train_epochs=config.get("epochs", 2),
weight_decay=config.get("weight_decay", 0.01),
push_to_hub=False,
max_steps=max_steps_per_epoch * config.get("epochs", 2),
disable_tqdm=True, # declutter the output a little
no_cuda=not use_gpu, # you need to explicitly set no_cuda if you want CPUs
report_to="none",
)
def compute_metrics(eval_pred):
predictions, labels = eval_pred
if task != "stsb":
predictions = np.argmax(predictions, axis=1)
else:
predictions = predictions[:, 0]
return metric.compute(predictions=predictions, references=labels)
trainer = Trainer(
model,
args,
train_dataset=train_ds_iterable,
eval_dataset=eval_ds_iterable,
tokenizer=tokenizer,
compute_metrics=compute_metrics,
)
trainer.add_callback(RayTrainReportCallback())
trainer = prepare_trainer(trainer)
print("Starting training")
trainer.train()
2025-07-09 15:56:28.075767: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-07-09 15:56:28.124864: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-07-09 15:56:28.124884: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-07-09 15:56:28.126125: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-07-09 15:56:28.133567: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-07-09 15:56:29.219640: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
/home/ray/anaconda3/lib/python3.9/site-packages/transformers/utils/generic.py:309: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
_torch_pytree._register_pytree_node(
/home/ray/anaconda3/lib/python3.9/site-packages/transformers/utils/generic.py:309: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
_torch_pytree._register_pytree_node(
comet_ml is installed but `COMET_API_KEY` is not set.
With your train_func
complete, you can now instantiate the TorchTrainer
. Aside from calling the function, set the scaling_config
, which controls the amount of workers and resources used, and the datasets
to use for training and evaluation.
from ray.train.torch import TorchTrainer
from ray.train import RunConfig, ScalingConfig, CheckpointConfig
trainer = TorchTrainer(
train_func,
scaling_config=ScalingConfig(num_workers=num_workers, use_gpu=use_gpu),
datasets={
"train": ray_datasets["train"],
"eval": ray_datasets["validation"],
},
run_config=RunConfig(
checkpoint_config=CheckpointConfig(
num_to_keep=1,
checkpoint_score_attribute="eval_loss",
checkpoint_score_order="min",
),
),
)
Finally, call the fit
method to start training with Ray Train. Save the Result
object to a variable so you can access metrics and checkpoints.
result = trainer.fit()
2025-07-09 15:56:32,564 INFO tune.py:616 -- [output] This uses the legacy output and progress reporter, as Jupyter notebooks are not supported by the new engine, yet. For more information, please see https://github.com/ray-project/ray/issues/36949
== Status ==
Current time: 2025-07-09 15:56:32 (running for 00:00:00.11)
Using FIFO scheduling algorithm.
Logical resource usage: 1.0/48 CPUs, 1.0/4 GPUs (0.0/1.0 anyscale/accelerator_shape:4xT4, 0.0/1.0 anyscale/node-group:head, 0.0/1.0 accelerator_type:T4, 0.0/1.0 anyscale/provider:aws, 0.0/1.0 anyscale/region:us-west-2)
Result logdir: /tmp/ray/session_2025-07-09_15-09-59_163606_3385/artifacts/2025-07-09_15-56-32/TorchTrainer_2025-07-09_15-56-32/driver_artifacts
Number of trials: 1/1 (1 PENDING)
(TrainTrainable pid=41390) /home/ray/anaconda3/lib/python3.9/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
(TrainTrainable pid=41390) _torch_pytree._register_pytree_node(
(TrainTrainable pid=41390) 2025-07-09 15:56:36.371154: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
(TrainTrainable pid=41390) 2025-07-09 15:56:36.418819: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
(TrainTrainable pid=41390) 2025-07-09 15:56:36.418845: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
(TrainTrainable pid=41390) 2025-07-09 15:56:36.420083: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
(TrainTrainable pid=41390) 2025-07-09 15:56:36.427078: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
(TrainTrainable pid=41390) To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
(TrainTrainable pid=41390) 2025-07-09 15:56:37.464124: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
== Status ==
Current time: 2025-07-09 15:56:37 (running for 00:00:05.13)
Using FIFO scheduling algorithm.
Logical resource usage: 1.0/48 CPUs, 1.0/4 GPUs (0.0/1.0 anyscale/accelerator_shape:4xT4, 0.0/1.0 anyscale/node-group:head, 0.0/1.0 accelerator_type:T4, 0.0/1.0 anyscale/provider:aws, 0.0/1.0 anyscale/region:us-west-2)
Result logdir: /tmp/ray/session_2025-07-09_15-09-59_163606_3385/artifacts/2025-07-09_15-56-32/TorchTrainer_2025-07-09_15-56-32/driver_artifacts
Number of trials: 1/1 (1 PENDING)
(TrainTrainable pid=41390) /home/ray/anaconda3/lib/python3.9/site-packages/transformers/utils/generic.py:309: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
(TrainTrainable pid=41390) _torch_pytree._register_pytree_node(
(TrainTrainable pid=41390) /home/ray/anaconda3/lib/python3.9/site-packages/transformers/utils/generic.py:309: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
(TrainTrainable pid=41390) _torch_pytree._register_pytree_node(
(TrainTrainable pid=41390) comet_ml is installed but `COMET_API_KEY` is not set.
== Status ==
Current time: 2025-07-09 15:56:42 (running for 00:00:10.18)
Using FIFO scheduling algorithm.
Logical resource usage: 1.0/48 CPUs, 1.0/4 GPUs (0.0/1.0 anyscale/node-group:head, 0.0/1.0 anyscale/provider:aws, 0.0/1.0 accelerator_type:T4, 0.0/1.0 anyscale/region:us-west-2, 0.0/1.0 anyscale/accelerator_shape:4xT4)
Result logdir: /tmp/ray/session_2025-07-09_15-09-59_163606_3385/artifacts/2025-07-09_15-56-32/TorchTrainer_2025-07-09_15-56-32/driver_artifacts
Number of trials: 1/1 (1 RUNNING)
(RayTrainWorker pid=41521) Setting up process group for: env:// [rank=0, world_size=1]
(TorchTrainer pid=41390) Started distributed worker processes:
(TorchTrainer pid=41390) - (node_id=f67b5f412a227b4c6b3ddd85d6f5b1eecd0bd0917efa8f9cd4b5e4da, ip=10.0.114.132, pid=41521) world_rank=0, local_rank=0, node_rank=0
(RayTrainWorker pid=41521) /home/ray/anaconda3/lib/python3.9/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
(RayTrainWorker pid=41521) _torch_pytree._register_pytree_node(
(RayTrainWorker pid=41521) 2025-07-09 15:56:44.730942: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
(RayTrainWorker pid=41521) 2025-07-09 15:56:44.779207: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
(RayTrainWorker pid=41521) 2025-07-09 15:56:44.779230: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
(RayTrainWorker pid=41521) 2025-07-09 15:56:44.780437: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
(RayTrainWorker pid=41521) 2025-07-09 15:56:44.787541: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
(RayTrainWorker pid=41521) To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
(RayTrainWorker pid=41521) 2025-07-09 15:56:45.863740: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
(RayTrainWorker pid=41521) /home/ray/anaconda3/lib/python3.9/site-packages/transformers/utils/generic.py:309: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
(RayTrainWorker pid=41521) _torch_pytree._register_pytree_node(
== Status ==
Current time: 2025-07-09 15:56:47 (running for 00:00:15.21)
Using FIFO scheduling algorithm.
Logical resource usage: 1.0/48 CPUs, 1.0/4 GPUs (0.0/1.0 anyscale/node-group:head, 0.0/1.0 anyscale/provider:aws, 0.0/1.0 accelerator_type:T4, 0.0/1.0 anyscale/region:us-west-2, 0.0/1.0 anyscale/accelerator_shape:4xT4)
Result logdir: /tmp/ray/session_2025-07-09_15-09-59_163606_3385/artifacts/2025-07-09_15-56-32/TorchTrainer_2025-07-09_15-56-32/driver_artifacts
Number of trials: 1/1 (1 RUNNING)
(RayTrainWorker pid=41521) /home/ray/anaconda3/lib/python3.9/site-packages/transformers/utils/generic.py:309: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
(RayTrainWorker pid=41521) _torch_pytree._register_pytree_node(
(RayTrainWorker pid=41521) comet_ml is installed but `COMET_API_KEY` is not set.
(RayTrainWorker pid=41521) Is CUDA available: True
(RayTrainWorker pid=41521) /home/ray/anaconda3/lib/python3.9/site-packages/huggingface_hub/file_download.py:795: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
(RayTrainWorker pid=41521) warnings.warn(
(RayTrainWorker pid=41521) Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
(RayTrainWorker pid=41521) You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
(RayTrainWorker pid=41521) /home/ray/anaconda3/lib/python3.9/site-packages/ray/data/iterator.py:436: RayDeprecationWarning: Passing a function to `iter_torch_batches(collate_fn)` is deprecated in Ray 2.47. Please switch to using a callable class that inherits from `ArrowBatchCollateFn`, `NumpyBatchCollateFn`, or `PandasBatchCollateFn`.
(RayTrainWorker pid=41521) warnings.warn(
(RayTrainWorker pid=41521) /home/ray/anaconda3/lib/python3.9/site-packages/accelerate/accelerator.py:432: FutureWarning: Passing the following arguments to `Accelerator` is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches']). Please pass an `accelerate.DataLoaderConfiguration` instead:
(RayTrainWorker pid=41521) dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False)
(RayTrainWorker pid=41521) warnings.warn(
(RayTrainWorker pid=41521) max_steps_per_epoch: 534
(RayTrainWorker pid=41521) Starting training
(SplitCoordinator pid=41621) Registered dataset logger for dataset train_23_0
(SplitCoordinator pid=41621) Starting execution of Dataset train_23_0. Full logs are in /tmp/ray/session_2025-07-09_15-09-59_163606_3385/logs/ray-data
(SplitCoordinator pid=41621) Execution plan of Dataset train_23_0: InputDataBuffer[Input] -> TaskPoolMapOperator[ReadParquet] -> OutputSplitter[split(1, equal=True)]
== Status ==
Current time: 2025-07-09 15:56:52 (running for 00:00:20.23)
Using FIFO scheduling algorithm.
Logical resource usage: 1.0/48 CPUs, 1.0/4 GPUs (0.0/1.0 anyscale/accelerator_shape:4xT4, 0.0/1.0 anyscale/provider:aws, 0.0/1.0 anyscale/region:us-west-2, 0.0/1.0 accelerator_type:T4, 0.0/1.0 anyscale/node-group:head)
Result logdir: /tmp/ray/session_2025-07-09_15-09-59_163606_3385/artifacts/2025-07-09_15-56-32/TorchTrainer_2025-07-09_15-56-32/driver_artifacts
Number of trials: 1/1 (1 RUNNING)
(RayTrainWorker pid=41521) /tmp/ipykernel_40967/133795194.py:24: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:206.)
(RayTrainWorker pid=41521) [rank0]:[W reducer.cpp:1389] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
== Status ==
Current time: 2025-07-09 15:56:57 (running for 00:00:25.25)
Using FIFO scheduling algorithm.
Logical resource usage: 1.0/48 CPUs, 1.0/4 GPUs (0.0/1.0 anyscale/accelerator_shape:4xT4, 0.0/1.0 anyscale/provider:aws, 0.0/1.0 anyscale/region:us-west-2, 0.0/1.0 accelerator_type:T4, 0.0/1.0 anyscale/node-group:head)
Result logdir: /tmp/ray/session_2025-07-09_15-09-59_163606_3385/artifacts/2025-07-09_15-56-32/TorchTrainer_2025-07-09_15-56-32/driver_artifacts
Number of trials: 1/1 (1 RUNNING)
== Status ==
Current time: 2025-07-09 15:57:02 (running for 00:00:30.27)
Using FIFO scheduling algorithm.
Logical resource usage: 1.0/48 CPUs, 1.0/4 GPUs (0.0/1.0 anyscale/provider:aws, 0.0/1.0 anyscale/region:us-west-2, 0.0/1.0 anyscale/node-group:head, 0.0/1.0 anyscale/accelerator_shape:4xT4, 0.0/1.0 accelerator_type:T4)
Result logdir: /tmp/ray/session_2025-07-09_15-09-59_163606_3385/artifacts/2025-07-09_15-56-32/TorchTrainer_2025-07-09_15-56-32/driver_artifacts
Number of trials: 1/1 (1 RUNNING)
== Status ==
Current time: 2025-07-09 15:57:07 (running for 00:00:35.29)
Using FIFO scheduling algorithm.
Logical resource usage: 1.0/48 CPUs, 1.0/4 GPUs (0.0/1.0 anyscale/provider:aws, 0.0/1.0 anyscale/region:us-west-2, 0.0/1.0 anyscale/node-group:head, 0.0/1.0 anyscale/accelerator_shape:4xT4, 0.0/1.0 accelerator_type:T4)
Result logdir: /tmp/ray/session_2025-07-09_15-09-59_163606_3385/artifacts/2025-07-09_15-56-32/TorchTrainer_2025-07-09_15-56-32/driver_artifacts
Number of trials: 1/1 (1 RUNNING)
== Status ==
Current time: 2025-07-09 15:57:12 (running for 00:00:40.32)
Using FIFO scheduling algorithm.
Logical resource usage: 1.0/48 CPUs, 1.0/4 GPUs (0.0/1.0 anyscale/provider:aws, 0.0/1.0 anyscale/accelerator_shape:4xT4, 0.0/1.0 anyscale/node-group:head, 0.0/1.0 anyscale/region:us-west-2, 0.0/1.0 accelerator_type:T4)
Result logdir: /tmp/ray/session_2025-07-09_15-09-59_163606_3385/artifacts/2025-07-09_15-56-32/TorchTrainer_2025-07-09_15-56-32/driver_artifacts
Number of trials: 1/1 (1 RUNNING)
== Status ==
Current time: 2025-07-09 15:57:17 (running for 00:00:45.34)
Using FIFO scheduling algorithm.
Logical resource usage: 1.0/48 CPUs, 1.0/4 GPUs (0.0/1.0 anyscale/provider:aws, 0.0/1.0 anyscale/accelerator_shape:4xT4, 0.0/1.0 anyscale/node-group:head, 0.0/1.0 anyscale/region:us-west-2, 0.0/1.0 accelerator_type:T4)
Result logdir: /tmp/ray/session_2025-07-09_15-09-59_163606_3385/artifacts/2025-07-09_15-56-32/TorchTrainer_2025-07-09_15-56-32/driver_artifacts
Number of trials: 1/1 (1 RUNNING)
(SplitCoordinator pid=41621) ✔️ Dataset train_23_0 execution finished in 28.21 seconds
(RayTrainWorker pid=41521) {'loss': 0.5441, 'learning_rate': 9.9812734082397e-06, 'epoch': 0.5}
(SplitCoordinator pid=41622) Registered dataset logger for dataset eval_24_0
(SplitCoordinator pid=41622) Starting execution of Dataset eval_24_0. Full logs are in /tmp/ray/session_2025-07-09_15-09-59_163606_3385/logs/ray-data
(SplitCoordinator pid=41622) Execution plan of Dataset eval_24_0: InputDataBuffer[Input] -> TaskPoolMapOperator[ReadParquet] -> OutputSplitter[split(1, equal=True)]
== Status ==
Current time: 2025-07-09 15:57:22 (running for 00:00:50.36)
Using FIFO scheduling algorithm.
Logical resource usage: 1.0/48 CPUs, 1.0/4 GPUs (0.0/1.0 accelerator_type:T4, 0.0/1.0 anyscale/region:us-west-2, 0.0/1.0 anyscale/provider:aws, 0.0/1.0 anyscale/accelerator_shape:4xT4, 0.0/1.0 anyscale/node-group:head)
Result logdir: /tmp/ray/session_2025-07-09_15-09-59_163606_3385/artifacts/2025-07-09_15-56-32/TorchTrainer_2025-07-09_15-56-32/driver_artifacts
Number of trials: 1/1 (1 RUNNING)
(RayTrainWorker pid=41521) {'eval_loss': 0.51453697681427, 'eval_matthews_correlation': 0.37793570732654813, 'eval_runtime': 1.8456, 'eval_samples_per_second': 565.126, 'eval_steps_per_second': 35.761, 'epoch': 0.5}
2025-07-09 15:57:26,970 WARNING experiment_state.py:206 -- Experiment state snapshotting has been triggered multiple times in the last 5.0 seconds and may become a bottleneck. A snapshot is forced if `CheckpointConfig(num_to_keep)` is set, and a trial has checkpointed >= `num_to_keep` times since the last snapshot.
You may want to consider increasing the `CheckpointConfig(num_to_keep)` or decreasing the frequency of saving checkpoints.
You can suppress this warning by setting the environment variable TUNE_WARN_EXCESSIVE_EXPERIMENT_CHECKPOINT_SYNC_THRESHOLD_S to a smaller value than the current threshold (5.0). Set it to 0 to completely suppress this warning.
(RayTrainWorker pid=41521) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/ray/ray_results/TorchTrainer_2025-07-09_15-56-32/TorchTrainer_f5114_00000_0_2025-07-09_15-56-32/checkpoint_000000)
(SplitCoordinator pid=41622) ✔️ Dataset eval_24_0 execution finished in 1.73 seconds
== Status ==
Current time: 2025-07-09 15:57:27 (running for 00:00:55.36)
Using FIFO scheduling algorithm.
Logical resource usage: 1.0/48 CPUs, 1.0/4 GPUs (0.0/1.0 accelerator_type:T4, 0.0/1.0 anyscale/region:us-west-2, 0.0/1.0 anyscale/provider:aws, 0.0/1.0 anyscale/accelerator_shape:4xT4, 0.0/1.0 anyscale/node-group:head)
Result logdir: /tmp/ray/session_2025-07-09_15-09-59_163606_3385/artifacts/2025-07-09_15-56-32/TorchTrainer_2025-07-09_15-56-32/driver_artifacts
Number of trials: 1/1 (1 RUNNING)
== Status ==
Current time: 2025-07-09 15:57:32 (running for 00:01:00.38)
Using FIFO scheduling algorithm.
Logical resource usage: 1.0/48 CPUs, 1.0/4 GPUs (0.0/1.0 anyscale/region:us-west-2, 0.0/1.0 anyscale/provider:aws, 0.0/1.0 accelerator_type:T4, 0.0/1.0 anyscale/accelerator_shape:4xT4, 0.0/1.0 anyscale/node-group:head)
Result logdir: /tmp/ray/session_2025-07-09_15-09-59_163606_3385/artifacts/2025-07-09_15-56-32/TorchTrainer_2025-07-09_15-56-32/driver_artifacts
Number of trials: 1/1 (1 RUNNING)
== Status ==
Current time: 2025-07-09 15:57:38 (running for 00:01:05.41)
Using FIFO scheduling algorithm.
Logical resource usage: 1.0/48 CPUs, 1.0/4 GPUs (0.0/1.0 anyscale/region:us-west-2, 0.0/1.0 anyscale/provider:aws, 0.0/1.0 accelerator_type:T4, 0.0/1.0 anyscale/accelerator_shape:4xT4, 0.0/1.0 anyscale/node-group:head)
Result logdir: /tmp/ray/session_2025-07-09_15-09-59_163606_3385/artifacts/2025-07-09_15-56-32/TorchTrainer_2025-07-09_15-56-32/driver_artifacts
Number of trials: 1/1 (1 RUNNING)
== Status ==
Current time: 2025-07-09 15:57:43 (running for 00:01:10.43)
Using FIFO scheduling algorithm.
Logical resource usage: 1.0/48 CPUs, 1.0/4 GPUs (0.0/1.0 anyscale/node-group:head, 0.0/1.0 anyscale/region:us-west-2, 0.0/1.0 accelerator_type:T4, 0.0/1.0 anyscale/accelerator_shape:4xT4, 0.0/1.0 anyscale/provider:aws)
Result logdir: /tmp/ray/session_2025-07-09_15-09-59_163606_3385/artifacts/2025-07-09_15-56-32/TorchTrainer_2025-07-09_15-56-32/driver_artifacts
Number of trials: 1/1 (1 RUNNING)
== Status ==
Current time: 2025-07-09 15:57:48 (running for 00:01:15.45)
Using FIFO scheduling algorithm.
Logical resource usage: 1.0/48 CPUs, 1.0/4 GPUs (0.0/1.0 anyscale/node-group:head, 0.0/1.0 anyscale/region:us-west-2, 0.0/1.0 accelerator_type:T4, 0.0/1.0 anyscale/accelerator_shape:4xT4, 0.0/1.0 anyscale/provider:aws)
Result logdir: /tmp/ray/session_2025-07-09_15-09-59_163606_3385/artifacts/2025-07-09_15-56-32/TorchTrainer_2025-07-09_15-56-32/driver_artifacts
Number of trials: 1/1 (1 RUNNING)
== Status ==
Current time: 2025-07-09 15:57:53 (running for 00:01:20.47)
Using FIFO scheduling algorithm.
Logical resource usage: 1.0/48 CPUs, 1.0/4 GPUs (0.0/1.0 anyscale/region:us-west-2, 0.0/1.0 anyscale/provider:aws, 0.0/1.0 anyscale/accelerator_shape:4xT4, 0.0/1.0 anyscale/node-group:head, 0.0/1.0 accelerator_type:T4)
Result logdir: /tmp/ray/session_2025-07-09_15-09-59_163606_3385/artifacts/2025-07-09_15-56-32/TorchTrainer_2025-07-09_15-56-32/driver_artifacts
Number of trials: 1/1 (1 RUNNING)
(SplitCoordinator pid=41621) ✔️ Dataset train_23_1 execution finished in 26.58 seconds
(SplitCoordinator pid=41621) Registered dataset logger for dataset train_23_1
(SplitCoordinator pid=41621) Starting execution of Dataset train_23_1. Full logs are in /tmp/ray/session_2025-07-09_15-09-59_163606_3385/logs/ray-data
(SplitCoordinator pid=41621) Execution plan of Dataset train_23_1: InputDataBuffer[Input] -> TaskPoolMapOperator[ReadParquet] -> OutputSplitter[split(1, equal=True)]
(RayTrainWorker pid=41521) {'loss': 0.3864, 'learning_rate': 0.0, 'epoch': 1.5}
(SplitCoordinator pid=41622) Registered dataset logger for dataset eval_24_1
(SplitCoordinator pid=41622) Starting execution of Dataset eval_24_1. Full logs are in /tmp/ray/session_2025-07-09_15-09-59_163606_3385/logs/ray-data
(SplitCoordinator pid=41622) Execution plan of Dataset eval_24_1: InputDataBuffer[Input] -> TaskPoolMapOperator[ReadParquet] -> OutputSplitter[split(1, equal=True)]
(RayTrainWorker pid=41521) {'eval_loss': 0.5683005452156067, 'eval_matthews_correlation': 0.45115517656589194, 'eval_runtime': 1.6027, 'eval_samples_per_second': 650.77, 'eval_steps_per_second': 41.18, 'epoch': 1.5}
== Status ==
Current time: 2025-07-09 15:57:58 (running for 00:01:25.49)
Using FIFO scheduling algorithm.
Logical resource usage: 1.0/48 CPUs, 1.0/4 GPUs (0.0/1.0 anyscale/region:us-west-2, 0.0/1.0 anyscale/provider:aws, 0.0/1.0 anyscale/accelerator_shape:4xT4, 0.0/1.0 anyscale/node-group:head, 0.0/1.0 accelerator_type:T4)
Result logdir: /tmp/ray/session_2025-07-09_15-09-59_163606_3385/artifacts/2025-07-09_15-56-32/TorchTrainer_2025-07-09_15-56-32/driver_artifacts
Number of trials: 1/1 (1 RUNNING)
2025-07-09 15:57:59,354 WARNING experiment_state.py:206 -- Experiment state snapshotting has been triggered multiple times in the last 5.0 seconds and may become a bottleneck. A snapshot is forced if `CheckpointConfig(num_to_keep)` is set, and a trial has checkpointed >= `num_to_keep` times since the last snapshot.
You may want to consider increasing the `CheckpointConfig(num_to_keep)` or decreasing the frequency of saving checkpoints.
You can suppress this warning by setting the environment variable TUNE_WARN_EXCESSIVE_EXPERIMENT_CHECKPOINT_SYNC_THRESHOLD_S to a smaller value than the current threshold (5.0). Set it to 0 to completely suppress this warning.
(RayTrainWorker pid=41521) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/ray/ray_results/TorchTrainer_2025-07-09_15-56-32/TorchTrainer_f5114_00000_0_2025-07-09_15-56-32/checkpoint_000001)
(SplitCoordinator pid=41622) ✔️ Dataset eval_24_1 execution finished in 1.49 seconds
(RayTrainWorker pid=41521) {'train_runtime': 66.7725, 'train_samples_per_second': 255.914, 'train_steps_per_second': 15.995, 'train_loss': 0.4653928092356478, 'epoch': 1.5}
2025-07-09 15:58:00,649 INFO tune.py:1009 -- Wrote the latest version of all result files and experiment state to '/home/ray/ray_results/TorchTrainer_2025-07-09_15-56-32' in 0.0022s.
2025-07-09 15:58:00,651 INFO tune.py:1041 -- Total run time: 88.09 seconds (88.03 seconds for the tuning loop).
== Status ==
Current time: 2025-07-09 15:58:00 (running for 00:01:28.04)
Using FIFO scheduling algorithm.
Logical resource usage: 1.0/48 CPUs, 1.0/4 GPUs (0.0/1.0 anyscale/region:us-west-2, 0.0/1.0 anyscale/provider:aws, 0.0/1.0 anyscale/accelerator_shape:4xT4, 0.0/1.0 anyscale/node-group:head, 0.0/1.0 accelerator_type:T4)
Result logdir: /tmp/ray/session_2025-07-09_15-09-59_163606_3385/artifacts/2025-07-09_15-56-32/TorchTrainer_2025-07-09_15-56-32/driver_artifacts
Number of trials: 1/1 (1 TERMINATED)
You can use the returned Result
object to access metrics and the Ray Train Checkpoint
associated with the last iteration.
result
Result(
metrics={'loss': 0.3864, 'learning_rate': 0.0, 'epoch': 1.5, 'step': 1068, 'eval_loss': 0.5683005452156067, 'eval_matthews_correlation': 0.45115517656589194, 'eval_runtime': 1.6027, 'eval_samples_per_second': 650.77, 'eval_steps_per_second': 41.18},
path='/home/ray/ray_results/TorchTrainer_2025-07-09_15-56-32/TorchTrainer_f5114_00000_0_2025-07-09_15-56-32',
filesystem='local',
checkpoint=Checkpoint(filesystem=local, path=/home/ray/ray_results/TorchTrainer_2025-07-09_15-56-32/TorchTrainer_f5114_00000_0_2025-07-09_15-56-32/checkpoint_000001)
)
Tune hyperparameters with Ray Tune#
To tune any hyperparameters of the model, pass your TorchTrainer
into a Tuner
and define the search space.
You can also take advantage of the advanced search algorithms and schedulers from Ray Tune. This example uses an ASHAScheduler
to aggresively terminate underperforming trials.
from ray import tune
from ray.tune import Tuner
from ray.tune.schedulers.async_hyperband import ASHAScheduler
tune_epochs = 4
tuner = Tuner(
trainer,
param_space={
"train_loop_config": {
"learning_rate": tune.grid_search([2e-5, 2e-4, 2e-3, 2e-2]),
"epochs": tune_epochs,
}
},
tune_config=tune.TuneConfig(
metric="eval_loss",
mode="min",
num_samples=1,
scheduler=ASHAScheduler(
max_t=tune_epochs,
),
),
run_config=RunConfig(
name="tune_transformers",
checkpoint_config=CheckpointConfig(
num_to_keep=1,
checkpoint_score_attribute="eval_loss",
checkpoint_score_order="min",
),
),
)
/home/ray/anaconda3/lib/python3.9/site-packages/ray/tune/impl/tuner_internal.py:108: RayDeprecationWarning: The Ray Train + Ray Tune integration has been reworked. Passing a Trainer to the Tuner is deprecated and will be removed in a future release. See this issue for more context and migration options: https://github.com/ray-project/ray/issues/49454. Disable these warnings by setting the environment variable: RAY_TRAIN_ENABLE_V2_MIGRATION_WARNINGS=0
_log_deprecation_warning(
2025-07-09 15:58:50,737 INFO tuner_internal.py:427 -- A `RunConfig` was passed to both the `Tuner` and the `TorchTrainer`. The run config passed to the `Tuner` is the one that will be used.
/home/ray/anaconda3/lib/python3.9/site-packages/ray/tune/impl/tuner_internal.py:144: RayDeprecationWarning: The `RunConfig` class should be imported from `ray.tune` when passing it to the Tuner. Please update your imports. See this issue for more context and migration options: https://github.com/ray-project/ray/issues/49454. Disable these warnings by setting the environment variable: RAY_TRAIN_ENABLE_V2_MIGRATION_WARNINGS=0
_log_deprecation_warning(
tune_results = tuner.fit()
Tune Status
Current time: | 2025-07-09 16:01:22 |
Running for: | 00:02:31.82 |
Memory: | 21.8/186.7 GiB |
System Info
Using AsyncHyperBand: num_stopped=4Bracket: Iter 4.000: -0.6557375341653824 | Iter 1.000: -0.5925458520650864
Logical resource usage: 1.0/48 CPUs, 1.0/4 GPUs (0.0/1.0 anyscale/accelerator_shape:4xT4, 0.0/1.0 anyscale/region:us-west-2, 0.0/1.0 anyscale/provider:aws, 0.0/1.0 anyscale/node-group:head, 0.0/1.0 accelerator_type:T4)
Trial Status
Trial name | status | loc | train_loop_config/le arning_rate | iter | total time (s) | loss | learning_rate | epoch |
---|---|---|---|---|---|---|---|---|
TorchTrainer_4776a_00000 | TERMINATED | 10.0.114.132:42556 | 2e-05 | 4 | 142.984 | 0.1999 | 0 | 3.25 |
TorchTrainer_4776a_00001 | TERMINATED | 10.0.114.132:42555 | 0.0002 | 4 | 140.012 | 0.6062 | 0 | 3.25 |
TorchTrainer_4776a_00002 | TERMINATED | 10.0.114.132:42554 | 0.002 | 1 | 45.3344 | 0.6338 | 0.00149906 | 0.25 |
TorchTrainer_4776a_00003 | TERMINATED | 10.0.114.132:42557 | 0.02 | 1 | 44.3268 | 1.0524 | 0.0149906 | 0.25 |
(TrainTrainable pid=42555) /home/ray/anaconda3/lib/python3.9/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
(TrainTrainable pid=42555) _torch_pytree._register_pytree_node(
(TrainTrainable pid=42555) 2025-07-09 15:58:54.742632: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
(TrainTrainable pid=42555) 2025-07-09 15:58:54.791129: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
(TrainTrainable pid=42555) 2025-07-09 15:58:54.791160: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
(TrainTrainable pid=42555) 2025-07-09 15:58:54.792360: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
(TrainTrainable pid=42555) 2025-07-09 15:58:54.799462: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
(TrainTrainable pid=42555) To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
(TrainTrainable pid=42555) 2025-07-09 15:58:55.891590: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
(TrainTrainable pid=42555) comet_ml is installed but `COMET_API_KEY` is not set.
(TrainTrainable pid=42557) /home/ray/anaconda3/lib/python3.9/site-packages/transformers/utils/generic.py:309: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. [repeated 11x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(TrainTrainable pid=42557) _torch_pytree._register_pytree_node( [repeated 11x across cluster]
(RayTrainWorker pid=42930) Setting up process group for: env:// [rank=0, world_size=1]
(TrainTrainable pid=42557) 2025-07-09 15:58:54.846302: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`. [repeated 3x across cluster]
(TrainTrainable pid=42557) 2025-07-09 15:58:54.894258: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered [repeated 3x across cluster]
(TrainTrainable pid=42557) 2025-07-09 15:58:54.894288: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered [repeated 3x across cluster]
(TrainTrainable pid=42557) 2025-07-09 15:58:54.895511: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered [repeated 3x across cluster]
(TrainTrainable pid=42557) 2025-07-09 15:58:54.902693: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. [repeated 3x across cluster]
(TrainTrainable pid=42557) To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. [repeated 3x across cluster]
(TrainTrainable pid=42557) 2025-07-09 15:58:55.983418: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT [repeated 3x across cluster]
(TorchTrainer pid=42555) Started distributed worker processes:
(TorchTrainer pid=42555) - (node_id=f67b5f412a227b4c6b3ddd85d6f5b1eecd0bd0917efa8f9cd4b5e4da, ip=10.0.114.132, pid=42936) world_rank=0, local_rank=0, node_rank=0
(TrainTrainable pid=42557) comet_ml is installed but `COMET_API_KEY` is not set. [repeated 3x across cluster]
(RayTrainWorker pid=42945) /home/ray/anaconda3/lib/python3.9/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. [repeated 4x across cluster]
(RayTrainWorker pid=42945) _torch_pytree._register_pytree_node( [repeated 4x across cluster]
(RayTrainWorker pid=42945) Setting up process group for: env:// [rank=0, world_size=1] [repeated 3x across cluster]
(RayTrainWorker pid=42945) 2025-07-09 15:59:04.120920: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`. [repeated 4x across cluster]
(RayTrainWorker pid=42945) 2025-07-09 15:59:04.169996: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered [repeated 4x across cluster]
(RayTrainWorker pid=42945) 2025-07-09 15:59:04.170027: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered [repeated 4x across cluster]
(RayTrainWorker pid=42945) 2025-07-09 15:59:04.171251: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered [repeated 4x across cluster]
(RayTrainWorker pid=42945) 2025-07-09 15:59:04.178492: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. [repeated 4x across cluster]
(RayTrainWorker pid=42945) To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. [repeated 4x across cluster]
(RayTrainWorker pid=42945) 2025-07-09 15:59:05.298407: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT [repeated 4x across cluster]
(TorchTrainer pid=42554) Started distributed worker processes: [repeated 3x across cluster]
(TorchTrainer pid=42554) - (node_id=f67b5f412a227b4c6b3ddd85d6f5b1eecd0bd0917efa8f9cd4b5e4da, ip=10.0.114.132, pid=42945) world_rank=0, local_rank=0, node_rank=0 [repeated 3x across cluster]
(RayTrainWorker pid=42936) Is CUDA available: True
(RayTrainWorker pid=42936) /home/ray/anaconda3/lib/python3.9/site-packages/huggingface_hub/file_download.py:795: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
(RayTrainWorker pid=42936) warnings.warn(
(RayTrainWorker pid=42936) max_steps_per_epoch: 534
(RayTrainWorker pid=42936) Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'pre_classifier.weight', 'classifier.weight', 'classifier.bias']
(RayTrainWorker pid=42936) You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
(RayTrainWorker pid=42936) /home/ray/anaconda3/lib/python3.9/site-packages/ray/data/iterator.py:436: RayDeprecationWarning: Passing a function to `iter_torch_batches(collate_fn)` is deprecated in Ray 2.47. Please switch to using a callable class that inherits from `ArrowBatchCollateFn`, `NumpyBatchCollateFn`, or `PandasBatchCollateFn`.
(RayTrainWorker pid=42936) /home/ray/anaconda3/lib/python3.9/site-packages/accelerate/accelerator.py:432: FutureWarning: Passing the following arguments to `Accelerator` is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches']). Please pass an `accelerate.DataLoaderConfiguration` instead:
(RayTrainWorker pid=42936) dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False)
(RayTrainWorker pid=42931) Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'classifier.bias', 'classifier.weight', 'pre_classifier.weight']
(RayTrainWorker pid=42930) Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.weight', 'classifier.bias', 'pre_classifier.bias']
(RayTrainWorker pid=42945) comet_ml is installed but `COMET_API_KEY` is not set. [repeated 4x across cluster]
(RayTrainWorker pid=42945) /home/ray/anaconda3/lib/python3.9/site-packages/transformers/utils/generic.py:309: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. [repeated 8x across cluster]
(RayTrainWorker pid=42945) _torch_pytree._register_pytree_node( [repeated 8x across cluster]
(RayTrainWorker pid=42936) Starting training
(RayTrainWorker pid=42945) Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
(SplitCoordinator pid=43278) Registered dataset logger for dataset train_25_0
(SplitCoordinator pid=43278) Starting execution of Dataset train_25_0. Full logs are in /tmp/ray/session_2025-07-09_15-09-59_163606_3385/logs/ray-data
(SplitCoordinator pid=43278) Execution plan of Dataset train_25_0: InputDataBuffer[Input] -> TaskPoolMapOperator[ReadParquet] -> OutputSplitter[split(1, equal=True)]
(RayTrainWorker pid=42936) /tmp/ipykernel_40967/133795194.py:24: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:206.)
(RayTrainWorker pid=42936) [rank0]:[W reducer.cpp:1389] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
(RayTrainWorker pid=42945) /home/ray/anaconda3/lib/python3.9/site-packages/huggingface_hub/file_download.py:795: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`. [repeated 3x across cluster]
(RayTrainWorker pid=42945) warnings.warn( [repeated 11x across cluster]
(RayTrainWorker pid=42945) Is CUDA available: True [repeated 3x across cluster]
(RayTrainWorker pid=42945) max_steps_per_epoch: 534 [repeated 3x across cluster]
(RayTrainWorker pid=42945) Starting training [repeated 3x across cluster]
(SplitCoordinator pid=43278) ✔️ Dataset train_25_0 execution finished in 26.65 seconds
(RayTrainWorker pid=42945) You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. [repeated 3x across cluster]
(RayTrainWorker pid=42945) /home/ray/anaconda3/lib/python3.9/site-packages/ray/data/iterator.py:436: RayDeprecationWarning: Passing a function to `iter_torch_batches(collate_fn)` is deprecated in Ray 2.47. Please switch to using a callable class that inherits from `ArrowBatchCollateFn`, `NumpyBatchCollateFn`, or `PandasBatchCollateFn`. [repeated 3x across cluster]
(RayTrainWorker pid=42945) /home/ray/anaconda3/lib/python3.9/site-packages/accelerate/accelerator.py:432: FutureWarning: Passing the following arguments to `Accelerator` is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches']). Please pass an `accelerate.DataLoaderConfiguration` instead: [repeated 3x across cluster]
(RayTrainWorker pid=42945) dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False) [repeated 3x across cluster]
(SplitCoordinator pid=43310) Registered dataset logger for dataset train_31_0 [repeated 3x across cluster]
(SplitCoordinator pid=43310) Starting execution of Dataset train_31_0. Full logs are in /tmp/ray/session_2025-07-09_15-09-59_163606_3385/logs/ray-data [repeated 3x across cluster]
(SplitCoordinator pid=43310) Execution plan of Dataset train_31_0: InputDataBuffer[Input] -> TaskPoolMapOperator[ReadParquet] -> OutputSplitter[split(1, equal=True)] [repeated 3x across cluster]
(RayTrainWorker pid=42945) /tmp/ipykernel_40967/133795194.py:24: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:206.) [repeated 3x across cluster]
(RayTrainWorker pid=42945) [rank0]:[W reducer.cpp:1389] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) [repeated 3x across cluster]
(RayTrainWorker pid=42936) {'loss': 0.6202, 'learning_rate': 0.0001499063670411985, 'epoch': 0.25}
(RayTrainWorker pid=42936) {'eval_loss': 0.6168375611305237, 'eval_matthews_correlation': 0.0, 'eval_runtime': 1.73, 'eval_samples_per_second': 602.874, 'eval_steps_per_second': 38.149, 'epoch': 0.25}
(SplitCoordinator pid=43293) ✔️ Dataset eval_30_0 execution finished in 1.44 seconds [repeated 7x across cluster]
(SplitCoordinator pid=43293) Registered dataset logger for dataset eval_30_0 [repeated 4x across cluster]
(SplitCoordinator pid=43293) Starting execution of Dataset eval_30_0. Full logs are in /tmp/ray/session_2025-07-09_15-09-59_163606_3385/logs/ray-data [repeated 4x across cluster]
(SplitCoordinator pid=43293) Execution plan of Dataset eval_30_0: InputDataBuffer[Input] -> TaskPoolMapOperator[ReadParquet] -> OutputSplitter[split(1, equal=True)] [repeated 4x across cluster]
2025-07-09 15:59:43,414 WARNING experiment_state.py:206 -- Experiment state snapshotting has been triggered multiple times in the last 5.0 seconds and may become a bottleneck. A snapshot is forced if `CheckpointConfig(num_to_keep)` is set, and a trial has checkpointed >= `num_to_keep` times since the last snapshot.
You may want to consider increasing the `CheckpointConfig(num_to_keep)` or decreasing the frequency of saving checkpoints.
You can suppress this warning by setting the environment variable TUNE_WARN_EXCESSIVE_EXPERIMENT_CHECKPOINT_SYNC_THRESHOLD_S to a smaller value than the current threshold (5.0). Set it to 0 to completely suppress this warning.
(RayTrainWorker pid=42936) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/ray/ray_results/tune_transformers/TorchTrainer_4776a_00001_1_learning_rate=0.0002_2025-07-09_15-58-50/checkpoint_000000)
2025-07-09 15:59:43,850 WARNING experiment_state.py:206 -- Experiment state snapshotting has been triggered multiple times in the last 5.0 seconds and may become a bottleneck. A snapshot is forced if `CheckpointConfig(num_to_keep)` is set, and a trial has checkpointed >= `num_to_keep` times since the last snapshot.
You may want to consider increasing the `CheckpointConfig(num_to_keep)` or decreasing the frequency of saving checkpoints.
You can suppress this warning by setting the environment variable TUNE_WARN_EXCESSIVE_EXPERIMENT_CHECKPOINT_SYNC_THRESHOLD_S to a smaller value than the current threshold (5.0). Set it to 0 to completely suppress this warning.
(raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff767201a67d246bccb2cfe99f04000000 Worker ID: 473432f94d2e6c055386344cc7ef057e60305c3612b2110043c15195 Node ID: f67b5f412a227b4c6b3ddd85d6f5b1eecd0bd0917efa8f9cd4b5e4da Worker IP address: 10.0.114.132 Worker port: 10235 Worker PID: 43285 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker exits unexpectedly by a signal. SystemExit is raised (sys.exit is called). Exit code: 1. The process receives a SIGTERM.
(RayTrainWorker pid=42930) {'loss': 0.5474, 'learning_rate': 1.4990636704119851e-05, 'epoch': 0.25} [repeated 3x across cluster]
2025-07-09 15:59:44,961 WARNING experiment_state.py:206 -- Experiment state snapshotting has been triggered multiple times in the last 5.0 seconds and may become a bottleneck. A snapshot is forced if `CheckpointConfig(num_to_keep)` is set, and a trial has checkpointed >= `num_to_keep` times since the last snapshot.
You may want to consider increasing the `CheckpointConfig(num_to_keep)` or decreasing the frequency of saving checkpoints.
You can suppress this warning by setting the environment variable TUNE_WARN_EXCESSIVE_EXPERIMENT_CHECKPOINT_SYNC_THRESHOLD_S to a smaller value than the current threshold (5.0). Set it to 0 to completely suppress this warning.
2025-07-09 15:59:45,474 WARNING experiment_state.py:206 -- Experiment state snapshotting has been triggered multiple times in the last 5.0 seconds and may become a bottleneck. A snapshot is forced if `CheckpointConfig(num_to_keep)` is set, and a trial has checkpointed >= `num_to_keep` times since the last snapshot.
You may want to consider increasing the `CheckpointConfig(num_to_keep)` or decreasing the frequency of saving checkpoints.
You can suppress this warning by setting the environment variable TUNE_WARN_EXCESSIVE_EXPERIMENT_CHECKPOINT_SYNC_THRESHOLD_S to a smaller value than the current threshold (5.0). Set it to 0 to completely suppress this warning.
(RayTrainWorker pid=42930) {'eval_loss': 0.5196707248687744, 'eval_matthews_correlation': 0.38334289753241174, 'eval_runtime': 1.5564, 'eval_samples_per_second': 670.155, 'eval_steps_per_second': 42.407, 'epoch': 0.25} [repeated 3x across cluster]
(SplitCoordinator pid=43278) ✔️ Dataset train_25_1 execution finished in 26.01 seconds
(SplitCoordinator pid=43292) Registered dataset logger for dataset train_29_1 [repeated 2x across cluster]
(SplitCoordinator pid=43292) Starting execution of Dataset train_29_1. Full logs are in /tmp/ray/session_2025-07-09_15-09-59_163606_3385/logs/ray-data [repeated 2x across cluster]
(SplitCoordinator pid=43292) Execution plan of Dataset train_29_1: InputDataBuffer[Input] -> TaskPoolMapOperator[ReadParquet] -> OutputSplitter[split(1, equal=True)] [repeated 2x across cluster]
(RayTrainWorker pid=42930) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/ray/ray_results/tune_transformers/TorchTrainer_4776a_00000_0_learning_rate=0.0000_2025-07-09_15-58-50/checkpoint_000000) [repeated 3x across cluster]
(RayTrainWorker pid=42936) {'loss': 0.6118, 'learning_rate': 9.981273408239701e-05, 'epoch': 1.25}
(SplitCoordinator pid=43292) ✔️ Dataset train_29_1 execution finished in 26.43 seconds
(RayTrainWorker pid=42936) {'eval_loss': 0.6183397769927979, 'eval_matthews_correlation': 0.0, 'eval_runtime': 1.6008, 'eval_samples_per_second': 651.532, 'eval_steps_per_second': 41.228, 'epoch': 1.25}
(SplitCoordinator pid=43293) Registered dataset logger for dataset eval_30_1 [repeated 2x across cluster]
(SplitCoordinator pid=43293) Starting execution of Dataset eval_30_1. Full logs are in /tmp/ray/session_2025-07-09_15-09-59_163606_3385/logs/ray-data [repeated 2x across cluster]
(SplitCoordinator pid=43293) Execution plan of Dataset eval_30_1: InputDataBuffer[Input] -> TaskPoolMapOperator[ReadParquet] -> OutputSplitter[split(1, equal=True)] [repeated 2x across cluster]
(RayTrainWorker pid=42936) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/ray/ray_results/tune_transformers/TorchTrainer_4776a_00001_1_learning_rate=0.0002_2025-07-09_15-58-50/checkpoint_000001)
(RayTrainWorker pid=42930) {'loss': 0.3907, 'learning_rate': 9.9812734082397e-06, 'epoch': 1.25}
2025-07-09 16:00:17,783 WARNING experiment_state.py:206 -- Experiment state snapshotting has been triggered multiple times in the last 5.0 seconds and may become a bottleneck. A snapshot is forced if `CheckpointConfig(num_to_keep)` is set, and a trial has checkpointed >= `num_to_keep` times since the last snapshot.
You may want to consider increasing the `CheckpointConfig(num_to_keep)` or decreasing the frequency of saving checkpoints.
You can suppress this warning by setting the environment variable TUNE_WARN_EXCESSIVE_EXPERIMENT_CHECKPOINT_SYNC_THRESHOLD_S to a smaller value than the current threshold (5.0). Set it to 0 to completely suppress this warning.
(SplitCoordinator pid=43293) ✔️ Dataset eval_30_1 execution finished in 1.42 seconds [repeated 2x across cluster]
(RayTrainWorker pid=42930) {'eval_loss': 0.5574285387992859, 'eval_matthews_correlation': 0.4857615494749571, 'eval_runtime': 1.5327, 'eval_samples_per_second': 680.485, 'eval_steps_per_second': 43.06, 'epoch': 1.25}
(SplitCoordinator pid=43292) Registered dataset logger for dataset train_29_2 [repeated 2x across cluster]
(SplitCoordinator pid=43292) Starting execution of Dataset train_29_2. Full logs are in /tmp/ray/session_2025-07-09_15-09-59_163606_3385/logs/ray-data [repeated 2x across cluster]
(SplitCoordinator pid=43292) Execution plan of Dataset train_29_2: InputDataBuffer[Input] -> TaskPoolMapOperator[ReadParquet] -> OutputSplitter[split(1, equal=True)] [repeated 2x across cluster]
(RayTrainWorker pid=42930) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/ray/ray_results/tune_transformers/TorchTrainer_4776a_00000_0_learning_rate=0.0000_2025-07-09_15-58-50/checkpoint_000001)
(SplitCoordinator pid=43278) ✔️ Dataset train_25_2 execution finished in 26.23 seconds
(RayTrainWorker pid=42936) {'loss': 0.6084, 'learning_rate': 4.971910112359551e-05, 'epoch': 2.25}
(SplitCoordinator pid=43292) ✔️ Dataset train_29_2 execution finished in 26.62 seconds
(RayTrainWorker pid=42936) {'eval_loss': 0.6190042495727539, 'eval_matthews_correlation': 0.0, 'eval_runtime': 1.74, 'eval_samples_per_second': 599.435, 'eval_steps_per_second': 37.932, 'epoch': 2.25}
(RayTrainWorker pid=42936) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/ray/ray_results/tune_transformers/TorchTrainer_4776a_00001_1_learning_rate=0.0002_2025-07-09_15-58-50/checkpoint_000002)
(SplitCoordinator pid=43293) Registered dataset logger for dataset eval_30_2 [repeated 2x across cluster]
(SplitCoordinator pid=43293) Starting execution of Dataset eval_30_2. Full logs are in /tmp/ray/session_2025-07-09_15-09-59_163606_3385/logs/ray-data [repeated 2x across cluster]
(SplitCoordinator pid=43293) Execution plan of Dataset eval_30_2: InputDataBuffer[Input] -> TaskPoolMapOperator[ReadParquet] -> OutputSplitter[split(1, equal=True)] [repeated 2x across cluster]
(RayTrainWorker pid=42930) {'loss': 0.2658, 'learning_rate': 4.971910112359551e-06, 'epoch': 2.25}
(SplitCoordinator pid=43293) ✔️ Dataset eval_30_2 execution finished in 1.39 seconds [repeated 2x across cluster]
2025-07-09 16:00:50,387 WARNING experiment_state.py:206 -- Experiment state snapshotting has been triggered multiple times in the last 5.0 seconds and may become a bottleneck. A snapshot is forced if `CheckpointConfig(num_to_keep)` is set, and a trial has checkpointed >= `num_to_keep` times since the last snapshot.
You may want to consider increasing the `CheckpointConfig(num_to_keep)` or decreasing the frequency of saving checkpoints.
You can suppress this warning by setting the environment variable TUNE_WARN_EXCESSIVE_EXPERIMENT_CHECKPOINT_SYNC_THRESHOLD_S to a smaller value than the current threshold (5.0). Set it to 0 to completely suppress this warning.
(RayTrainWorker pid=42930) {'eval_loss': 0.6665876507759094, 'eval_matthews_correlation': 0.5282217682774969, 'eval_runtime': 1.5007, 'eval_samples_per_second': 695.026, 'eval_steps_per_second': 43.981, 'epoch': 2.25}
(RayTrainWorker pid=42930) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/ray/ray_results/tune_transformers/TorchTrainer_4776a_00000_0_learning_rate=0.0000_2025-07-09_15-58-50/checkpoint_000002)
(SplitCoordinator pid=43292) Registered dataset logger for dataset train_29_3 [repeated 2x across cluster]
(SplitCoordinator pid=43292) Starting execution of Dataset train_29_3. Full logs are in /tmp/ray/session_2025-07-09_15-09-59_163606_3385/logs/ray-data [repeated 2x across cluster]
(SplitCoordinator pid=43292) Execution plan of Dataset train_29_3: InputDataBuffer[Input] -> TaskPoolMapOperator[ReadParquet] -> OutputSplitter[split(1, equal=True)] [repeated 2x across cluster]
(SplitCoordinator pid=43278) ✔️ Dataset train_25_3 execution finished in 26.13 seconds
(RayTrainWorker pid=42936) {'loss': 0.6062, 'learning_rate': 0.0, 'epoch': 3.25}
(SplitCoordinator pid=43292) ✔️ Dataset train_29_3 execution finished in 26.48 seconds
(RayTrainWorker pid=42936) {'eval_loss': 0.6288657784461975, 'eval_matthews_correlation': 0.0, 'eval_runtime': 1.5236, 'eval_samples_per_second': 684.579, 'eval_steps_per_second': 43.319, 'epoch': 3.25}
(RayTrainWorker pid=42936) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/ray/ray_results/tune_transformers/TorchTrainer_4776a_00001_1_learning_rate=0.0002_2025-07-09_15-58-50/checkpoint_000003)
(SplitCoordinator pid=43293) Registered dataset logger for dataset eval_30_3 [repeated 2x across cluster]
(SplitCoordinator pid=43293) Starting execution of Dataset eval_30_3. Full logs are in /tmp/ray/session_2025-07-09_15-09-59_163606_3385/logs/ray-data [repeated 2x across cluster]
(SplitCoordinator pid=43293) Execution plan of Dataset eval_30_3: InputDataBuffer[Input] -> TaskPoolMapOperator[ReadParquet] -> OutputSplitter[split(1, equal=True)] [repeated 2x across cluster]
(RayTrainWorker pid=42936) {'train_runtime': 129.007, 'train_samples_per_second': 264.916, 'train_steps_per_second': 16.557, 'train_loss': 0.6116742623432745, 'epoch': 3.25}
(SplitCoordinator pid=43293) ✔️ Dataset eval_30_3 execution finished in 1.46 seconds [repeated 2x across cluster]
2025-07-09 16:01:22,622 WARNING experiment_state.py:206 -- Experiment state snapshotting has been triggered multiple times in the last 5.0 seconds and may become a bottleneck. A snapshot is forced if `CheckpointConfig(num_to_keep)` is set, and a trial has checkpointed >= `num_to_keep` times since the last snapshot.
You may want to consider increasing the `CheckpointConfig(num_to_keep)` or decreasing the frequency of saving checkpoints.
You can suppress this warning by setting the environment variable TUNE_WARN_EXCESSIVE_EXPERIMENT_CHECKPOINT_SYNC_THRESHOLD_S to a smaller value than the current threshold (5.0). Set it to 0 to completely suppress this warning.
2025-07-09 16:01:22,626 INFO tune.py:1009 -- Wrote the latest version of all result files and experiment state to '/home/ray/ray_results/tune_transformers' in 0.0024s.
2025-07-09 16:01:22,631 INFO tune.py:1041 -- Total run time: 151.83 seconds (151.81 seconds for the tuning loop).
View the results of the tuning run as a dataframe, and find the best result.
tune_results.get_dataframe().sort_values("eval_loss")
loss | learning_rate | epoch | step | eval_loss | eval_matthews_correlation | eval_runtime | eval_samples_per_second | eval_steps_per_second | timestamp | ... | time_this_iter_s | time_total_s | pid | hostname | node_ip | time_since_restore | iterations_since_restore | config/train_loop_config/learning_rate | config/train_loop_config/epochs | logdir | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | 0.6338 | 0.001499 | 0.25 | 535 | 0.618490 | 0.000000 | 1.5122 | 689.707 | 43.644 | 1752101984 | ... | 45.334411 | 45.334411 | 42554 | ip-10-0-114-132 | 10.0.114.132 | 45.334411 | 1 | 0.00200 | 4 | 4776a_00002 |
3 | 1.0524 | 0.014991 | 0.25 | 535 | 0.618516 | 0.000000 | 1.5102 | 690.648 | 43.704 | 1752101983 | ... | 44.326816 | 44.326816 | 42557 | ip-10-0-114-132 | 10.0.114.132 | 44.326816 | 1 | 0.02000 | 4 | 4776a_00003 |
1 | 0.6062 | 0.000000 | 3.25 | 2136 | 0.628866 | 0.000000 | 1.5236 | 684.579 | 43.319 | 1752102079 | ... | 31.721999 | 140.012268 | 42555 | ip-10-0-114-132 | 10.0.114.132 | 140.012268 | 4 | 0.00020 | 4 | 4776a_00001 |
0 | 0.1999 | 0.000000 | 3.25 | 2136 | 0.736353 | 0.536455 | 1.5675 | 665.375 | 42.104 | 1752102082 | ... | 32.129375 | 142.983678 | 42556 | ip-10-0-114-132 | 10.0.114.132 | 142.983678 | 4 | 0.00002 | 4 | 4776a_00000 |
4 rows × 26 columns
best_result = tune_results.get_best_result()
(RayTrainWorker pid=42930) {'train_runtime': 131.5118, 'train_samples_per_second': 259.87, 'train_steps_per_second': 16.242, 'train_loss': 0.35124670521596846, 'epoch': 3.25}
See also#
Ray Train Examples for more use cases
Ray Train User Guides for how-to guides