GPT-J-6B Fine-Tuning with Ray Train and DeepSpeed#

This example showcases how to use Ray Train for GPT-J fine-tuning. GPT-J is a GPT-2-like causal language model trained on the Pile dataset. This particular model has 6 billion parameters. For more information, see GPT-J.

This example uses the Ray Train 🤗 Transformers integration and a pre-trained model from the Hugging Face Hub. Note that this example is adaptable to other similar models.

This is an advanced example that focuses on the performance and distributed computing aspects of Ray Train. For a beginner-friendly introduction to the Ray Train 🤗 Transformers integration, see Basic Example for HuggingFace Transformers.

Read Ray Train Key Concepts and Ray Data Integration User Guides before starting this example.

Note

To run this example, make sure your Ray cluster has access to at least one GPU with 16 or more GBs of memory. The required amount of memory depends on the model. This notebook is tested with 16 g4dn.4xlarge instances (including the head node).

This notebook has the following steps:

Set up Ray
Load the dataset
Preprocess the dataset with Ray Data
Run the training with Ray Train
Generate text from prompt

Uncomment and run the following line in order to install all the necessary dependencies (this notebook was tested with accelerate=0.18.0, transformers==4.26.0, deepspeed==0.12.3):

! pip install -q "datasets" "evaluate" "accelerate==0.18.0" "transformers==4.26.0" "torch>=1.12.0" "deepspeed==0.12.3"

import numpy as np
import pandas as pd
import os

Set up Ray#

First, let’s set some global variables. We will use 16 workers, each being assigned 1 GPU and 8 CPUs.

model_name = "EleutherAI/gpt-j-6B"
use_gpu = True
num_workers = 16
cpus_per_worker = 8

We will use ray.init() to initialize a local cluster. By default, this cluster will be comprised of only the machine you are running this notebook on. You can also run this notebook on an Anyscale cluster.

We define a runtime environment to ensure that the Ray workers have access to all the necessary packages. You can omit the runtime_env argument if you have all of the packages already installed on each node in your cluster.

import ray

ray.init(
    runtime_env={
        "pip": [
            "datasets",
            "evaluate",
            # The latest combination accelerate==0.25.0, transformers==4.36.0, deepspeed==0.12.4
            # has issues with DeepSpeed process group initialization,
            # and will result in a batch_size validation problem.
            # TODO(ml-team): get rid of the pins once the issue is fixed.
            "accelerate==0.18.0",
            "transformers==4.26.0",
            "torch>=1.12.0",
            "deepspeed==0.12.3",
        ],
    },
)

Loading the dataset#

We will be fine-tuning the model on the tiny_shakespeare dataset, comprised of 40,000 lines of Shakespeare from a variety of Shakespeare’s plays. The aim will be to make the GPT-J model better at generating text in the style of Shakespeare.

from datasets import load_dataset

print("Loading tiny_shakespeare dataset")
current_dataset = load_dataset("tiny_shakespeare")
current_dataset

We will use Ray Data for distributed preprocessing and data ingestion. We can easily convert the dataset obtained from Hugging Face Hub to Ray Data by using ray.data.from_huggingface().

import ray.data

ray_datasets = {
    "train": ray.data.from_huggingface(current_dataset["train"]),
    "validation": ray.data.from_huggingface(current_dataset["validation"]),
}

ray_datasets

{'train': MaterializedDataset(num_blocks=1, num_rows=1, schema={text: string}),
 'validation': MaterializedDataset(num_blocks=1, num_rows=1, schema={text: string})}

Note that the dataset is represented by a single line of large string, and needs some preprocessing. To do this, use the map_batches() API to apply transformation functions to batches of data.

The split_text function takes the single string and splits it into separate lines, removing empty lines and character names ending with ‘:’ (eg. ‘ROMEO:’). The tokenize function takes the lines and tokenizes them using the 🤗 Tokenizer associated with the model, ensuring each entry has the same length (block_size) by padding and truncating. This preprocessing is necessary for training.

Note

This preprocessing can be done in other ways. A common pattern is to tokenize first, and then split the obtained tokens into equally-sized blocks.

block_size = 512

from transformers import AutoTokenizer


def split_text(batch: pd.DataFrame) -> pd.DataFrame:
    text = list(batch["text"])
    flat_text = "".join(text)
    split_text = [
        x.strip()
        for x in flat_text.split("\n")
        if x.strip() and not x.strip()[-1] == ":"
    ]
    return pd.DataFrame(split_text, columns=["text"])


def tokenize(batch: pd.DataFrame) -> dict:
    tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
    tokenizer.pad_token = tokenizer.eos_token
    ret = tokenizer(
        list(batch["text"]),
        truncation=True,
        max_length=block_size,
        padding="max_length",
        return_tensors="np",
    )
    ret["labels"] = ret["input_ids"].copy()
    return dict(ret)


processed_datasets = {
    key: (
        ds.map_batches(split_text, batch_format="pandas")
        .map_batches(tokenize, batch_format="pandas")
    )
    for key, ds in ray_datasets.items()
}
processed_datasets

{'train': MapBatches(tokenize)
 +- MapBatches(split_text)
    +- Dataset(num_blocks=1, num_rows=1, schema={text: string}),
 'validation': MapBatches(tokenize)
 +- MapBatches(split_text)
    +- Dataset(num_blocks=1, num_rows=1, schema={text: string})}

Fine-tuning the model with Ray Train#

Configure Ray Train’s TorchTrainer to perform distributed fine-tuning of the model. Specify a train_loop_per_worker function, which defines the training logic to be distributed by Ray using Distributed Data Parallelism, which uses the PyTorch Distributed backend internally. Each worker has its own copy of the model, but operates on different data. At the end of each step, all the workers sync gradients.

Because GPT-J is a relatively large model, it may not be possible to fit it on smaller GPU types (<=16 GB GRAM). To deal with that issue, this example uses DeepSpeed, a library to optimize the training process and to offload and partition optimizer and parameter states, reducing GRAM usage. Furthermore, DeepSpeed ZeRO Stage 3 can load large models without running out of memory.

🤗 Transformers and Ray Train’s integrations allow you to easily configure and use DDP and DeepSpeed. All you need to do is specify the DeepSpeed configuration in the TrainingArguments object.

Tip

There are many DeepSpeed settings that allow you to trade-off speed for memory usage. The settings used below are tailored to the cluster setup used (16 g4dn.4xlarge nodes) and per device batch size of 16. Some things to keep in mind:

If your GPUs support bfloat16, use that instead of float16 mixed precision to get better performance and prevent overflows. Replace fp16=True with bf16=True in TrainingArguments.
If you are running out of GRAM: try reducing batch size (defined in the cell below the next one), set "overlap_comm": False in DeepSpeed config.
If you are running out of RAM, add more nodes to your cluster, use nodes with more RAM, set "pin_memory": False in the DeepSpeed config, reduce the batch size, and remove "offload_param" from the DeepSpeed config.

For more information on DeepSpeed configuration, refer to Hugging Face documentation and DeepSpeed documentation.

Additionally, if you prefer a lower-level API, the logic below can be expressed as an Accelerate training loop distributed by a Ray Train TorchTrainer.

Training speed#

As this example uses data parallelism, each worker operates on its own shard of the data. The batch size set in train_ds.iter_torch_batches is the per device batch size (per worker batch size). By changing the number of workers, you can change the effective batch size and thus the time needed for training to complete. Calculate the effective batch size as per device batch size * number of workers * number of gradient accumulation steps. As you add more workers, the effective batch size rises and thus less time is needed to complete a full epoch. While the speedup is not exactly linear due to extra communication overheads, in many cases it can be close to linear.

The preprocessed dataset has 1348 examples. We have set per device batch size to 16.

With 16 g4dn.4xlarge nodes, the effective batch size was 256, which equals to 85 steps per epoch. One epoch took ~2440 seconds (including initialization time).
With 32 g4dn.4xlarge nodes, the effective batch size was 512, which equals to 43 steps per epoch. One epoch took ~1280 seconds (including initialization time).

import evaluate
import torch
from transformers import (
    Trainer,
    TrainingArguments,
    GPTJForCausalLM,
    AutoTokenizer,
    default_data_collator,
)
from transformers.utils.logging import disable_progress_bar, enable_progress_bar

from ray import train
from ray.train.huggingface.transformers import prepare_trainer, RayTrainReportCallback


def train_func(config):
    # Use the actual number of CPUs assigned by Ray
    os.environ["OMP_NUM_THREADS"] = str(
        train.get_context().get_trial_resources().bundles[-1].get("CPU", 1)
    )
    # Enable tf32 for better performance
    torch.backends.cuda.matmul.allow_tf32 = True

    batch_size = config.get("batch_size", 4)
    epochs = config.get("epochs", 2)
    warmup_steps = config.get("warmup_steps", 0)
    learning_rate = config.get("learning_rate", 0.00002)
    weight_decay = config.get("weight_decay", 0.01)
    steps_per_epoch = config.get("steps_per_epoch")

    deepspeed = {
        "fp16": {
            "enabled": "auto",
            "initial_scale_power": 8,
            "hysteresis": 4,
            "consecutive_hysteresis": True,
        },
        "bf16": {"enabled": "auto"},
        "optimizer": {
            "type": "AdamW",
            "params": {
                "lr": "auto",
                "betas": "auto",
                "eps": "auto",
            },
        },
        "zero_optimization": {
            "stage": 3,
            "offload_optimizer": {
                "device": "cpu",
                "pin_memory": True,
            },
            "overlap_comm": True,
            "contiguous_gradients": True,
            "reduce_bucket_size": "auto",
            "stage3_prefetch_bucket_size": "auto",
            "stage3_param_persistence_threshold": "auto",
            "gather_16bit_weights_on_model_save": True,
            "round_robin_gradients": True,
        },
        "gradient_accumulation_steps": "auto",
        "gradient_clipping": "auto",
        "steps_per_print": 10,
        "train_batch_size": "auto",
        "train_micro_batch_size_per_gpu": "auto",
        "wall_clock_breakdown": False,
    }

    print("Preparing training arguments")
    training_args = TrainingArguments(
        "output",
        logging_steps=1,
        save_strategy="steps",
        save_steps=steps_per_epoch,
        max_steps=steps_per_epoch * epochs,
        per_device_train_batch_size=batch_size,
        gradient_accumulation_steps=1,
        learning_rate=learning_rate,
        weight_decay=weight_decay,
        warmup_steps=warmup_steps,
        label_names=["input_ids", "attention_mask"],
        push_to_hub=False,
        report_to="none",
        disable_tqdm=True,  # declutter the output a little
        fp16=True,
        gradient_checkpointing=True,
        deepspeed=deepspeed,
    )
    disable_progress_bar()

    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokenizer.pad_token = tokenizer.eos_token

    print("Loading model")

    model = GPTJForCausalLM.from_pretrained(model_name, use_cache=False)
    model.resize_token_embeddings(len(tokenizer))

    print("Model loaded")

    enable_progress_bar()

    metric = evaluate.load("accuracy")

    train_ds = train.get_dataset_shard("train")
    eval_ds = train.get_dataset_shard("validation")

    train_ds_iterable = train_ds.iter_torch_batches(
        batch_size=batch_size,
        local_shuffle_buffer_size=train.get_context().get_world_size() * batch_size,
    )
    eval_ds_iterable = eval_ds.iter_torch_batches(batch_size=batch_size)

    def compute_metrics(eval_pred):
        logits, labels = eval_pred
        predictions = np.argmax(logits, axis=-1)
        return metric.compute(predictions=predictions, references=labels)

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_ds_iterable,
        eval_dataset=eval_ds_iterable,
        compute_metrics=compute_metrics,
        tokenizer=tokenizer,
        data_collator=default_data_collator,
    )

    # Add callback to report checkpoints to Ray Train
    trainer.add_callback(RayTrainReportCallback())
    trainer = prepare_trainer(trainer)
    trainer.train()

After defining the training function, instantiate the TorchTrainer. Aside from the function, set the scaling_config to control the number of workers and amount of resources to use, and datasets(the preprocessed Ray Datasets) to use for training and evaluation.

Note

Running with multiple nodes necessitates the persistence of checkpoints and other outputs to some external storage for access after training has completed. You should set up cloud storage or NFS, then replace storage_path with your own cloud bucket URI or NFS path.

See Configuration and Persistent Storage for more details.

storage_path = "s3://your-bucket-here"  # TODO: Set up cloud storage
# storage_path="/mnt/path/to/nfs"     # TODO: Alternatively, set up NFS

batch_size = 16
train_ds_size = processed_datasets["train"].count()
steps_per_epoch = train_ds_size // (batch_size * num_workers)

from ray.train.torch import TorchTrainer
from ray.train import RunConfig, ScalingConfig

trainer = TorchTrainer(
    train_loop_per_worker=train_func,
    train_loop_config={
        "epochs": 1,
        "batch_size": batch_size,  # per device
        "steps_per_epoch": steps_per_epoch,
    },
    scaling_config=ScalingConfig(
        num_workers=num_workers,
        use_gpu=use_gpu,
        resources_per_worker={"GPU": 1, "CPU": cpus_per_worker},
    ),
    datasets=processed_datasets,
    run_config=RunConfig(storage_path=storage_path),
)

Finally, call the fit() method to start training with Ray Train. Save the Result object to a variable to access metrics and checkpoints.

results = trainer.fit()

Show code cell output Hide code cell output

Tune Status

Current time:	2023-08-18 18:54:02
Running for:	00:44:50.37
Memory:	10.2/62.0 GiB

System Info

Using FIFO scheduling algorithm.
Logical resource usage: 129.0/256 CPUs, 16.0/16 GPUs

Trial Status

Trial name	status	loc	iter	total time (s)	loss	learning_rate	epoch
TorchTrainer_01ea5_00000	TERMINATED	10.0.60.59:8839	1	2663.78	0.069	2.38095e-07	1

(TrainTrainable pid=8839, ip=10.0.60.59) 2023-08-18 18:09:16.315108: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
(TrainTrainable pid=8839, ip=10.0.60.59) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
(TrainTrainable pid=8839, ip=10.0.60.59) 2023-08-18 18:09:16.462944: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
(TrainTrainable pid=8839, ip=10.0.60.59) 2023-08-18 18:09:17.336229: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
(TrainTrainable pid=8839, ip=10.0.60.59) 2023-08-18 18:09:17.336299: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
(TrainTrainable pid=8839, ip=10.0.60.59) 2023-08-18 18:09:17.336306: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
(TrainTrainable pid=8839, ip=10.0.60.59) --------------------------------------------------------------------------
(TrainTrainable pid=8839, ip=10.0.60.59)                  Aim collects anonymous usage analytics.                 
(TrainTrainable pid=8839, ip=10.0.60.59)                         Read how to opt-out here:                         
(TrainTrainable pid=8839, ip=10.0.60.59)     https://aimstack.readthedocs.io/en/latest/community/telemetry.html    
(TrainTrainable pid=8839, ip=10.0.60.59) --------------------------------------------------------------------------
(TrainTrainable pid=8839, ip=10.0.60.59) comet_ml is installed but `COMET_API_KEY` is not set.
(TorchTrainer pid=8839, ip=10.0.60.59) Starting distributed worker processes: ['8911 (10.0.60.59)', '36675 (10.0.13.222)', '8880 (10.0.63.99)', '8867 (10.0.49.236)', '49329 (10.0.40.253)', '8845 (10.0.18.195)', '36249 (10.0.11.26)', '8858 (10.0.0.119)', '8857 (10.0.44.114)', '8885 (10.0.47.209)', '36311 (10.0.27.53)', '8830 (10.0.30.35)', '8875 (10.0.0.80)', '8851 (10.0.43.240)', '9631 (10.0.57.153)', '36262 (10.0.52.191)']
(RayTrainWorker pid=8911, ip=10.0.60.59) Setting up process group for: env:// [rank=0, world_size=16]
(RayTrainWorker pid=8911, ip=10.0.60.59) 2023-08-18 18:09:25.209122: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
(RayTrainWorker pid=8911, ip=10.0.60.59) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
(RayTrainWorker pid=8911, ip=10.0.60.59) 2023-08-18 18:09:25.358493: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
(RayTrainWorker pid=8911, ip=10.0.60.59) 2023-08-18 18:09:26.095161: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
(RayTrainWorker pid=8911, ip=10.0.60.59) 2023-08-18 18:09:26.095229: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
(RayTrainWorker pid=8911, ip=10.0.60.59) 2023-08-18 18:09:26.095236: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
(SplitCoordinator pid=8980, ip=10.0.60.59) Auto configuring locality_with_output=['6002ded0aaa53ce9a0351d22a72b344ef411a422919132f41d9f937a', 'd3bbd390b6fe73f26202f96d75998946cf3e8b457528d426db0c6e07', 'fe6aaf54317ee630a02d23e0d49581b57b5cd51316eaf769e28bb045', 'f7de4694a4f764c05a9c51a6a4bd40ac33f3fced3b25127b25cd4ac3', '42866a2fba4ce2ab4b6645c4d731d486b762e2b23ac24cafccba7096', '8a7272830662c7e756a656de0a9b433a3a1f9b990768f692b6fe11a7', 'bba62e8b57552509c62a6b6b7fd67c1a2280b9d81b3d9c41eb4d1b9b', 'b40764f303538c24bc439106f2e7b2144d382bfed6c9fdec15ab828e', 'd1de4d4b6d44eff93857026df4ef0f70e24e3dc91e15d87015f2ed32', '4d6a9dc1aa7bfc80cb73d9f66f4e28041807f12769391f5643bce143', '8bcc7235f459b61be21fe158d0bae4fef2ec6de013ec60e7aaf7897a', '73c50b995811afa0ece70fd3d4466b7fd0dc85a97d6807128b2c47da', '03bf3d374a9f857b1cd1aebdbe028208f7904b077fb151790e03e9fe', '9f7fc101a7d6b3e17b72e57ca1c92f91d13aa385a6740f99d58ec016', '867844d104a8e9351a1dcc8bbd61d99906a8dc5b53e220c2ae2efbe1', '7677b344c59d6b30c3db451f48e346d61bb60cc798e5567aa4e0a1ea']
(RayTrainWorker pid=49329) comet_ml is installed but `COMET_API_KEY` is not set.
(RayTrainWorker pid=8867, ip=10.0.49.236) --------------------------------------------------------------------------
(RayTrainWorker pid=8867, ip=10.0.49.236)                  Aim collects anonymous usage analytics.                 
(RayTrainWorker pid=8867, ip=10.0.49.236)                         Read how to opt-out here:                         
(RayTrainWorker pid=8867, ip=10.0.49.236)     https://aimstack.readthedocs.io/en/latest/community/telemetry.html    
(RayTrainWorker pid=8867, ip=10.0.49.236) --------------------------------------------------------------------------
(SplitCoordinator pid=8980, ip=10.0.60.59) 2023-08-18 18:09:26.534936: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA [repeated 16x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)
(SplitCoordinator pid=8980, ip=10.0.60.59) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. [repeated 16x across cluster]
(SplitCoordinator pid=8980, ip=10.0.60.59) 2023-08-18 18:09:26.667181: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`. [repeated 16x across cluster]

(RayTrainWorker pid=8885, ip=10.0.47.209) Preparing training arguments
(RayTrainWorker pid=36675, ip=10.0.13.222) Loading model
(autoscaler +3m53s) [workspace snapshot] New snapshot created successfully (size: 172.52 MB).
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:12:01,852] [INFO] [partition_parameters.py:454:__exit__] finished initializing model with 6.05B parameters
(RayTrainWorker pid=36675, ip=10.0.13.222) Preparing training arguments [repeated 15x across cluster]
(RayTrainWorker pid=8880, ip=10.0.63.99) Loading model [repeated 15x across cluster]
(RayTrainWorker pid=8851, ip=10.0.43.240) Model loaded

Downloading builder script: 100%|██████████| 4.20k/4.20k [00:00<00:00, 22.1MB/s]
(SplitCoordinator pid=8980, ip=10.0.60.59) 2023-08-18 18:09:27.424862: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 [repeated 32x across cluster]
(SplitCoordinator pid=8980, ip=10.0.60.59) 2023-08-18 18:09:27.424869: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. [repeated 16x across cluster]
(RayTrainWorker pid=36311, ip=10.0.27.53) comet_ml is installed but `COMET_API_KEY` is not set. [repeated 15x across cluster]
(RayTrainWorker pid=36262, ip=10.0.52.191) -------------------------------------------------------------------------- [repeated 26x across cluster]
(RayTrainWorker pid=36262, ip=10.0.52.191)                  Aim collects anonymous usage analytics.                  [repeated 13x across cluster]
(RayTrainWorker pid=36262, ip=10.0.52.191)                         Read how to opt-out here:                          [repeated 13x across cluster]
(RayTrainWorker pid=36262, ip=10.0.52.191)     https://aimstack.readthedocs.io/en/latest/community/telemetry.html     [repeated 13x across cluster]
(RayTrainWorker pid=8911, ip=10.0.60.59) max_steps is given, it will override any value given in num_train_epochs
(RayTrainWorker pid=8911, ip=10.0.60.59) Using cuda_amp half precision backend

(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:12:36,256] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.9.2, git-hash=unknown, git-branch=unknown
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:12:36,373] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False

(RayTrainWorker pid=8858, ip=10.0.0.119) Using /home/ray/.cache/torch_extensions/py39_cu118 as PyTorch extensions root...
(RayTrainWorker pid=8858, ip=10.0.0.119) Creating extension directory /home/ray/.cache/torch_extensions/py39_cu118/cpu_adam...
Downloading builder script: 100%|██████████| 4.20k/4.20k [00:00<00:00, 19.8MB/s] [repeated 15x across cluster]
(RayTrainWorker pid=8857, ip=10.0.44.114) max_steps is given, it will override any value given in num_train_epochs [repeated 15x across cluster]
(RayTrainWorker pid=8857, ip=10.0.44.114) Using cuda_amp half precision backend [repeated 15x across cluster]
(RayTrainWorker pid=49329) Detected CUDA files, patching ldflags
(RayTrainWorker pid=49329) Emitting ninja build file /home/ray/.cache/torch_extensions/py39_cu118/cpu_adam/build.ninja...
(RayTrainWorker pid=49329) Building extension module cpu_adam...
(RayTrainWorker pid=49329) Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)

(RayTrainWorker pid=8858, ip=10.0.0.119) [1/3] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /home/ray/anaconda3/lib/python3.9/site-packages/torch/include -isystem /home/ray/anaconda3/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /home/ray/anaconda3/lib/python3.9/site-packages/torch/include/TH -isystem /home/ray/anaconda3/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /home/ray/anaconda3/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_75,code=compute_75 -gencode=arch=compute_75,code=sm_75 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_75,code=compute_75 -c /home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o 
(RayTrainWorker pid=8830, ip=10.0.30.35) Model loaded [repeated 15x across cluster]
(RayTrainWorker pid=49329) [2/3] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /home/ray/anaconda3/lib/python3.9/site-packages/torch/include -isystem /home/ray/anaconda3/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /home/ray/anaconda3/lib/python3.9/site-packages/torch/include/TH -isystem /home/ray/anaconda3/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /home/ray/anaconda3/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++14 -g -Wno-reorder -L/usr/local/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX512__ -D__ENABLE_CUDA__ -c /home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o 
(RayTrainWorker pid=36675, ip=10.0.13.222) [1/3] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /home/ray/anaconda3/lib/python3.9/site-packages/torch/include -isystem /home/ray/anaconda3/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /home/ray/anaconda3/lib/python3.9/site-packages/torch/include/TH -isystem /home/ray/anaconda3/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /home/ray/anaconda3/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_75,code=compute_75 -gencode=arch=compute_75,code=sm_75 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_75,code=compute_75 -c /home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o  [repeated 15x across cluster]
(RayTrainWorker pid=49329) [3/3] c++ cpu_adam.o custom_cuda_kernel.cuda.o -shared -lcurand -L/home/ray/anaconda3/lib/python3.9/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o cpu_adam.so
(RayTrainWorker pid=49329) Time to load cpu_adam op: 31.202290058135986 seconds

(RayTrainWorker pid=49329) Loading extension module cpu_adam...
(RayTrainWorker pid=36675, ip=10.0.13.222) Using /home/ray/.cache/torch_extensions/py39_cu118 as PyTorch extensions root... [repeated 15x across cluster]
(RayTrainWorker pid=36675, ip=10.0.13.222) Creating extension directory /home/ray/.cache/torch_extensions/py39_cu118/cpu_adam... [repeated 15x across cluster]
(RayTrainWorker pid=36675, ip=10.0.13.222) Detected CUDA files, patching ldflags [repeated 15x across cluster]
(RayTrainWorker pid=36675, ip=10.0.13.222) Emitting ninja build file /home/ray/.cache/torch_extensions/py39_cu118/cpu_adam/build.ninja... [repeated 15x across cluster]
(RayTrainWorker pid=36675, ip=10.0.13.222) Building extension module cpu_adam... [repeated 15x across cluster]
(RayTrainWorker pid=36675, ip=10.0.13.222) Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) [repeated 15x across cluster]

(RayTrainWorker pid=49329) Adam Optimizer #0 is created with AVX512 arithmetic capability.
(RayTrainWorker pid=49329) Config: alpha=0.000020, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1

(RayTrainWorker pid=49329) Building extension module utils...

(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:13,196] [INFO] [logging.py:96:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adamw as basic optimizer
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:13,212] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:13,212] [INFO] [utils.py:54:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'>
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:13,212] [INFO] [logging.py:96:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer, MiCS is enabled False, Hierarchical params gather False
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:13,212] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.float16 ZeRO stage 3 optimizer
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:13,520] [INFO] [utils.py:785:see_memory_usage] Stage 3 initialize beginning
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:13,521] [INFO] [utils.py:786:see_memory_usage] MA 0.11 GB         Max_MA 1.26 GB         CA 1.54 GB         Max_CA 2 GB 
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:13,521] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 8.96 GB, percent = 14.4%
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:13,523] [INFO] [stage3.py:113:__init__] Reduce bucket size 16777216
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:13,523] [INFO] [stage3.py:114:__init__] Prefetch bucket size 15099494
(RayTrainWorker pid=49329) [1/2] c++ -MMD -MF flatten_unflatten.o.d -DTORCH_EXTENSION_NAME=utils -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/ray/anaconda3/lib/python3.9/site-packages/torch/include -isystem /home/ray/anaconda3/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /home/ray/anaconda3/lib/python3.9/site-packages/torch/include/TH -isystem /home/ray/anaconda3/lib/python3.9/site-packages/torch/include/THC -isystem /home/ray/anaconda3/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -c /home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/ops/csrc/utils/flatten_unflatten.cpp -o flatten_unflatten.o 
(RayTrainWorker pid=36675, ip=10.0.13.222) [2/3] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /home/ray/anaconda3/lib/python3.9/site-packages/torch/include -isystem /home/ray/anaconda3/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /home/ray/anaconda3/lib/python3.9/site-packages/torch/include/TH -isystem /home/ray/anaconda3/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /home/ray/anaconda3/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++14 -g -Wno-reorder -L/usr/local/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX512__ -D__ENABLE_CUDA__ -c /home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o  [repeated 15x across cluster]
(RayTrainWorker pid=36675, ip=10.0.13.222) [3/3] c++ cpu_adam.o custom_cuda_kernel.cuda.o -shared -lcurand -L/home/ray/anaconda3/lib/python3.9/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o cpu_adam.so [repeated 15x across cluster]
(RayTrainWorker pid=36675, ip=10.0.13.222) Time to load cpu_adam op: 34.29589319229126 seconds [repeated 15x across cluster]
(RayTrainWorker pid=36675, ip=10.0.13.222) Adam Optimizer #0 is created with AVX512 arithmetic capability. [repeated 15x across cluster]
(RayTrainWorker pid=36675, ip=10.0.13.222) Config: alpha=0.000020, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1 [repeated 15x across cluster]
(RayTrainWorker pid=49329) [2/2] c++ flatten_unflatten.o -shared -L/home/ray/anaconda3/lib/python3.9/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o utils.so
(RayTrainWorker pid=49329) Time to load utils op: 15.381849527359009 seconds

(RayTrainWorker pid=49329) Loading extension module utils...
(RayTrainWorker pid=36675, ip=10.0.13.222) Loading extension module cpu_adam... [repeated 15x across cluster]
(RayTrainWorker pid=36675, ip=10.0.13.222) Using /home/ray/.cache/torch_extensions/py39_cu118 as PyTorch extensions root... [repeated 16x across cluster]
(RayTrainWorker pid=36675, ip=10.0.13.222) Creating extension directory /home/ray/.cache/torch_extensions/py39_cu118/utils... [repeated 16x across cluster]
(RayTrainWorker pid=36675, ip=10.0.13.222) Emitting ninja build file /home/ray/.cache/torch_extensions/py39_cu118/utils/build.ninja... [repeated 16x across cluster]
(RayTrainWorker pid=36675, ip=10.0.13.222) Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) [repeated 16x across cluster]
(RayTrainWorker pid=36675, ip=10.0.13.222) Building extension module utils... [repeated 15x across cluster]

(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:29,490] [INFO] [utils.py:785:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:29,491] [INFO] [utils.py:786:see_memory_usage] MA 0.11 GB         Max_MA 0.11 GB         CA 1.54 GB         Max_CA 2 GB 
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:29,491] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 8.96 GB, percent = 14.5%
(RayTrainWorker pid=8911, ip=10.0.60.59) Parameter Offload: Total persistent parameters: 811008 in 114 params
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:29,763] [INFO] [utils.py:785:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:29,764] [INFO] [utils.py:786:see_memory_usage] MA 0.11 GB         Max_MA 0.11 GB         CA 1.54 GB         Max_CA 2 GB 
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:29,764] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 8.96 GB, percent = 14.5%
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:30,012] [INFO] [utils.py:785:see_memory_usage] Before creating fp16 partitions
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:30,013] [INFO] [utils.py:786:see_memory_usage] MA 0.11 GB         Max_MA 0.11 GB         CA 1.54 GB         Max_CA 2 GB 
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:30,013] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 8.96 GB, percent = 14.5%
(RayTrainWorker pid=36675, ip=10.0.13.222) [1/2] c++ -MMD -MF flatten_unflatten.o.d -DTORCH_EXTENSION_NAME=utils -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/ray/anaconda3/lib/python3.9/site-packages/torch/include -isystem /home/ray/anaconda3/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /home/ray/anaconda3/lib/python3.9/site-packages/torch/include/TH -isystem /home/ray/anaconda3/lib/python3.9/site-packages/torch/include/THC -isystem /home/ray/anaconda3/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -c /home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/ops/csrc/utils/flatten_unflatten.cpp -o flatten_unflatten.o  [repeated 15x across cluster]

(RayTrainWorker pid=36675, ip=10.0.13.222) Loading extension module utils... [repeated 15x across cluster]

(RayTrainWorker pid=36675, ip=10.0.13.222) [2/2] c++ flatten_unflatten.o -shared -L/home/ray/anaconda3/lib/python3.9/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o utils.so [repeated 15x across cluster]
(RayTrainWorker pid=36675, ip=10.0.13.222) Time to load utils op: 16.94431161880493 seconds [repeated 15x across cluster]
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:31,872] [INFO] [utils.py:785:see_memory_usage] After creating fp16 partitions: 1
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:31,873] [INFO] [utils.py:786:see_memory_usage] MA 0.11 GB         Max_MA 0.11 GB         CA 1.54 GB         Max_CA 2 GB 
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:31,873] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 9.98 GB, percent = 16.1%
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:32,120] [INFO] [utils.py:785:see_memory_usage] Before creating fp32 partitions
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:32,121] [INFO] [utils.py:786:see_memory_usage] MA 0.11 GB         Max_MA 0.11 GB         CA 1.54 GB         Max_CA 2 GB 
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:32,121] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 9.98 GB, percent = 16.1%
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:32,624] [INFO] [utils.py:785:see_memory_usage] After creating fp32 partitions
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:32,624] [INFO] [utils.py:786:see_memory_usage] MA 0.11 GB         Max_MA 0.11 GB         CA 1.54 GB         Max_CA 2 GB 
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:32,625] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 11.39 GB, percent = 18.4%
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:32,870] [INFO] [utils.py:785:see_memory_usage] Before initializing optimizer states
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:32,870] [INFO] [utils.py:786:see_memory_usage] MA 0.11 GB         Max_MA 0.11 GB         CA 1.54 GB         Max_CA 2 GB 
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:32,871] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 11.39 GB, percent = 18.4%
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:34,834] [INFO] [utils.py:785:see_memory_usage] After initializing optimizer states
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:34,835] [INFO] [utils.py:786:see_memory_usage] MA 0.11 GB         Max_MA 0.11 GB         CA 1.54 GB         Max_CA 2 GB 
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:34,835] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 16.25 GB, percent = 26.2%
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:34,835] [INFO] [stage3.py:392:_setup_for_real_optimizer] optimizer state initialized

(RayTrainWorker pid=8830, ip=10.0.30.35) Using /home/ray/.cache/torch_extensions/py39_cu118 as PyTorch extensions root...
(RayTrainWorker pid=8830, ip=10.0.30.35) No modifications detected for re-loaded extension module utils, skipping build step...
(RayTrainWorker pid=8830, ip=10.0.30.35) Loading extension module utils...
(RayTrainWorker pid=9631, ip=10.0.57.153) Loading extension module utils...
(RayTrainWorker pid=9631, ip=10.0.57.153) ***** Running training *****
(RayTrainWorker pid=9631, ip=10.0.57.153)   Num examples = 10752
(RayTrainWorker pid=9631, ip=10.0.57.153)   Num Epochs = 9223372036854775807
(RayTrainWorker pid=9631, ip=10.0.57.153)   Instantaneous batch size per device = 8
(RayTrainWorker pid=9631, ip=10.0.57.153)   Total train batch size (w. parallel, distributed & accumulation) = 128
(RayTrainWorker pid=9631, ip=10.0.57.153)   Gradient Accumulation steps = 1
(RayTrainWorker pid=9631, ip=10.0.57.153)   Total optimization steps = 84
(RayTrainWorker pid=9631, ip=10.0.57.153)   Number of trainable parameters = 0

(RayTrainWorker pid=8830, ip=10.0.30.35) Time to load utils op: 0.0005006790161132812 seconds
(RayTrainWorker pid=9631, ip=10.0.57.153) Time to load utils op: 0.0005137920379638672 seconds
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,692] [INFO] [utils.py:785:see_memory_usage] After initializing ZeRO optimizer
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,693] [INFO] [utils.py:786:see_memory_usage] MA 0.14 GB         Max_MA 0.91 GB         CA 1.54 GB         Max_CA 2 GB 
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,693] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 17.3 GB, percent = 27.9%
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,694] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = adamw
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,694] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client callable to create LR scheduler
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,694] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = <torch.optim.lr_scheduler.LambdaLR object at 0x7f50b45fbfd0>
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,694] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[2e-05], mom=[[0.9, 0.999]]
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,695] [INFO] [config.py:955:print] DeepSpeedEngine configuration:
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,696] [INFO] [config.py:959:print]   activation_checkpointing_config  {
(RayTrainWorker pid=8911, ip=10.0.60.59)     "partition_activations": false, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "contiguous_memory_optimization": false, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "cpu_checkpointing": false, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "number_checkpoints": null, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "synchronize_checkpoint_boundary": false, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "profile": false
(RayTrainWorker pid=8911, ip=10.0.60.59) }
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,696] [INFO] [config.py:959:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,696] [INFO] [config.py:959:print]   amp_enabled .................. False
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,696] [INFO] [config.py:959:print]   amp_params ................... False
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,696] [INFO] [config.py:959:print]   autotuning_config ............ {
(RayTrainWorker pid=8911, ip=10.0.60.59)     "enabled": false, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "start_step": null, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "end_step": null, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "metric_path": null, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "arg_mappings": null, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "metric": "throughput", 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "model_info": null, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "results_dir": "autotuning_results", 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "exps_dir": "autotuning_exps", 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "overwrite": true, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "fast": true, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "start_profile_step": 3, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "end_profile_step": 5, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "tuner_type": "gridsearch", 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "tuner_early_stopping": 5, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "tuner_num_trials": 50, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "model_info_path": null, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "mp_size": 1, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "max_train_batch_size": null, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "min_train_batch_size": 1, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "min_train_micro_batch_size_per_gpu": 1, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "num_tuning_micro_batch_sizes": 3
(RayTrainWorker pid=8911, ip=10.0.60.59) }
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,696] [INFO] [config.py:959:print]   bfloat16_enabled ............. False
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,696] [INFO] [config.py:959:print]   checkpoint_parallel_write_pipeline  False
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,696] [INFO] [config.py:959:print]   checkpoint_tag_validation_enabled  True
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,696] [INFO] [config.py:959:print]   checkpoint_tag_validation_fail  False
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,696] [INFO] [config.py:959:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f50c6da2370>
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,696] [INFO] [config.py:959:print]   communication_data_type ...... None
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,696] [INFO] [config.py:959:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,696] [INFO] [config.py:959:print]   curriculum_enabled_legacy .... False
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,696] [INFO] [config.py:959:print]   curriculum_params_legacy ..... False
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,696] [INFO] [config.py:959:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,696] [INFO] [config.py:959:print]   data_efficiency_enabled ...... False
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,696] [INFO] [config.py:959:print]   dataloader_drop_last ......... False
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,697] [INFO] [config.py:959:print]   disable_allgather ............ False
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,697] [INFO] [config.py:959:print]   dump_state ................... False
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,697] [INFO] [config.py:959:print]   dynamic_loss_scale_args ...... {'init_scale': 256, 'scale_window': 1000, 'delayed_shift': 2, 'min_scale': 1}
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,697] [INFO] [config.py:959:print]   eigenvalue_enabled ........... False
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,697] [INFO] [config.py:959:print]   eigenvalue_gas_boundary_resolution  1
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,697] [INFO] [config.py:959:print]   eigenvalue_layer_name ........ bert.encoder.layer
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,697] [INFO] [config.py:959:print]   eigenvalue_layer_num ......... 0
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,697] [INFO] [config.py:959:print]   eigenvalue_max_iter .......... 100
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,697] [INFO] [config.py:959:print]   eigenvalue_stability ......... 1e-06
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,697] [INFO] [config.py:959:print]   eigenvalue_tol ............... 0.01
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,697] [INFO] [config.py:959:print]   eigenvalue_verbose ........... False
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,697] [INFO] [config.py:959:print]   elasticity_enabled ........... False
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,697] [INFO] [config.py:959:print]   flops_profiler_config ........ {
(RayTrainWorker pid=8911, ip=10.0.60.59)     "enabled": false, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "profile_step": 1, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "module_depth": -1, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "top_modules": 1, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "detailed": true, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "output_file": null
(RayTrainWorker pid=8911, ip=10.0.60.59) }
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,697] [INFO] [config.py:959:print]   fp16_auto_cast ............... False
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,697] [INFO] [config.py:959:print]   fp16_enabled ................. True
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,697] [INFO] [config.py:959:print]   fp16_master_weights_and_gradients  False
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,697] [INFO] [config.py:959:print]   global_rank .................. 0
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,697] [INFO] [config.py:959:print]   grad_accum_dtype ............. None
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,697] [INFO] [config.py:959:print]   gradient_accumulation_steps .. 1
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,697] [INFO] [config.py:959:print]   gradient_clipping ............ 1.0
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,697] [INFO] [config.py:959:print]   gradient_predivide_factor .... 1.0
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,697] [INFO] [config.py:959:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,697] [INFO] [config.py:959:print]   initial_dynamic_scale ........ 256
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,697] [INFO] [config.py:959:print]   load_universal_checkpoint .... False
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,697] [INFO] [config.py:959:print]   loss_scale ................... 0
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,697] [INFO] [config.py:959:print]   memory_breakdown ............. False
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,697] [INFO] [config.py:959:print]   mics_hierarchial_params_gather  False
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,697] [INFO] [config.py:959:print]   mics_shard_size .............. -1
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,697] [INFO] [config.py:959:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,697] [INFO] [config.py:959:print]   nebula_config ................ {
(RayTrainWorker pid=8911, ip=10.0.60.59)     "enabled": false, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "persistent_storage_path": null, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "persistent_time_interval": 100, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "num_of_version_in_retention": 2, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "enable_nebula_load": true, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "load_path": null
(RayTrainWorker pid=8911, ip=10.0.60.59) }
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,697] [INFO] [config.py:959:print]   optimizer_legacy_fusion ...... False
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,697] [INFO] [config.py:959:print]   optimizer_name ............... adamw
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,697] [INFO] [config.py:959:print]   optimizer_params ............. {'lr': 2e-05, 'betas': [0.9, 0.999], 'eps': 1e-08}
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,697] [INFO] [config.py:959:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,698] [INFO] [config.py:959:print]   pld_enabled .................. False
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,698] [INFO] [config.py:959:print]   pld_params ................... False
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,698] [INFO] [config.py:959:print]   prescale_gradients ........... False
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,698] [INFO] [config.py:959:print]   scheduler_name ............... None
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,698] [INFO] [config.py:959:print]   scheduler_params ............. None
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,698] [INFO] [config.py:959:print]   sparse_attention ............. None
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,698] [INFO] [config.py:959:print]   sparse_gradients_enabled ..... False
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,698] [INFO] [config.py:959:print]   steps_per_print .............. 10
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,698] [INFO] [config.py:959:print]   train_batch_size ............. 128
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,698] [INFO] [config.py:959:print]   train_micro_batch_size_per_gpu  8
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,698] [INFO] [config.py:959:print]   use_node_local_storage ....... False
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,698] [INFO] [config.py:959:print]   wall_clock_breakdown ......... False
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,698] [INFO] [config.py:959:print]   world_size ................... 16
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,698] [INFO] [config.py:959:print]   zero_allow_untested_optimizer  False
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,698] [INFO] [config.py:959:print]   zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=16777216 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='cpu', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=True) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='cpu', nvme_path=None, buffer_count=4, pin_memory=True, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False) sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=15099494 param_persistence_threshold=40960 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=True stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=True mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,698] [INFO] [config.py:959:print]   zero_enabled ................. True
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,698] [INFO] [config.py:959:print]   zero_force_ds_cpu_optimizer .. True
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,698] [INFO] [config.py:959:print]   zero_optimization_stage ...... 3
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,698] [INFO] [config.py:945:print_user_config]   json = {
(RayTrainWorker pid=8911, ip=10.0.60.59)     "fp16": {
(RayTrainWorker pid=8911, ip=10.0.60.59)         "enabled": true, 
(RayTrainWorker pid=8911, ip=10.0.60.59)         "initial_scale_power": 8
(RayTrainWorker pid=8911, ip=10.0.60.59)     }, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "bf16": {
(RayTrainWorker pid=8911, ip=10.0.60.59)         "enabled": false
(RayTrainWorker pid=8911, ip=10.0.60.59)     }, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "optimizer": {
(RayTrainWorker pid=8911, ip=10.0.60.59)         "type": "AdamW", 
(RayTrainWorker pid=8911, ip=10.0.60.59)         "params": {
(RayTrainWorker pid=8911, ip=10.0.60.59)             "lr": 2e-05, 
(RayTrainWorker pid=8911, ip=10.0.60.59)             "betas": [0.9, 0.999], 
(RayTrainWorker pid=8911, ip=10.0.60.59)             "eps": 1e-08
(RayTrainWorker pid=8911, ip=10.0.60.59)         }
(RayTrainWorker pid=8911, ip=10.0.60.59)     }, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "zero_optimization": {
(RayTrainWorker pid=8911, ip=10.0.60.59)         "stage": 3, 
(RayTrainWorker pid=8911, ip=10.0.60.59)         "offload_optimizer": {
(RayTrainWorker pid=8911, ip=10.0.60.59)             "device": "cpu", 
(RayTrainWorker pid=8911, ip=10.0.60.59)             "pin_memory": true
(RayTrainWorker pid=8911, ip=10.0.60.59)         }, 
(RayTrainWorker pid=8911, ip=10.0.60.59)         "offload_param": {
(RayTrainWorker pid=8911, ip=10.0.60.59)             "device": "cpu", 
(RayTrainWorker pid=8911, ip=10.0.60.59)             "pin_memory": true
(RayTrainWorker pid=8911, ip=10.0.60.59)         }, 
(RayTrainWorker pid=8911, ip=10.0.60.59)         "overlap_comm": true, 
(RayTrainWorker pid=8911, ip=10.0.60.59)         "contiguous_gradients": true, 
(RayTrainWorker pid=8911, ip=10.0.60.59)         "reduce_bucket_size": 1.677722e+07, 
(RayTrainWorker pid=8911, ip=10.0.60.59)         "stage3_prefetch_bucket_size": 1.509949e+07, 
(RayTrainWorker pid=8911, ip=10.0.60.59)         "stage3_param_persistence_threshold": 4.096000e+04, 
(RayTrainWorker pid=8911, ip=10.0.60.59)         "gather_16bit_weights_on_model_save": true, 
(RayTrainWorker pid=8911, ip=10.0.60.59)         "round_robin_gradients": true
(RayTrainWorker pid=8911, ip=10.0.60.59)     }, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "gradient_accumulation_steps": 1, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "gradient_clipping": 1.0, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "steps_per_print": 10, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "train_batch_size": 128, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "train_micro_batch_size_per_gpu": 8, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "wall_clock_breakdown": false
(RayTrainWorker pid=8911, ip=10.0.60.59) }

(SplitCoordinator pid=8980, ip=10.0.60.59) Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[MapBatches(split_text)->MapBatches(tokenize)] -> OutputSplitter[split(16, equal=True)]
(SplitCoordinator pid=8980, ip=10.0.60.59) Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=2000000000.0), locality_with_output=['6002ded0aaa53ce9a0351d22a72b344ef411a422919132f41d9f937a', 'd3bbd390b6fe73f26202f96d75998946cf3e8b457528d426db0c6e07', 'fe6aaf54317ee630a02d23e0d49581b57b5cd51316eaf769e28bb045', 'f7de4694a4f764c05a9c51a6a4bd40ac33f3fced3b25127b25cd4ac3', '42866a2fba4ce2ab4b6645c4d731d486b762e2b23ac24cafccba7096', '8a7272830662c7e756a656de0a9b433a3a1f9b990768f692b6fe11a7', 'bba62e8b57552509c62a6b6b7fd67c1a2280b9d81b3d9c41eb4d1b9b', 'b40764f303538c24bc439106f2e7b2144d382bfed6c9fdec15ab828e', 'd1de4d4b6d44eff93857026df4ef0f70e24e3dc91e15d87015f2ed32', '4d6a9dc1aa7bfc80cb73d9f66f4e28041807f12769391f5643bce143', '8bcc7235f459b61be21fe158d0bae4fef2ec6de013ec60e7aaf7897a', '73c50b995811afa0ece70fd3d4466b7fd0dc85a97d6807128b2c47da', '03bf3d374a9f857b1cd1aebdbe028208f7904b077fb151790e03e9fe', '9f7fc101a7d6b3e17b72e57ca1c92f91d13aa385a6740f99d58ec016', '867844d104a8e9351a1dcc8bbd61d99906a8dc5b53e220c2ae2efbe1', '7677b344c59d6b30c3db451f48e346d61bb60cc798e5567aa4e0a1ea'], preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
(SplitCoordinator pid=8980, ip=10.0.60.59) Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`
(MapBatches(split_text)->MapBatches(tokenize) pid=10097, ip=10.0.60.59) 2023-08-18 18:13:42.547741: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
(MapBatches(split_text)->MapBatches(tokenize) pid=10097, ip=10.0.60.59) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
(MapBatches(split_text)->MapBatches(tokenize) pid=10097, ip=10.0.60.59) 2023-08-18 18:13:42.685843: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
(MapBatches(split_text)->MapBatches(tokenize) pid=10097, ip=10.0.60.59) 2023-08-18 18:13:43.506819: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
(MapBatches(split_text)->MapBatches(tokenize) pid=10097, ip=10.0.60.59) 2023-08-18 18:13:43.506880: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
(MapBatches(split_text)->MapBatches(tokenize) pid=10097, ip=10.0.60.59) 2023-08-18 18:13:43.506887: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.

(RayTrainWorker pid=8911, ip=10.0.60.59) Time to load utils op: 0.0003864765167236328 seconds [repeated 14x across cluster]
(RayTrainWorker pid=8851, ip=10.0.43.240) {'loss': 12.1235, 'learning_rate': 1.9761904761904763e-05, 'epoch': 0.01}
(RayTrainWorker pid=8851, ip=10.0.43.240) {'loss': 6.7834, 'learning_rate': 1.9523809523809524e-05, 'epoch': 0.02} [repeated 16x across cluster]
(RayTrainWorker pid=8857, ip=10.0.44.114) {'loss': 2.2151, 'learning_rate': 1.928571428571429e-05, 'epoch': 0.04} [repeated 16x across cluster]
(RayTrainWorker pid=9631, ip=10.0.57.153) {'loss': 0.1739, 'learning_rate': 1.904761904761905e-05, 'epoch': 0.05} [repeated 16x across cluster]
(autoscaler +8m53s) [workspace snapshot] New snapshot created successfully (size: 172.58 MB).
(RayTrainWorker pid=8858, ip=10.0.0.119) {'loss': 0.121, 'learning_rate': 1.880952380952381e-05, 'epoch': 0.06} [repeated 16x across cluster]
(RayTrainWorker pid=36675, ip=10.0.13.222) {'loss': 0.1422, 'learning_rate': 1.8571428571428575e-05, 'epoch': 0.07} [repeated 16x across cluster]
(RayTrainWorker pid=36249, ip=10.0.11.26) {'loss': 0.1007, 'learning_rate': 1.8333333333333333e-05, 'epoch': 0.08} [repeated 16x across cluster]
(RayTrainWorker pid=8867, ip=10.0.49.236) {'loss': 0.1082, 'learning_rate': 1.8095238095238097e-05, 'epoch': 0.1} [repeated 16x across cluster]
(RayTrainWorker pid=8858, ip=10.0.0.119) {'loss': 0.094, 'learning_rate': 1.785714285714286e-05, 'epoch': 0.11} [repeated 16x across cluster]
(RayTrainWorker pid=8880, ip=10.0.63.99) {'loss': 0.0936, 'learning_rate': 1.761904761904762e-05, 'epoch': 0.12} [repeated 16x across cluster]
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:18:36,553] [INFO] [logging.py:96:log_dist] [Rank 0] step=10, skipped=0, lr=[1.761904761904762e-05], mom=[[0.9, 0.999]]
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:18:36,554] [INFO] [timer.py:199:stop] epoch=0/micro_step=10/global_step=10, RunningAvgSamplesPerSec=4.768458258762969, CurrSamplesPerSec=4.833942877725304, MemAllocated=0.16GB, MaxMemAllocated=8.93GB
(RayTrainWorker pid=8857, ip=10.0.44.114) {'loss': 0.0921, 'learning_rate': 1.7380952380952384e-05, 'epoch': 0.13} [repeated 16x across cluster]
(RayTrainWorker pid=8851, ip=10.0.43.240) {'loss': 0.0915, 'learning_rate': 1.7142857142857142e-05, 'epoch': 0.14} [repeated 16x across cluster]
(RayTrainWorker pid=49329) {'loss': 0.0883, 'learning_rate': 1.6904761904761906e-05, 'epoch': 0.15} [repeated 16x across cluster]
(RayTrainWorker pid=36311, ip=10.0.27.53) {'loss': 0.0868, 'learning_rate': 1.6666666666666667e-05, 'epoch': 0.17} [repeated 16x across cluster]
(RayTrainWorker pid=8911, ip=10.0.60.59) {'loss': 0.0815, 'learning_rate': 1.642857142857143e-05, 'epoch': 0.18} [repeated 16x across cluster]
(autoscaler +13m58s) [workspace snapshot] New snapshot created successfully (size: 172.58 MB).
(RayTrainWorker pid=8875, ip=10.0.0.80) {'loss': 0.0825, 'learning_rate': 1.6190476190476193e-05, 'epoch': 0.19} [repeated 16x across cluster]
(RayTrainWorker pid=49329) {'loss': 0.0813, 'learning_rate': 1.5952380952380954e-05, 'epoch': 0.2} [repeated 16x across cluster]
(RayTrainWorker pid=8880, ip=10.0.63.99) {'loss': 0.0816, 'learning_rate': 1.5714285714285715e-05, 'epoch': 0.21} [repeated 16x across cluster]
(RayTrainWorker pid=8851, ip=10.0.43.240) {'loss': 0.0813, 'learning_rate': 1.5476190476190476e-05, 'epoch': 0.23} [repeated 16x across cluster]
(RayTrainWorker pid=8885, ip=10.0.47.209) {'loss': 0.0765, 'learning_rate': 1.523809523809524e-05, 'epoch': 0.24} [repeated 16x across cluster]
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:23:03,756] [INFO] [logging.py:96:log_dist] [Rank 0] step=20, skipped=0, lr=[1.523809523809524e-05], mom=[[0.9, 0.999]]
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:23:03,756] [INFO] [timer.py:199:stop] epoch=0/micro_step=20/global_step=20, RunningAvgSamplesPerSec=4.781402482813706, CurrSamplesPerSec=4.7832870646183325, MemAllocated=0.16GB, MaxMemAllocated=8.93GB
(RayTrainWorker pid=8858, ip=10.0.0.119) {'loss': 0.0833, 'learning_rate': 1.5000000000000002e-05, 'epoch': 0.25} [repeated 16x across cluster]
(RayTrainWorker pid=8857, ip=10.0.44.114) {'loss': 0.084, 'learning_rate': 1.4761904761904763e-05, 'epoch': 0.26} [repeated 16x across cluster]
(RayTrainWorker pid=49329) {'loss': 0.0839, 'learning_rate': 1.4523809523809524e-05, 'epoch': 0.27} [repeated 16x across cluster]
(RayTrainWorker pid=8867, ip=10.0.49.236) {'loss': 0.0825, 'learning_rate': 1.4285714285714287e-05, 'epoch': 0.29} [repeated 16x across cluster]
(RayTrainWorker pid=36262, ip=10.0.52.191) {'loss': 0.0838, 'learning_rate': 1.4047619047619048e-05, 'epoch': 0.3} [repeated 16x across cluster]
(RayTrainWorker pid=8867, ip=10.0.49.236) {'loss': 0.0847, 'learning_rate': 1.3809523809523811e-05, 'epoch': 0.31} [repeated 16x across cluster]
(autoscaler +18m58s) [workspace snapshot] New snapshot created successfully (size: 172.58 MB).
(RayTrainWorker pid=8851, ip=10.0.43.240) {'loss': 0.0788, 'learning_rate': 1.3571428571428574e-05, 'epoch': 0.32} [repeated 16x across cluster]
(RayTrainWorker pid=8875, ip=10.0.0.80) {'loss': 0.0832, 'learning_rate': 1.3333333333333333e-05, 'epoch': 0.33} [repeated 16x across cluster]
(RayTrainWorker pid=8875, ip=10.0.0.80) {'loss': 0.0811, 'learning_rate': 1.3095238095238096e-05, 'epoch': 0.35} [repeated 16x across cluster]
(RayTrainWorker pid=36675, ip=10.0.13.222) {'loss': 0.0759, 'learning_rate': 1.2857142857142859e-05, 'epoch': 0.36} [repeated 16x across cluster]
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:27:35,516] [INFO] [logging.py:96:log_dist] [Rank 0] step=30, skipped=0, lr=[1.2857142857142859e-05], mom=[[0.9, 0.999]]
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:27:35,517] [INFO] [timer.py:199:stop] epoch=0/micro_step=30/global_step=30, RunningAvgSamplesPerSec=4.756191577689035, CurrSamplesPerSec=4.775146730091594, MemAllocated=0.16GB, MaxMemAllocated=8.93GB
(RayTrainWorker pid=8875, ip=10.0.0.80) {'loss': 0.0774, 'learning_rate': 1.261904761904762e-05, 'epoch': 0.37} [repeated 16x across cluster]
(RayTrainWorker pid=8858, ip=10.0.0.119) {'loss': 0.0751, 'learning_rate': 1.2380952380952383e-05, 'epoch': 0.38} [repeated 16x across cluster]
(RayTrainWorker pid=36311, ip=10.0.27.53) {'loss': 0.0744, 'learning_rate': 1.2142857142857142e-05, 'epoch': 0.39} [repeated 16x across cluster]
(RayTrainWorker pid=8845, ip=10.0.18.195) {'loss': 0.0722, 'learning_rate': 1.1904761904761905e-05, 'epoch': 0.4} [repeated 16x across cluster]
(RayTrainWorker pid=8880, ip=10.0.63.99) {'loss': 0.0742, 'learning_rate': 1.1666666666666668e-05, 'epoch': 0.42} [repeated 16x across cluster]
(RayTrainWorker pid=8857, ip=10.0.44.114) {'loss': 0.0764, 'learning_rate': 1.1428571428571429e-05, 'epoch': 0.43} [repeated 16x across cluster]
(RayTrainWorker pid=8851, ip=10.0.43.240) {'loss': 0.0786, 'learning_rate': 1.1190476190476192e-05, 'epoch': 0.44} [repeated 16x across cluster]
(autoscaler +24m4s) [workspace snapshot] New snapshot created successfully (size: 172.58 MB).
(RayTrainWorker pid=9631, ip=10.0.57.153) {'loss': 0.0738, 'learning_rate': 1.0952380952380955e-05, 'epoch': 0.45} [repeated 16x across cluster]
(RayTrainWorker pid=36675, ip=10.0.13.222) {'loss': 0.0784, 'learning_rate': 1.0714285714285714e-05, 'epoch': 0.46} [repeated 16x across cluster]
(RayTrainWorker pid=8885, ip=10.0.47.209) {'loss': 0.0786, 'learning_rate': 1.0476190476190477e-05, 'epoch': 0.48} [repeated 16x across cluster]
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:32:06,009] [INFO] [logging.py:96:log_dist] [Rank 0] step=40, skipped=0, lr=[1.0476190476190477e-05], mom=[[0.9, 0.999]]
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:32:06,009] [INFO] [timer.py:199:stop] epoch=0/micro_step=40/global_step=40, RunningAvgSamplesPerSec=4.750214082000028, CurrSamplesPerSec=4.781755388354574, MemAllocated=0.16GB, MaxMemAllocated=8.93GB
(RayTrainWorker pid=8845, ip=10.0.18.195) {'loss': 0.0714, 'learning_rate': 1.0238095238095238e-05, 'epoch': 0.49} [repeated 16x across cluster]
(RayTrainWorker pid=36311, ip=10.0.27.53) {'loss': 0.0739, 'learning_rate': 1e-05, 'epoch': 0.5} [repeated 16x across cluster]
(RayTrainWorker pid=8851, ip=10.0.43.240) {'loss': 0.0767, 'learning_rate': 9.761904761904762e-06, 'epoch': 0.51} [repeated 16x across cluster]
(RayTrainWorker pid=36675, ip=10.0.13.222) {'loss': 0.0827, 'learning_rate': 9.523809523809525e-06, 'epoch': 0.52} [repeated 16x across cluster]
(RayTrainWorker pid=36262, ip=10.0.52.191) {'loss': 0.0751, 'learning_rate': 9.285714285714288e-06, 'epoch': 0.54} [repeated 16x across cluster]
(RayTrainWorker pid=36675, ip=10.0.13.222) {'loss': 0.0737, 'learning_rate': 9.047619047619049e-06, 'epoch': 0.55} [repeated 16x across cluster]
(RayTrainWorker pid=8885, ip=10.0.47.209) {'loss': 0.0755, 'learning_rate': 8.80952380952381e-06, 'epoch': 0.56} [repeated 16x across cluster]
(RayTrainWorker pid=49329) {'loss': 0.0745, 'learning_rate': 8.571428571428571e-06, 'epoch': 0.57} [repeated 16x across cluster]
(RayTrainWorker pid=9631, ip=10.0.57.153) {'loss': 0.0753, 'learning_rate': 8.333333333333334e-06, 'epoch': 0.58} [repeated 16x across cluster]
(autoscaler +29m9s) [workspace snapshot] New snapshot created successfully (size: 172.59 MB).
(RayTrainWorker pid=8845, ip=10.0.18.195) {'loss': 0.0739, 'learning_rate': 8.095238095238097e-06, 'epoch': 0.6} [repeated 16x across cluster]
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:36:34,033] [INFO] [logging.py:96:log_dist] [Rank 0] step=50, skipped=0, lr=[8.095238095238097e-06], mom=[[0.9, 0.999]]
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:36:34,033] [INFO] [timer.py:199:stop] epoch=0/micro_step=50/global_step=50, RunningAvgSamplesPerSec=4.75579745222066, CurrSamplesPerSec=4.705258125568294, MemAllocated=0.16GB, MaxMemAllocated=8.93GB
(RayTrainWorker pid=8885, ip=10.0.47.209) {'loss': 0.073, 'learning_rate': 7.857142857142858e-06, 'epoch': 0.61} [repeated 16x across cluster]
(RayTrainWorker pid=8830, ip=10.0.30.35) {'loss': 0.0721, 'learning_rate': 7.61904761904762e-06, 'epoch': 0.62} [repeated 16x across cluster]
(RayTrainWorker pid=36311, ip=10.0.27.53) {'loss': 0.0729, 'learning_rate': 7.380952380952382e-06, 'epoch': 0.63} [repeated 16x across cluster]
(RayTrainWorker pid=8880, ip=10.0.63.99) {'loss': 0.0714, 'learning_rate': 7.1428571428571436e-06, 'epoch': 0.64} [repeated 16x across cluster]
(RayTrainWorker pid=36675, ip=10.0.13.222) {'loss': 0.0745, 'learning_rate': 6.9047619047619055e-06, 'epoch': 0.65} [repeated 16x across cluster]
(RayTrainWorker pid=9631, ip=10.0.57.153) {'loss': 0.0726, 'learning_rate': 6.666666666666667e-06, 'epoch': 0.67} [repeated 16x across cluster]
(RayTrainWorker pid=8911, ip=10.0.60.59) {'loss': 0.0699, 'learning_rate': 6.4285714285714295e-06, 'epoch': 0.68} [repeated 16x across cluster]
(RayTrainWorker pid=36262, ip=10.0.52.191) {'loss': 0.0732, 'learning_rate': 6.1904761904761914e-06, 'epoch': 0.69} [repeated 16x across cluster]
(RayTrainWorker pid=49329) {'loss': 0.0714, 'learning_rate': 5.9523809523809525e-06, 'epoch': 0.7} [repeated 16x across cluster]
(RayTrainWorker pid=8851, ip=10.0.43.240) {'loss': 0.0709, 'learning_rate': 5.7142857142857145e-06, 'epoch': 0.71} [repeated 16x across cluster]
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:41:07,338] [INFO] [logging.py:96:log_dist] [Rank 0] step=60, skipped=0, lr=[5.7142857142857145e-06], mom=[[0.9, 0.999]]
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:41:07,338] [INFO] [timer.py:199:stop] epoch=0/micro_step=60/global_step=60, RunningAvgSamplesPerSec=4.74341422313603, CurrSamplesPerSec=4.640637786972311, MemAllocated=0.16GB, MaxMemAllocated=8.93GB
(autoscaler +34m9s) [workspace snapshot] New snapshot created successfully (size: 172.59 MB).
(RayTrainWorker pid=8875, ip=10.0.0.80) {'loss': 0.071, 'learning_rate': 5.476190476190477e-06, 'epoch': 0.73} [repeated 16x across cluster]
(RayTrainWorker pid=36311, ip=10.0.27.53) {'loss': 0.0714, 'learning_rate': 5.2380952380952384e-06, 'epoch': 0.74} [repeated 16x across cluster]
(RayTrainWorker pid=8875, ip=10.0.0.80) {'loss': 0.0703, 'learning_rate': 5e-06, 'epoch': 0.75} [repeated 16x across cluster]
(RayTrainWorker pid=36675, ip=10.0.13.222) {'loss': 0.0733, 'learning_rate': 4.761904761904762e-06, 'epoch': 0.76} [repeated 16x across cluster]
(RayTrainWorker pid=8845, ip=10.0.18.195) {'loss': 0.0686, 'learning_rate': 4.523809523809524e-06, 'epoch': 0.77} [repeated 16x across cluster]
(RayTrainWorker pid=8851, ip=10.0.43.240) {'loss': 0.068, 'learning_rate': 4.2857142857142855e-06, 'epoch': 0.79} [repeated 16x across cluster]
(RayTrainWorker pid=36311, ip=10.0.27.53) {'loss': 0.071, 'learning_rate': 4.047619047619048e-06, 'epoch': 0.8} [repeated 16x across cluster]
(RayTrainWorker pid=8911, ip=10.0.60.59) {'loss': 0.0708, 'learning_rate': 3.80952380952381e-06, 'epoch': 0.81} [repeated 16x across cluster]
(RayTrainWorker pid=36311, ip=10.0.27.53) {'loss': 0.0766, 'learning_rate': 3.5714285714285718e-06, 'epoch': 0.82} [repeated 16x across cluster]
(RayTrainWorker pid=8858, ip=10.0.0.119) {'loss': 0.0743, 'learning_rate': 3.3333333333333333e-06, 'epoch': 0.83} [repeated 16x across cluster]
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:45:31,965] [INFO] [logging.py:96:log_dist] [Rank 0] step=70, skipped=0, lr=[3.3333333333333333e-06], mom=[[0.9, 0.999]]
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:45:31,965] [INFO] [timer.py:199:stop] epoch=0/micro_step=70/global_step=70, RunningAvgSamplesPerSec=4.757168325507401, CurrSamplesPerSec=4.8146031804109555, MemAllocated=0.16GB, MaxMemAllocated=8.93GB
(RayTrainWorker pid=8830, ip=10.0.30.35) {'loss': 0.0752, 'learning_rate': 3.3333333333333333e-06, 'epoch': 0.85} [repeated 16x across cluster]
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:45:58,184] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 256, but hysteresis is 2. Reducing hysteresis to 1
(autoscaler +39m14s) [workspace snapshot] New snapshot created successfully (size: 172.59 MB).
(RayTrainWorker pid=8845, ip=10.0.18.195) {'loss': 0.0717, 'learning_rate': 3.0952380952380957e-06, 'epoch': 0.86} [repeated 16x across cluster]
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:46:26,433] [WARNING] [stage3.py:1826:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
(RayTrainWorker pid=36675, ip=10.0.13.222) {'loss': 0.0695, 'learning_rate': 2.8571428571428573e-06, 'epoch': 0.87} [repeated 16x across cluster]
(RayTrainWorker pid=36249, ip=10.0.11.26) {'loss': 0.0709, 'learning_rate': 2.6190476190476192e-06, 'epoch': 0.88} [repeated 16x across cluster]
(RayTrainWorker pid=8885, ip=10.0.47.209) {'loss': 0.0729, 'learning_rate': 2.380952380952381e-06, 'epoch': 0.89} [repeated 16x across cluster]
(RayTrainWorker pid=8880, ip=10.0.63.99) {'loss': 0.0752, 'learning_rate': 2.1428571428571427e-06, 'epoch': 0.9} [repeated 16x across cluster]
(RayTrainWorker pid=8845, ip=10.0.18.195) {'loss': 0.0712, 'learning_rate': 1.904761904761905e-06, 'epoch': 0.92} [repeated 16x across cluster]
(RayTrainWorker pid=9631, ip=10.0.57.153) {'loss': 0.0708, 'learning_rate': 1.6666666666666667e-06, 'epoch': 0.93} [repeated 16x across cluster]
(RayTrainWorker pid=36249, ip=10.0.11.26) {'loss': 0.0723, 'learning_rate': 1.4285714285714286e-06, 'epoch': 0.94} [repeated 16x across cluster]
(RayTrainWorker pid=8845, ip=10.0.18.195) {'loss': 0.0689, 'learning_rate': 1.1904761904761906e-06, 'epoch': 0.95} [repeated 16x across cluster]
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:50:01,494] [INFO] [logging.py:96:log_dist] [Rank 0] step=80, skipped=1, lr=[1.1904761904761906e-06], mom=[[0.9, 0.999]]
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:50:01,494] [INFO] [timer.py:199:stop] epoch=0/micro_step=80/global_step=80, RunningAvgSamplesPerSec=4.756310378443122, CurrSamplesPerSec=4.758170892979721, MemAllocated=0.16GB, MaxMemAllocated=8.93GB
(RayTrainWorker pid=36675, ip=10.0.13.222) {'loss': 0.0715, 'learning_rate': 9.523809523809525e-07, 'epoch': 0.96} [repeated 16x across cluster]
(RayTrainWorker pid=8880, ip=10.0.63.99) {'loss': 0.07, 'learning_rate': 7.142857142857143e-07, 'epoch': 0.98} [repeated 16x across cluster]
(RayTrainWorker pid=9631, ip=10.0.57.153) {'loss': 0.0716, 'learning_rate': 4.7619047619047623e-07, 'epoch': 0.99} [repeated 16x across cluster]
(autoscaler +44m19s) [workspace snapshot] New snapshot created successfully (size: 172.60 MB).
(RayTrainWorker pid=8880, ip=10.0.63.99) {'loss': 0.069, 'learning_rate': 2.3809523809523811e-07, 'epoch': 1.0} [repeated 16x across cluster]

(RayTrainWorker pid=8911, ip=10.0.60.59) Saving model checkpoint to output/checkpoint-84
(RayTrainWorker pid=8911, ip=10.0.60.59) Configuration saved in output/checkpoint-84/config.json
(RayTrainWorker pid=8911, ip=10.0.60.59) Configuration saved in output/checkpoint-84/generation_config.json
(RayTrainWorker pid=8911, ip=10.0.60.59) Using /home/ray/.cache/torch_extensions/py39_cu118 as PyTorch extensions root... [repeated 15x across cluster]
(RayTrainWorker pid=8911, ip=10.0.60.59) No modifications detected for re-loaded extension module utils, skipping build step... [repeated 15x across cluster]
(RayTrainWorker pid=8911, ip=10.0.60.59) Loading extension module utils... [repeated 14x across cluster]
(RayTrainWorker pid=8911, ip=10.0.60.59) ***** Running training ***** [repeated 15x across cluster]
(RayTrainWorker pid=8911, ip=10.0.60.59)   Num examples = 10752 [repeated 15x across cluster]
(RayTrainWorker pid=8911, ip=10.0.60.59)   Num Epochs = 9223372036854775807 [repeated 15x across cluster]
(RayTrainWorker pid=8911, ip=10.0.60.59)   Instantaneous batch size per device = 8 [repeated 15x across cluster]
(RayTrainWorker pid=8911, ip=10.0.60.59)   Total train batch size (w. parallel, distributed & accumulation) = 128 [repeated 15x across cluster]
(RayTrainWorker pid=8911, ip=10.0.60.59)   Gradient Accumulation steps = 1 [repeated 15x across cluster]
(RayTrainWorker pid=8911, ip=10.0.60.59)   Total optimization steps = 84 [repeated 15x across cluster]
(RayTrainWorker pid=8911, ip=10.0.60.59)   Number of trainable parameters = 0 [repeated 15x across cluster]
(RayTrainWorker pid=8911, ip=10.0.60.59) Model weights saved in output/checkpoint-84/pytorch_model.bin
(RayTrainWorker pid=8911, ip=10.0.60.59) tokenizer config file saved in output/checkpoint-84/tokenizer_config.json
(RayTrainWorker pid=8911, ip=10.0.60.59) Special tokens file saved in output/checkpoint-84/special_tokens_map.json

(RayTrainWorker pid=49329) [2023-08-18 18:52:12,213] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step84 is ready now!
(RayTrainWorker pid=36249, ip=10.0.11.26) {'loss': 0.069, 'learning_rate': 2.3809523809523811e-07, 'epoch': 1.0} [repeated 15x across cluster]
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:52:12,213] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint global_step84 is about to be saved!
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:52:12,213] [INFO] [engine.py:3337:save_16bit_model] Saving model weights to output/checkpoint-84/pytorch_model.bin, tag: global_step84
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:52:12,213] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving output/checkpoint-84/pytorch_model.bin...

(RayTrainWorker pid=49329) /home/ray/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
(RayTrainWorker pid=49329)   warnings.warn(

(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:52:27,660] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved output/checkpoint-84/pytorch_model.bin.
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:52:27,673] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint global_step84 is about to be saved!
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:52:27,684] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: output/checkpoint-84/global_step84/zero_pp_rank_0_mp_rank_00_model_states.pt
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:52:27,685] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving output/checkpoint-84/global_step84/zero_pp_rank_0_mp_rank_00_model_states.pt...
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:52:27,660] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step84 is ready now! [repeated 15x across cluster]
(RayTrainWorker pid=36262, ip=10.0.52.191) [2023-08-18 18:52:27,685] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving output/checkpoint-84/global_step84/zero_pp_rank_15_mp_rank_00_model_states.pt...
(RayTrainWorker pid=9631, ip=10.0.57.153) [2023-08-18 18:52:32,337] [INFO] [engine.py:3228:_save_zero_checkpoint] zero checkpoint saved output/checkpoint-84/global_step84/zero_pp_rank_14_mp_rank_00_optim_states.pt
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:52:36,011] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved output/checkpoint-84/global_step84/zero_pp_rank_0_mp_rank_00_optim_states.pt. [repeated 32x across cluster]
(RayTrainWorker pid=36675, ip=10.0.13.222) [2023-08-18 18:52:27,684] [INFO] [logging.py:96:log_dist] [Rank 1] Saving model checkpoint: output/checkpoint-84/global_step84/zero_pp_rank_1_mp_rank_00_model_states.pt
(RayTrainWorker pid=8867, ip=10.0.49.236) [2023-08-18 18:52:27,873] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving output/checkpoint-84/global_step84/zero_pp_rank_3_mp_rank_00_optim_states.pt... [repeated 30x across cluster]
(RayTrainWorker pid=36311, ip=10.0.27.53) [2023-08-18 18:52:36,193] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step84 is ready now!

(RayTrainWorker pid=8885, ip=10.0.47.209) 
(RayTrainWorker pid=8885, ip=10.0.47.209) 
(RayTrainWorker pid=8885, ip=10.0.47.209) Training completed. Do not forget to share your model on huggingface.co/models =)
(RayTrainWorker pid=8885, ip=10.0.47.209) 
(RayTrainWorker pid=8885, ip=10.0.47.209) 
(RayTrainWorker pid=8867, ip=10.0.49.236) /home/ray/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details. [repeated 15x across cluster]
(RayTrainWorker pid=8867, ip=10.0.49.236)   warnings.warn( [repeated 15x across cluster]
2023-08-18 18:53:44,782	WARNING syncer.py:853 -- Ray AIR no longer supports the synchronization of checkpoints and other artifacts from worker nodes to the head node. This means that the checkpoints and artifacts saved by trials scheduled on worker nodes will not be accessible during the run (e.g., resuming from a checkpoint after a failure) or after the run (e.g., loading the checkpoint of a trial that ran on an already terminated worker node).

To fix this issue, configure AIR to use either:
(1) Cloud storage: `RunConfig(storage_path='s3://your/bucket')`
(2) A network filesystem mounted on all nodes: `RunConfig(storage_path='/mnt/path/to/nfs_storage')`
See this Github issue for more details on transitioning to cloud storage/NFS as well as an explanation on why this functionality is being removed: https://github.com/ray-project/ray/issues/37177
If you are already using NFS, you can ignore this warning message.

Other temporary workarounds:
- If you want to avoid errors/warnings and continue running with syncing explicitly turned off, set `RunConfig(SyncConfig(syncer=None))`
- Or, to re-enable the head node syncing behavior, set the environment variable RAY_AIR_REENABLE_DEPRECATED_SYNC_TO_HEAD_NODE=1
  - **Note that this functionality will tentatively be hard-deprecated in Ray 2.7.** See the linked issue for the latest information.

(RayTrainWorker pid=36262, ip=10.0.52.191) {'train_runtime': 2355.3551, 'train_samples_per_second': 4.565, 'train_steps_per_second': 0.036, 'train_loss': 0.32820896875290645, 'epoch': 1.0}
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:52:36,012] [INFO] [engine.py:3228:_save_zero_checkpoint] zero checkpoint saved output/checkpoint-84/global_step84/zero_pp_rank_0_mp_rank_00_optim_states.pt [repeated 15x across cluster]
(RayTrainWorker pid=8875, ip=10.0.0.80) [2023-08-18 18:52:36,193] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step84 is ready now! [repeated 15x across cluster]

(RayTrainWorker pid=8911, ip=10.0.60.59)  [repeated 60x across cluster]
(RayTrainWorker pid=8911, ip=10.0.60.59) Training completed. Do not forget to share your model on huggingface.co/models =) [repeated 15x across cluster]
2023-08-18 18:54:02,594	INFO tune.py:1146 -- Total run time: 2691.03 seconds (2676.82 seconds for the tuning loop).

Use the returned Result object to access metrics and the Ray Train Checkpoint associated with the last iteration.

checkpoint = results.checkpoint
checkpoint

Checkpoint(filesystem=<pyarrow._s3fs.S3FileSystem object at 0x7f8c59d311b0>, path=anyscale-staging-data-cld-kvedzwag2qa8i5bjxuevf5i7/org_7c1Kalm9WcX2bNIjW53GUT/cld_kvedZWag2qA8i5BjxUevf5i7/artifact_storage/yunxuan__xiao/gptj-deepspeed-finetune/TorchTrainer_2023-08-18_18-09-11/TorchTrainer_01ea5_00000_0_2023-08-18_18-09-12/checkpoint_000000)

Generate text from prompt#

First, download the persistent Ray Train checkpoint locally and load the fine-tuned model weights and tokenizer from the checkpoint. Then use 🤗 Transformers pipeline to generate predictions from the fine-tuned model.

Tip

For large scale batch inference, see End-to-end: Offline Batch Inference.

import os

os.system(f"aws s3 sync s3://{checkpoint.path} /mnt/local_storage/")

Set the task to "text-generation", and also set device_map="auto" for Ray Train to automatically place the model on the right device.

from transformers import pipeline, AutoTokenizer, GPTJForCausalLM

model = GPTJForCausalLM.from_pretrained("/mnt/local_storage/checkpoint")
tokenizer = AutoTokenizer.from_pretrained("/mnt/local_storage/checkpoint")

pipe = pipeline(
    model=model,
    tokenizer=tokenizer,
    task="text-generation",
    torch_dtype=torch.float16,
    device_map="auto",
)

# Generate from prompts!
for sentence in pipe(
    ["Romeo and Juliet", "Romeo", "Juliet"], do_sample=True, min_length=20
):
    print(sentence)

[{'generated_text': 'Romeo and Juliet. This very night shall they come. A word with you, sir.'}]
[{'generated_text': 'Romeo! I know thee not. Lord Mercutio, is it you! Signior Montague.'}]
[{'generated_text': 'Juliet, look up in the vault, and there shalt find a grave; within the monument there is a table:'}]