{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Fine-tune `vicuna-13b` with Lightning and DeepSpeed\n",
    "\n",
    "<a id=\"try-anyscale-quickstart-vicuna_13b_lightning_deepspeed_finetune\" href=\"https://console.anyscale.com/register/ha?render_flow=ray&utm_source=ray_docs&utm_medium=docs&utm_campaign=vicuna_13b_lightning_deepspeed_finetune\">\n",
    "    <img src=\"../../../_static/img/run-on-anyscale.svg\" alt=\"try-anyscale-quickstart\">\n",
    "</a>\n",
    "<br></br>\n",
    "\n",
    "In this example, we will demonstrate how to perform full fine-tuning for a [`vicuna-13b-v1.3`](https://huggingface.co/lmsys/vicuna-13b-v1.3) model using Ray Train PyTorch Lightning integrations with the DeepSpeed ZeRO-3 strategy.\n",
    "\n",
    "- [DeepSpeed](<https://github.com/microsoft/DeepSpeed>) is an open-source deep learning optimization library for PyTorch. It's designed to reduce computing power and memory usage, and to train large distributed models by leveraging state-of-the-art innovations like ZeRO, 3D-Parallelism, DeepSpeed-MoE, and ZeRO-Infinity. \n",
    "- PyTorch Lightning offers a [DeepSpeed integration](https://lightning.ai/docs/pytorch/stable/api/pytorch_lightning.strategies.DeepSpeedStrategy.html), which provides a simple interface to configure the knobs for DeepSpeed and automatically trigger your training process with the DeepSpeed Engine.\n",
    "- {class}`Ray TorchTrainer <ray.train.torch.TorchTrainer>` allows you to easily scale your PyTorch Lightning job across multiple nodes in a Ray cluster, without worrying about the underlying cluster management, autoscaling, and distributed process group settings.\n",
    "\n",
    "Our demo aims to illustrate how these three tools can be combined effectively to finetune the Vicuna-13B model, leveraging the strengths of each to create an efficient and high-performance deep learning solution.\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```{note}\n",
    "This is an advanced example of Large Language Model fine-tuning with Ray Train. If you're a beginner or new to the concepts of Ray Train and our Lightning integrations, it would be beneficial to first explore the introductory documentation below to build a foundational understanding. \n",
    "- [Ray Train Key Concepts](train-key-concepts) \n",
    "- [Ray Data Quickstart](data_quickstart)\n",
    "- {doc}`[Basic] Image Classification with PyTorch Lightning and Ray Train <lightning_mnist_example>`\n",
    "- {doc}`[Intermediate] Fine-tuning Lightning Modules with with Ray Data <lightning_cola_advanced>`\n",
    "```\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Cluster Setting\n",
    "\n",
    "\n",
    "### Compute instances\n",
    "In this example, we set up a Ray cluster on AWS with the following settings:\n",
    "\n",
    "|  | num | instance type | GPU per node | GPU Memory | CPU Memory |\n",
    "|-|-|-|-|-|-|\n",
    "|Head node|1|g5.16xlarge|1 x A10G | 24 GB | 256 GB|\n",
    "|Worker node|15|g5.4xlarge|1 x A10G | 24 GB | 64 GB|\n",
    "\n",
    "```{note}\n",
    "In this example, we used 16 A10G GPUs for model training and tuned the DeepSpeed configurations for this setup. If you have a different cluster setup or GPUs with lower memory capacities, you may need to modify the DeepSpeed configurations and batch size to fit the model into the GPUs.\n",
    "```\n",
    "\n",
    "```{tip}\n",
    "We selected a GPU instance with additional CPU memory for the head node to demonstrate single-node offline inference. If you are training only, you can still opt for the g5.4xlarge instance for the head node.\n",
    "```\n",
    "\n",
    "\n",
    "### Cloud Storage\n",
    "\n",
    "Additionally, since the checkpoint size for this 13B parameter model can be large (~140GB), we choose to store the checkpoints in AWS S3. Thanks to the newly introduced distributed checkpointing feature in Ray 2.5, each worker can upload its own shards individually to the S3 bucket, greatly reducing the latency and network traffic of checkpoint syncing.\n",
    "\n",
    "### Local Storage\n",
    "To demonstrate offline inference, we need to download and consolidate the model checkpoint onto the head node. This action requires around 200GB disk storage. Therefore, we mounted the NVMe SSD provided by g5 instances at `/dev/nvme1n1` to `/mnt/local_storage`, and we will save the checkpoints in this folder.\n",
    "\n",
    "For more details, see [Amazon EBS and NVMe on Linux instances](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/nvme-ebs-volumes.html).\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup Ray Environment\n",
    "\n",
    "We define a runtime environment to ensure that the Ray workers have access to all necessary packages. If you have already included these dependencies in your Docker image or installed them on each node, you can ignore the `runtime_env` argument.\n",
    "\n",
    "```{note}\n",
    "Note that the codebases of `transformers`, `accelerate`, and `deepspeed` are all rapidly changing, so we have pinned the package versions here to ensure testing stability. You can try other version combinations and feel free to report any issues you encounter.\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "os.environ[\"RAY_TRAIN_V2_ENABLED\"] = \"1\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import ray\n",
    "\n",
    "NUM_WORKERS = 16\n",
    "BATCH_SIZE_PER_WORKER = 8\n",
    "MODEL_NAME = \"lmsys/vicuna-13b-v1.3\"\n",
    "\n",
    "ray.init(\n",
    "    runtime_env={\n",
    "        \"pip\": [\n",
    "            \"datasets\",\n",
    "            \"torch>=1.13.0\",\n",
    "            \"deepspeed==0.12.3\",\n",
    "            \"accelerate==0.20.3\",\n",
    "            \"transformers==4.30.2\",\n",
    "            \"lightning==2.0.3\",\n",
    "        ],\n",
    "    }\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load and preprocess datasets\n",
    "\n",
    "We were impressed by LLM's ability of zero-shot text-generation, while some LLMs may not perform well in code generation due to the lack of code in the training corpus. The CMU [CoNaLa](https://conala-corpus.github.io/)(The Code/Natural Language Challenge) was designed to test systems for generating program snippets from natural language. Each data record contains an intent sentence and a one-line code snippet. The goal is to fine-tune the Vicuna model on this dataset, enabling the model to generate correct and runnable code snippets, thereby achieving natural language intent. Here are some examples:\n",
    "\n",
    "| intent | code snippet |\n",
    "| - | - |\n",
    "| \"convert a list of integers into a single integer\" | `r = int(''.join(map(str, x)))`|\n",
    "| \"normalize a pandas dataframe `df` by row\" | `df.div(df.sum(axis=1), axis=0)` | \n",
    "| \"Convert string '03:55' into datetime.time object\" | `datetime.datetime.strptime('03:55', '%H:%M').time()` |\n",
    "\n",
    "The CoNaLa team has released a dataset crawled from Stack Overflow, automatically filtered, then curated by annotators, split into 2379 training and 500 test examples. In addition, they also included an automatically-mined dataset with 600k examples. In this demo, we take all the curated data and the top 5000 mined data for fine-tuning."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here we preprocess the CoNaLa dataset with Ray Data. You can also use HuggingFace Datasets and pass it directly to `LightningConfigBuilder.fit_params()`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/ray/anaconda3/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.\n",
      "  _torch_pytree._register_pytree_node(\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Dataset({\n",
      "    features: ['question_id', 'intent', 'rewritten_intent', 'snippet', 'parent_answer_post_id', 'prob', 'id'],\n",
      "    num_rows: 7379\n",
      "})\n"
     ]
    }
   ],
   "source": [
    "import re\n",
    "import ray\n",
    "import json\n",
    "from transformers import AutoTokenizer\n",
    "from huggingface_hub import HfFileSystem\n",
    "fs = HfFileSystem()\n",
    "\n",
    "path = \"hf://datasets/neulab/conala/data\"\n",
    "mined_path = path + \"/conala-mined.json\"\n",
    "curated_path = path + \"/conala-paired-train.json\"\n",
    "\n",
    "curated = ray.data.read_json(curated_path, filesystem=fs)\n",
    "mined = ray.data.read_json(mined_path, filesystem=fs).limit(5000).materialize()\n",
    "\n",
    "ray_ds = mined.union(curated)\n",
    "\n",
    "# Build a prompt template for Vicuna-13b model\n",
    "PROMPT_TEMPLATE = \"Intent: {intent}\\nOne-line code snippet: {snippet}\"\n",
    "\n",
    "\n",
    "def fill_prompt(batch):\n",
    "    batch[\"input_sentence\"] = batch.apply(\n",
    "        lambda row: PROMPT_TEMPLATE.format(\n",
    "            intent=row[\"rewritten_intent\"]\n",
    "            if row[\"rewritten_intent\"]\n",
    "            else row[\"intent\"],\n",
    "            snippet=f\"`{row['snippet']}`\",\n",
    "        )\n",
    "        + \"</s>\",\n",
    "        axis=1,\n",
    "    )\n",
    "    return batch[[\"input_sentence\"]]\n",
    "\n",
    "\n",
    "# Tokenize input sentences to tensors\n",
    "def tokenize(batch):\n",
    "    tokenizer = AutoTokenizer.from_pretrained(\n",
    "        MODEL_NAME, padding_side=\"left\", use_fast=False\n",
    "    )\n",
    "    tokenizer.pad_token = tokenizer.eos_token\n",
    "    ret = tokenizer(\n",
    "        list(batch[\"input_sentence\"]),\n",
    "        truncation=True,\n",
    "        max_length=128,\n",
    "        padding=\"max_length\",\n",
    "        return_tensors=\"np\",\n",
    "    )\n",
    "    ret[\"labels\"] = ret[\"input_ids\"].copy()\n",
    "    return dict(ret)\n",
    "\n",
    "# Preprocess train dataset\n",
    "processed_ds = ray_ds.map_batches(fill_prompt, batch_format=\"pandas\").map_batches(tokenize, batch_format=\"pandas\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   },
   "outputs": [],
   "source": [
    "# To accelerate release tests\n",
    "processed_ds = processed_ds.limit(16 * 8 * 1)  # each worker has 1 batch"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Define a Lightning Module\n",
    "\n",
    "Here we load the pre-trained model weights from HuggingFace Model Hub, and wrap them into `pl.LightningModule`. We adopted the efficient model initialization techniques introduced in [Lightning-transformers](https://github.com/Lightning-Universe/lightning-transformers) to avoid unnecessary full weights loading."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch\n",
    "import transformers\n",
    "import lightning.pytorch as pl\n",
    "from transformers import AutoTokenizer, AutoModelForCausalLM\n",
    "from deepspeed.ops.adam import DeepSpeedCPUAdam\n",
    "\n",
    "\n",
    "class ZeRO3Config:\n",
    "    def __init__(self, pl_module):\n",
    "        self.config = pl_module.trainer.strategy.config\n",
    "\n",
    "    def __call__(self, *args, **kwargs):\n",
    "        return self\n",
    "\n",
    "    def is_zero3(self) -> bool:\n",
    "        return True\n",
    "\n",
    "\n",
    "def enable_transformers_pretrained_deepspeed_sharding(\n",
    "    pl_module: \"pl.LightningModule\",\n",
    ") -> None:\n",
    "    transformers.deepspeed._hf_deepspeed_config_weak_ref = ZeRO3Config(pl_module)\n",
    "\n",
    "\n",
    "class Vicuna13BModel(pl.LightningModule):\n",
    "    def __init__(self):\n",
    "        super().__init__()\n",
    "        # Enable tf32 for better performance\n",
    "        torch.backends.cuda.matmul.allow_tf32 = True\n",
    "\n",
    "    def setup(self, stage) -> None:\n",
    "        # Defer model initialization to inject deepspeed configs to HF.\n",
    "        # During initialization, HF transformers can immediately partition \n",
    "        # the model across all gpus avoid the overhead in time and memory \n",
    "        # copying it on CPU or each GPU first.\n",
    "        enable_transformers_pretrained_deepspeed_sharding(self)\n",
    "        self.model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)\n",
    "        if self.global_rank == 0:\n",
    "            print(\"DeepSpeed Configs: \", self.trainer.strategy.config)\n",
    "            print(\"Model Archetecture: \", self.model)\n",
    "\n",
    "    def forward(self, batch):\n",
    "        outputs = self.model(\n",
    "            batch[\"input_ids\"],\n",
    "            labels=batch[\"labels\"],\n",
    "            attention_mask=batch[\"attention_mask\"],\n",
    "        )\n",
    "        return outputs.loss\n",
    "\n",
    "    def training_step(self, batch, batch_idx):\n",
    "        loss = self.forward(batch)\n",
    "        self.log(\"train_loss\", loss, prog_bar=True, on_step=True, sync_dist=True)\n",
    "        return loss\n",
    "\n",
    "    def configure_optimizers(self):\n",
    "        return DeepSpeedCPUAdam(self.parameters(), lr=2e-5, weight_decay=0.01)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## DeepSpeed Configurations\n",
    "\n",
    "Before training, let's calculate the memory usage of finetuning a `vicuna-13b` model. Assume we are using FP16 mixed-precision training, and the optimizer is Adam with FP32 states.\n",
    "\n",
    "- Model parameters: 13(billion parameters) * 2(FP16) ≈ 26GB\n",
    "- Optimizer states: 13(billion parameters)  * 2(momentums per param) * 4 (FP32) ≈ 52GB\n",
    "\n",
    "As we can see, the model parameters themselves require 26GB, which cannot fit in a single A10G GPU, let alone the activations and optimizers states. Here, we use ZeRO stage-3 to partition the model, gradients, and optimizer states across 16 nodes. Additionally, we employ optimizer CPU offloading to reduce GRAM usage and increase throughput with larger batch sizes. We also disabled parameter offloading and activation checkpointing to improve the training speed.\n",
    "\n",
    "Regarding other knobs such as `reduce_bucket_size`, `stage3_prefetch_bucket_size` and `stage3_param_persistence_threshold`, we kept them as the [default values in HuggingFace](https://huggingface.co/docs/transformers/main_classes/deepspeed#zero3-config). Feel free to further adjust them to speed up the training process."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/ray/anaconda3/lib/python3.10/site-packages/huggingface_hub/file_download.py:795: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.\n",
      "  warnings.warn(\n"
     ]
    }
   ],
   "source": [
    "from transformers import AutoConfig\n",
    "\n",
    "config = AutoConfig.from_pretrained(MODEL_NAME)\n",
    "HIDDEN_SIZE = config.hidden_size\n",
    "\n",
    "deepspeed_configs = {\n",
    "    \"zero_allow_untested_optimizer\": True,\n",
    "    \"bf16\": {\"enabled\": True},\n",
    "    \"zero_optimization\": {\n",
    "        \"stage\": 3,\n",
    "        \"offload_optimizer\": {\"device\": \"cpu\", \"pin_memory\": True},\n",
    "        \"overlap_comm\": True,\n",
    "        \"contiguous_gradients\": True,\n",
    "        \"reduce_bucket_size\": HIDDEN_SIZE * HIDDEN_SIZE,\n",
    "        \"stage3_prefetch_bucket_size\": 0.9 * HIDDEN_SIZE * HIDDEN_SIZE,\n",
    "        \"stage3_param_persistence_threshold\": 10 * HIDDEN_SIZE,\n",
    "    },\n",
    "}"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Define your training function\n",
    "\n",
    "Finally, define the training function that will be launched on multiple workers. The training function is generally the same as the pure pytorch Lightning training code, with additional Ray Train utilities:\n",
    "\n",
    "- {class}`~ray.train.lightning.RayDeepSpeedStrategy`: Same argument list as Lightning DeepSpeedStrategy but integrated with Ray Train.\n",
    "- {class}`~ray.train.lightning.RayLightningEnvironment`: Lightning environments for Ray cluster.\n",
    "- {class}`~ray.train.lightning.RayTrainReportCallback`: On each epoch end, it reports the checkpoint from each worker to the ray train (distributed checkpointing).\n",
    "- {meth}`~ray.train.lightning.prepare_trainer`: Validate your lightning Trainer configurations.\n",
    "\n",
    "For Ray Data ingestion, we fetched the preprocessed and sharded dataset with {meth}`~ray.train.get_dataset_shard`, and created a dataloader with {meth}`~ray.data.Dataset.iter_torch_batches`. It returns a custom iterator that replaces the Torch DataLoader.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "import ray.train\n",
    "from ray.train import CheckpointConfig, RunConfig, ScalingConfig\n",
    "from ray.train.torch import TorchTrainer\n",
    "from ray.train.lightning import (\n",
    "    prepare_trainer,\n",
    "    RayDeepSpeedStrategy, \n",
    "    RayLightningEnvironment, \n",
    "    RayTrainReportCallback\n",
    ")\n",
    "\n",
    "\n",
    "def train_func(config):\n",
    "    \"\"\"Training function for each worker.\"\"\"\n",
    "\n",
    "    # Unpack the `train_loop_config`\n",
    "    max_epochs = config[\"max_epochs\"]\n",
    "    batch_size = config[\"batch_size\"]\n",
    "    accumulate_grad_batches = config[\"accumulate_grad_batches\"]\n",
    "\n",
    "    model = Vicuna13BModel()\n",
    "    \n",
    "    # Prepare Ray Data Ingestion\n",
    "    train_ds = ray.train.get_dataset_shard(\"train\")\n",
    "    train_dataloader = train_ds.iter_torch_batches(batch_size=batch_size)\n",
    "    \n",
    "    pl_trainer = pl.Trainer(\n",
    "        devices=\"auto\",\n",
    "        accelerator=\"auto\",\n",
    "        default_root_dir=\"/mnt/local_storage\",\n",
    "        strategy=RayDeepSpeedStrategy(config=deepspeed_configs),\n",
    "        plugins=[RayLightningEnvironment()],\n",
    "        callbacks=[RayTrainReportCallback()],\n",
    "        enable_checkpointing=False, # RayTrainReportCallback will save the checkpoints\n",
    "        max_epochs=max_epochs,\n",
    "        precision=\"bf16-mixed\",\n",
    "        accumulate_grad_batches=accumulate_grad_batches,\n",
    "    )\n",
    "    pl_trainer = prepare_trainer(pl_trainer)\n",
    "\n",
    "    pl_trainer.fit(model, train_dataloaders=train_dataloader)\n",
    "    \n",
    "\n",
    "trainer = TorchTrainer(\n",
    "    train_loop_per_worker=train_func,\n",
    "    train_loop_config={\n",
    "        \"max_epochs\": 1,\n",
    "        \"batch_size\": BATCH_SIZE_PER_WORKER,\n",
    "        \"accumulate_grad_batches\": 2,\n",
    "    },\n",
    "    run_config=RunConfig(\n",
    "        name=\"vicuna-13b-finetune\",\n",
    "        storage_path=\"/mnt/cluster_storage\",\n",
    "        checkpoint_config=CheckpointConfig(num_to_keep=1),\n",
    "    ),\n",
    "    scaling_config=ScalingConfig(\n",
    "        num_workers=NUM_WORKERS,\n",
    "        use_gpu=True,\n",
    "        resources_per_worker={\"CPU\": 15, \"GPU\": 1},\n",
    "    ),\n",
    "    datasets={\"train\": processed_ds},\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Model Fine-tuning\n",
    "\n",
    "Once everything is configured in TorchTrainer, training becomes easy. Simply call `trainer.fit()`, and your workload will be scaled to the Ray cluster, initiating ZeRO-3 parallel training."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[36m(TrainController pid=17559)\u001b[0m [State Transition] INITIALIZING -> SCHEDULING.\n",
      "\u001b[36m(TrainController pid=17559)\u001b[0m Attempting to start training worker group of size 16 with the following resources: [{'CPU': 15, 'GPU': 1}] * 16\n",
      "\u001b[36m(RayTrainWorker pid=17770)\u001b[0m Setting up process group for: env:// [rank=0, world_size=16]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[36m(RayTrainWorker pid=17770)\u001b[0m [2025-10-15 15:51:07,627] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[36m(RayTrainWorker pid=4048, ip=10.0.130.188)\u001b[0m 2025-10-15 15:51:09.458702: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n",
      "\u001b[36m(RayTrainWorker pid=4048, ip=10.0.130.188)\u001b[0m 2025-10-15 15:51:09.458741: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n",
      "\u001b[36m(RayTrainWorker pid=4048, ip=10.0.130.188)\u001b[0m 2025-10-15 15:51:09.460080: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n",
      "\u001b[36m(RayTrainWorker pid=4048, ip=10.0.130.188)\u001b[0m 2025-10-15 15:51:09.467398: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.\n",
      "\u001b[36m(RayTrainWorker pid=4048, ip=10.0.130.188)\u001b[0m To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.\n",
      "\u001b[36m(RayTrainWorker pid=4048, ip=10.0.130.188)\u001b[0m 2025-10-15 15:51:10.359839: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT\n",
      "\u001b[36m(RayTrainWorker pid=4048, ip=10.0.130.188)\u001b[0m INFO: initializing deepspeed distributed: GLOBAL_RANK: 5, MEMBER: 6/16\n",
      "\u001b[36m(RayTrainWorker pid=4048, ip=10.0.130.188)\u001b[0m initializing deepspeed distributed: GLOBAL_RANK: 5, MEMBER: 6/16\n",
      "\u001b[36m(RayTrainWorker pid=4048, ip=10.0.130.188)\u001b[0m WARNING: Missing logger folder: /tmp/ray/session_2025-10-15_15-40-01_399241_4076/artifacts/vicuna-13b-finetune/lightning_logs\n",
      "\u001b[36m(RayTrainWorker pid=4048, ip=10.0.130.188)\u001b[0m Missing logger folder: /tmp/ray/session_2025-10-15_15-40-01_399241_4076/artifacts/vicuna-13b-finetune/lightning_logs\n",
      "\u001b[36m(TrainController pid=17559)\u001b[0m Started training worker group of size 16: \n",
      "\u001b[36m(TrainController pid=17559)\u001b[0m - (ip=10.0.171.127, pid=17770) world_rank=0, local_rank=0, node_rank=0\n",
      "\u001b[36m(TrainController pid=17559)\u001b[0m - (ip=10.0.155.201, pid=4224) world_rank=1, local_rank=0, node_rank=1\n",
      "\u001b[36m(TrainController pid=17559)\u001b[0m - (ip=10.0.130.65, pid=4187) world_rank=2, local_rank=0, node_rank=2\n",
      "\u001b[36m(TrainController pid=17559)\u001b[0m - (ip=10.0.178.75, pid=4182) world_rank=3, local_rank=0, node_rank=3\n",
      "\u001b[36m(TrainController pid=17559)\u001b[0m - (ip=10.0.167.159, pid=5417) world_rank=4, local_rank=0, node_rank=4\n",
      "\u001b[36m(TrainController pid=17559)\u001b[0m - (ip=10.0.130.188, pid=4048) world_rank=5, local_rank=0, node_rank=5\n",
      "\u001b[36m(TrainController pid=17559)\u001b[0m - (ip=10.0.134.47, pid=4191) world_rank=6, local_rank=0, node_rank=6\n",
      "\u001b[36m(TrainController pid=17559)\u001b[0m - (ip=10.0.173.126, pid=4079) world_rank=7, local_rank=0, node_rank=7\n",
      "\u001b[36m(TrainController pid=17559)\u001b[0m - (ip=10.0.166.0, pid=4053) world_rank=8, local_rank=0, node_rank=8\n",
      "\u001b[36m(TrainController pid=17559)\u001b[0m - (ip=10.0.183.211, pid=5448) world_rank=9, local_rank=0, node_rank=9\n",
      "\u001b[36m(TrainController pid=17559)\u001b[0m - (ip=10.0.138.121, pid=4069) world_rank=10, local_rank=0, node_rank=10\n",
      "\u001b[36m(TrainController pid=17559)\u001b[0m - (ip=10.0.129.201, pid=5418) world_rank=11, local_rank=0, node_rank=11\n",
      "\u001b[36m(TrainController pid=17559)\u001b[0m - (ip=10.0.184.103, pid=4038) world_rank=12, local_rank=0, node_rank=12\n",
      "\u001b[36m(TrainController pid=17559)\u001b[0m - (ip=10.0.164.99, pid=4075) world_rank=13, local_rank=0, node_rank=13\n",
      "\u001b[36m(TrainController pid=17559)\u001b[0m - (ip=10.0.136.125, pid=4040) world_rank=14, local_rank=0, node_rank=14\n",
      "\u001b[36m(TrainController pid=17559)\u001b[0m - (ip=10.0.161.115, pid=4057) world_rank=15, local_rank=0, node_rank=15\n",
      "\u001b[36m(TrainController pid=17559)\u001b[0m [State Transition] SCHEDULING -> RUNNING.\n",
      "\u001b[36m(RayTrainWorker pid=17770)\u001b[0m INFO: GPU available: True (cuda), used: True\n",
      "\u001b[36m(RayTrainWorker pid=17770)\u001b[0m GPU available: True (cuda), used: True\n",
      "\u001b[36m(RayTrainWorker pid=17770)\u001b[0m INFO: TPU available: False, using: 0 TPU cores\n",
      "\u001b[36m(RayTrainWorker pid=17770)\u001b[0m TPU available: False, using: 0 TPU cores\n",
      "\u001b[36m(RayTrainWorker pid=17770)\u001b[0m INFO: IPU available: False, using: 0 IPUs\n",
      "\u001b[36m(RayTrainWorker pid=17770)\u001b[0m IPU available: False, using: 0 IPUs\n",
      "\u001b[36m(RayTrainWorker pid=17770)\u001b[0m INFO: HPU available: False, using: 0 HPUs\n",
      "\u001b[36m(RayTrainWorker pid=17770)\u001b[0m HPU available: False, using: 0 HPUs\n",
      "\u001b[36m(RayTrainWorker pid=17770)\u001b[0m /home/ray/anaconda3/lib/python3.10/site-packages/huggingface_hub/file_download.py:795: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.\n",
      "\u001b[36m(RayTrainWorker pid=17770)\u001b[0m   warnings.warn(\n",
      "Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]\n",
      "\u001b[36m(RayTrainWorker pid=5418, ip=10.0.129.201)\u001b[0m 2025-10-15 15:51:09.590755: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\u001b[32m [repeated 15x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)\u001b[0m\n",
      "\u001b[36m(RayTrainWorker pid=5418, ip=10.0.129.201)\u001b[0m 2025-10-15 15:51:09.590792: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\u001b[32m [repeated 15x across cluster]\u001b[0m\n",
      "\u001b[36m(RayTrainWorker pid=5418, ip=10.0.129.201)\u001b[0m 2025-10-15 15:51:09.592129: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\u001b[32m [repeated 15x across cluster]\u001b[0m\n",
      "\u001b[36m(RayTrainWorker pid=5418, ip=10.0.129.201)\u001b[0m 2025-10-15 15:51:09.599431: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.\u001b[32m [repeated 15x across cluster]\u001b[0m\n",
      "\u001b[36m(RayTrainWorker pid=5418, ip=10.0.129.201)\u001b[0m To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.\u001b[32m [repeated 15x across cluster]\u001b[0m\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[36m(autoscaler +35s)\u001b[0m Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Downloading shards:  33%|███▎      | 1/3 [00:08<00:16,  8.45s/it]\n",
      "\u001b[36m(RayTrainWorker pid=5418, ip=10.0.129.201)\u001b[0m 2025-10-15 15:51:10.532071: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT\u001b[32m [repeated 15x across cluster]\u001b[0m\n",
      "\u001b[36m(RayTrainWorker pid=5418, ip=10.0.129.201)\u001b[0m INFO: initializing deepspeed distributed: GLOBAL_RANK: 11, MEMBER: 12/16\u001b[32m [repeated 15x across cluster]\u001b[0m\n",
      "\u001b[36m(RayTrainWorker pid=5418, ip=10.0.129.201)\u001b[0m initializing deepspeed distributed: GLOBAL_RANK: 11, MEMBER: 12/16\u001b[32m [repeated 15x across cluster]\u001b[0m\n",
      "\u001b[36m(RayTrainWorker pid=5418, ip=10.0.129.201)\u001b[0m WARNING: Missing logger folder: /tmp/ray/session_2025-10-15_15-40-01_399241_4076/artifacts/vicuna-13b-finetune/lightning_logs\u001b[32m [repeated 15x across cluster]\u001b[0m\n",
      "\u001b[36m(RayTrainWorker pid=5418, ip=10.0.129.201)\u001b[0m Missing logger folder: /tmp/ray/session_2025-10-15_15-40-01_399241_4076/artifacts/vicuna-13b-finetune/lightning_logs\u001b[32m [repeated 15x across cluster]\u001b[0m\n",
      "\u001b[36m(RayTrainWorker pid=5418, ip=10.0.129.201)\u001b[0m /home/ray/anaconda3/lib/python3.10/site-packages/huggingface_hub/file_download.py:795: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.\u001b[32m [repeated 15x across cluster]\u001b[0m\n",
      "\u001b[36m(RayTrainWorker pid=5418, ip=10.0.129.201)\u001b[0m   warnings.warn(\u001b[32m [repeated 15x across cluster]\u001b[0m\n",
      "Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]\u001b[32m [repeated 15x across cluster]\u001b[0m\n",
      "Downloading shards:  33%|███▎      | 1/3 [00:14<00:29, 14.64s/it]\u001b[32m [repeated 10x across cluster]\u001b[0m\n",
      "Downloading shards:  33%|███▎      | 1/3 [00:24<00:48, 24.42s/it]\u001b[32m [repeated 5x across cluster]\u001b[0m\n",
      "Downloading shards:  67%|██████▋   | 2/3 [00:32<00:17, 17.90s/it]\n",
      "Downloading shards:  67%|██████▋   | 2/3 [00:36<00:19, 19.52s/it]\n",
      "Downloading shards:  67%|██████▋   | 2/3 [00:47<00:24, 24.79s/it]\u001b[32m [repeated 9x across cluster]\u001b[0m\n",
      "Downloading shards: 100%|██████████| 3/3 [00:49<00:00, 16.55s/it]\n",
      "Downloading shards:  67%|██████▋   | 2/3 [00:51<00:27, 27.69s/it]\n",
      "Downloading shards: 100%|██████████| 3/3 [00:54<00:00, 18.33s/it]\u001b[32m [repeated 8x across cluster]\u001b[0m\n",
      "Downloading shards:  67%|██████▋   | 2/3 [01:00<00:33, 33.57s/it]\n",
      "Downloading shards: 100%|██████████| 3/3 [00:55<00:00, 18.63s/it]\n",
      "Downloading shards: 100%|██████████| 3/3 [01:03<00:00, 21.30s/it]\n",
      "Downloading shards:  67%|██████▋   | 2/3 [01:05<00:35, 35.56s/it]\u001b[32m [repeated 2x across cluster]\u001b[0m\n",
      "Downloading shards: 100%|██████████| 3/3 [01:09<00:00, 23.31s/it]\n",
      "Downloading shards:  67%|██████▋   | 2/3 [01:12<00:38, 38.09s/it]\n",
      "Downloading shards: 100%|██████████| 3/3 [01:30<00:00, 30.22s/it]\n",
      "Downloading shards: 100%|██████████| 3/3 [01:36<00:00, 32.00s/it]\u001b[32m [repeated 2x across cluster]\u001b[0m\n",
      "Downloading shards: 100%|██████████| 3/3 [01:41<00:00, 33.94s/it]\n",
      "Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]\n",
      "Loading checkpoint shards:  33%|███▎      | 1/3 [00:17<00:35, 17.89s/it]\n",
      "Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]\u001b[32m [repeated 15x across cluster]\u001b[0m\n",
      "Loading checkpoint shards:  33%|███▎      | 1/3 [00:23<00:47, 23.70s/it]\u001b[32m [repeated 15x across cluster]\u001b[0m\n",
      "Loading checkpoint shards:  67%|██████▋   | 2/3 [00:39<00:19, 19.88s/it]\n",
      "Loading checkpoint shards:  67%|██████▋   | 2/3 [00:39<00:19, 19.89s/it]\n",
      "Loading checkpoint shards:  67%|██████▋   | 2/3 [00:44<00:22, 22.24s/it]\u001b[32m [repeated 14x across cluster]\u001b[0m\n",
      "Loading checkpoint shards: 100%|██████████| 3/3 [00:52<00:00, 17.38s/it]\n",
      "Loading checkpoint shards: 100%|██████████| 3/3 [00:57<00:00, 19.26s/it]\u001b[32m [repeated 15x across cluster]\u001b[0m\n",
      "\u001b[36m(RayTrainWorker pid=4040, ip=10.0.136.125)\u001b[0m INFO: LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]\n",
      "\u001b[36m(RayTrainWorker pid=4040, ip=10.0.136.125)\u001b[0m LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]\n",
      "\u001b[36m(RayTrainWorker pid=17770)\u001b[0m DeepSpeed Configs:  {'zero_allow_untested_optimizer': True, 'bf16': {'enabled': True}, 'zero_optimization': {'stage': 3, 'offload_optimizer': {'device': 'cpu', 'pin_memory': True}, 'overlap_comm': True, 'contiguous_gradients': True, 'reduce_bucket_size': 26214400, 'stage3_prefetch_bucket_size': 23592960.0, 'stage3_param_persistence_threshold': 51200}, 'gradient_accumulation_steps': 2, 'train_micro_batch_size_per_gpu': 1, 'gradient_clipping': 0.0}\n",
      "\u001b[36m(RayTrainWorker pid=17770)\u001b[0m Model Archetecture:  LlamaForCausalLM(\n",
      "\u001b[36m(RayTrainWorker pid=17770)\u001b[0m   (model): LlamaModel(\n",
      "\u001b[36m(RayTrainWorker pid=17770)\u001b[0m     (embed_tokens): Embedding(32000, 5120, padding_idx=0)\n",
      "\u001b[36m(RayTrainWorker pid=17770)\u001b[0m     (layers): ModuleList(\n",
      "\u001b[36m(RayTrainWorker pid=17770)\u001b[0m       (0-39): 40 x LlamaDecoderLayer(\n",
      "\u001b[36m(RayTrainWorker pid=17770)\u001b[0m         (self_attn): LlamaAttention(\n",
      "\u001b[36m(RayTrainWorker pid=17770)\u001b[0m           (q_proj): Linear(in_features=5120, out_features=5120, bias=False)\n",
      "\u001b[36m(RayTrainWorker pid=17770)\u001b[0m           (k_proj): Linear(in_features=5120, out_features=5120, bias=False)\n",
      "\u001b[36m(RayTrainWorker pid=17770)\u001b[0m           (v_proj): Linear(in_features=5120, out_features=5120, bias=False)\n",
      "\u001b[36m(RayTrainWorker pid=17770)\u001b[0m           (o_proj): Linear(in_features=5120, out_features=5120, bias=False)\n",
      "\u001b[36m(RayTrainWorker pid=17770)\u001b[0m           (rotary_emb): LlamaRotaryEmbedding()\n",
      "\u001b[36m(RayTrainWorker pid=17770)\u001b[0m         )\n",
      "\u001b[36m(RayTrainWorker pid=17770)\u001b[0m         (mlp): LlamaMLP(\n",
      "\u001b[36m(RayTrainWorker pid=17770)\u001b[0m           (gate_proj): Linear(in_features=5120, out_features=13824, bias=False)\n",
      "\u001b[36m(RayTrainWorker pid=17770)\u001b[0m           (down_proj): Linear(in_features=13824, out_features=5120, bias=False)\n",
      "\u001b[36m(RayTrainWorker pid=17770)\u001b[0m           (up_proj): Linear(in_features=5120, out_features=13824, bias=False)\n",
      "\u001b[36m(RayTrainWorker pid=17770)\u001b[0m           (act_fn): SiLUActivation()\n",
      "\u001b[36m(RayTrainWorker pid=17770)\u001b[0m         )\n",
      "\u001b[36m(RayTrainWorker pid=17770)\u001b[0m         (input_layernorm): LlamaRMSNorm()\n",
      "\u001b[36m(RayTrainWorker pid=17770)\u001b[0m         (post_attention_layernorm): LlamaRMSNorm()\n",
      "\u001b[36m(RayTrainWorker pid=17770)\u001b[0m       )\n",
      "\u001b[36m(RayTrainWorker pid=17770)\u001b[0m     )\n",
      "\u001b[36m(RayTrainWorker pid=17770)\u001b[0m     (norm): LlamaRMSNorm()\n",
      "\u001b[36m(RayTrainWorker pid=17770)\u001b[0m   )\n",
      "\u001b[36m(RayTrainWorker pid=17770)\u001b[0m   (lm_head): Linear(in_features=5120, out_features=32000, bias=False)\n",
      "\u001b[36m(RayTrainWorker pid=17770)\u001b[0m )\n",
      "\u001b[36m(RayTrainWorker pid=4038, ip=10.0.184.103)\u001b[0m Using /home/ray/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...\n",
      "\u001b[36m(RayTrainWorker pid=4038, ip=10.0.184.103)\u001b[0m Creating extension directory /home/ray/.cache/torch_extensions/py310_cu121/cpu_adam...\n",
      "\u001b[36m(RayTrainWorker pid=4224, ip=10.0.155.201)\u001b[0m INFO: LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]\u001b[32m [repeated 15x across cluster]\u001b[0m\n",
      "\u001b[36m(RayTrainWorker pid=4224, ip=10.0.155.201)\u001b[0m LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]\u001b[32m [repeated 15x across cluster]\u001b[0m\n",
      "\u001b[36m(RayTrainWorker pid=17770)\u001b[0m Detected CUDA files, patching ldflags\n",
      "\u001b[36m(RayTrainWorker pid=17770)\u001b[0m Emitting ninja build file /home/ray/.cache/torch_extensions/py310_cu121/cpu_adam/build.ninja...\n",
      "\u001b[36m(RayTrainWorker pid=17770)\u001b[0m /home/ray/anaconda3/lib/python3.10/site-packages/torch/utils/cpp_extension.py:1967: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. \n",
      "\u001b[36m(RayTrainWorker pid=17770)\u001b[0m If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].\n",
      "\u001b[36m(RayTrainWorker pid=17770)\u001b[0m   warnings.warn(\n",
      "\u001b[36m(RayTrainWorker pid=17770)\u001b[0m Building extension module cpu_adam...\n",
      "\u001b[36m(RayTrainWorker pid=17770)\u001b[0m Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[36m(RayTrainWorker pid=4187, ip=10.0.130.65)\u001b[0m [1/4] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output custom_cuda_kernel.cuda.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\\\"_gcc\\\" -DPYBIND11_STDLIB=\\\"_libstdcpp\\\" -DPYBIND11_BUILD_ABI=\\\"_cxxabi1011\\\" -I/home/ray/anaconda3/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /home/ray/anaconda3/lib/python3.10/site-packages/torch/include -isystem /home/ray/anaconda3/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /home/ray/anaconda3/lib/python3.10/site-packages/torch/include/TH -isystem /home/ray/anaconda3/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /home/ray/anaconda3/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_86,code=compute_86 -DBF16_AVAILABLE -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -c /home/ray/anaconda3/lib/python3.10/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o \n",
      "\u001b[36m(RayTrainWorker pid=5418, ip=10.0.129.201)\u001b[0m [2025-10-15 15:51:07,681] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)\u001b[32m [repeated 15x across cluster]\u001b[0m\n",
      "\u001b[36m(RayTrainWorker pid=17770)\u001b[0m [2/4] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\\\"_gcc\\\" -DPYBIND11_STDLIB=\\\"_libstdcpp\\\" -DPYBIND11_BUILD_ABI=\\\"_cxxabi1011\\\" -I/home/ray/anaconda3/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /home/ray/anaconda3/lib/python3.10/site-packages/torch/include -isystem /home/ray/anaconda3/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /home/ray/anaconda3/lib/python3.10/site-packages/torch/include/TH -isystem /home/ray/anaconda3/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /home/ray/anaconda3/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -L/usr/local/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX256__ -D__ENABLE_CUDA__ -DBF16_AVAILABLE -c /home/ray/anaconda3/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o \n",
      "\u001b[36m(RayTrainWorker pid=5418, ip=10.0.129.201)\u001b[0m [1/4] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output custom_cuda_kernel.cuda.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\\\"_gcc\\\" -DPYBIND11_STDLIB=\\\"_libstdcpp\\\" -DPYBIND11_BUILD_ABI=\\\"_cxxabi1011\\\" -I/home/ray/anaconda3/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /home/ray/anaconda3/lib/python3.10/site-packages/torch/include -isystem /home/ray/anaconda3/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /home/ray/anaconda3/lib/python3.10/site-packages/torch/include/TH -isystem /home/ray/anaconda3/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /home/ray/anaconda3/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_86,code=compute_86 -DBF16_AVAILABLE -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -c /home/ray/anaconda3/lib/python3.10/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o \u001b[32m [repeated 15x across cluster]\u001b[0m\n",
      "\u001b[36m(RayTrainWorker pid=4048, ip=10.0.130.188)\u001b[0m [2/4] c++ -MMD -MF cpu_adam_impl.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\\\"_gcc\\\" -DPYBIND11_STDLIB=\\\"_libstdcpp\\\" -DPYBIND11_BUILD_ABI=\\\"_cxxabi1011\\\" -I/home/ray/anaconda3/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /home/ray/anaconda3/lib/python3.10/site-packages/torch/include -isystem /home/ray/anaconda3/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /home/ray/anaconda3/lib/python3.10/site-packages/torch/include/TH -isystem /home/ray/anaconda3/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /home/ray/anaconda3/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -L/usr/local/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX256__ -D__ENABLE_CUDA__ -DBF16_AVAILABLE -c /home/ray/anaconda3/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/cpu_adam_impl.cpp -o cpu_adam_impl.o \n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[36m(RayTrainWorker pid=4048, ip=10.0.130.188)\u001b[0m Loading extension module cpu_adam...\n",
      "\u001b[36m(RayTrainWorker pid=4048, ip=10.0.130.188)\u001b[0m Time to load cpu_adam op: 28.735835075378418 seconds\n",
      "\u001b[36m(RayTrainWorker pid=5418, ip=10.0.129.201)\u001b[0m Using /home/ray/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...\u001b[32m [repeated 15x across cluster]\u001b[0m\n",
      "\u001b[36m(RayTrainWorker pid=5418, ip=10.0.129.201)\u001b[0m Creating extension directory /home/ray/.cache/torch_extensions/py310_cu121/cpu_adam...\u001b[32m [repeated 15x across cluster]\u001b[0m\n",
      "\u001b[36m(RayTrainWorker pid=5418, ip=10.0.129.201)\u001b[0m Detected CUDA files, patching ldflags\u001b[32m [repeated 15x across cluster]\u001b[0m\n",
      "\u001b[36m(RayTrainWorker pid=5418, ip=10.0.129.201)\u001b[0m Emitting ninja build file /home/ray/.cache/torch_extensions/py310_cu121/cpu_adam/build.ninja...\u001b[32m [repeated 15x across cluster]\u001b[0m\n",
      "\u001b[36m(RayTrainWorker pid=5418, ip=10.0.129.201)\u001b[0m /home/ray/anaconda3/lib/python3.10/site-packages/torch/utils/cpp_extension.py:1967: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. \u001b[32m [repeated 15x across cluster]\u001b[0m\n",
      "\u001b[36m(RayTrainWorker pid=5418, ip=10.0.129.201)\u001b[0m If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].\u001b[32m [repeated 15x across cluster]\u001b[0m\n",
      "\u001b[36m(RayTrainWorker pid=5418, ip=10.0.129.201)\u001b[0m   warnings.warn(\u001b[32m [repeated 15x across cluster]\u001b[0m\n",
      "\u001b[36m(RayTrainWorker pid=5418, ip=10.0.129.201)\u001b[0m Building extension module cpu_adam...\u001b[32m [repeated 15x across cluster]\u001b[0m\n",
      "\u001b[36m(RayTrainWorker pid=5418, ip=10.0.129.201)\u001b[0m Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)\u001b[32m [repeated 15x across cluster]\u001b[0m\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[36m(RayTrainWorker pid=4048, ip=10.0.130.188)\u001b[0m [4/4] c++ cpu_adam.o cpu_adam_impl.o custom_cuda_kernel.cuda.o -shared -lcurand -L/home/ray/anaconda3/lib/python3.10/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o cpu_adam.so\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[36m(RayTrainWorker pid=17770)\u001b[0m Parameter Offload: Total persistent parameters: 414720 in 81 params\n",
      "\u001b[36m(RayTrainWorker pid=17770)\u001b[0m INFO: \n",
      "\u001b[36m(RayTrainWorker pid=17770)\u001b[0m   | Name  | Type             | Params | Params per Device\n",
      "\u001b[36m(RayTrainWorker pid=17770)\u001b[0m ---------------------------------------------------------------\n",
      "\u001b[36m(RayTrainWorker pid=17770)\u001b[0m 0 | model | LlamaForCausalLM | 13.0 B | 813 M            \n",
      "\u001b[36m(RayTrainWorker pid=17770)\u001b[0m ---------------------------------------------------------------\n",
      "\u001b[36m(RayTrainWorker pid=17770)\u001b[0m 13.0 B    Trainable params\n",
      "\u001b[36m(RayTrainWorker pid=17770)\u001b[0m 0         Non-trainable params\n",
      "\u001b[36m(RayTrainWorker pid=17770)\u001b[0m 13.0 B    Total params\n",
      "\u001b[36m(RayTrainWorker pid=17770)\u001b[0m 52,063.457Total estimated model params size (MB)\n",
      "\u001b[36m(RayTrainWorker pid=17770)\u001b[0m \n",
      "\u001b[36m(RayTrainWorker pid=17770)\u001b[0m   | Name  | Type             | Params | Params per Device\n",
      "\u001b[36m(RayTrainWorker pid=17770)\u001b[0m ---------------------------------------------------------------\n",
      "\u001b[36m(RayTrainWorker pid=17770)\u001b[0m 0 | model | LlamaForCausalLM | 13.0 B | 813 M            \n",
      "\u001b[36m(RayTrainWorker pid=17770)\u001b[0m ---------------------------------------------------------------\n",
      "\u001b[36m(RayTrainWorker pid=17770)\u001b[0m 13.0 B    Trainable params\n",
      "\u001b[36m(RayTrainWorker pid=17770)\u001b[0m 0         Non-trainable params\n",
      "\u001b[36m(RayTrainWorker pid=17770)\u001b[0m 13.0 B    Total params\n",
      "\u001b[36m(RayTrainWorker pid=17770)\u001b[0m 52,063.457Total estimated model params size (MB)\n",
      "\u001b[36m(RayTrainWorker pid=5418, ip=10.0.129.201)\u001b[0m Loading extension module cpu_adam...\u001b[32m [repeated 15x across cluster]\u001b[0m\n",
      "\u001b[36m(RayTrainWorker pid=5418, ip=10.0.129.201)\u001b[0m Time to load cpu_adam op: 31.185880184173584 seconds\u001b[32m [repeated 15x across cluster]\u001b[0m\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch 0: : 0it [00:00, ?it/s]0)\u001b[0m \n",
      "\u001b[36m(RayTrainWorker pid=5418, ip=10.0.129.201)\u001b[0m [2/4] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\\\"_gcc\\\" -DPYBIND11_STDLIB=\\\"_libstdcpp\\\" -DPYBIND11_BUILD_ABI=\\\"_cxxabi1011\\\" -I/home/ray/anaconda3/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /home/ray/anaconda3/lib/python3.10/site-packages/torch/include -isystem /home/ray/anaconda3/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /home/ray/anaconda3/lib/python3.10/site-packages/torch/include/TH -isystem /home/ray/anaconda3/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /home/ray/anaconda3/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -L/usr/local/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX256__ -D__ENABLE_CUDA__ -DBF16_AVAILABLE -c /home/ray/anaconda3/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o \u001b[32m [repeated 15x across cluster]\u001b[0m\n",
      "\u001b[36m(RayTrainWorker pid=5418, ip=10.0.129.201)\u001b[0m [3/4] c++ -MMD -MF cpu_adam_impl.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\\\"_gcc\\\" -DPYBIND11_STDLIB=\\\"_libstdcpp\\\" -DPYBIND11_BUILD_ABI=\\\"_cxxabi1011\\\" -I/home/ray/anaconda3/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /home/ray/anaconda3/lib/python3.10/site-packages/torch/include -isystem /home/ray/anaconda3/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /home/ray/anaconda3/lib/python3.10/site-packages/torch/include/TH -isystem /home/ray/anaconda3/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /home/ray/anaconda3/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -L/usr/local/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX256__ -D__ENABLE_CUDA__ -DBF16_AVAILABLE -c /home/ray/anaconda3/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/cpu_adam_impl.cpp -o cpu_adam_impl.o \u001b[32m [repeated 15x across cluster]\u001b[0m\n",
      "\u001b[36m(RayTrainWorker pid=5418, ip=10.0.129.201)\u001b[0m [4/4] c++ cpu_adam.o cpu_adam_impl.o custom_cuda_kernel.cuda.o -shared -lcurand -L/home/ray/anaconda3/lib/python3.10/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o cpu_adam.so\u001b[32m [repeated 15x across cluster]\u001b[0m\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "2a3cf444199946fa9760cd89e1e8d198",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "(pid=17972) Running 0: 0.00 row [00:00, ? row/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "e0810cbe81cb4f418cd6a45728187e66",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "(pid=17972) - MapBatches(fill_prompt)->MapBatches(tokenize) 1: 0.00 row [00:00, ? row/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "27c3f884506944d1b3825a1104412c6c",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "(pid=17972) - limit=2048 2: 0.00 row [00:00, ? row/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "029aff619c7644bcb70086a01f3c15e5",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "(pid=17972) - split(16, equal=True) 3: 0.00 row [00:00, ? row/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[36m(SplitCoordinator pid=17972)\u001b[0m Registered dataset logger for dataset train_16_0\n",
      "\u001b[36m(SplitCoordinator pid=17972)\u001b[0m Starting execution of Dataset train_16_0. Full logs are in /tmp/ray/session_2025-10-15_15-40-01_399241_4076/logs/ray-data\n",
      "\u001b[36m(SplitCoordinator pid=17972)\u001b[0m Execution plan of Dataset train_16_0: InputDataBuffer[Input] -> TaskPoolMapOperator[MapBatches(fill_prompt)->MapBatches(tokenize)] -> LimitOperator[limit=2048] -> OutputSplitter[split(16, equal=True)]\n",
      "\u001b[36m(SplitCoordinator pid=17972)\u001b[0m ⚠️  Ray's object store is configured to use only 28.0% of available memory (341.1GiB out of 1216.0GiB total). For optimal Ray Data performance, we recommend setting the object store to at least 50% of available memory. You can do this by setting the 'object_store_memory' parameter when calling ray.init() or by setting the RAY_DEFAULT_OBJECT_STORE_MEMORY_PROPORTION environment variable.\n",
      "\u001b[36m(MapBatches(fill_prompt)->MapBatches(tokenize) pid=4600, ip=10.0.166.0)\u001b[0m /home/ray/anaconda3/lib/python3.10/site-packages/huggingface_hub/file_download.py:795: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.\n",
      "\u001b[36m(MapBatches(fill_prompt)->MapBatches(tokenize) pid=4600, ip=10.0.166.0)\u001b[0m   warnings.warn(\n",
      "\u001b[36m(MapBatches(fill_prompt)->MapBatches(tokenize) pid=4600, ip=10.0.166.0)\u001b[0m normalizer.cc(51) LOG(INFO) precompiled_charsmap is empty. use identity normalization.\n",
      "\u001b[36m(SplitCoordinator pid=17972)\u001b[0m ✔️  Dataset train_16_0 execution finished in 5.69 seconds\n",
      "\u001b[36m(RayTrainWorker pid=4048, ip=10.0.130.188)\u001b[0m /home/ray/anaconda3/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::broadcast_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.)\n",
      "\u001b[36m(RayTrainWorker pid=4048, ip=10.0.130.188)\u001b[0m   return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch 0: : 1it [00:52, 52.00s/it, v_num=0, train_loss=9.190]\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[36m(RayTrainWorker pid=5418, ip=10.0.129.201)\u001b[0m /home/ray/anaconda3/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::broadcast_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.)\u001b[32m [repeated 16x across cluster]\u001b[0m\n",
      "\u001b[36m(RayTrainWorker pid=5418, ip=10.0.129.201)\u001b[0m   return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass\u001b[32m [repeated 16x across cluster]\u001b[0m\n",
      "\u001b[36m(RayTrainWorker pid=4224, ip=10.0.155.201)\u001b[0m /home/ray/anaconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py:1330: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:78.)\n",
      "\u001b[36m(RayTrainWorker pid=4224, ip=10.0.155.201)\u001b[0m   total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)])\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch 0: : 2it [01:28, 44.47s/it, v_num=0, train_loss=9.250]\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[36m(RayTrainWorker pid=4075, ip=10.0.164.99)\u001b[0m /home/ray/anaconda3/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::broadcast_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.)\u001b[32m [repeated 16x across cluster]\u001b[0m\n",
      "\u001b[36m(RayTrainWorker pid=4075, ip=10.0.164.99)\u001b[0m   return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass\u001b[32m [repeated 16x across cluster]\u001b[0m\n",
      "\u001b[36m(RayTrainWorker pid=17770)\u001b[0m /home/ray/anaconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py:1330: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:78.)\u001b[32m [repeated 15x across cluster]\u001b[0m\n",
      "\u001b[36m(RayTrainWorker pid=17770)\u001b[0m   total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)])\u001b[32m [repeated 15x across cluster]\u001b[0m\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch 0: : 3it [01:59, 39.68s/it, v_num=0, train_loss=1.160]\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[36m(RayTrainWorker pid=4040, ip=10.0.136.125)\u001b[0m /home/ray/anaconda3/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::broadcast_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.)\u001b[32m [repeated 16x across cluster]\u001b[0m\n",
      "\u001b[36m(RayTrainWorker pid=4040, ip=10.0.136.125)\u001b[0m   return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass\u001b[32m [repeated 16x across cluster]\u001b[0m\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch 0: : 4it [02:34, 38.54s/it, v_num=0, train_loss=1.120]\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[36m(RayTrainWorker pid=5418, ip=10.0.129.201)\u001b[0m /home/ray/anaconda3/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::broadcast_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.)\u001b[32m [repeated 16x across cluster]\u001b[0m\n",
      "\u001b[36m(RayTrainWorker pid=5418, ip=10.0.129.201)\u001b[0m   return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass\u001b[32m [repeated 16x across cluster]\u001b[0m\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch 0: : 5it [03:05, 37.12s/it, v_num=0, train_loss=0.957]\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[36m(RayTrainWorker pid=4079, ip=10.0.173.126)\u001b[0m /home/ray/anaconda3/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::broadcast_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.)\u001b[32m [repeated 16x across cluster]\u001b[0m\n",
      "\u001b[36m(RayTrainWorker pid=4079, ip=10.0.173.126)\u001b[0m   return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass\u001b[32m [repeated 16x across cluster]\u001b[0m\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch 0: : 6it [03:40, 36.73s/it, v_num=0, train_loss=0.941]\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[36m(RayTrainWorker pid=4224, ip=10.0.155.201)\u001b[0m /home/ray/anaconda3/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::broadcast_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.)\u001b[32m [repeated 16x across cluster]\u001b[0m\n",
      "\u001b[36m(RayTrainWorker pid=4224, ip=10.0.155.201)\u001b[0m   return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass\u001b[32m [repeated 16x across cluster]\u001b[0m\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch 0: : 7it [04:10, 35.84s/it, v_num=0, train_loss=0.793]\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[36m(RayTrainWorker pid=4182, ip=10.0.178.75)\u001b[0m /home/ray/anaconda3/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::broadcast_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.)\u001b[32m [repeated 16x across cluster]\u001b[0m\n",
      "\u001b[36m(RayTrainWorker pid=4182, ip=10.0.178.75)\u001b[0m   return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass\u001b[32m [repeated 16x across cluster]\u001b[0m\n",
      "\u001b[36m(RayTrainWorker pid=4182, ip=10.0.178.75)\u001b[0m Exiting prefetcher's background thread\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch 0: : 8it [04:46, 35.78s/it, v_num=0, train_loss=0.777]\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[36m(RayTrainWorker pid=4040, ip=10.0.136.125)\u001b[0m /home/ray/anaconda3/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::broadcast_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.)\u001b[32m [repeated 16x across cluster]\u001b[0m\n",
      "\u001b[36m(RayTrainWorker pid=4040, ip=10.0.136.125)\u001b[0m   return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass\u001b[32m [repeated 16x across cluster]\u001b[0m\n",
      "\u001b[36m(RayTrainWorker pid=4079, ip=10.0.173.126)\u001b[0m Exiting prefetcher's background thread\u001b[32m [repeated 15x across cluster]\u001b[0m\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch 0: : 9it [05:19, 35.48s/it, v_num=0, train_loss=0.629]\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[36m(RayTrainWorker pid=4224, ip=10.0.155.201)\u001b[0m /home/ray/anaconda3/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::broadcast_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.)\u001b[32m [repeated 16x across cluster]\u001b[0m\n",
      "\u001b[36m(RayTrainWorker pid=4224, ip=10.0.155.201)\u001b[0m   return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass\u001b[32m [repeated 16x across cluster]\u001b[0m\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch 0: : 10it [05:57, 35.70s/it, v_num=0, train_loss=0.672]\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[36m(RayTrainWorker pid=4182, ip=10.0.178.75)\u001b[0m /home/ray/anaconda3/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::broadcast_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.)\u001b[32m [repeated 16x across cluster]\u001b[0m\n",
      "\u001b[36m(RayTrainWorker pid=4182, ip=10.0.178.75)\u001b[0m   return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass\u001b[32m [repeated 16x across cluster]\u001b[0m\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch 0: : 11it [06:30, 35.54s/it, v_num=0, train_loss=0.562]\n",
      "Epoch 0: : 12it [07:04, 35.41s/it, v_num=0, train_loss=0.562]\n",
      "Epoch 0: : 13it [07:36, 35.09s/it, v_num=0, train_loss=0.559]\n",
      "Epoch 0: : 14it [08:11, 35.13s/it, v_num=0, train_loss=0.582]\n",
      "Epoch 0: : 15it [08:43, 34.89s/it, v_num=0, train_loss=0.535]\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[36m(RayTrainWorker pid=4224, ip=10.0.155.201)\u001b[0m /home/ray/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py:1898: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.\n",
      "\u001b[36m(RayTrainWorker pid=4224, ip=10.0.155.201)\u001b[0m   warnings.warn(\n",
      "\u001b[36m(RayTrainWorker pid=4048, ip=10.0.130.188)\u001b[0m /home/ray/anaconda3/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::broadcast_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.)\u001b[32m [repeated 15x across cluster]\u001b[0m\n",
      "\u001b[36m(RayTrainWorker pid=4048, ip=10.0.130.188)\u001b[0m   return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass\u001b[32m [repeated 15x across cluster]\u001b[0m\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch 0: : 16it [09:19, 34.98s/it, v_num=0, train_loss=0.551]\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[36m(RayTrainWorker pid=4048, ip=10.0.130.188)\u001b[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/cluster_storage/vicuna-13b-finetune/checkpoint_2025-10-15_16-04-29.037536)\n",
      "\u001b[36m(RayTrainWorker pid=4048, ip=10.0.130.188)\u001b[0m Reporting training result 1: TrainingReport(checkpoint=Checkpoint(filesystem=local, path=/mnt/cluster_storage/vicuna-13b-finetune/checkpoint_2025-10-15_16-04-29.037536), metrics={'train_loss': 0.55078125, 'epoch': 0, 'step': 8}, validation_spec=None)\n",
      "\u001b[36m(RayTrainWorker pid=4075, ip=10.0.164.99)\u001b[0m /home/ray/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py:1898: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.\u001b[32m [repeated 15x across cluster]\u001b[0m\n",
      "\u001b[36m(RayTrainWorker pid=4075, ip=10.0.164.99)\u001b[0m   warnings.warn(\u001b[32m [repeated 15x across cluster]\u001b[0m\n",
      "\u001b[36m(RayTrainWorker pid=5417, ip=10.0.167.159)\u001b[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/cluster_storage/vicuna-13b-finetune/checkpoint_2025-10-15_16-04-29.037536)\u001b[32m [repeated 2x across cluster]\u001b[0m\n",
      "\u001b[36m(RayTrainWorker pid=5417, ip=10.0.167.159)\u001b[0m Reporting training result 1: TrainingReport(checkpoint=Checkpoint(filesystem=local, path=/mnt/cluster_storage/vicuna-13b-finetune/checkpoint_2025-10-15_16-04-29.037536), metrics={'train_loss': 0.55078125, 'epoch': 0, 'step': 8}, validation_spec=None)\u001b[32m [repeated 2x across cluster]\u001b[0m\n",
      "\u001b[36m(RayTrainWorker pid=4191, ip=10.0.134.47)\u001b[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/cluster_storage/vicuna-13b-finetune/checkpoint_2025-10-15_16-04-29.037536)\u001b[32m [repeated 8x across cluster]\u001b[0m\n",
      "\u001b[36m(RayTrainWorker pid=4191, ip=10.0.134.47)\u001b[0m Reporting training result 1: TrainingReport(checkpoint=Checkpoint(filesystem=local, path=/mnt/cluster_storage/vicuna-13b-finetune/checkpoint_2025-10-15_16-04-29.037536), metrics={'train_loss': 0.55078125, 'epoch': 0, 'step': 8}, validation_spec=None)\u001b[32m [repeated 8x across cluster]\u001b[0m\n",
      "\u001b[36m(RayTrainWorker pid=4075, ip=10.0.164.99)\u001b[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/cluster_storage/vicuna-13b-finetune/checkpoint_2025-10-15_16-04-29.037536)\u001b[32m [repeated 3x across cluster]\u001b[0m\n",
      "\u001b[36m(RayTrainWorker pid=4075, ip=10.0.164.99)\u001b[0m Reporting training result 1: TrainingReport(checkpoint=Checkpoint(filesystem=local, path=/mnt/cluster_storage/vicuna-13b-finetune/checkpoint_2025-10-15_16-04-29.037536), metrics={'train_loss': 0.55078125, 'epoch': 0, 'step': 8}, validation_spec=None)\u001b[32m [repeated 3x across cluster]\u001b[0m\n",
      "\u001b[36m(RayTrainWorker pid=4053, ip=10.0.166.0)\u001b[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/cluster_storage/vicuna-13b-finetune/checkpoint_2025-10-15_16-04-29.037536)\u001b[32m [repeated 2x across cluster]\u001b[0m\n",
      "\u001b[36m(RayTrainWorker pid=4053, ip=10.0.166.0)\u001b[0m Reporting training result 1: TrainingReport(checkpoint=Checkpoint(filesystem=local, path=/mnt/cluster_storage/vicuna-13b-finetune/checkpoint_2025-10-15_16-04-29.037536), metrics={'train_loss': 0.55078125, 'epoch': 0, 'step': 8}, validation_spec=None)\u001b[32m [repeated 2x across cluster]\u001b[0m\n",
      "\u001b[36m(RayTrainWorker pid=17770)\u001b[0m INFO: `Trainer.fit` stopped: `max_epochs=1` reached.\n",
      "\u001b[36m(RayTrainWorker pid=17770)\u001b[0m `Trainer.fit` stopped: `max_epochs=1` reached.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch 0: : 16it [10:25, 39.09s/it, v_num=0, train_loss=0.551]\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[36m(TrainController pid=17559)\u001b[0m [State Transition] RUNNING -> FINISHED.\n"
     ]
    }
   ],
   "source": [
    "result = trainer.fit()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## LLM Inference\n",
    "\n",
    "Now, it's time to play with our fine-tuned Vicuna code generator!"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The deepspeed ZeRO-3 checkpoint is a directory containing of k shards (k=16 in our case).\n",
    "\n",
    "- `zero_pp_rank_k_mp_rank_00_model_states.pt`: contains the model parameter skeleton of shard k.\n",
    "- `bf16_zero_pp_rank_k_mp_rank_00_optim_states.pt`: contains the actual flattened model parameters and optimizer states of shard k.\n",
    "\n",
    "Next, we removed the optimizer states and consolidate the checkpoint into a single binary file using DeepSpeed utilities. Also, since we wrapped vicuna-13b within a `LightningModule`, we need to remove the prefix `_forward_module.model.model` so that we can directly load the checkpoint into a HF vicuna model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Processing zero checkpoint '/mnt/cluster_storage/vicuna-13b-finetune/checkpoint_2025-10-15_16-04-29.037536/checkpoint.ckpt/checkpoint'\n",
      "Detected checkpoint of type zero stage 3, world_size: 16\n",
      "Parsing checkpoint created by deepspeed==0.12.3\n",
      "Reconstructed Trainable fp32 state dict with 363 params 13015864320 elements\n"
     ]
    }
   ],
   "source": [
    "import os\n",
    "import torch\n",
    "from deepspeed.utils.zero_to_fp32 import get_fp32_state_dict_from_zero_checkpoint\n",
    "\n",
    "def extract_fp32_ckpt_from_zero(zero_ckpt_dir):\n",
    "    state_dict = get_fp32_state_dict_from_zero_checkpoint(zero_ckpt_dir)\n",
    "    vicuna_state_dict = {\n",
    "        k.replace(\"_forward_module.model.\", \"\"): v for k, v in state_dict.items()\n",
    "    }\n",
    "    torch.save(vicuna_state_dict, os.path.join(zero_ckpt_dir, \"full_model.pt\"))\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Initialize Generation Pipeline\n",
    "\n",
    "Here, we leverage the Accelerate library to efficiently load the model onto a suitable device(GPU and CPU) and generate a HF text generation pipeline. \n",
    "\n",
    "- Initialize an empty model on metadevice\n",
    "- Create valid device mappings for the vicuna-13b model\n",
    "- Load and distribute model weights to target devices\n",
    "\n",
    "This ensures that only 1x model size of RAM is used for model initialization."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import shutil\n",
    "import torch\n",
    "import ray\n",
    "import lightning.pytorch as pl\n",
    "from transformers import AutoConfig, AutoTokenizer, AutoModelForCausalLM, pipeline\n",
    "from accelerate import (\n",
    "    init_empty_weights,\n",
    "    infer_auto_device_map,\n",
    "    load_checkpoint_and_dispatch,\n",
    ")\n",
    "\n",
    "\n",
    "def generate_sample_outputs(model_checkpoint_path, prompts):\n",
    "    # Initialize a model on meta device\n",
    "    with init_empty_weights():\n",
    "        config = AutoConfig.from_pretrained(MODEL_NAME)\n",
    "        meta_model = AutoModelForCausalLM.from_config(config)\n",
    "    meta_model.tie_weights()\n",
    "\n",
    "    # Define the device mapping\n",
    "    device_map = infer_auto_device_map(\n",
    "        meta_model,\n",
    "        max_memory={0: \"15GB\", \"cpu\": \"60GB\"},\n",
    "        no_split_module_classes=[\"LlamaDecoderLayer\"],\n",
    "    )\n",
    "\n",
    "    local_checkpoint_path = \"/mnt/local_storage/vicuna_ckpt\"\n",
    "    shutil.copytree(model_checkpoint_path, local_checkpoint_path)\n",
    "\n",
    "    extract_fp32_ckpt_from_zero(local_checkpoint_path)\n",
    "\n",
    "    full_model_ckpt_path = os.path.join(local_checkpoint_path, \"full_model.pt\")\n",
    "\n",
    "    # Load the model parameters\n",
    "    model = load_checkpoint_and_dispatch(\n",
    "        meta_model,\n",
    "        checkpoint=full_model_ckpt_path,\n",
    "        device_map=device_map,\n",
    "    )\n",
    "\n",
    "    generator = pipeline(\n",
    "        \"text-generation\",\n",
    "        model=model,\n",
    "        device_map=device_map,\n",
    "        tokenizer=AutoTokenizer.from_pretrained(\n",
    "            MODEL_NAME, padding_side=\"left\", use_fast=False\n",
    "        ),\n",
    "    )\n",
    "\n",
    "    for sample_prompt in prompts:\n",
    "        prompt = PROMPT_TEMPLATE.format(intent=sample_prompt[\"intent\"], snippet=\"\")\n",
    "        output = generator(prompt, max_new_tokens=30, do_sample=True)\n",
    "        print(output[0][\"generated_text\"])"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Case Study\n",
    "\n",
    "We took 3 examples from the CoNaLa's test split for demo:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "testcases = [\n",
    "    {\n",
    "        \"intent\": \"replace white spaces in colunm 'col' of dataframe `df` with '_'\",\n",
    "    },\n",
    "    {\n",
    "        \"intent\": \"search for occurrences of regex pattern '>.*<' in xml string `line`\",\n",
    "    },\n",
    "    {\n",
    "        \"intent\": \"send a signal `signal.SIGUSR1` to the current process\",\n",
    "    },\n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "generate_sample_outputs(os.path.join(result.checkpoint.path, \"checkpoint.ckpt\"), testcases)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Test the Generated Code Snippets\n",
    "\n",
    "The generated code snippets look pretty reasonable. The results covered Pandas operations, regular expressions, and Linux commands. Let's test them one by one."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "df = pd.DataFrame.from_dict({\"col\": [\"abc def ghi\", \" 12 3 456\", \"     \"]})\n",
    "print(\"Before\\n\", df)\n",
    "\n",
    "df[\"col\"] = df[\"col\"].str.replace(\" \", \"_\")\n",
    "print(\"After\\n\", df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "\n",
    "line = \"\"\"\n",
    "<bookstore>\n",
    "  <book category=\"fiction\">\n",
    "    <title>The Great Gatsby</title>\n",
    "    <author>F. Scott Fitzgerald</author>\n",
    "    <year>1925</year>\n",
    "  </book>\n",
    "  <book category=\"non-fiction\">\n",
    "    <title>Sapiens: A Brief History of Humankind</title>\n",
    "    <author>Yuval Noah Harari</author>\n",
    "    <year>2011</year>\n",
    "  </book>\n",
    "</bookstore>\n",
    "\"\"\"\n",
    "re.findall(\">.*<\", line)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, let's hand it over to LLM and let it wrap up the demo:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os, signal\n",
    "\n",
    "# Don't actually kill the process, it's just for demo :D\n",
    "# os.kill(os.getpid(), signal.SIGUSR1)  # Terminate the current process~"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## References:\n",
    "\n",
    "- [CoNaLa: The Code/Natural Language Challenge](https://conala-corpus.github.io/)\n",
    "- [HuggingFace: DeepSpeed Integration](https://huggingface.co/docs/transformers/main_classes/deepspeed#deepspeed-integration)\n",
    "- [HuggingFace: Handling big models for inference](https://huggingface.co/docs/accelerate/main/usage_guides/big_modeling)\n",
    "- [Lightning Transformers: DeepSpeed Training with Big Transformer Models](https://lightning-transformers.readthedocs.io/en/latest/)\n",
    "- Rajbhandari, S., Rasley, J., et al. (2020). ZeRO: Memory Optimizations Toward Training Trillion Parameter Models. [arXiv:1910.02054](https://arxiv.org/abs/1910.02054)\n",
    "- Zheng, L., Chiang, W-L., Sheng, Y., et al. (2023). Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. [arXiv:2306.05685](https://arxiv.org/abs/2306.05685)\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.18"
  },
  "orphan": true
 },
 "nbformat": 4,
 "nbformat_minor": 4
}