Incremental Learning with Ray AIR
Contents
This example is adapted from Continual AI Avalanche quick start https://avalanche.continualai.org/
Incremental Learning with Ray AIR#
In this example, we show how to use Ray AIR to incrementally train a simple image classification PyTorch model on a stream of incoming tasks.
Each task is a random permutation of the MNIST Dataset, which is a common benchmark used for continual training. After training on all the tasks, the model is expected to be able to make predictions on data from any task.
In this example, we use just a naive finetuning strategy, where the model is trained on each task, without any special methods to prevent catastrophic forgetting. Model performance is expected to be poor.
More precisely, this example showcases domain incremental training, in which during prediction/testing time, the model is asked to predict on data from tasks trained on so far with the task ID not provided. This is opposed to task incremental training, where the task ID is provided during prediction/testing time.
For more information on the 3 different categories for incremental/continual learning, please see βThree scenarios for continual learningβ by van de Ven and Tolias
This example will cover the following:
Loading a PyTorch Dataset to Ray Datasets
Create an
Iterator[ray.data.Datasets]
abstraction to represent a stream of data to train on for incremental training.Implement a custom Ray AIR preprocessor to preprocess the Dataset.
Incrementally train a model using data parallel training.
Use our trained model to perform batch prediction on test data.
Incrementally deploying our trained model with Ray Serve and performing online prediction queries.
Step 1: Installations and Initializing Ray#
To get started, letβs first install the necessary packages: Ray AIR, torch, and torchvision. Uncomment the below lines and run the cell to install the necessary packages.
# !pip install -q "ray[air]"
# !pip install -q torch
# !pip install -q torchvision
Then, letβs initialize Ray! We can just import and call ray.init()
. If you are running on a Ray cluster, then you can do ray.init("auto")
to connect to the cluster instead of initiailzing a new local Ray instance.
import ray
ray.init()
# If runnning on a cluster, use the below line instead.
# ray.init("auto")
2022-09-23 16:31:18,554 INFO worker.py:1509 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265
Ray
Python version: | 3.10.6 |
Ray version: | 3.0.0.dev0 |
Dashboard: | http://127.0.0.1:8265 |
Step 2: Define our PyTorch Model#
Now that we have the necessary installations, letβs define our PyTorch model. For this example to classify MNIST images, we will use a simple multi-layer perceptron.
import torch.nn as nn
class SimpleMLP(nn.Module):
def __init__(self, num_classes=10, input_size=28 * 28):
super(SimpleMLP, self).__init__()
self.features = nn.Sequential(
nn.Linear(input_size, 512),
nn.ReLU(inplace=True),
nn.Dropout(),
)
self.classifier = nn.Linear(512, num_classes)
self._input_size = input_size
def forward(self, x):
x = x.contiguous()
x = x.view(-1, self._input_size)
x = self.features(x)
x = self.classifier(x)
return x
/home/pdmurray/.pyenv/versions/mambaforge/envs/ray/lib/python3.10/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
Step 3: Create the Stream of tasks#
We can now create a stream of tasks (where each task contains a dataset to train on). For this example, we will create an artificial stream of tasks consisting of permuted variations of MNIST, which is a classic benchmark in continual learning research.
For real-world scenarios, this step is not necessary as fresh data will already be arriving as a stream of tasks. It does not need to be artificially created.
3a: Load MNIST Dataset to a Ray Dataset#
Letβs first define a simple function that will return the original MNIST Dataset as a distributed Ray Dataset. Ray Datasets are the standard way to load and exchange data in Ray libraries and applications, read more about them here!
The function in the below code snippet does the following:
Downloads the MNIST Dataset from torchvision in-memory
Loads the in-memory Torch Dataset into a Ray Dataset
Converts the Ray Dataset into Numpy format. Instead of the Ray Dataset iterating over tuples, it will have 2 columns: βimageβ & βlabelβ. This will allow us to apply built-in preprocessors to the Ray Dataset and allow Ray Datasets to be used with Ray AIR Predictors.
For this example, since we are just working with MNIST dataset, which is small, we use the from_torch
which just loads the full MNIST dataset into memory.
For loading larger datasets in a parallel fashion, you should use Ray Datasetβs additional read APIs to load data from parquet, csv, image files, and more!
import pandas as pd
import torchvision
from torchvision.transforms import RandomCrop
import ray
def get_mnist_dataset(train: bool = True) -> ray.data.Dataset:
"""Returns MNIST Dataset as a ray.data.Dataset.
Args:
train: Whether to return the train dataset or test dataset.
"""
if train:
# Only perform random cropping on the Train dataset.
transform = RandomCrop(28, padding=4)
else:
transform = None
mnist_dataset = torchvision.datasets.MNIST("./data", download=True, train=train, transform=transform)
mnist_dataset = ray.data.from_torch(mnist_dataset)
def convert_batch_to_numpy(batch):
images = np.array([np.array(item[0]) for item in batch])
labels = np.array([item[1] for item in batch])
return {"image": images, "label": labels}
mnist_dataset = mnist_dataset.map_batches(convert_batch_to_numpy).cache()
return mnist_dataset
3b: Create our Stream abstraction#
Now we can create our βstreamβ abstraction. This abstraction provides two
methods (generate_train_stream
and generate_test_stream
) that each returns an Iterator
over Ray Datasets. Each item in this iterator contains a unique permutation of
MNIST, and is one task that we want to train on.
In this example, βthe stream of tasksβ is contrived since all the data for all tasks exist already in an offline setting. For true online continual learning, you would want to implement a custom dataset iterator that reads from some stream datasource to produce new tasks. The only abstraction thatβs needed is Iterator[ray.data.Dataset]
.
Note that the test dataset stream has the same permutations that are used for the training dataset stream. In general for continual learning, it is expected that the data distribution of the test/prediction data follows what the model was trained on. If you notice that the distribution of new prediction queries is changing compared to the distribution of the training data, then you should probably trigger training of a new task.
from typing import Dict, Iterator, List
import random
import numpy as np
from ray.data import ActorPoolStrategy
class PermutedMNISTStream:
"""Generates streams of permuted MNIST Datasets.
Example:
permuted_mnist = PermutedMNISTStream(n_tasks=3)
train_stream = permuted_mnist.generate_train_stream()
# Iterate through the train_stream
for train_dataset in train_stream:
...
Args:
n_tasks: The number of tasks to generate.
"""
def __init__(self, n_tasks: int = 3):
self.n_tasks = n_tasks
self.permutations = [
np.random.permutation(28 * 28) for _ in range(self.n_tasks)
]
self.train_mnist_dataset = get_mnist_dataset(train=True)
self.test_mnist_dataset = get_mnist_dataset(train=False)
def random_permute_dataset(
self, dataset: ray.data.Dataset, permutation: np.ndarray
):
"""Randomly permutes the pixels for each image in the dataset."""
class PixelsPermutation(object):
def __call__(self, batch):
batch["image"] = batch["image"].map(lambda image: image.reshape(-1)[permutation].reshape(28, 28))
return batch
return dataset.map_batches(PixelsPermutation, compute=ActorPoolStrategy(), batch_format="pandas")
def generate_train_stream(self) -> Iterator[ray.data.Dataset]:
for permutation in self.permutations:
permuted_mnist_dataset = self.random_permute_dataset(
self.train_mnist_dataset, permutation
)
yield permuted_mnist_dataset
def generate_test_stream(self) -> Iterator[ray.data.Dataset]:
for permutation in self.permutations:
mnist_dataset = get_mnist_dataset(train=False)
permuted_mnist_dataset = self.random_permute_dataset(
self.test_mnist_dataset, permutation
)
yield permuted_mnist_dataset
def generate_test_samples(self, num_samples: int = 10) -> List[np.ndarray]:
"""Generates num_samples permuted MNIST images."""
random_permutation = random.choice(self.permutations)
return list(self.random_permute_dataset(
self.test_mnist_dataset.random_shuffle().limit(num_samples),
random_permutation,
).to_pandas()["image"].to_numpy())
Step 4: Define the logic for Training and Inference/Prediction#
Now that we can get an Iterator over Ray Datasets, we can incrementally train our model in a data parallel fashion via Ray Train, while incrementally deploying our model via Ray Serve. Letβs define some helper functions to allow us to do this!
If you are not familiar with data parallel training, it is a form of distributed training strategies, where we have multiple model replicas, and each replica trains on a different batch of data. After each batch, the gradients are synchronized across the replicas. This effecitively allows us to train on more data in a shorter amount of time.
4a: Define our training logic for each Data Parallel worker#
The first thing we need to do is to define the training loop that will be run on each training worker.
The training loop takes in a config
Dict as an argument that we can use to pass in any configurations for training.
This is just standard PyTorch training, with the difference being that we can leverage Ray Trainβs utility functions and Ray AIR Sesssion:
ray.train.torch.prepare_model(...)
: This will prepare the model for distributed training by wrapping it in either PyTorchDistributedDataParallel
orFullyShardedDataParallel
and moving it to the correct accelerator device.ray.air.session.get_dataset_shard(...)
: This will get the Ray Dataset shard for this particular Data Parallel worker.ray.air.session.report({}, checkpoint=...)
: This will tell Ray Train to persist the providedCheckpoint
object.ray.air.session.get_checkpoint()
: Returns a checkpoint to resume from. This is useful for either fault tolerance purposes, or for our purposes, to continue training the same model on a new incoming dataset.
from ray import train
from ray.air import session, Checkpoint
from torch.optim import SGD
from torch.nn import CrossEntropyLoss
def train_loop_per_worker(config: dict):
num_epochs = config["num_epochs"]
learning_rate = config["learning_rate"]
momentum = config["momentum"]
batch_size = config["batch_size"]
model = SimpleMLP(num_classes=10)
# Load model from checkpoint if there is a checkpoint to load from.
checkpoint_to_load = session.get_checkpoint()
if checkpoint_to_load:
state_dict_to_resume_from = checkpoint_to_load.to_dict()["model"]
model.load_state_dict(state_dict=state_dict_to_resume_from)
model = train.torch.prepare_model(model)
optimizer = SGD(model.parameters(), lr=learning_rate, momentum=momentum)
criterion = CrossEntropyLoss()
# Get the Ray Dataset shard for this data parallel worker, and convert it to a PyTorch Dataset.
dataset_shard = session.get_dataset_shard("train").iter_torch_batches(
batch_size=batch_size,
)
for epoch_idx in range(num_epochs):
running_loss = 0
for iteration, batch in enumerate(dataset_shard):
optimizer.zero_grad()
train_mb_x, train_mb_y = batch["image"], batch["label"]
train_mb_x = train_mb_x.to(train.torch.get_device())
train_mb_y = train_mb_y.to(train.torch.get_device())
# Forward
logits = model(train_mb_x)
# Loss
loss = criterion(logits, train_mb_y)
# Backward
loss.backward()
# Update
optimizer.step()
running_loss += loss.item()
if session.get_world_rank() == 0 and iteration % 500 == 0:
print(f"loss: {loss.item():>7f}, epoch: {epoch_idx}, iteration: {iteration}")
# Checkpoint model after every epoch.
state_dict = model.state_dict()
checkpoint = Checkpoint.from_dict(dict(model=state_dict))
session.report({"loss": running_loss}, checkpoint=checkpoint)
4b: Define our Preprocessor#
Next, we define our Preprocessor
to preprocess our data before training and prediction. Our preprocessor will normalize the MNIST Images by the mean and standard deviation of the MNIST training dataset. This is a common operation to do on MNIST to improve training: https://discuss.pytorch.org/t/normalization-in-the-mnist-example/457
from typing import Dict
import numpy as np
import torch
from torchvision import transforms
from ray.data.preprocessors import TorchVisionPreprocessor
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])
mnist_normalize_preprocessor = TorchVisionPreprocessor(columns=["image"], transform=transform)
4c: Define logic for Batch/Offline Prediction.#
After training on each task, we want to use our trained model to do batch (i.e. offline) inference on a test dataset.
To do this, we leverage the built-in ray.air.BatchPredictor
. We define a batch_predict
function that will take in a Checkpoint and a Test Dataset and outputs the accuracy our model achieves on the test dataset.
from ray.train.batch_predictor import BatchPredictor
from ray.train.torch import TorchPredictor
def batch_predict(checkpoint: ray.air.Checkpoint, test_dataset: ray.data.Dataset) -> float:
"""Perform batch prediction on the provided test dataset, and return accuracy results."""
batch_predictor = BatchPredictor.from_checkpoint(checkpoint, predictor_cls=TorchPredictor, model=SimpleMLP(num_classes=10))
model_output = batch_predictor.predict(
data=test_dataset, feature_columns=["image"], keep_columns=["label"]
)
# Postprocess model outputs.
# Convert logits outputted from model into actual class predictions.
def convert_logits_to_classes(df):
best_class = df["predictions"].map(lambda x: np.array(x).argmax())
df["predictions"] = best_class
return df
prediction_results = model_output.map_batches(convert_logits_to_classes, batch_format="pandas")
# Then, for each prediction output, see if it matches with the ground truth
# label.
def calculate_prediction_scores(df):
return pd.DataFrame({"correct": df["predictions"] == df["label"]})
correct_dataset = prediction_results.map_batches(
calculate_prediction_scores, batch_format="pandas"
)
return correct_dataset.sum(on="correct") / correct_dataset.count()
4d: Define logic for Deploying and Querying our model#
In addition to batch inference, we also want to deploy our model so that we can submit live queries to it for online inference. We use Ray Serveβs PredictorDeployment
utility to deploy our trained model.
Once we deploy the model, we can send HTTP requests to our deployment.
from typing import List
import requests
from requests import Response
import numpy as np
from ray.serve.http_adapters import json_to_ndarray
def deploy_model(checkpoint: ray.air.Checkpoint) -> str:
"""Deploys the model from the provided Checkpoint and returns the URL for the endpoint of the model deployment."""
serve.run(
PredictorDeployment.options(
name="mnist_model",
route_prefix="/mnist_predict",
num_replicas=2,
).bind(
http_adapter=json_to_ndarray,
predictor_cls=TorchPredictor,
checkpoint=latest_checkpoint,
model=SimpleMLP(num_classes=10),
)
)
return "http://localhost:8000/mnist_predict"
# Function that queries our deployed model
def query_deployment(test_samples: List[np.ndarray], endpoint_uri: str) -> List[Response]:
"""Given a set of test samples, queries the model deployment at the provided endpoint and returns the results."""
results = []
# Convert to Python List since Numpy arrays are not Json serializable.
for sample in test_samples:
results.append(requests.post(endpoint_uri, json={"array": sample.tolist(), "dtype": "float32"}))
return results
Step 5: Putting it all together#
Once we have defined our training logic and our preprocessor, we can put everything together!
For each dataset in our stream, we do the following:
Train on the dataset in Data Parallel fashion. We create a
TorchTrainer
, specify the config for the training loop we defined above, the dataset to train on, and how much we want to scale.TorchTrainer
also accepts acheckpoint
arg to continue training from a previously saved checkpoint.Get the saved checkpoint from the training run.
Test our trained model on a test set containing test data from all the tasks trained on so far.
After training on each task, we deploy our model so we can query it for predictions.
In this example, the training and test data for each task is well-defined beforehand by the benchmark. For real-world scenarios, this probably will not be the case. It is very likely that the prediction requests after training on one task will become the training data for the next task.
from ray.train.torch import TorchTrainer
from ray.air.config import ScalingConfig
from ray.train.torch import TorchPredictor
from ray import serve
from ray.serve import PredictorDeployment
from ray.serve.http_adapters import json_to_ndarray
# The number of tasks (i.e. datasets in our stream) that we want to use for this example.
n_tasks = 3
# Number of epochs to train each task for.
num_epochs = 4
# Batch size.
batch_size = 32
# Optimizer args.
learning_rate = 0.001
momentum = 0.9
# Number of data parallel workers to use for training.
num_workers = 1
# Whether to use GPU or not.
use_gpu = ray.available_resources().get("GPU", 0) > 0
permuted_mnist = PermutedMNISTStream(n_tasks=n_tasks)
train_stream = permuted_mnist.generate_train_stream()
test_stream = permuted_mnist.generate_test_stream()
latest_checkpoint = None
accuracy_for_all_tasks = []
task_idx = 0
all_test_datasets_seen_so_far = []
for train_dataset, test_dataset in zip(train_stream, test_stream):
print(f"Starting training for task: {task_idx}")
task_idx += 1
# *********Training*****************
trainer = TorchTrainer(
train_loop_per_worker=train_loop_per_worker,
train_loop_config={
"num_epochs": num_epochs,
"learning_rate": learning_rate,
"momentum": momentum,
"batch_size": batch_size,
},
# Have to specify trainer_resources as 0 so that the example works on Colab.
scaling_config=ScalingConfig(num_workers=num_workers, use_gpu=use_gpu, trainer_resources={"CPU": 0}),
datasets={"train": train_dataset},
preprocessor=mnist_normalize_preprocessor,
resume_from_checkpoint=latest_checkpoint,
)
result = trainer.fit()
latest_checkpoint = result.checkpoint
# **************Batch Prediction**************************
# We can do batch prediction on the test data for the tasks seen so far.
# TODO: Fix type signature in Ray Datasets
# TODO: Fix dataset.union when used with empty list.
if len(all_test_datasets_seen_so_far) > 0:
full_test_dataset = test_dataset.union(*all_test_datasets_seen_so_far)
else:
full_test_dataset = test_dataset
all_test_datasets_seen_so_far.append(test_dataset)
accuracy_for_this_task = batch_predict(latest_checkpoint, full_test_dataset)
print(f"Accuracy for task {task_idx}: {accuracy_for_this_task}")
accuracy_for_all_tasks.append(accuracy_for_this_task)
# *************Model Deployment & Online Inference***************************
# We can also deploy our model to do online inference with Ray Serve.
# Start Ray Serve.
test_samples = permuted_mnist.generate_test_samples()
endpoint_uri = deploy_model(latest_checkpoint)
online_inference_results = query_deployment(test_samples, endpoint_uri)
if ray.available_resources().get("CPU", 0) < num_workers+1:
# If there are no more CPUs left, then shutdown the Serve replicas so we can continue training on the next task.
serve.shutdown()
serve.shutdown()
Read->Map_Batches: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:03<00:00, 3.42s/it]
Read->Map_Batches: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:00<00:00, 5.27it/s]
Map Progress (1 actors 1 pending): 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:00<00:00, 1.40it/s]
Read->Map_Batches: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:00<00:00, 4.17it/s]
Map Progress (1 actors 1 pending): 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:00<00:00, 1.78it/s]
Starting training for task: 0
Tune Status
Current time: | 2022-09-23 16:31:51 |
Running for: | 00:00:20.79 |
Memory: | 17.1/62.7 GiB |
System Info
Using FIFO scheduling algorithm.Resources requested: 0/24 CPUs, 0/0 GPUs, 0.0/32.53 GiB heap, 0.0/16.26 GiB objects
Trial Status
Trial name | status | loc | iter | total time (s) | loss | _timestamp | _time_this_iter_s |
---|---|---|---|---|---|---|---|
TorchTrainer_da157_00000 | TERMINATED | 10.109.175.190:856770 | 4 | 17.0121 | 0 | 1663975908 | 0.0839479 |
(RayTrainWorker pid=856836) 2022-09-23 16:31:37,847 INFO config.py:71 -- Setting up process group for: env:// [rank=0, world_size=1]
(RayTrainWorker pid=856836) 2022-09-23 16:31:38,047 INFO train_loop_utils.py:354 -- Moving model to device: cuda:0
(RayTrainWorker pid=856836) loss: 2.436360, epoch: 0, iteration: 0
(RayTrainWorker pid=856836) loss: 1.608793, epoch: 0, iteration: 500
(RayTrainWorker pid=856836) loss: 1.285775, epoch: 0, iteration: 1000
(RayTrainWorker pid=856836) loss: 0.785092, epoch: 0, iteration: 1500
Trial Progress
Trial name | _time_this_iter_s | _timestamp | _training_iteration | date | done | episodes_total | experiment_id | experiment_tag | hostname | iterations_since_restore | loss | node_ip | pid | should_checkpoint | time_since_restore | time_this_iter_s | time_total_s | timestamp | timesteps_since_restore | timesteps_total | training_iteration | trial_id | warmup_time |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
TorchTrainer_da157_00000 | 0.0839479 | 1663975908 | 4 | 2022-09-23_16-31-49 | True | 96c794a64d6f43d79b87130a76d21f1f | 0 | corvus | 4 | 0 | 10.109.175.190 | 856770 | True | 17.0121 | 0.11111 | 17.0121 | 1663975909 | 0 | 4 | da157_00000 | 0.00297165 |
2022-09-23 16:31:51,231 INFO tune.py:762 -- Total run time: 20.91 seconds (20.79 seconds for the tuning loop).
Map_Batches: 0%| | 0/1 [00:00<?, ?it/s](BlockWorker pid=857028) 2022-09-23 16:31:52,652 WARNING torch_predictor.py:53 -- You have `use_gpu` as False but there are 1 GPUs detected on host where prediction will only use CPU. Please consider explicitly setting `TorchPredictor(use_gpu=True)` or `batch_predictor.predict(ds, num_gpus_per_worker=1)` to enable GPU prediction.
Map Progress (1 actors 1 pending): 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:02<00:00, 2.17s/it]
Map_Batches: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:00<00:00, 40.09it/s]
Map_Batches: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:00<00:00, 116.17it/s]
Shuffle Map: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:00<00:00, 141.72it/s]
Shuffle Reduce: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:00<00:00, 220.51it/s]
Accuracy for task 1: 0.8678
Shuffle Map: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:00<00:00, 58.32it/s]
Shuffle Reduce: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:00<00:00, 79.30it/s]
Map Progress (1 actors 1 pending): 0%| | 0/1 [00:00<?, ?it/s](BlockWorker pid=857062) 2022-09-23 16:31:54,055 WARNING torch_predictor.py:53 -- You have `use_gpu` as False but there are 1 GPUs detected on host where prediction will only use CPU. Please consider explicitly setting `TorchPredictor(use_gpu=True)` or `batch_predictor.predict(ds, num_gpus_per_worker=1)` to enable GPU prediction.
Map Progress (1 actors 1 pending): 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:00<00:00, 1.77it/s]
(ServeController pid=857134) INFO 2022-09-23 16:31:54,643 controller 857134 http_state.py:129 - Starting HTTP proxy with name 'SERVE_CONTROLLER_ACTOR:SERVE_PROXY_ACTOR-610d4158d56aeda61abd25d5751611d23ba1aa97eddb34d2ee4e6020' on node '610d4158d56aeda61abd25d5751611d23ba1aa97eddb34d2ee4e6020' listening on '127.0.0.1:8000'
(HTTPProxyActor pid=857184) INFO: Started server process [857184]
(ServeController pid=857134) INFO 2022-09-23 16:31:55,258 controller 857134 deployment_state.py:1277 - Adding 2 replicas to deployment 'mnist_model'.
(ServeReplica:mnist_model pid=857227) 2022-09-23 16:31:56,857 WARNING torch_predictor.py:53 -- You have `use_gpu` as False but there are 1 GPUs detected on host where prediction will only use CPU. Please consider explicitly setting `TorchPredictor(use_gpu=True)` or `batch_predictor.predict(ds, num_gpus_per_worker=1)` to enable GPU prediction.
(ServeReplica:mnist_model pid=857234) 2022-09-23 16:31:56,871 WARNING torch_predictor.py:53 -- You have `use_gpu` as False but there are 1 GPUs detected on host where prediction will only use CPU. Please consider explicitly setting `TorchPredictor(use_gpu=True)` or `batch_predictor.predict(ds, num_gpus_per_worker=1)` to enable GPU prediction.
(HTTPProxyActor pid=857184) INFO 2022-09-23 16:31:57,276 http_proxy 10.109.175.190 http_proxy.py:315 - POST /mnist_predict 307 3.3ms
(ServeReplica:mnist_model pid=857227) INFO 2022-09-23 16:31:57,275 mnist_model mnist_model#QckEDj replica.py:505 - HANDLE __call__ OK 0.2ms
(HTTPProxyActor pid=857184) INFO 2022-09-23 16:32:02,313 http_proxy 10.109.175.190 http_proxy.py:315 - POST /mnist_predict 200 5035.8ms
(HTTPProxyActor pid=857184) INFO 2022-09-23 16:32:02,360 http_proxy 10.109.175.190 http_proxy.py:315 - POST /mnist_predict 307 2.7ms
(ServeReplica:mnist_model pid=857227) INFO 2022-09-23 16:32:02,359 mnist_model mnist_model#QckEDj replica.py:505 - HANDLE __call__ OK 0.2ms
(ServeReplica:mnist_model pid=857234) INFO 2022-09-23 16:32:02,312 mnist_model mnist_model#PEPxlw replica.py:505 - HANDLE __call__ OK 5031.9ms
(HTTPProxyActor pid=857184) INFO 2022-09-23 16:32:07,340 http_proxy 10.109.175.190 http_proxy.py:315 - POST /mnist_predict 200 4978.8ms
(ServeReplica:mnist_model pid=857234) INFO 2022-09-23 16:32:07,339 mnist_model mnist_model#PEPxlw replica.py:505 - HANDLE __call__ OK 4975.8ms
(HTTPProxyActor pid=857184) INFO 2022-09-23 16:32:07,391 http_proxy 10.109.175.190 http_proxy.py:315 - POST /mnist_predict 307 3.8ms
(ServeReplica:mnist_model pid=857227) INFO 2022-09-23 16:32:07,390 mnist_model mnist_model#QckEDj replica.py:505 - HANDLE __call__ OK 0.2ms
(HTTPProxyActor pid=857184) INFO 2022-09-23 16:32:12,367 http_proxy 10.109.175.190 http_proxy.py:315 - POST /mnist_predict 200 4974.3ms
(ServeReplica:mnist_model pid=857234) INFO 2022-09-23 16:32:12,364 mnist_model mnist_model#PEPxlw replica.py:505 - HANDLE __call__ OK 4970.9ms
(HTTPProxyActor pid=857184) INFO 2022-09-23 16:32:12,414 http_proxy 10.109.175.190 http_proxy.py:315 - POST /mnist_predict 307 3.4ms
(ServeReplica:mnist_model pid=857227) INFO 2022-09-23 16:32:12,413 mnist_model mnist_model#QckEDj replica.py:505 - HANDLE __call__ OK 0.3ms
(HTTPProxyActor pid=857184) INFO 2022-09-23 16:32:17,394 http_proxy 10.109.175.190 http_proxy.py:315 - POST /mnist_predict 200 4977.8ms
(HTTPProxyActor pid=857184) INFO 2022-09-23 16:32:17,444 http_proxy 10.109.175.190 http_proxy.py:315 - POST /mnist_predict 307 4.6ms
(ServeReplica:mnist_model pid=857227) INFO 2022-09-23 16:32:17,443 mnist_model mnist_model#QckEDj replica.py:505 - HANDLE __call__ OK 0.2ms
(ServeReplica:mnist_model pid=857234) INFO 2022-09-23 16:32:17,392 mnist_model mnist_model#PEPxlw replica.py:505 - HANDLE __call__ OK 4973.3ms
(HTTPProxyActor pid=857184) INFO 2022-09-23 16:32:22,419 http_proxy 10.109.175.190 http_proxy.py:315 - POST /mnist_predict 200 4972.7ms
(HTTPProxyActor pid=857184) INFO 2022-09-23 16:32:22,471 http_proxy 10.109.175.190 http_proxy.py:315 - POST /mnist_predict 307 4.0ms
(ServeReplica:mnist_model pid=857227) INFO 2022-09-23 16:32:22,470 mnist_model mnist_model#QckEDj replica.py:505 - HANDLE __call__ OK 0.4ms
(ServeReplica:mnist_model pid=857234) INFO 2022-09-23 16:32:22,417 mnist_model mnist_model#PEPxlw replica.py:505 - HANDLE __call__ OK 4969.1ms
(HTTPProxyActor pid=857184) INFO 2022-09-23 16:32:27,440 http_proxy 10.109.175.190 http_proxy.py:315 - POST /mnist_predict 200 4966.7ms
(ServeReplica:mnist_model pid=857234) INFO 2022-09-23 16:32:27,439 mnist_model mnist_model#PEPxlw replica.py:505 - HANDLE __call__ OK 4963.6ms
(HTTPProxyActor pid=857184) INFO 2022-09-23 16:32:27,490 http_proxy 10.109.175.190 http_proxy.py:315 - POST /mnist_predict 307 3.0ms
(ServeReplica:mnist_model pid=857227) INFO 2022-09-23 16:32:27,489 mnist_model mnist_model#QckEDj replica.py:505 - HANDLE __call__ OK 0.3ms
(HTTPProxyActor pid=857184) INFO 2022-09-23 16:32:32,469 http_proxy 10.109.175.190 http_proxy.py:315 - POST /mnist_predict 200 4977.7ms
(HTTPProxyActor pid=857184) INFO 2022-09-23 16:32:32,520 http_proxy 10.109.175.190 http_proxy.py:315 - POST /mnist_predict 307 3.2ms
(ServeReplica:mnist_model pid=857227) INFO 2022-09-23 16:32:32,519 mnist_model mnist_model#QckEDj replica.py:505 - HANDLE __call__ OK 0.2ms
(ServeReplica:mnist_model pid=857234) INFO 2022-09-23 16:32:32,467 mnist_model mnist_model#PEPxlw replica.py:505 - HANDLE __call__ OK 4974.2ms
(HTTPProxyActor pid=857184) INFO 2022-09-23 16:32:37,496 http_proxy 10.109.175.190 http_proxy.py:315 - POST /mnist_predict 200 4974.4ms
(ServeReplica:mnist_model pid=857234) INFO 2022-09-23 16:32:37,495 mnist_model mnist_model#PEPxlw replica.py:505 - HANDLE __call__ OK 4971.3ms
(HTTPProxyActor pid=857184) INFO 2022-09-23 16:32:37,544 http_proxy 10.109.175.190 http_proxy.py:315 - POST /mnist_predict 307 3.2ms
(ServeReplica:mnist_model pid=857227) INFO 2022-09-23 16:32:37,543 mnist_model mnist_model#QckEDj replica.py:505 - HANDLE __call__ OK 0.2ms
(HTTPProxyActor pid=857184) INFO 2022-09-23 16:32:42,522 http_proxy 10.109.175.190 http_proxy.py:315 - POST /mnist_predict 200 4975.8ms
(HTTPProxyActor pid=857184) INFO 2022-09-23 16:32:42,570 http_proxy 10.109.175.190 http_proxy.py:315 - POST /mnist_predict 307 2.8ms
(ServeReplica:mnist_model pid=857234) INFO 2022-09-23 16:32:42,520 mnist_model mnist_model#PEPxlw replica.py:505 - HANDLE __call__ OK 4972.9ms
(ServeReplica:mnist_model pid=857234) INFO 2022-09-23 16:32:42,569 mnist_model mnist_model#PEPxlw replica.py:505 - HANDLE __call__ OK 0.2ms
Map_Batches: 0%| | 0/1 [00:00<?, ?it/s](HTTPProxyActor pid=857184) INFO 2022-09-23 16:32:47,614 http_proxy 10.109.175.190 http_proxy.py:315 - POST /mnist_predict 200 5042.0ms
(ServeReplica:mnist_model pid=857227) INFO 2022-09-23 16:32:47,612 mnist_model mnist_model#QckEDj replica.py:505 - HANDLE __call__ OK 5039.0ms
Map Progress (1 actors 1 pending): 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:00<00:00, 1.34it/s]
Read->Map_Batches: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:00<00:00, 4.26it/s]
Map Progress (1 actors 1 pending): 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:00<00:00, 1.72it/s]
Starting training for task: 1
Tune Status
Current time: | 2022-09-23 16:33:08 |
Running for: | 00:00:19.49 |
Memory: | 18.2/62.7 GiB |
System Info
Using FIFO scheduling algorithm.Resources requested: 0/24 CPUs, 0/0 GPUs, 0.0/32.53 GiB heap, 0.0/16.26 GiB objects
Trial Status
Trial name | status | loc | iter | total time (s) | loss | _timestamp | _time_this_iter_s |
---|---|---|---|---|---|---|---|
TorchTrainer_09424_00000 | TERMINATED | 10.109.175.190:857781 | 4 | 15.3611 | 0 | 1663975986 | 0.0699804 |
(RayTrainWorker pid=857818) 2022-09-23 16:32:55,672 INFO config.py:71 -- Setting up process group for: env:// [rank=0, world_size=1]
(RayTrainWorker pid=857818) 2022-09-23 16:32:55,954 INFO train_loop_utils.py:354 -- Moving model to device: cuda:0
(RayTrainWorker pid=857818) loss: 2.457292, epoch: 0, iteration: 0
(RayTrainWorker pid=857818) loss: 1.339169, epoch: 0, iteration: 500
(RayTrainWorker pid=857818) loss: 1.032746, epoch: 0, iteration: 1000
(RayTrainWorker pid=857818) loss: 0.707931, epoch: 0, iteration: 1500
Trial Progress
Trial name | _time_this_iter_s | _timestamp | _training_iteration | date | done | episodes_total | experiment_id | experiment_tag | hostname | iterations_since_restore | loss | node_ip | pid | should_checkpoint | time_since_restore | time_this_iter_s | time_total_s | timestamp | timesteps_since_restore | timesteps_total | training_iteration | trial_id | warmup_time |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
TorchTrainer_09424_00000 | 0.0699804 | 1663975986 | 4 | 2022-09-23_16-33-06 | True | 77c9c5f109fa4a47b459b0afadf3ba33 | 0 | corvus | 4 | 0 | 10.109.175.190 | 857781 | True | 15.3611 | 0.0725608 | 15.3611 | 1663975986 | 0 | 4 | 09424_00000 | 0.00418878 |
2022-09-23 16:33:09,072 INFO tune.py:762 -- Total run time: 19.62 seconds (19.49 seconds for the tuning loop).
Map Progress (1 actors 1 pending): 0%| | 0/2 [00:01<?, ?it/s](BlockWorker pid=857874) 2022-09-23 16:33:10,528 WARNING torch_predictor.py:53 -- You have `use_gpu` as False but there are 1 GPUs detected on host where prediction will only use CPU. Please consider explicitly setting `TorchPredictor(use_gpu=True)` or `batch_predictor.predict(ds, num_gpus_per_worker=1)` to enable GPU prediction.
Map Progress (2 actors 1 pending): 50%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1/2 [00:02<00:02, 2.23s/it](BlockWorker pid=857902) 2022-09-23 16:33:11,882 WARNING torch_predictor.py:53 -- You have `use_gpu` as False but there are 1 GPUs detected on host where prediction will only use CPU. Please consider explicitly setting `TorchPredictor(use_gpu=True)` or `batch_predictor.predict(ds, num_gpus_per_worker=1)` to enable GPU prediction.
Map Progress (2 actors 1 pending): 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 2/2 [00:03<00:00, 1.53s/it]
Map_Batches: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 2/2 [00:00<00:00, 7.46it/s]
Map_Batches: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 2/2 [00:00<00:00, 125.99it/s]
Shuffle Map: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 2/2 [00:00<00:00, 269.85it/s]
Shuffle Reduce: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:00<00:00, 261.75it/s]
Accuracy for task 2: 0.86465
Shuffle Map: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:00<00:00, 97.22it/s]
Shuffle Reduce: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:00<00:00, 96.18it/s]
Map Progress (1 actors 1 pending): 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:00<00:00, 1.83it/s]
(ServeController pid=857134) INFO 2022-09-23 16:33:13,164 controller 857134 deployment_state.py:1234 - Stopping 1 replicas of deployment 'mnist_model' with outdated versions.
(BlockWorker pid=857930) 2022-09-23 16:33:13,290 WARNING torch_predictor.py:53 -- You have `use_gpu` as False but there are 1 GPUs detected on host where prediction will only use CPU. Please consider explicitly setting `TorchPredictor(use_gpu=True)` or `batch_predictor.predict(ds, num_gpus_per_worker=1)` to enable GPU prediction.
(ServeController pid=857134) INFO 2022-09-23 16:33:15,301 controller 857134 deployment_state.py:1277 - Adding 1 replica to deployment 'mnist_model'.
(ServeReplica:mnist_model pid=858036) 2022-09-23 16:33:16,792 WARNING torch_predictor.py:53 -- You have `use_gpu` as False but there are 1 GPUs detected on host where prediction will only use CPU. Please consider explicitly setting `TorchPredictor(use_gpu=True)` or `batch_predictor.predict(ds, num_gpus_per_worker=1)` to enable GPU prediction.
(ServeController pid=857134) INFO 2022-09-23 16:33:16,946 controller 857134 deployment_state.py:1234 - Stopping 1 replicas of deployment 'mnist_model' with outdated versions.
(ServeController pid=857134) INFO 2022-09-23 16:33:19,087 controller 857134 deployment_state.py:1277 - Adding 1 replica to deployment 'mnist_model'.
(ServeReplica:mnist_model pid=858081) 2022-09-23 16:33:20,575 WARNING torch_predictor.py:53 -- You have `use_gpu` as False but there are 1 GPUs detected on host where prediction will only use CPU. Please consider explicitly setting `TorchPredictor(use_gpu=True)` or `batch_predictor.predict(ds, num_gpus_per_worker=1)` to enable GPU prediction.
(HTTPProxyActor pid=857184) INFO 2022-09-23 16:33:21,138 http_proxy 10.109.175.190 http_proxy.py:315 - POST /mnist_predict 307 4.4ms
(ServeReplica:mnist_model pid=858036) INFO 2022-09-23 16:33:21,137 mnist_model mnist_model#JcKoby replica.py:505 - HANDLE __call__ OK 0.3ms
(HTTPProxyActor pid=857184) INFO 2022-09-23 16:33:26,162 http_proxy 10.109.175.190 http_proxy.py:315 - POST /mnist_predict 200 5021.8ms
(HTTPProxyActor pid=857184) INFO 2022-09-23 16:33:26,210 http_proxy 10.109.175.190 http_proxy.py:315 - POST /mnist_predict 307 2.8ms
(ServeReplica:mnist_model pid=858036) INFO 2022-09-23 16:33:26,209 mnist_model mnist_model#JcKoby replica.py:505 - HANDLE __call__ OK 0.3ms
(ServeReplica:mnist_model pid=858081) INFO 2022-09-23 16:33:26,160 mnist_model mnist_model#BpvmYM replica.py:505 - HANDLE __call__ OK 5017.8ms
(HTTPProxyActor pid=857184) INFO 2022-09-23 16:33:31,190 http_proxy 10.109.175.190 http_proxy.py:315 - POST /mnist_predict 200 4979.0ms
(HTTPProxyActor pid=857184) INFO 2022-09-23 16:33:31,237 http_proxy 10.109.175.190 http_proxy.py:315 - POST /mnist_predict 307 3.2ms
(ServeReplica:mnist_model pid=858036) INFO 2022-09-23 16:33:31,236 mnist_model mnist_model#JcKoby replica.py:505 - HANDLE __call__ OK 0.3ms
(ServeReplica:mnist_model pid=858081) INFO 2022-09-23 16:33:31,189 mnist_model mnist_model#BpvmYM replica.py:505 - HANDLE __call__ OK 4975.9ms
(HTTPProxyActor pid=857184) INFO 2022-09-23 16:33:36,219 http_proxy 10.109.175.190 http_proxy.py:315 - POST /mnist_predict 200 4980.6ms
(ServeReplica:mnist_model pid=858081) INFO 2022-09-23 16:33:36,218 mnist_model mnist_model#BpvmYM replica.py:505 - HANDLE __call__ OK 4977.7ms
(HTTPProxyActor pid=857184) INFO 2022-09-23 16:33:36,266 http_proxy 10.109.175.190 http_proxy.py:315 - POST /mnist_predict 307 2.6ms
(ServeReplica:mnist_model pid=858036) INFO 2022-09-23 16:33:36,265 mnist_model mnist_model#JcKoby replica.py:505 - HANDLE __call__ OK 0.2ms
(HTTPProxyActor pid=857184) INFO 2022-09-23 16:33:41,246 http_proxy 10.109.175.190 http_proxy.py:315 - POST /mnist_predict 200 4979.9ms
(ServeReplica:mnist_model pid=858081) INFO 2022-09-23 16:33:41,245 mnist_model mnist_model#BpvmYM replica.py:505 - HANDLE __call__ OK 4977.0ms
(HTTPProxyActor pid=857184) INFO 2022-09-23 16:33:41,293 http_proxy 10.109.175.190 http_proxy.py:315 - POST /mnist_predict 307 3.1ms
(ServeReplica:mnist_model pid=858036) INFO 2022-09-23 16:33:41,292 mnist_model mnist_model#JcKoby replica.py:505 - HANDLE __call__ OK 0.3ms
(HTTPProxyActor pid=857184) INFO 2022-09-23 16:33:46,274 http_proxy 10.109.175.190 http_proxy.py:315 - POST /mnist_predict 200 4979.4ms
(HTTPProxyActor pid=857184) INFO 2022-09-23 16:33:46,320 http_proxy 10.109.175.190 http_proxy.py:315 - POST /mnist_predict 307 3.1ms
(ServeReplica:mnist_model pid=858036) INFO 2022-09-23 16:33:46,319 mnist_model mnist_model#JcKoby replica.py:505 - HANDLE __call__ OK 0.3ms
(ServeReplica:mnist_model pid=858081) INFO 2022-09-23 16:33:46,272 mnist_model mnist_model#BpvmYM replica.py:505 - HANDLE __call__ OK 4976.2ms
(HTTPProxyActor pid=857184) INFO 2022-09-23 16:33:51,292 http_proxy 10.109.175.190 http_proxy.py:315 - POST /mnist_predict 200 4970.2ms
(HTTPProxyActor pid=857184) INFO 2022-09-23 16:33:51,340 http_proxy 10.109.175.190 http_proxy.py:315 - POST /mnist_predict 307 3.2ms
(ServeReplica:mnist_model pid=858036) INFO 2022-09-23 16:33:51,339 mnist_model mnist_model#JcKoby replica.py:505 - HANDLE __call__ OK 0.3ms
(ServeReplica:mnist_model pid=858081) INFO 2022-09-23 16:33:51,290 mnist_model mnist_model#BpvmYM replica.py:505 - HANDLE __call__ OK 4966.7ms
(HTTPProxyActor pid=857184) INFO 2022-09-23 16:33:56,315 http_proxy 10.109.175.190 http_proxy.py:315 - POST /mnist_predict 200 4973.0ms
(HTTPProxyActor pid=857184) INFO 2022-09-23 16:33:56,364 http_proxy 10.109.175.190 http_proxy.py:315 - POST /mnist_predict 307 3.0ms
(ServeReplica:mnist_model pid=858036) INFO 2022-09-23 16:33:56,363 mnist_model mnist_model#JcKoby replica.py:505 - HANDLE __call__ OK 0.3ms
(ServeReplica:mnist_model pid=858081) INFO 2022-09-23 16:33:56,314 mnist_model mnist_model#BpvmYM replica.py:505 - HANDLE __call__ OK 4969.9ms
(HTTPProxyActor pid=857184) INFO 2022-09-23 16:34:01,344 http_proxy 10.109.175.190 http_proxy.py:315 - POST /mnist_predict 200 4978.3ms
(ServeReplica:mnist_model pid=858081) INFO 2022-09-23 16:34:01,342 mnist_model mnist_model#BpvmYM replica.py:505 - HANDLE __call__ OK 4975.1ms
(HTTPProxyActor pid=857184) INFO 2022-09-23 16:34:01,390 http_proxy 10.109.175.190 http_proxy.py:315 - POST /mnist_predict 307 3.2ms
(ServeReplica:mnist_model pid=858036) INFO 2022-09-23 16:34:01,389 mnist_model mnist_model#JcKoby replica.py:505 - HANDLE __call__ OK 0.3ms
(HTTPProxyActor pid=857184) INFO 2022-09-23 16:34:06,367 http_proxy 10.109.175.190 http_proxy.py:315 - POST /mnist_predict 200 4975.1ms
(ServeReplica:mnist_model pid=858081) INFO 2022-09-23 16:34:06,366 mnist_model mnist_model#BpvmYM replica.py:505 - HANDLE __call__ OK 4972.3ms
(HTTPProxyActor pid=857184) INFO 2022-09-23 16:34:06,413 http_proxy 10.109.175.190 http_proxy.py:315 - POST /mnist_predict 307 3.0ms
(ServeReplica:mnist_model pid=858036) INFO 2022-09-23 16:34:06,412 mnist_model mnist_model#JcKoby replica.py:505 - HANDLE __call__ OK 0.3ms
Map_Batches: 0%| | 0/1 [00:00<?, ?it/s](HTTPProxyActor pid=857184) INFO 2022-09-23 16:34:11,392 http_proxy 10.109.175.190 http_proxy.py:315 - POST /mnist_predict 200 4977.5ms
(ServeReplica:mnist_model pid=858081) INFO 2022-09-23 16:34:11,391 mnist_model mnist_model#BpvmYM replica.py:505 - HANDLE __call__ OK 4975.0ms
Map Progress (1 actors 1 pending): 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:00<00:00, 1.37it/s]
Read->Map_Batches: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:00<00:00, 5.31it/s]
Map Progress (1 actors 1 pending): 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:00<00:00, 1.76it/s]
Starting training for task: 2
Tune Status
Current time: | 2022-09-23 16:34:33 |
Running for: | 00:00:19.45 |
Memory: | 18.4/62.7 GiB |
System Info
Using FIFO scheduling algorithm.Resources requested: 0/24 CPUs, 0/0 GPUs, 0.0/32.53 GiB heap, 0.0/16.26 GiB objects
Trial Status
Trial name | status | loc | iter | total time (s) | loss | _timestamp | _time_this_iter_s |
---|---|---|---|---|---|---|---|
TorchTrainer_3b7e3_00000 | TERMINATED | 10.109.175.190:858536 | 4 | 15.3994 | 0 | 1663976070 | 0.0710998 |
(RayTrainWorker pid=858579) 2022-09-23 16:34:19,902 INFO config.py:71 -- Setting up process group for: env:// [rank=0, world_size=1]
(RayTrainWorker pid=858579) 2022-09-23 16:34:20,191 INFO train_loop_utils.py:354 -- Moving model to device: cuda:0
(RayTrainWorker pid=858579) loss: 2.515887, epoch: 0, iteration: 0
(RayTrainWorker pid=858579) loss: 1.260738, epoch: 0, iteration: 500
(RayTrainWorker pid=858579) loss: 0.892560, epoch: 0, iteration: 1000
(RayTrainWorker pid=858579) loss: 0.497198, epoch: 0, iteration: 1500
Trial Progress
Trial name | _time_this_iter_s | _timestamp | _training_iteration | date | done | episodes_total | experiment_id | experiment_tag | hostname | iterations_since_restore | loss | node_ip | pid | should_checkpoint | time_since_restore | time_this_iter_s | time_total_s | timestamp | timesteps_since_restore | timesteps_total | training_iteration | trial_id | warmup_time |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
TorchTrainer_3b7e3_00000 | 0.0710998 | 1663976070 | 4 | 2022-09-23_16-34-30 | True | c9312be01e964b958b931d1796623509 | 0 | corvus | 4 | 0 | 10.109.175.190 | 858536 | True | 15.3994 | 0.0705044 | 15.3994 | 1663976070 | 0 | 4 | 3b7e3_00000 | 0.00414133 |
2022-09-23 16:34:33,315 INFO tune.py:762 -- Total run time: 19.59 seconds (19.45 seconds for the tuning loop).
Map Progress (1 actors 1 pending): 0%| | 0/3 [00:01<?, ?it/s](BlockWorker pid=858662) 2022-09-23 16:34:34,737 WARNING torch_predictor.py:53 -- You have `use_gpu` as False but there are 1 GPUs detected on host where prediction will only use CPU. Please consider explicitly setting `TorchPredictor(use_gpu=True)` or `batch_predictor.predict(ds, num_gpus_per_worker=1)` to enable GPU prediction.
Map Progress (2 actors 1 pending): 33%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1/3 [00:02<00:04, 2.18s/it](BlockWorker pid=858688) 2022-09-23 16:34:36,116 WARNING torch_predictor.py:53 -- You have `use_gpu` as False but there are 1 GPUs detected on host where prediction will only use CPU. Please consider explicitly setting `TorchPredictor(use_gpu=True)` or `batch_predictor.predict(ds, num_gpus_per_worker=1)` to enable GPU prediction.
Map Progress (2 actors 1 pending): 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 3/3 [00:03<00:00, 1.25s/it]
Map_Batches: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 3/3 [00:00<00:00, 10.84it/s]
Map_Batches: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 3/3 [00:00<00:00, 165.80it/s]
Shuffle Map: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 3/3 [00:00<00:00, 350.61it/s]
Shuffle Reduce: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:00<00:00, 186.97it/s]
Accuracy for task 3: 0.8439
Shuffle Map: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:00<00:00, 114.31it/s]
Shuffle Reduce: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:00<00:00, 102.29it/s]
Map_Batches: 0%| | 0/1 [00:00<?, ?it/s](BlockWorker pid=858715) 2022-09-23 16:34:37,520 WARNING torch_predictor.py:53 -- You have `use_gpu` as False but there are 1 GPUs detected on host where prediction will only use CPU. Please consider explicitly setting `TorchPredictor(use_gpu=True)` or `batch_predictor.predict(ds, num_gpus_per_worker=1)` to enable GPU prediction.
Map Progress (1 actors 1 pending): 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:00<00:00, 1.83it/s]
(ServeController pid=857134) INFO 2022-09-23 16:34:38,052 controller 857134 deployment_state.py:1234 - Stopping 1 replicas of deployment 'mnist_model' with outdated versions.
(ServeController pid=857134) INFO 2022-09-23 16:34:40,199 controller 857134 deployment_state.py:1277 - Adding 1 replica to deployment 'mnist_model'.
(ServeReplica:mnist_model pid=858821) 2022-09-23 16:34:41,756 WARNING torch_predictor.py:53 -- You have `use_gpu` as False but there are 1 GPUs detected on host where prediction will only use CPU. Please consider explicitly setting `TorchPredictor(use_gpu=True)` or `batch_predictor.predict(ds, num_gpus_per_worker=1)` to enable GPU prediction.
(ServeController pid=857134) INFO 2022-09-23 16:34:41,943 controller 857134 deployment_state.py:1234 - Stopping 1 replicas of deployment 'mnist_model' with outdated versions.
(ServeController pid=857134) INFO 2022-09-23 16:34:44,087 controller 857134 deployment_state.py:1277 - Adding 1 replica to deployment 'mnist_model'.
(ServeReplica:mnist_model pid=858865) 2022-09-23 16:34:45,635 WARNING torch_predictor.py:53 -- You have `use_gpu` as False but there are 1 GPUs detected on host where prediction will only use CPU. Please consider explicitly setting `TorchPredictor(use_gpu=True)` or `batch_predictor.predict(ds, num_gpus_per_worker=1)` to enable GPU prediction.
(HTTPProxyActor pid=857184) INFO 2022-09-23 16:34:46,091 http_proxy 10.109.175.190 http_proxy.py:315 - POST /mnist_predict 307 3.5ms
(ServeReplica:mnist_model pid=858821) INFO 2022-09-23 16:34:46,091 mnist_model mnist_model#moUXYX replica.py:505 - HANDLE __call__ OK 0.3ms
(HTTPProxyActor pid=857184) INFO 2022-09-23 16:34:51,133 http_proxy 10.109.175.190 http_proxy.py:315 - POST /mnist_predict 200 5039.8ms
(HTTPProxyActor pid=857184) INFO 2022-09-23 16:34:51,181 http_proxy 10.109.175.190 http_proxy.py:315 - POST /mnist_predict 307 3.3ms
(ServeReplica:mnist_model pid=858821) INFO 2022-09-23 16:34:51,180 mnist_model mnist_model#moUXYX replica.py:505 - HANDLE __call__ OK 0.2ms
(ServeReplica:mnist_model pid=858865) INFO 2022-09-23 16:34:51,131 mnist_model mnist_model#UYagxG replica.py:505 - HANDLE __call__ OK 5035.4ms
(HTTPProxyActor pid=857184) INFO 2022-09-23 16:34:56,160 http_proxy 10.109.175.190 http_proxy.py:315 - POST /mnist_predict 200 4977.5ms
(HTTPProxyActor pid=857184) INFO 2022-09-23 16:34:56,207 http_proxy 10.109.175.190 http_proxy.py:315 - POST /mnist_predict 307 2.8ms
(ServeReplica:mnist_model pid=858821) INFO 2022-09-23 16:34:56,206 mnist_model mnist_model#moUXYX replica.py:505 - HANDLE __call__ OK 0.2ms
(ServeReplica:mnist_model pid=858865) INFO 2022-09-23 16:34:56,158 mnist_model mnist_model#UYagxG replica.py:505 - HANDLE __call__ OK 4974.3ms
(HTTPProxyActor pid=857184) INFO 2022-09-23 16:35:01,188 http_proxy 10.109.175.190 http_proxy.py:315 - POST /mnist_predict 200 4979.3ms
(ServeReplica:mnist_model pid=858865) INFO 2022-09-23 16:35:01,186 mnist_model mnist_model#UYagxG replica.py:505 - HANDLE __call__ OK 4976.3ms
(HTTPProxyActor pid=857184) INFO 2022-09-23 16:35:01,237 http_proxy 10.109.175.190 http_proxy.py:315 - POST /mnist_predict 307 2.9ms
(ServeReplica:mnist_model pid=858821) INFO 2022-09-23 16:35:01,236 mnist_model mnist_model#moUXYX replica.py:505 - HANDLE __call__ OK 0.3ms
(HTTPProxyActor pid=857184) INFO 2022-09-23 16:35:06,210 http_proxy 10.109.175.190 http_proxy.py:315 - POST /mnist_predict 200 4970.7ms
(ServeReplica:mnist_model pid=858865) INFO 2022-09-23 16:35:06,208 mnist_model mnist_model#UYagxG replica.py:505 - HANDLE __call__ OK 4967.5ms
(HTTPProxyActor pid=857184) INFO 2022-09-23 16:35:06,257 http_proxy 10.109.175.190 http_proxy.py:315 - POST /mnist_predict 307 3.0ms
(ServeReplica:mnist_model pid=858821) INFO 2022-09-23 16:35:06,256 mnist_model mnist_model#moUXYX replica.py:505 - HANDLE __call__ OK 0.3ms
(HTTPProxyActor pid=857184) INFO 2022-09-23 16:35:11,236 http_proxy 10.109.175.190 http_proxy.py:315 - POST /mnist_predict 200 4978.0ms
(HTTPProxyActor pid=857184) INFO 2022-09-23 16:35:11,291 http_proxy 10.109.175.190 http_proxy.py:315 - POST /mnist_predict 307 10.3ms
(ServeReplica:mnist_model pid=858821) INFO 2022-09-23 16:35:11,283 mnist_model mnist_model#moUXYX replica.py:505 - HANDLE __call__ OK 0.5ms
(ServeReplica:mnist_model pid=858865) INFO 2022-09-23 16:35:11,235 mnist_model mnist_model#UYagxG replica.py:505 - HANDLE __call__ OK 4974.9ms
(HTTPProxyActor pid=857184) INFO 2022-09-23 16:35:16,259 http_proxy 10.109.175.190 http_proxy.py:315 - POST /mnist_predict 200 4966.0ms
(HTTPProxyActor pid=857184) INFO 2022-09-23 16:35:16,307 http_proxy 10.109.175.190 http_proxy.py:315 - POST /mnist_predict 307 3.0ms
(ServeReplica:mnist_model pid=858821) INFO 2022-09-23 16:35:16,306 mnist_model mnist_model#moUXYX replica.py:505 - HANDLE __call__ OK 0.2ms
(ServeReplica:mnist_model pid=858865) INFO 2022-09-23 16:35:16,258 mnist_model mnist_model#UYagxG replica.py:505 - HANDLE __call__ OK 4962.2ms
(HTTPProxyActor pid=857184) INFO 2022-09-23 16:35:21,284 http_proxy 10.109.175.190 http_proxy.py:315 - POST /mnist_predict 200 4975.8ms
(HTTPProxyActor pid=857184) INFO 2022-09-23 16:35:21,330 http_proxy 10.109.175.190 http_proxy.py:315 - POST /mnist_predict 307 2.9ms
(ServeReplica:mnist_model pid=858821) INFO 2022-09-23 16:35:21,329 mnist_model mnist_model#moUXYX replica.py:505 - HANDLE __call__ OK 0.3ms
(ServeReplica:mnist_model pid=858865) INFO 2022-09-23 16:35:21,283 mnist_model mnist_model#UYagxG replica.py:505 - HANDLE __call__ OK 4972.7ms
(HTTPProxyActor pid=857184) INFO 2022-09-23 16:35:26,312 http_proxy 10.109.175.190 http_proxy.py:315 - POST /mnist_predict 200 4980.6ms
(ServeReplica:mnist_model pid=858865) INFO 2022-09-23 16:35:26,311 mnist_model mnist_model#UYagxG replica.py:505 - HANDLE __call__ OK 4977.5ms
(HTTPProxyActor pid=857184) INFO 2022-09-23 16:35:26,363 http_proxy 10.109.175.190 http_proxy.py:315 - POST /mnist_predict 307 2.9ms
(ServeReplica:mnist_model pid=858821) INFO 2022-09-23 16:35:26,362 mnist_model mnist_model#moUXYX replica.py:505 - HANDLE __call__ OK 0.2ms
(HTTPProxyActor pid=857184) INFO 2022-09-23 16:35:31,337 http_proxy 10.109.175.190 http_proxy.py:315 - POST /mnist_predict 200 4971.5ms
(HTTPProxyActor pid=857184) INFO 2022-09-23 16:35:31,383 http_proxy 10.109.175.190 http_proxy.py:315 - POST /mnist_predict 307 3.0ms
(ServeReplica:mnist_model pid=858821) INFO 2022-09-23 16:35:31,382 mnist_model mnist_model#moUXYX replica.py:505 - HANDLE __call__ OK 0.3ms
(ServeReplica:mnist_model pid=858865) INFO 2022-09-23 16:35:31,335 mnist_model mnist_model#UYagxG replica.py:505 - HANDLE __call__ OK 4968.3ms
(HTTPProxyActor pid=857184) INFO 2022-09-23 16:35:36,366 http_proxy 10.109.175.190 http_proxy.py:315 - POST /mnist_predict 200 4981.2ms
(ServeReplica:mnist_model pid=858865) INFO 2022-09-23 16:35:36,364 mnist_model mnist_model#UYagxG replica.py:505 - HANDLE __call__ OK 4977.9ms
(ServeController pid=857134) INFO 2022-09-23 16:35:36,511 controller 857134 deployment_state.py:1303 - Removing 2 replicas from deployment 'mnist_model'.
Now that we have finished all of our training, letβs see the accuracy of our model after training on each task.
We should see the accuracy decrease over time. This is to be expected since we are using just a naive fine-tuning strategy so our model is prone to catastrophic forgetting.
As we increase the number of tasks, the model performance on all the tasks trained on so far should decrease.
accuracy_for_all_tasks
[0.8678, 0.86465, 0.8439]
[Optional] Step 6: Compare against full training.#
We have now incrementally trained our simple multi-layer perceptron. Letβs compare the incrementally trained model via fine tuning against a model that is trained on all the tasks up front.
Since we are using a naive fine-tuning strategy, we should expect that our incrementally trained model will perform worse than the the one that is fully trained! However, thereβs various other strategies that have been developed and are actively being researched to improve accuracy for incremental training. And overall, incremental/continual learning allows you to train in many real world settings where the entire dataset is not available up front, but new data is arriving at a relatively high rate.
Letβs first combine all of our datasets for each task into a single, unified Dataset
train_stream = permuted_mnist.generate_train_stream()
# Collect all datasets in the stream into a single dataset.
all_training_datasets = []
for train_dataset in train_stream:
all_training_datasets.append(train_dataset)
combined_training_dataset = all_training_datasets[0].union(*all_training_datasets[1:])
combined_training_dataset = combined_training_dataset.random_shuffle()
Map Progress (1 actors 1 pending): 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:00<00:00, 1.37it/s]
Map Progress (1 actors 1 pending): 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:00<00:00, 1.37it/s]
Map Progress (1 actors 1 pending): 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:00<00:00, 1.40it/s]
Shuffle Map: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 3/3 [00:00<00:00, 40.34it/s]
Shuffle Reduce: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 3/3 [00:00<00:00, 28.99it/s]
Then, we train a new model on the unified Dataset using the same configurations as before.
# Now we do training with the same configurations as before
trainer = TorchTrainer(
train_loop_per_worker=train_loop_per_worker,
train_loop_config={
"num_epochs": num_epochs,
"learning_rate": learning_rate,
"momentum": momentum,
"batch_size": batch_size,
},
# Have to specify trainer_resources as 0 so that the example works on Colab.
scaling_config=ScalingConfig(num_workers=num_workers, use_gpu=use_gpu, trainer_resources={"CPU": 0}),
datasets={"train": combined_training_dataset},
preprocessor=mnist_normalize_preprocessor,
)
result = trainer.fit()
full_training_checkpoint = result.checkpoint
Tune Status
Current time: | 2022-09-23 16:37:13 |
Running for: | 00:00:25.97 |
Memory: | 19.4/62.7 GiB |
System Info
Using FIFO scheduling algorithm.Resources requested: 0/24 CPUs, 0/0 GPUs, 0.0/32.53 GiB heap, 0.0/16.26 GiB objects
Trial Status
Trial name | status | loc | iter | total time (s) | loss | _timestamp | _time_this_iter_s |
---|---|---|---|---|---|---|---|
TorchTrainer_971af_00000 | TERMINATED | 10.109.175.190:860035 | 4 | 22.1282 | 0 | 1663976231 | 0.0924587 |
(RayTrainWorker pid=860154) 2022-09-23 16:36:55,188 INFO config.py:71 -- Setting up process group for: env:// [rank=0, world_size=1]
(RayTrainWorker pid=860154) 2022-09-23 16:36:55,399 INFO train_loop_utils.py:354 -- Moving model to device: cuda:0
(RayTrainWorker pid=860154) loss: 2.301066, epoch: 0, iteration: 0
(RayTrainWorker pid=860154) loss: 1.869080, epoch: 0, iteration: 500
(RayTrainWorker pid=860154) loss: 1.489264, epoch: 0, iteration: 1000
(RayTrainWorker pid=860154) loss: 1.646756, epoch: 0, iteration: 1500
(RayTrainWorker pid=860154) loss: 1.582330, epoch: 0, iteration: 2000
(RayTrainWorker pid=860154) loss: 1.246018, epoch: 0, iteration: 2500
(RayTrainWorker pid=860154) loss: 1.035204, epoch: 0, iteration: 3000
(RayTrainWorker pid=860154) loss: 0.872962, epoch: 0, iteration: 3500
(RayTrainWorker pid=860154) loss: 1.138829, epoch: 0, iteration: 4000
(RayTrainWorker pid=860154) loss: 0.753354, epoch: 0, iteration: 4500
(RayTrainWorker pid=860154) loss: 0.991935, epoch: 0, iteration: 5000
(RayTrainWorker pid=860154) loss: 0.928292, epoch: 0, iteration: 5500
Trial Progress
Trial name | _time_this_iter_s | _timestamp | _training_iteration | date | done | episodes_total | experiment_id | experiment_tag | hostname | iterations_since_restore | loss | node_ip | pid | should_checkpoint | time_since_restore | time_this_iter_s | time_total_s | timestamp | timesteps_since_restore | timesteps_total | training_iteration | trial_id | warmup_time |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
TorchTrainer_971af_00000 | 0.0924587 | 1663976231 | 4 | 2022-09-23_16-37-11 | True | 26d685b2612a4752b7d062d1ebfb89f0 | 0 | corvus | 4 | 0 | 10.109.175.190 | 860035 | True | 22.1282 | 0.0941384 | 22.1282 | 1663976231 | 0 | 4 | 971af_00000 | 0.0034101 |
2022-09-23 16:37:13,525 INFO tune.py:762 -- Total run time: 26.08 seconds (25.96 seconds for the tuning loop).
Then, letβs test model that was trained on all the tasks up front.
# Then, we used the fully trained model and do batch prediction on the entire test set.
# `full_test_dataset` should already contain the combined test datasets.
fully_trained_accuracy = batch_predict(full_training_checkpoint, full_test_dataset)
Map Progress (1 actors 1 pending): 0%| | 0/3 [00:01<?, ?it/s](BlockWorker pid=860261) 2022-09-23 16:37:15,152 WARNING torch_predictor.py:53 -- You have `use_gpu` as False but there are 1 GPUs detected on host where prediction will only use CPU. Please consider explicitly setting `TorchPredictor(use_gpu=True)` or `batch_predictor.predict(ds, num_gpus_per_worker=1)` to enable GPU prediction.
Map Progress (2 actors 1 pending): 33%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1/3 [00:03<00:04, 2.45s/it](BlockWorker pid=860289) 2022-09-23 16:37:16,696 WARNING torch_predictor.py:53 -- You have `use_gpu` as False but there are 1 GPUs detected on host where prediction will only use CPU. Please consider explicitly setting `TorchPredictor(use_gpu=True)` or `batch_predictor.predict(ds, num_gpus_per_worker=1)` to enable GPU prediction.
Map Progress (2 actors 1 pending): 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 3/3 [00:04<00:00, 1.37s/it]
Map_Batches: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 3/3 [00:00<00:00, 74.29it/s]
Map_Batches: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 3/3 [00:00<00:00, 134.64it/s]
Shuffle Map: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 3/3 [00:00<00:00, 304.26it/s]
Shuffle Reduce: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:00<00:00, 108.41it/s]
Finally, letβs compare the accuracies between the incrementally trained model and the fully trained model. We should see that the fully trained model performs better.
print("Fully trained model accuracy: ", fully_trained_accuracy)
print("Incrementally trained model accuracy: ", accuracy_for_all_tasks[-1])
Fully trained model accuracy: 0.8888666666666667
Incrementally trained model accuracy: 0.8439
(BlockWorker pid=860324) 2022-09-23 16:37:18,256 WARNING torch_predictor.py:53 -- You have `use_gpu` as False but there are 1 GPUs detected on host where prediction will only use CPU. Please consider explicitly setting `TorchPredictor(use_gpu=True)` or `batch_predictor.predict(ds, num_gpus_per_worker=1)` to enable GPU prediction.
Next Steps#
Once youβve completed this notebook, you should be set to play around with scalable incremental training using Ray, either by trying more fancy algorithms for incremental learning other than naive fine-tuning, or attempting to scale out to larger datasets!
If you run into any issues, or have any feature requests, please file an issue on the Ray Github.