RaySGD Hyperparameter Tuning¶
RaySGD integrates with Ray Tune to easily run distributed hyperparameter tuning experiments with your RaySGD Trainer.
PyTorch¶
Tip
If you want to leverage multi-node data parallel training with PyTorch while using RayTune without using RaySGD, check out the Tune PyTorch user guide and Tune’s lightweight distributed pytorch integrations.
TorchTrainer
naturally integrates with Tune via the BaseTorchTrainable
interface. Without changing any arguments, you can call TorchTrainer.as_trainable(...)
to create a Tune-compatible class.
Then, you can simply pass the returned Trainable class to tune.run
. The config
used for each Trainable
in tune will automatically be passed down to the TorchTrainer
.
Therefore, each trial will have its own TorchTrainable
that holds an instance of the TorchTrainer
with its own unique hyperparameter configuration.
See the documentation (BaseTorchTrainable) for more info.
def tune_example(operator_cls, num_workers=1, use_gpu=False):
TorchTrainable = TorchTrainer.as_trainable(
training_operator_cls=operator_cls,
num_workers=num_workers,
use_gpu=use_gpu,
config={BATCH_SIZE: 128}
)
analysis = tune.run(
TorchTrainable,
num_samples=3,
config={"lr": tune.grid_search([1e-4, 1e-3])},
stop={"training_iteration": 2},
verbose=1)
return analysis.get_best_config(metric="val_loss", mode="min")
By default the training step for the returned Trainable
will run one epoch of training and one epoch of validation, and will report
the combined result dictionaries to Tune.
By combining RaySGD with Tune, each individual trial will be run in a distributed fashion with num_workers
workers,
but there can be multiple trials running in parallel as well.
Custom Training Step¶
Sometimes it is necessary to provide a custom training step, for example if you want to run more than one epoch of training for
each tune iteration, or you need to manually update the scheduler after validation. Custom training steps can easily be provided by passing
in a override_tune_step
function to TorchTrainer.as_trainable(...)
.
def tune_example_manual(operator_cls, num_workers=1, use_gpu=False):
def step(trainer, info: dict):
"""Define a custom training loop for tune.
This is needed because we want to manually update our scheduler.
"""
train_stats = trainer.train(profile=True)
validation_stats = trainer.validate(profile=True)
# Manually update our scheduler with the given metric.
trainer.update_scheduler(metric=validation_stats["val_loss"])
all_stats = merge_dicts(train_stats, validation_stats)
return all_stats
TorchTrainable = TorchTrainer.as_trainable(
override_tune_step=step,
training_operator_cls=operator_cls,
num_workers=num_workers,
use_gpu=use_gpu,
scheduler_step_freq="manual",
config={BATCH_SIZE: 128}
)
analysis = tune.run(
TorchTrainable,
num_samples=3,
config={"lr": tune.grid_search([1e-4, 1e-3])},
stop={"training_iteration": 2},
verbose=1)
return analysis.get_best_config(metric="val_loss", mode="min")
Your custom step function should take in two arguments: an instance of the TorchTrainer
and an info
dict containing other potentially
necessary information.
The info dict contains the following values:
# The current Tune iteration.
# This may be different than the number of epochs trained if each tune step does more than one epoch of training.
iteration
If you would like any other information to be available in the info
dict please file a feature request on Github Issues!
You can see the Tune example script for an end-to-end example.