RaySGD Hyperparameter Tuning

RaySGD integrates with Ray Tune to easily run distributed hyperparameter tuning experiments with your RaySGD Trainer.

PyTorch

Tip

If you want to leverage multi-node data parallel training with PyTorch while using RayTune without using RaySGD, check out the Tune PyTorch user guide and Tune’s lightweight distributed pytorch integrations.

TorchTrainer naturally integrates with Tune via the BaseTorchTrainable interface. Without changing any arguments, you can call TorchTrainer.as_trainable(...) to create a Tune-compatible class. Then, you can simply pass the returned Trainable class to tune.run. The config used for each Trainable in tune will automatically be passed down to the TorchTrainer. Therefore, each trial will have its own TorchTrainable that holds an instance of the TorchTrainer with its own unique hyperparameter configuration. See the documentation (BaseTorchTrainable) for more info.

def tune_example(operator_cls, num_workers=1, use_gpu=False):
    TorchTrainable = TorchTrainer.as_trainable(
        training_operator_cls=operator_cls,
        num_workers=num_workers,
        use_gpu=use_gpu,
        config={BATCH_SIZE: 128}
    )

    analysis = tune.run(
        TorchTrainable,
        num_samples=3,
        config={"lr": tune.grid_search([1e-4, 1e-3])},
        stop={"training_iteration": 2},
        verbose=1)

    return analysis.get_best_config(metric="val_loss", mode="min")

By default the training step for the returned Trainable will run one epoch of training and one epoch of validation, and will report the combined result dictionaries to Tune.

By combining RaySGD with Tune, each individual trial will be run in a distributed fashion with num_workers workers, but there can be multiple trials running in parallel as well.

Custom Training Step

Sometimes it is necessary to provide a custom training step, for example if you want to run more than one epoch of training for each tune iteration, or you need to manually update the scheduler after validation. Custom training steps can easily be provided by passing in a override_tune_step function to TorchTrainer.as_trainable(...).

def tune_example_manual(operator_cls, num_workers=1, use_gpu=False):
    def step(trainer, info: dict):
        """Define a custom training loop for tune.
         This is needed because we want to manually update our scheduler.
         """
        train_stats = trainer.train(profile=True)
        validation_stats = trainer.validate(profile=True)
        # Manually update our scheduler with the given metric.
        trainer.update_scheduler(metric=validation_stats["val_loss"])
        all_stats = merge_dicts(train_stats, validation_stats)
        return all_stats

    TorchTrainable = TorchTrainer.as_trainable(
        override_tune_step=step,
        training_operator_cls=operator_cls,
        num_workers=num_workers,
        use_gpu=use_gpu,
        scheduler_step_freq="manual",
        config={BATCH_SIZE: 128}
    )

    analysis = tune.run(
        TorchTrainable,
        num_samples=3,
        config={"lr": tune.grid_search([1e-4, 1e-3])},
        stop={"training_iteration": 2},
        verbose=1)

    return analysis.get_best_config(metric="val_loss", mode="min")

Your custom step function should take in two arguments: an instance of the TorchTrainer and an info dict containing other potentially necessary information.

The info dict contains the following values:

# The current Tune iteration.
# This may be different than the number of epochs trained if each tune step does more than one epoch of training.
iteration

If you would like any other information to be available in the info dict please file a feature request on Github Issues!

You can see the Tune example script for an end-to-end example.