Configuring Hyperparameter Tuning#

The Ray AIR Tuner is the recommended way to tune hyperparameters in Ray AIR.

../_images/tuner.svg

The Tuner will take in a Trainer and execute multiple training runs, each with different hyperparameter configurations.#

As part of Ray Tune, the Tuner provides an interface that works with AIR Trainers to perform distributed hyperparameter tuning. It provides a variety of state-of-the-art hyperparameter tuning algorithms for optimizing model performance.

What follows next is basic coverage of what a Tuner is and how you can use it for basic examples. If you are interested in reading more, please take a look at the Ray Tune documentation.

Key Concepts#

There are a number of key concepts that dictate proper use of a Tuner:

  • A set of hyperparameters you want to tune in a search space.

  • A search algorithm to effectively optimize your parameters and optionally use a scheduler to stop searches early and speed up your experiments.

  • The search space, search algorithm, scheduler, and Trainer are passed to a Tuner, which runs the hyperparameter tuning workload by evaluating multiple hyperparameters in parallel.

  • Each individual hyperparameter evaluation run is called a trial.

  • The Tuner returns its results in a ResultGrid.

Note

Tuners can also be used to launch hyperparameter tuning without using Ray AIR Trainers. See the Ray Tune documentation for more guides and examples.

Basic usage#

Below, we demonstrate how you can use a Trainer object with a Tuner.

import ray
from ray import tune
from ray.tune import Tuner
from ray.train.xgboost import XGBoostTrainer

dataset = ray.data.read_csv("s3://[email protected]/breast_cancer.csv")

trainer = XGBoostTrainer(
    label_column="target",
    params={
        "objective": "binary:logistic",
        "eval_metric": ["logloss", "error"],
        "max_depth": 4,
    },
    datasets={"train": dataset},
)

# Create Tuner
tuner = Tuner(
    trainer,
    # Add some parameters to tune
    param_space={"params": {"max_depth": tune.choice([4, 5, 6])}},
    # Specify tuning behavior
    tune_config=tune.TuneConfig(metric="train-logloss", mode="min", num_samples=2),
)
# Run tuning job
tuner.fit()

How to configure a search space?#

A Tuner takes in a param_space argument where you can define the search space from which hyperparameter configurations will be sampled.

Depending on the model and dataset, you may want to tune:

  • The training batch size

  • The learning rate for deep learning training (e.g., image classification)

  • The maximum depth for tree-based models (e.g., XGBoost)

The following shows some example code on how to specify the param_space.

import ray
from ray import tune
from ray.tune import Tuner
from ray.train.xgboost import XGBoostTrainer
from ray.air.config import ScalingConfig, RunConfig

dataset = ray.data.read_csv("s3://[email protected]/breast_cancer.csv")

# Create an XGBoost trainer
trainer = XGBoostTrainer(
    label_column="target",
    params={
        "objective": "binary:logistic",
        "eval_metric": ["logloss", "error"],
        "max_depth": 4,
    },
    num_boost_round=10,
    datasets={"train": dataset},
)

param_space = {
    # Tune parameters directly passed into the XGBoostTrainer
    "num_boost_round": tune.randint(5, 20),
    # `params` will be merged with the `params` defined in the above XGBoostTrainer
    "params": {
        "min_child_weight": tune.uniform(0.8, 1.0),
        # Below will overwrite the XGBoostTrainer setting
        "max_depth": tune.randint(1, 5),
    },
    # Tune the number of distributed workers
    "scaling_config": ScalingConfig(num_workers=tune.grid_search([1, 2])),
}

tuner = Tuner(
    trainable=trainer,
    run_config=RunConfig(name="test_tuner"),
    param_space=param_space,
    tune_config=tune.TuneConfig(
        mode="min", metric="train-logloss", num_samples=2, max_concurrent_trials=2
    ),
)
result_grid = tuner.fit()
from ray import tune
from ray.tune import Tuner
from ray.train.examples.pytorch.torch_linear_example import (
    train_func as linear_train_func,
)
from ray.train.torch import TorchTrainer

trainer = TorchTrainer(
    train_loop_per_worker=linear_train_func,
    train_loop_config={"lr": 1e-2, "batch_size": 4, "epochs": 10},
    scaling_config=ScalingConfig(num_workers=1, use_gpu=False),
)

param_space = {
    # The params will be merged with the ones defined in the TorchTrainer
    "train_loop_config": {
        # This is a parameter that hasn't been set in the TorchTrainer
        "hidden_size": tune.randint(1, 4),
        # This will overwrite whatever was set when TorchTrainer was instantiated
        "batch_size": tune.choice([4, 8]),
    },
    # Tune the number of distributed workers
    "scaling_config": ScalingConfig(num_workers=tune.grid_search([1, 2])),
}

tuner = Tuner(
    trainable=trainer,
    run_config=RunConfig(name="test_tuner", local_dir="~/ray_results"),
    param_space=param_space,
    tune_config=tune.TuneConfig(
        mode="min", metric="loss", num_samples=2, max_concurrent_trials=2
    ),
)
result_grid = tuner.fit()

Read more about Tune search spaces here.

You can use a Tuner to tune most arguments and configurations in Ray AIR, including but not limited to:

  • Ray Datasets

  • Preprocessors

  • Scaling configurations

  • and other hyperparameters.

There are a couple gotchas about parameter specification when using Tuners with Trainers:

  • By default, configuration dictionaries and config objects will be deep-merged.

  • Parameters that are duplicated in the Trainer and Tuner will be overwritten by the Tuner param_space.

  • Exception: all arguments of the RunConfig and TuneConfig are inherently un-tunable.

See Getting Data in and out of Tune for an example.

How to configure a Tuner?#

There are two main configuration objects that can be passed into a Tuner: the TuneConfig and the RunConfig.

The TuneConfig contains tuning specific settings, including:

  • the tuning algorithm to use

  • the metric and mode to rank results

  • the amount of parallelism to use

Here are some common configurations for TuneConfig:

from ray.tune import TuneConfig
from ray.tune.search.bayesopt import BayesOptSearch

tune_config = TuneConfig(
    metric="loss",
    mode="min",
    max_concurrent_trials=10,
    num_samples=100,
    search_alg=BayesOptSearch(),
)

See the TuneConfig API reference for more details.

The RunConfig contains configurations that are more generic than tuning specific settings. This may include:

  • failure/retry configurations

  • verbosity levels

  • the name of the experiment

  • the logging directory

  • checkpoint configurations

  • custom callbacks

  • integration with cloud storage

Below we showcase some common configurations of RunConfig.

from ray import air, tune
from ray.air.config import RunConfig

run_config = RunConfig(
    name="MyExperiment",
    local_dir="./your_log_directory/",
    verbose=2,
    sync_config=tune.SyncConfig(upload_dir="s3://..."),
    checkpoint_config=air.CheckpointConfig(checkpoint_frequency=2),
)

See the RunConfig API reference for more details.

How to specify parallelism?#

You can specify parallelism via the TuneConfig by setting the following flags:

  • num_samples which specifies the number of trials to run in total

  • max_concurrent_trials which specifies the max number of trials to run concurrently

Note that actual parallelism can be less than max_concurrent_trials and will be determined by how many trials can fit in the cluster at once (i.e., if you have a trial that requires 16 GPUs, your cluster has 32 GPUs, and max_concurrent_trials=10, the Tuner can only run 2 trials concurrently).

from ray.tune import TuneConfig

config = TuneConfig(
    # ...
    num_samples=100,
    max_concurrent_trials=10,
)

Read more about this in A Guide To Parallelism and Resources for Ray Tune section.

How to specify an optimization algorithm?#

You can specify your hyperparameter optimization method via the TuneConfig by setting the following flags:

  • search_alg which provides an optimizer for selecting the optimal hyperparameters

  • scheduler which provides a scheduling/resource allocation algorithm for accelerating the search process

from ray.tune.search.bayesopt import BayesOptSearch
from ray.tune.schedulers import HyperBandScheduler
from ray.tune import TuneConfig

config = TuneConfig(
    # ...
    search_alg=BayesOptSearch(),
    scheduler=HyperBandScheduler(),
)

Read more about this in the Search Algorithm and Scheduler section.

How to analyze results?#

Tuner.fit() generates a ResultGrid object. This object contains metrics, results, and checkpoints of each trial. Below is a simple example:

from ray.tune import Tuner, TuneConfig

tuner = Tuner(
    trainable=trainer,
    param_space=param_space,
    tune_config=TuneConfig(mode="min", metric="loss", num_samples=5),
)
result_grid = tuner.fit()

num_results = len(result_grid)

# Check if there have been errors
if result_grid.errors:
    print("At least one trial failed.")

# Get the best result
best_result = result_grid.get_best_result()

# And the best checkpoint
best_checkpoint = best_result.checkpoint

# And the best metrics
best_metric = best_result.metrics

# Or a dataframe for further analysis
results_df = result_grid.get_dataframe()
print("Shortest training time:", results_df["time_total_s"].min())

# Iterate over results
for result in result_grid:
    if result.error:
        print("The trial had an error:", result.error)
        continue

    print("The trial finished successfully with the metrics:", result.metrics["loss"])

See Analyzing Tune Experiment Results for more usage examples.

Advanced Tuning#

Tuners also offer the ability to tune different data preprocessing steps, as shown in the following snippet.

from ray.data.preprocessors import StandardScaler
from ray.tune import Tuner

prep_v1 = StandardScaler(["worst radius", "worst area"])
prep_v2 = StandardScaler(["worst concavity", "worst smoothness"])
tuner = Tuner(
    trainer,
    param_space={
        "preprocessor": tune.grid_search([prep_v1, prep_v2]),
        # Your other parameters go here
    },
)

Additionally, you can sample different train/validation datasets:

def get_dataset():
    return ray.data.read_csv("s3://[email protected]/breast_cancer.csv")


def get_another_dataset():
    # imagine this is a different dataset
    return ray.data.read_csv("s3://[email protected]/breast_cancer.csv")


dataset_1 = get_dataset()
dataset_2 = get_another_dataset()

tuner = tune.Tuner(
    trainer,
    param_space={
        "datasets": {
            "train": tune.grid_search([dataset_1, dataset_2]),
        }
        # Your other parameters go here
    },
)

Restoring and resuming#

A Tuner regularly saves its state, so that a tuning run can be resumed after being interrupted.

Additionally, if trials fail during a tuning run, they can be retried - either from scratch or from the latest available checkpoint.

To restore the Tuner state, pass the path to the experiment directory as an argument to Tuner.restore(...).

This path is obtained from the output of a tuning run, namely “Result logdir”. However, if you specify a name in the RunConfig, it is located under ~/ray_results/<name>.

tuner = Tuner.restore("~/ray_results/test_tuner", restart_errored=True)
tuner.fit()

For more resume options, please see the documentation of Tuner.restore().