Stopping and Resuming Tune Trials

Ray Tune periodically checkpoints the experiment state so that it can be restarted when it fails or stops. The checkpointing period is dynamically adjusted so that at least 95% of the time is used for handling training results and scheduling.

If you send a SIGINT signal to the process running tune.run() (which is usually what happens when you press Ctrl+C in the console), Ray Tune shuts down training gracefully and saves a final experiment-level checkpoint.

How to resume a Tune run?

If you’ve stopped a run and and want to resume from where you left off, you can then call tune.run() with resume=True like this:

tune.run(
    train,
    # other configuration
    name="my_experiment",
    resume=True
)

You will have to pass a name if you are using resume=True so that Ray Tune can detect the experiment folder (which is usually stored at e.g. ~/ray_results/my_experiment). If you forgot to pass a name in the first call, you can still pass the name when you resume the run. Please note that in this case it is likely that your experiment name has a date suffix, so if you ran tune.run(my_trainable), the name might look like something like this: my_trainable_2021-01-29_10-16-44.

You can see which name you need to pass by taking a look at the results table of your original tuning run:

== Status ==
Memory usage on this node: 11.0/16.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 1/16 CPUs, 0/0 GPUs, 0.0/4.69 GiB heap, 0.0/1.61 GiB objects
Result logdir: /Users/ray/ray_results/my_trainable_2021-01-29_10-16-44
Number of trials: 1/1 (1 RUNNING)

Another useful option to know about is resume="AUTO", which will attempt to resume the experiment if possible, and otherwise will start a new experiment. For more details and other options for resume, see the Tune run API documentation.

How to stop Tune runs programmatically?

We’ve just covered the case in which you manually interrupt a Tune run. But you can also control when trials are stopped early by passing the stop argument to tune.run. This argument takes, a dictionary, a function, or a Stopper class as an argument.

If a dictionary is passed in, the keys may be any field in the return result of tune.report in the Function API or step() (including the results from step and auto-filled metrics).

Stopping with a dictionary

In the example below, each trial will be stopped either when it completes 10 iterations or when it reaches a mean accuracy of 0.98. These metrics are assumed to be increasing.

# training_iteration is an auto-filled metric by Tune.
tune.run(
    my_trainable,
    stop={"training_iteration": 10, "mean_accuracy": 0.98}
)

Stopping with a function

For more flexibility, you can pass in a function instead. If a function is passed in, it must take (trial_id, result) as arguments and return a boolean (True if trial should be stopped and False otherwise).

def stopper(trial_id, result):
    return result["mean_accuracy"] / result["training_iteration"] > 5

tune.run(my_trainable, stop=stopper)

Stopping with a class

Finally, you can implement the Stopper abstract class for stopping entire experiments. For example, the following example stops all trials after the criteria is fulfilled by any individual trial, and prevents new ones from starting:

from ray.tune import Stopper

class CustomStopper(Stopper):
    def __init__(self):
        self.should_stop = False

    def __call__(self, trial_id, result):
        if not self.should_stop and result['foo'] > 10:
            self.should_stop = True
        return self.should_stop

    def stop_all(self):
        """Returns whether to stop trials and prevent new ones from starting."""
        return self.should_stop

stopper = CustomStopper()
tune.run(my_trainable, stop=stopper)

Note that in the above example the currently running trials will not stop immediately but will do so once their current iterations are complete.

Ray Tune comes with a set of out-of-the-box stopper classes. See the Stopper documentation.

Stopping after the first failure

By default, tune.run will continue executing until all trials have terminated or errored. To stop the entire Tune run as soon as any trial errors:

tune.run(trainable, fail_fast=True)

This is useful when you are trying to setup a large hyperparameter experiment.