Execution (tune.run, tune.Experiment)

tune.run

ray.tune.run(run_or_experiment, name=None, metric=None, mode=None, stop=None, time_budget_s=None, config=None, resources_per_trial=None, num_samples=1, local_dir=None, search_alg=None, scheduler=None, keep_checkpoints_num=None, checkpoint_score_attr=None, checkpoint_freq=0, checkpoint_at_end=False, verbose=2, progress_reporter=None, loggers=None, log_to_file=False, trial_name_creator=None, trial_dirname_creator=None, sync_config=None, export_formats=None, max_failures=0, fail_fast=False, restore=None, server_port=None, resume=False, queue_trials=False, reuse_actors=False, trial_executor=None, raise_on_failed_trial=True, callbacks=None, ray_auto_init=None, run_errored_only=None, global_checkpoint_period=None, with_server=None, upload_dir=None, sync_to_cloud=None, sync_to_driver=None, sync_on_checkpoint=None)[source]

Executes training.

Examples:

# Run 10 trials (each trial is one instance of a Trainable). Tune runs
# in parallel and automatically determines concurrency.
tune.run(trainable, num_samples=10)

# Run 1 trial, stop when trial has reached 10 iterations
tune.run(my_trainable, stop={"training_iteration": 10})

# automatically retry failed trials up to 3 times
tune.run(my_trainable, stop={"training_iteration": 10}, max_failures=3)

# Run 1 trial, search over hyperparameters, stop after 10 iterations.
space = {"lr": tune.uniform(0, 1), "momentum": tune.uniform(0, 1)}
tune.run(my_trainable, config=space, stop={"training_iteration": 10})

# Resumes training if a previous machine crashed
tune.run(my_trainable, config=space,
         local_dir=<path/to/dir>, resume=True)

# Rerun ONLY failed trials after an experiment is finished.
tune.run(my_trainable, config=space,
         local_dir=<path/to/dir>, resume="ERRORED_ONLY")
Parameters
  • run_or_experiment (function | class | str | Experiment) – If function|class|str, this is the algorithm or model to train. This may refer to the name of a built-on algorithm (e.g. RLLib’s DQN or PPO), a user-defined trainable function or class, or the string identifier of a trainable function or class registered in the tune registry. If Experiment, then Tune will execute training based on Experiment.spec. If you want to pass in a Python lambda, you will need to first register the function: tune.register_trainable("lambda_id", lambda x: ...). You can then use tune.run("lambda_id").

  • metric (str) – Metric to optimize. This metric should be reported with tune.report(). If set, will be passed to the search algorithm and scheduler.

  • mode (str) – Must be one of [min, max]. Determines whether objective is minimizing or maximizing the metric attribute. If set, will be passed to the search algorithm and scheduler.

  • name (str) – Name of experiment.

  • stop (dict | callable | Stopper) – Stopping criteria. If dict, the keys may be any field in the return result of ‘train()’, whichever is reached first. If function, it must take (trial_id, result) as arguments and return a boolean (True if trial should be stopped, False otherwise). This can also be a subclass of ray.tune.Stopper, which allows users to implement custom experiment-wide stopping (i.e., stopping an entire Tune run based on some time constraint).

  • time_budget_s (int|float|datetime.timedelta) – Global time budget in seconds after which all trials are stopped. Can also be a datetime.timedelta object.

  • config (dict) – Algorithm-specific configuration for Tune variant generation (e.g. env, hyperparams). Defaults to empty dict. Custom search algorithms may ignore this.

  • resources_per_trial (dict) – Machine resources to allocate per trial, e.g. {"cpu": 64, "gpu": 8}. Note that GPUs will not be assigned unless you specify them here. Defaults to 1 CPU and 0 GPUs in Trainable.default_resource_request().

  • num_samples (int) – Number of times to sample from the hyperparameter space. Defaults to 1. If grid_search is provided as an argument, the grid will be repeated num_samples of times. If this is -1, (virtually) infinite samples are generated until a stopping condition is met.

  • local_dir (str) – Local dir to save training results to. Defaults to ~/ray_results.

  • search_alg (Searcher) – Search algorithm for optimization.

  • scheduler (TrialScheduler) – Scheduler for executing the experiment. Choose among FIFO (default), MedianStopping, AsyncHyperBand, HyperBand and PopulationBasedTraining. Refer to ray.tune.schedulers for more options.

  • keep_checkpoints_num (int) – Number of checkpoints to keep. A value of None keeps all checkpoints. Defaults to None. If set, need to provide checkpoint_score_attr.

  • checkpoint_score_attr (str) – Specifies by which attribute to rank the best checkpoint. Default is increasing order. If attribute starts with min- it will rank attribute in decreasing order, i.e. min-validation_loss.

  • checkpoint_freq (int) – How many training iterations between checkpoints. A value of 0 (default) disables checkpointing. This has no effect when using the Functional Training API.

  • checkpoint_at_end (bool) – Whether to checkpoint at the end of the experiment regardless of the checkpoint_freq. Default is False. This has no effect when using the Functional Training API.

  • verbose (int) – 0, 1, or 2. Verbosity mode. 0 = silent, 1 = only status updates, 2 = status and trial results.

  • progress_reporter (ProgressReporter) – Progress reporter for reporting intermediate experiment progress. Defaults to CLIReporter if running in command-line, or JupyterNotebookReporter if running in a Jupyter notebook.

  • loggers (list) – List of logger creators to be used with each Trial. If None, defaults to ray.tune.logger.DEFAULT_LOGGERS. See ray/tune/logger.py.

  • log_to_file (bool|str|Sequence) – Log stdout and stderr to files in Tune’s trial directories. If this is False (default), no files are written. If true, outputs are written to trialdir/stdout and trialdir/stderr, respectively. If this is a single string, this is interpreted as a file relative to the trialdir, to which both streams are written. If this is a Sequence (e.g. a Tuple), it has to have length 2 and the elements indicate the files to which stdout and stderr are written, respectively.

  • trial_name_creator (Callable[[Trial], str]) – Optional function for generating the trial string representation.

  • trial_dirname_creator (Callable[[Trial], str]) – Function for generating the trial dirname. This function should take in a Trial object and return a string representing the name of the directory. The return value cannot be a path.

  • sync_config (SyncConfig) – Configuration object for syncing. See tune.SyncConfig.

  • export_formats (list) – List of formats that exported at the end of the experiment. Default is None.

  • max_failures (int) – Try to recover a trial at least this many times. Ray will recover from the latest checkpoint if present. Setting to -1 will lead to infinite recovery retries. Setting to 0 will disable retries. Defaults to 0.

  • fail_fast (bool | str) – Whether to fail upon the first error. If fail_fast=’raise’ provided, Tune will automatically raise the exception received by the Trainable. fail_fast=’raise’ can easily leak resources and should be used with caution (it is best used with ray.init(local_mode=True)).

  • restore (str) – Path to checkpoint. Only makes sense to set if running 1 trial. Defaults to None.

  • server_port (int) – Port number for launching TuneServer.

  • resume (str|bool) – One of “LOCAL”, “REMOTE”, “PROMPT”, “ERRORED_ONLY”, or bool. LOCAL/True restores the checkpoint from the local_checkpoint_dir, determined by name and local_dir. REMOTE restores the checkpoint from remote_checkpoint_dir. PROMPT provides CLI feedback. False forces a new experiment. ERRORED_ONLY resets and reruns ERRORED trials upon resume - previous trial artifacts will be left untouched. If resume is set but checkpoint does not exist, ValueError will be thrown.

  • queue_trials (bool) – Whether to queue trials when the cluster does not currently have enough resources to launch one. This should be set to True when running on an autoscaling cluster to enable automatic scale-up.

  • reuse_actors (bool) – Whether to reuse actors between different trials when possible. This can drastically speed up experiments that start and stop actors often (e.g., PBT in time-multiplexing mode). This requires trials to have the same resource requirements.

  • trial_executor (TrialExecutor) – Manage the execution of trials.

  • raise_on_failed_trial (bool) – Raise TuneError if there exists failed trial (of ERROR state) when the experiments complete.

  • callbacks (list) – List of callbacks that will be called at different times in the training loop. Must be instances of the ray.tune.trial_runner.Callback class.

Returns

Object for experiment analysis.

Return type

ExperimentAnalysis

Raises

TuneError – Any trials failed and raise_on_failed_trial is True.

tune.run_experiments

ray.tune.run_experiments(experiments, scheduler=None, server_port=None, verbose=2, progress_reporter=None, resume=False, queue_trials=False, reuse_actors=False, trial_executor=None, raise_on_failed_trial=True, concurrent=True)[source]

Runs and blocks until all trials finish.

Examples

>>> experiment_spec = Experiment("experiment", my_func)
>>> run_experiments(experiments=experiment_spec)
>>> experiment_spec = {"experiment": {"run": my_func}}
>>> run_experiments(experiments=experiment_spec)
Returns

List of Trial objects, holding data for each executed trial.

tune.Experiment

ray.tune.Experiment(name, run, stop=None, time_budget_s=None, config=None, resources_per_trial=None, num_samples=1, local_dir=None, upload_dir=None, trial_name_creator=None, trial_dirname_creator=None, loggers=None, log_to_file=False, sync_to_driver=None, checkpoint_freq=0, checkpoint_at_end=False, sync_on_checkpoint=True, keep_checkpoints_num=None, checkpoint_score_attr=None, export_formats=None, max_failures=0, restore=None)[source]

Tracks experiment specifications.

Implicitly registers the Trainable if needed. The args here take the same meaning as the arguments defined tune.py:run.

experiment_spec = Experiment(
    "my_experiment_name",
    my_func,
    stop={"mean_accuracy": 100},
    config={
        "alpha": tune.grid_search([0.2, 0.4, 0.6]),
        "beta": tune.grid_search([1, 2]),
    },
    resources_per_trial={
        "cpu": 1,
        "gpu": 0
    },
    num_samples=10,
    local_dir="~/ray_results",
    checkpoint_freq=10,
    max_failures=2)

tune.with_parameters

ray.tune.with_parameters(fn, **kwargs)[source]

Wrapper for function trainables to pass arbitrary large data objects.

This wrapper function will store all passed parameters in the Ray object store and retrieve them when calling the function. It can thus be used to pass arbitrary data, even datasets, to Tune trainable functions.

This can also be used as an alternative to functools.partial to pass default arguments to trainables.

Parameters
  • fn – function to wrap

  • **kwargs – parameters to store in object store.

from ray import tune

def train(config, data=None):
    for sample in data:
        # ...
        tune.report(loss=loss)

data = HugeDataset(download=True)

tune.run(
    tune.with_parameters(train, data=data),
    #...
)

Stopper (tune.Stopper)

class ray.tune.Stopper[source]

Base class for implementing a Tune experiment stopper.

Allows users to implement experiment-level stopping via stop_all. By default, this class does not stop any trials. Subclasses need to implement __call__ and stop_all.

import time
from ray import tune
from ray.tune import Stopper

class TimeStopper(Stopper):
    def __init__(self):
        self._start = time.time()
        self._deadline = 300

    def __call__(self, trial_id, result):
        return False

    def stop_all(self):
        return time.time() - self._start > self.deadline

tune.run(Trainable, num_samples=200, stop=TimeStopper())
__call__(trial_id, result)[source]

Returns true if the trial should be terminated given the result.

stop_all()[source]

Returns true if the experiment should be terminated.

tune.SyncConfig

ray.tune.SyncConfig(upload_dir: str = None, sync_to_cloud: Any = None, sync_to_driver: Any = None, sync_on_checkpoint: bool = True, node_sync_period: int = 300, cloud_sync_period: int = 300) → None[source]

Configuration object for syncing.

Parameters
  • upload_dir (str) – Optional URI to sync training results and checkpoints to (e.g. s3://bucket, gs://bucket or hdfs://path).

  • sync_to_cloud (func|str) – Function for syncing the local_dir to and from upload_dir. If string, then it must be a string template that includes {source} and {target} for the syncer to run. If not provided, the sync command defaults to standard S3, gsutil or HDFS sync commands. By default local_dir is synced to remote_dir every 300 seconds. To change this, set the TUNE_CLOUD_SYNC_S environment variable in the driver machine.

  • sync_to_driver (func|str|bool) – Function for syncing trial logdir from remote node to local. If string, then it must be a string template that includes {source} and {target} for the syncer to run. If True or not provided, it defaults to using rsync. If False, syncing to driver is disabled.

  • sync_on_checkpoint (bool) – Force sync-down of trial checkpoint to driver. If set to False, checkpoint syncing from worker to driver is asynchronous and best-effort. This does not affect persistent storage syncing. Defaults to True.

  • node_sync_period (int) – Syncing period for syncing worker logs to driver. Defaults to 300.

  • cloud_sync_period (int) – Syncing period for syncing local checkpoints to cloud. Defaults to 300.