ray.tune.run#
- ray.tune.run(run_or_experiment: str | Callable | Type, *, name: str | None = None, metric: str | None = None, mode: str | None = None, stop: Mapping | Stopper | Callable[[str, Mapping], bool] | None = None, time_budget_s: int | float | timedelta | None = None, config: Dict[str, Any] | None = None, resources_per_trial: None | Mapping[str, float | int | Mapping] | PlacementGroupFactory = None, num_samples: int = 1, storage_path: str | None = None, storage_filesystem: pyarrow.fs.FileSystem | None = None, search_alg: Searcher | SearchAlgorithm | str | None = None, scheduler: TrialScheduler | str | None = None, checkpoint_config: CheckpointConfig | None = None, verbose: int | AirVerbosity | Verbosity | None = None, progress_reporter: ProgressReporter | None = None, log_to_file: bool = False, trial_name_creator: Callable[[Trial], str] | None = None, trial_dirname_creator: Callable[[Trial], str] | None = None, sync_config: SyncConfig | None = None, export_formats: Sequence | None = None, max_failures: int = 0, fail_fast: bool = False, restore: str | None = None, resume: bool | str | None = None, resume_config: ResumeConfig | None = None, reuse_actors: bool = False, raise_on_failed_trial: bool = True, callbacks: Sequence[Callback] | None = None, max_concurrent_trials: int | None = None, keep_checkpoints_num: int | None = None, checkpoint_score_attr: str | None = None, checkpoint_freq: int = 0, checkpoint_at_end: bool = False, chdir_to_trial_dir: bool = 'DEPRECATED', local_dir: str | None = None, _remote: bool | None = None, _remote_string_queue: Queue | None = None, _entrypoint: AirEntrypoint = AirEntrypoint.TUNE_RUN) ExperimentAnalysis [source]#
Executes training.
When a SIGINT signal is received (e.g. through Ctrl+C), the tuning run will gracefully shut down and checkpoint the latest experiment state. Sending SIGINT again (or SIGKILL/SIGTERM instead) will skip this step.
Many aspects of Tune, such as the frequency of global checkpointing, maximum pending placement group trials and the path of the result directory be configured through environment variables. Refer to Environment variables used by Ray Tune for a list of environment variables available.
Examples:
# Run 10 trials (each trial is one instance of a Trainable). Tune runs # in parallel and automatically determines concurrency. tune.run(trainable, num_samples=10) # Run 1 trial, stop when trial has reached 10 iterations tune.run(my_trainable, stop={"training_iteration": 10}) # automatically retry failed trials up to 3 times tune.run(my_trainable, stop={"training_iteration": 10}, max_failures=3) # Run 1 trial, search over hyperparameters, stop after 10 iterations. space = {"lr": tune.uniform(0, 1), "momentum": tune.uniform(0, 1)} tune.run(my_trainable, config=space, stop={"training_iteration": 10}) # Resumes training if a previous machine crashed tune.run( my_trainable, config=space, storage_path=<path/to/dir>, name=<exp_name>, resume=True )
- Parameters:
run_or_experiment – If function|class|str, this is the algorithm or model to train. This may refer to the name of a built-on algorithm (e.g. RLlib’s DQN or PPO), a user-defined trainable function or class, or the string identifier of a trainable function or class registered in the tune registry. If Experiment, then Tune will execute training based on Experiment.spec. If you want to pass in a Python lambda, you will need to first register the function:
tune.register_trainable("lambda_id", lambda x: ...)
. You can then usetune.run("lambda_id")
.metric – Metric to optimize. This metric should be reported with
tune.report()
. If set, will be passed to the search algorithm and scheduler.mode – Must be one of [min, max]. Determines whether objective is minimizing or maximizing the metric attribute. If set, will be passed to the search algorithm and scheduler.
name – Name of experiment.
stop – Stopping criteria. If dict, the keys may be any field in the return result of ‘train()’, whichever is reached first. If function, it must take (trial_id, result) as arguments and return a boolean (True if trial should be stopped, False otherwise). This can also be a subclass of
ray.tune.Stopper
, which allows users to implement custom experiment-wide stopping (i.e., stopping an entire Tune run based on some time constraint).time_budget_s – Global time budget in seconds after which all trials are stopped. Can also be a
datetime.timedelta
object.config – Algorithm-specific configuration for Tune variant generation (e.g. env, hyperparams). Defaults to empty dict. Custom search algorithms may ignore this.
resources_per_trial – Machine resources to allocate per trial, e.g.
{"cpu": 64, "gpu": 8}
. Note that GPUs will not be assigned unless you specify them here. Defaults to 1 CPU and 0 GPUs inTrainable.default_resource_request()
. This can also be a PlacementGroupFactory object wrapping arguments to create a per-trial placement group.num_samples – Number of times to sample from the hyperparameter space. Defaults to 1. If
grid_search
is provided as an argument, the grid will be repeatednum_samples
of times. If this is -1, (virtually) infinite samples are generated until a stopping condition is met.storage_path – Path to store results at. Can be a local directory or a destination on cloud storage. Defaults to the local
~/ray_results
directory.search_alg – Search algorithm for optimization. You can also use the name of the algorithm.
scheduler – Scheduler for executing the experiment. Choose among FIFO (default), MedianStopping, AsyncHyperBand, HyperBand and PopulationBasedTraining. Refer to ray.tune.schedulers for more options. You can also use the name of the scheduler.
verbose – 0, 1, or 2. Verbosity mode. 0 = silent, 1 = default, 2 = verbose. Defaults to 1. If the
RAY_AIR_NEW_OUTPUT=1
environment variable is set, uses the old verbosity settings: 0 = silent, 1 = only status updates, 2 = status and brief results, 3 = status and detailed results.progress_reporter – Progress reporter for reporting intermediate experiment progress. Defaults to CLIReporter if running in command-line, or JupyterNotebookReporter if running in a Jupyter notebook.
log_to_file – Log stdout and stderr to files in Tune’s trial directories. If this is
False
(default), no files are written. Iftrue
, outputs are written totrialdir/stdout
andtrialdir/stderr
, respectively. If this is a single string, this is interpreted as a file relative to the trialdir, to which both streams are written. If this is a Sequence (e.g. a Tuple), it has to have length 2 and the elements indicate the files to which stdout and stderr are written, respectively.trial_name_creator – Optional function that takes in a Trial and returns its name (i.e. its string representation). Be sure to include some unique identifier (such as
Trial.trial_id
) in each trial’s name.trial_dirname_creator – Optional function that takes in a trial and generates its trial directory name as a string. Be sure to include some unique identifier (such as
Trial.trial_id
) is used in each trial’s directory name. Otherwise, trials could overwrite artifacts and checkpoints of other trials. The return value cannot be a path.chdir_to_trial_dir – Deprecated. Set the
RAY_CHDIR_TO_TRIAL_DIR
env var insteadsync_config – Configuration object for syncing. See train.SyncConfig.
export_formats – List of formats that exported at the end of the experiment. Default is None.
max_failures – Try to recover a trial at least this many times. Ray will recover from the latest checkpoint if present. Setting to -1 will lead to infinite recovery retries. Setting to 0 will disable retries. Defaults to 0.
fail_fast – Whether to fail upon the first error. If fail_fast=’raise’ provided, Tune will automatically raise the exception received by the Trainable. fail_fast=’raise’ can easily leak resources and should be used with caution (it is best used with
ray.init(local_mode=True)
).restore – Path to checkpoint. Only makes sense to set if running 1 trial. Defaults to None.
resume – One of [True, False, “AUTO”]. Can be suffixed with one or more of [“+ERRORED”, “+ERRORED_ONLY”, “+RESTART_ERRORED”, “+RESTART_ERRORED_ONLY”] (e.g.
AUTO+ERRORED
).resume=True
andresume="AUTO"
will attempt to resume from a checkpoint and otherwise start a new experiment. The suffix “+ERRORED” resets and reruns errored trials upon resume - previous trial artifacts will be left untouched. It will try to continue from the last observed checkpoint. The suffix “+RESTART_ERRORED” will instead start the errored trials from scratch. “+ERRORED_ONLY” and “+RESTART_ERRORED_ONLY” will disable resuming non-errored trials - they will be added as finished instead. New trials can still be generated by the search algorithm.resume_config – [Experimental] Config object that controls how to resume trials of different statuses. Can be used as a substitute to the
resume
suffixes described above.reuse_actors – Whether to reuse actors between different trials when possible. This can drastically speed up experiments that start and stop actors often (e.g., PBT in time-multiplexing mode). This requires trials to have the same resource requirements. Defaults to
False
.raise_on_failed_trial – Raise TuneError if there exists failed trial (of ERROR state) when the experiments complete.
callbacks – List of callbacks that will be called at different times in the training loop. Must be instances of the
ray.tune.callback.Callback
class. If not passed,LoggerCallback
(json/csv/tensorboard) callbacks are automatically added.max_concurrent_trials – Maximum number of trials to run concurrently. Must be non-negative. If None or 0, no limit will be applied. This is achieved by wrapping the
search_alg
in aConcurrencyLimiter
, and thus setting this argument will raise an exception if thesearch_alg
is already aConcurrencyLimiter
. Defaults to None._remote – Whether to run the Tune driver in a remote function. This is disabled automatically if a custom trial executor is passed in. This is enabled by default in Ray client mode.
local_dir – Deprecated. Use
storage_path
instead.keep_checkpoints_num – Deprecated. use checkpoint_config instead.
checkpoint_score_attr – Deprecated. use checkpoint_config instead.
checkpoint_freq – Deprecated. use checkpoint_config instead.
checkpoint_at_end – Deprecated. use checkpoint_config instead.
checkpoint_keep_all_ranks – Deprecated. use checkpoint_config instead.
checkpoint_upload_from_workers – Deprecated. use checkpoint_config instead.
- Returns:
Object for experiment analysis.
- Return type:
- Raises:
TuneError – Any trials failed and
raise_on_failed_trial
is True.