Tune Internals#

TunerInternal#

class ray.tune.impl.tuner_internal.TunerInternal(restore_path: str = None, resume_config: Optional[ray.tune.execution.trial_runner._ResumeConfig] = None, trainable: Optional[Union[str, Callable, Type[ray.tune.trainable.trainable.Trainable], BaseTrainer]] = None, param_space: Optional[Dict[str, Any]] = None, tune_config: Optional[ray.tune.tune_config.TuneConfig] = None, run_config: Optional[ray.air.config.RunConfig] = None, _tuner_kwargs: Optional[Dict] = None)[source]#

The real implementation behind external facing Tuner.

The external facing Tuner multiplexes between local Tuner and remote Tuner depending on whether in Ray client mode.

In Ray client mode, external Tuner wraps TunerInternal into a remote actor, which is guaranteed to be placed on head node.

TunerInternal can be constructed from fresh, in which case, trainable needs to be provided, together with optional param_space, tune_config and run_config.

It can also be restored from a previous failed run (given restore_path).

Parameters
  • restore_path – The path from where the Tuner can be restored. If provided, None of the rest args are needed.

  • resume_config – Resume config to configure which trials to continue.

  • trainable – The trainable to be tuned.

  • param_space – Search space of the tuning job. One thing to note is that both preprocessor and dataset can be tuned here.

  • tune_config – Tuning algorithm specific configs. Refer to ray.tune.tune_config.TuneConfig for more info.

  • run_config – Runtime configuration that is specific to individual trials. If passed, this will overwrite the run config passed to the Trainer, if applicable. Refer to ray.air.config.RunConfig for more info.

RayTrialExecutor#

class ray.tune.execution.ray_trial_executor.RayTrialExecutor(resource_manager: Optional[ray.air.execution.resources.resource_manager.ResourceManager] = None, reuse_actors: bool = False, result_buffer_length: Optional[int] = None, refresh_period: Optional[float] = None, chdir_to_trial_dir: bool = False)[source]#

An implementation of TrialExecutor based on Ray.

DeveloperAPI: This API may change across minor Ray releases.

set_status(trial: ray.tune.experiment.trial.Trial, status: str) None[source]#

Sets status and checkpoints metadata if needed.

Only checkpoints metadata if trial status is a terminal condition. PENDING, PAUSED, and RUNNING switches have checkpoints taken care of in the TrialRunner.

Parameters
  • trial – Trial to checkpoint.

  • status – Status to set trial to.

get_checkpoints() Dict[str, str][source]#

Returns a copy of mapping of the trial ID to pickled metadata.

get_ready_trial() Optional[ray.tune.experiment.trial.Trial][source]#

Get a trial whose resources are ready and that thus can be started.

Can also return None if no trial is available.

Returns

Trial object or None.

start_trial(trial: ray.tune.experiment.trial.Trial) bool[source]#

Starts the trial.

Will not return resources if trial repeatedly fails on start.

Parameters

trial – Trial to be started.

Returns

True if the remote runner has been started. False if trial was

not started (e.g. because of lacking resources/pending PG).

stop_trial(trial: ray.tune.experiment.trial.Trial, error: bool = False, exc: Optional[Union[ray.tune.error.TuneError, ray.exceptions.RayTaskError]] = None) None[source]#

Stops the trial, releasing held resources and removing futures related to this trial from the execution queue.

Parameters
  • trial – Trial to stop.

  • error – Whether to mark this trial as terminated in error. The trial status will be set to either Trial.ERROR or Trial.TERMINATED based on this. Defaults to False.

  • exc – Optional exception to log (as a reason for stopping). Defaults to None.

continue_training(trial: ray.tune.experiment.trial.Trial) None[source]#

Continues the training of this trial.

pause_trial(trial: ray.tune.experiment.trial.Trial, should_checkpoint: bool = True) None[source]#

Pauses the trial, releasing resources (specifically GPUs)

We do this by: 1. Checkpoint the trial (if should_checkpoint) in memory to allow us to resume from this state in the future. We may not always want to checkpoint, if we know that the checkpoint will not be used. 2. Stop the trial and release resources, see RayTrialExecutor.stop_trial above 3. Set the trial status to Trial.PAUSED, which is similar to Trial.TERMINATED, except we have the intention of resuming the trial.

Parameters
  • trial – Trial to pause.

  • should_checkpoint – Whether to save an in-memory checkpoint before stopping.

reset_trial(trial: ray.tune.experiment.trial.Trial, new_config: Dict, new_experiment_tag: str, logger_creator: Optional[Callable[[Dict], ray.tune.Logger]] = None) bool[source]#

Tries to invoke Trainable.reset() to reset trial.

Parameters
  • trial – Trial to be reset.

  • new_config – New configuration for Trial trainable.

  • new_experiment_tag – New experiment name for trial.

  • logger_creator – Function that instantiates a logger on the actor process.

Returns

True if reset_config is successful else False.

has_resources_for_trial(trial: ray.tune.experiment.trial.Trial) bool[source]#

Returns whether there are resources available for this trial.

This will return True as long as we didn’t reach the maximum number of pending trials. It will also return True if the trial placement group is already staged.

Parameters

trial – Trial object which should be scheduled.

Returns

boolean

debug_string() str[source]#

Returns a human readable message for printing to the console.

on_step_begin() None[source]#

Before step() is called, update the available resources.

save(trial: ray.tune.experiment.trial.Trial, storage: ray.air._internal.checkpoint_manager.CheckpointStorage = CheckpointStorage.PERSISTENT, result: Optional[Dict] = None) ray.air._internal.checkpoint_manager._TrackedCheckpoint[source]#

Saves the trial’s state to a checkpoint asynchronously.

Parameters
  • trial – The trial to be saved.

  • storage – Where to store the checkpoint. Defaults to PERSISTENT.

  • result – The state of this trial as a dictionary to be saved. If result is None, the trial’s last result will be used.

Returns

Checkpoint object, or None if an Exception occurs.

restore(trial: ray.tune.experiment.trial.Trial) None[source]#

Restores training state from a given model checkpoint.

Parameters

trial – The trial to be restored.

Raises
  • RuntimeError – This error is raised if no runner is found.

  • AbortTrialExecution – This error is raised if the trial is ineligible for restoration, given the Tune input arguments.

export_trial_if_needed(trial: ray.tune.experiment.trial.Trial) Dict[source]#

Exports model of this trial based on trial.export_formats.

Returns

A dict that maps ExportFormats to successfully exported models.

get_next_executor_event(live_trials: Set[ray.tune.experiment.trial.Trial], next_trial_exists: bool) ray.tune.execution.ray_trial_executor._ExecutorEvent[source]#

Get the next executor event to be processed in TrialRunner.

In case there are multiple events available for handling, the next event is determined by the following priority: 1. if there is next_trial_exists, and if there is cached resources to use, PG_READY is emitted. 2. if there is next_trial_exists and there is no cached resources to use, wait on pg future and randomized other futures. If multiple futures are ready, pg future will take priority to be handled first. 3. if there is no next_trial_exists, wait on just randomized other futures.

An example of #3 would be synchronous hyperband. Although there are pgs ready, the scheduler is holding back scheduling new trials since the whole band of trials is waiting for the slowest trial to finish. In this case, we prioritize handling training result to avoid deadlock situation.

This is a blocking wait with a timeout (specified with env var). The reason for the timeout is we still want to print status info periodically in TrialRunner for better user experience.

The handle of ExecutorEvent.STOP_RESULT is purely internal to RayTrialExecutor itself. All the other future results are handled by TrialRunner.

In the future we may want to do most of the handle of ExecutorEvent.RESTORE_RESULT and SAVING_RESULT in RayTrialExecutor itself and only notify TrialRunner to invoke corresponding callbacks. This view is more consistent with our goal of TrialRunner responsible for external facing Trial state transition, while RayTrialExecutor responsible for internal facing transitions, namely, is_saving, is_restoring etc.

Also you may notice that the boundary between RayTrialExecutor and PlacementGroupManager right now is really blurry. This will be improved once we move to an ActorPool abstraction.

next_trial_exists means that there is a trial to run - prioritize returning PG_READY in this case.

TrialRunner#

class ray.tune.execution.trial_runner.TrialRunner(search_alg: Optional[ray.tune.search.search_algorithm.SearchAlgorithm] = None, scheduler: Optional[ray.tune.schedulers.trial_scheduler.TrialScheduler] = None, local_checkpoint_dir: Optional[str] = None, sync_config: Optional[ray.tune.syncer.SyncConfig] = None, experiment_dir_name: Optional[str] = None, stopper: Optional[ray.tune.stopper.stopper.Stopper] = None, resume: Union[str, bool] = False, server_port: Optional[int] = None, fail_fast: bool = False, checkpoint_period: Optional[Union[str, int]] = None, trial_executor: Optional[ray.tune.execution.ray_trial_executor.RayTrialExecutor] = None, callbacks: Optional[List[ray.tune.callback.Callback]] = None, metric: Optional[str] = None, trial_checkpoint_config: Optional[ray.air.config.CheckpointConfig] = None, driver_sync_trial_checkpoints: bool = False)[source]#

A TrialRunner implements the event loop for scheduling trials on Ray.

The main job of TrialRunner is scheduling trials to efficiently use cluster resources, without overloading the cluster.

While Ray itself provides resource management for tasks and actors, this is not sufficient when scheduling trials that may instantiate multiple actors. This is because if insufficient resources are available, concurrent trials could deadlock waiting for new resources to become available. Furthermore, oversubscribing the cluster could degrade training performance, leading to misleading benchmark results.

Parameters
  • search_alg – SearchAlgorithm for generating Trial objects.

  • scheduler – Defaults to FIFOScheduler.

  • local_checkpoint_dir – Path where global experiment state checkpoints are saved and restored from.

  • sync_config – See SyncConfig. Within sync config, the upload_dir specifies cloud storage, and experiment state checkpoints will be synced to the remote_checkpoint_dir: {sync_config.upload_dir}/{experiment_name}.

  • experiment_dir_name – Experiment directory name. See Experiment.

  • stopper – Custom class for stopping whole experiments. See Stopper.

  • resume – see tune.py:run.

  • server_port – Port number for launching TuneServer.

  • fail_fast – Finishes as soon as a trial fails if True. If fail_fast=’raise’ provided, Tune will automatically raise the exception received by the Trainable. fail_fast=’raise’ can easily leak resources and should be used with caution.

  • checkpoint_period – Trial runner checkpoint periodicity in seconds. Defaults to "auto", which adjusts checkpointing time so that at most 5% of the time is spent on writing checkpoints.

  • trial_executor – Defaults to RayTrialExecutor.

  • callbacks – List of callbacks that will be called at different times in the training loop. Must be instances of the ray.tune.execution.trial_runner.Callback class.

  • metric – Metric used to check received results. If a result is reported without this metric, an error will be raised. The error can be omitted by not providing a metric or by setting the env variable TUNE_DISABLE_STRICT_METRIC_CHECKING=0

DeveloperAPI: This API may change across minor Ray releases.

setup_experiments(experiments: List[ray.tune.experiment.experiment.Experiment], total_num_samples: int) None[source]#

Obtains any necessary information from experiments.

Mainly used to setup callbacks.

Parameters
  • experiments – List of Experiments to use.

  • total_num_samples – Total number of samples factoring in grid search samplers.

end_experiment_callbacks() None[source]#

Calls on_experiment_end method in callbacks.

checkpoint(force: bool = False)[source]#

Saves execution state to self._local_checkpoint_dir.

Overwrites the current session checkpoint, which starts when self is instantiated. Throttle depends on self._checkpoint_period.

Also automatically saves the search algorithm to the local checkpoint dir.

Parameters

force – Forces a checkpoint despite checkpoint_period.

resume(resume_unfinished: bool = True, resume_errored: bool = False, restart_errored: bool = False)[source]#

Resumes all checkpointed trials from previous run.

Requires user to manually re-register their objects. Also stops all ongoing trials.

update_pending_trial_resources(resources: Union[dict, ray.tune.execution.placement_groups.PlacementGroupFactory])[source]#

Update trial resources when resuming from checkpoint.

Only updating the pending ones.

is_finished()[source]#

Returns whether all trials have finished running.

step()[source]#

Runs one step of the trial event loop.

Callers should typically run this method repeatedly in a loop. They may inspect or modify the runner’s state in between calls to step().

get_trials()[source]#

Returns the list of trials managed by this TrialRunner.

Note that the caller usually should not mutate trial state directly.

get_live_trials()[source]#

Returns the set of trials that are not in Trial.TERMINATED state.

add_trial(trial: ray.tune.experiment.trial.Trial)[source]#

Adds a new trial to this TrialRunner.

Trials may be added at any time.

Parameters

trial – Trial to queue.

pause_trial(trial: ray.tune.experiment.trial.Trial, should_checkpoint: bool = True)[source]#

Pause a trial and reset the necessary state variables for resuming later.

Parameters
  • trial – Trial to pause.

  • should_checkpoint – Whether or not an in-memory checkpoint should be created for this paused trial. Defaults to True.

stop_trial(trial)[source]#

The canonical implementation of stopping a trial.

Trials may be in any external status when this function is called. If trial is in state PENDING or PAUSED, calls on_trial_remove for scheduler and on_trial_complete() for search_alg. If trial is in state RUNNING, calls on_trial_complete for scheduler and search_alg if RUNNING. Caller to ensure that there is no outstanding future to be handled for the trial. If there is, the future would be discarded.

cleanup()[source]#

Cleanup trials and callbacks.

Trial#

class ray.tune.experiment.trial.Trial(trainable_name: str, *, config: Optional[Dict] = None, trial_id: Optional[str] = None, local_dir: Optional[str] = '/home/docs/ray_results', evaluated_params: Optional[Dict] = None, experiment_tag: str = '', resources: Optional[ray.tune.resources.Resources] = None, placement_group_factory: Optional[ray.tune.execution.placement_groups.PlacementGroupFactory] = None, stopping_criterion: Optional[Dict[str, float]] = None, experiment_dir_name: Optional[str] = None, sync_config: Optional[ray.tune.syncer.SyncConfig] = None, checkpoint_config: Optional[ray.air.config.CheckpointConfig] = None, export_formats: Optional[List[str]] = None, restore_path: Optional[str] = None, trial_name_creator: Optional[Callable[[ray.tune.experiment.trial.Trial], str]] = None, trial_dirname_creator: Optional[Callable[[ray.tune.experiment.trial.Trial], str]] = None, log_to_file: Union[str, None, Tuple[Optional[str], Optional[str]]] = None, max_failures: int = 0, stub: bool = False, _setup_default_resource: bool = True)[source]#

A trial object holds the state for one model training run.

Trials are themselves managed by the TrialRunner class, which implements the event loop for submitting trial runs to a Ray cluster.

Trials start in the PENDING state, and transition to RUNNING once started. On error it transitions to ERROR, otherwise TERMINATED on success.

There are resources allocated to each trial. These should be specified using PlacementGroupFactory.

trainable_name#

Name of the trainable object to be executed.

config#

Provided configuration dictionary with evaluated params.

trial_id#

Unique identifier for the trial.

local_dir#

local_dir as passed to air.RunConfig() joined with the name of the experiment.

logdir#

Directory where the trial logs are saved.

relative_logdir#

Same as logdir, but relative to the parent of the local_dir (equal to local_dir argument passed to air.RunConfig()).

evaluated_params#

Evaluated parameters by search algorithm,

experiment_tag#

Identifying trial name to show in the console

status#

One of PENDING, RUNNING, PAUSED, TERMINATED, ERROR/

error_file#

Path to the errors that this trial has raised.

DeveloperAPI: This API may change across minor Ray releases.

property checkpoint#

Returns the most recent checkpoint.

If the trial is in ERROR state, the most recent PERSISTENT checkpoint is returned.

property remote_checkpoint_dir#

This is the per trial remote checkpoint dir.

This is different from per experiment remote checkpoint dir.

init_logdir()[source]#

Init logdir.

update_resources(resources: Union[Dict, ray.tune.resources.Resources, ray.tune.execution.placement_groups.PlacementGroupFactory])[source]#

EXPERIMENTAL: Updates the resource requirements.

Should only be called when the trial is not running.

Raises

ValueError if trial status is running.

refresh_default_resource_request()[source]#

Update trial resources according to the trainable’s default resource request, if it is provided.

set_location(location)[source]#

Sets the location of the trial.

set_status(status)[source]#

Sets the status of the trial.

should_stop(result)[source]#

Whether the given result meets this trial’s stopping criteria.

should_checkpoint()[source]#

Whether this trial is due for checkpointing.

on_checkpoint(checkpoint: ray.air._internal.checkpoint_manager._TrackedCheckpoint)[source]#

Hook for handling checkpoints taken by the Trainable.

Parameters

checkpoint – Checkpoint taken.

on_restore()[source]#

Handles restoration completion.

should_recover()[source]#

Returns whether the trial qualifies for retrying.

This is if the trial has not failed more than max_failures. Note this may return true even when there is no checkpoint, either because self.checkpoint_freq is 0 or because the trial failed before a checkpoint has been made.

FunctionTrainable#

class ray.tune.trainable.function_trainable.FunctionTrainable(config: Dict[str, Any] = None, logger_creator: Callable[[Dict[str, Any]], Logger] = None, remote_checkpoint_dir: Optional[str] = None, custom_syncer: Optional[ray.tune.syncer.Syncer] = None, sync_timeout: Optional[int] = None)[source]#

Trainable that runs a user function reporting results.

This mode of execution does not support checkpoint/restore.

DeveloperAPI: This API may change across minor Ray releases.

ray.tune.trainable.function_trainable.wrap_function(train_func: Callable[[Any], Any], warn: bool = True, name: Optional[str] = None) Type[ray.tune.trainable.function_trainable.FunctionTrainable][source]#

DeveloperAPI: This API may change across minor Ray releases.

Registry#

ray.tune.register_trainable(name: str, trainable: Union[Callable, Type], warn: bool = True)[source]#

Register a trainable function or class.

This enables a class or function to be accessed on every Ray process in the cluster.

Parameters
  • name – Name to register.

  • trainable – Function or tune.Trainable class. Functions must take (config, status_reporter) as arguments and will be automatically converted into a class during registration.

DeveloperAPI: This API may change across minor Ray releases.

ray.tune.register_env(name: str, env_creator: Callable)[source]#

Register a custom environment for use with RLlib.

This enables the environment to be accessed on every Ray process in the cluster.

Parameters
  • name – Name to register.

  • env_creator – Callable that creates an env.

DeveloperAPI: This API may change across minor Ray releases.