Tune Internals#

TunerInternal#

class ray.tune.impl.tuner_internal.TunerInternal(restore_path: str = None, storage_filesystem: pyarrow.fs.FileSystem | None = None, resume_config: ResumeConfig | None = None, trainable: str | Callable | Type[Trainable] | BaseTrainer | None = None, param_space: Dict[str, Any] | None = None, tune_config: TuneConfig | None = None, run_config: RunConfig | None = None, _tuner_kwargs: Dict | None = None, _entrypoint: AirEntrypoint = AirEntrypoint.TUNER)[source]#

The real implementation behind external facing Tuner.

The external facing Tuner multiplexes between local Tuner and remote Tuner depending on whether in Ray client mode.

In Ray client mode, external Tuner wraps TunerInternal into a remote actor, which is guaranteed to be placed on head node.

TunerInternal can be constructed from fresh, in which case, trainable needs to be provided, together with optional param_space, tune_config and run_config.

It can also be restored from a previous failed run (given restore_path).

Parameters:
  • restore_path – The path from where the Tuner can be restored. If provided, None of the rest args are needed.

  • resume_config – Resume config to configure which trials to continue.

  • trainable – The trainable to be tuned.

  • param_space – Search space of the tuning job. One thing to note is that both preprocessor and dataset can be tuned here.

  • tune_config – Tuning algorithm specific configs. Refer to ray.tune.tune_config.TuneConfig for more info.

  • run_config – Runtime configuration that is specific to individual trials. If passed, this will overwrite the run config passed to the Trainer, if applicable. Refer to ray.train.RunConfig for more info.

Trial#

class ray.tune.experiment.trial.Trial(trainable_name: str, *, config: Dict | None = None, trial_id: str | None = None, storage: StorageContext | None = None, evaluated_params: Dict | None = None, experiment_tag: str = '', placement_group_factory: PlacementGroupFactory | None = None, stopping_criterion: Dict[str, float] | None = None, checkpoint_config: CheckpointConfig | None = None, export_formats: List[str] | None = None, restore_path: str | None = None, trial_name_creator: Callable[[Trial], str] | None = None, trial_dirname_creator: Callable[[Trial], str] | None = None, log_to_file: str | None | Tuple[str | None, str | None] = None, max_failures: int = 0, stub: bool = False, _setup_default_resource: bool = True)[source]#

A trial object holds the state for one model training run.

Trials are themselves managed by the TrialRunner class, which implements the event loop for submitting trial runs to a Ray cluster.

Trials start in the PENDING state, and transition to RUNNING once started. On error, it transitions to ERROR, otherwise TERMINATED on success.

There are resources allocated to each trial. These should be specified using PlacementGroupFactory.

trainable_name#

Name of the trainable object to be executed.

config#

Provided configuration dictionary with evaluated params.

trial_id#

Unique identifier for the trial.

path#

Path where results for this trial are stored. Can be on the local node or on cloud storage.

local_path#

Path on the local disk where results are stored.

remote_path#

Path on cloud storage where results are stored, or None if not set.

relative_logdir#

Directory of the trial relative to its experiment directory.

evaluated_params#

Evaluated parameters by search algorithm,

experiment_tag#

Identifying trial name to show in the console

status#

One of PENDING, RUNNING, PAUSED, TERMINATED, ERROR/

error_file#

Path to the errors that this trial has raised.

DeveloperAPI: This API may change across minor Ray releases.

create_placement_group_factory()[source]#

Compute placement group factory if needed.

Note: this must be called after all the placeholders in self.config are resolved.

property local_dir#

Warning

DEPRECATED: This API is deprecated and may be removed in future Ray releases.

property logdir: str | None#

Warning

DEPRECATED: This API is deprecated and may be removed in future Ray releases.

property checkpoint: Checkpoint | None#

Returns the most recent checkpoint if one has been saved.

init_logdir()[source]#

Warning

DEPRECATED: This API is deprecated and may be removed in future Ray releases.

init_local_path()[source]#

Init logdir.

update_resources(resources: dict | PlacementGroupFactory)[source]#

EXPERIMENTAL: Updates the resource requirements.

Should only be called when the trial is not running.

Raises:

ValueError if trial status is running.

set_location(location)[source]#

Sets the location of the trial.

set_status(status)[source]#

Sets the status of the trial.

set_storage(new_storage: StorageContext)[source]#

Updates the storage context of the trial.

If the storage_path or experiment_dir_name has changed, then this setter also updates the paths of all checkpoints tracked by the checkpoint manager. This enables restoration from a checkpoint if the user moves the directory.

get_pickled_error() Exception | None[source]#

Returns the pickled error object if it exists in storage.

This is a pickled version of the latest error that the trial encountered.

get_error() TuneError | None[source]#

Returns the error text file trace as a TuneError object if it exists in storage.

This is a text trace of the latest error that the trial encountered, which is used in the case that the error is not picklable.

should_stop(result)[source]#

Whether the given result meets this trial’s stopping criteria.

should_checkpoint()[source]#

Whether this trial is due for checkpointing.

on_checkpoint(checkpoint_result: _TrainingResult)[source]#

Hook for handling checkpoints taken by the Trainable.

Parameters:

checkpoint – Checkpoint taken.

on_restore()[source]#

Handles restoration completion.

should_recover()[source]#

Returns whether the trial qualifies for retrying.

num_failures should represent the number of times the trial has failed up to the moment this method is called. If we’ve failed 5 times and max_failures=5, then we should recover, since we only pass the limit on the 6th failure.

Note this may return true even when there is no checkpoint, either because self.checkpoint_freq is 0 or because the trial failed before a checkpoint has been made.

FunctionTrainable#

class ray.tune.trainable.function_trainable.FunctionTrainable(config: Dict[str, Any] = None, logger_creator: Callable[[Dict[str, Any]], Logger] = None, storage: StorageContext | None = None)[source]#

Trainable that runs a user function reporting results.

This mode of execution does not support checkpoint/restore.

DeveloperAPI: This API may change across minor Ray releases.

ray.tune.trainable.function_trainable.wrap_function(train_func: Callable[[Any], Any], name: str | None = None) Type[FunctionTrainable][source]#

DeveloperAPI: This API may change across minor Ray releases.

Registry#

ray.tune.register_trainable(name: str, trainable: Callable | Type, warn: bool = True)[source]#

Register a trainable function or class.

This enables a class or function to be accessed on every Ray process in the cluster.

Parameters:
  • name – Name to register.

  • trainable – Function or tune.Trainable class. Functions must take (config, status_reporter) as arguments and will be automatically converted into a class during registration.

DeveloperAPI: This API may change across minor Ray releases.

ray.tune.register_env(name: str, env_creator: Callable)[source]#

Register a custom environment for use with RLlib.

This enables the environment to be accessed on every Ray process in the cluster.

Parameters:
  • name – Name to register.

  • env_creator – Callable that creates an env.

DeveloperAPI: This API may change across minor Ray releases.