Tune Internals#
TunerInternal#
- class ray.tune.impl.tuner_internal.TunerInternal(restore_path: str = None, storage_filesystem: pyarrow.fs.FileSystem | None = None, resume_config: ResumeConfig | None = None, trainable: str | Callable | Type[Trainable] | BaseTrainer | None = None, param_space: Dict[str, Any] | None = None, tune_config: TuneConfig | None = None, run_config: RunConfig | None = None, _tuner_kwargs: Dict | None = None, _entrypoint: AirEntrypoint = AirEntrypoint.TUNER)[source]#
The real implementation behind external facing
Tuner
.The external facing
Tuner
multiplexes between local Tuner and remote Tuner depending on whether in Ray client mode.In Ray client mode, external
Tuner
wrapsTunerInternal
into a remote actor, which is guaranteed to be placed on head node.TunerInternal
can be constructed from fresh, in which case,trainable
needs to be provided, together with optionalparam_space
,tune_config
andrun_config
.It can also be restored from a previous failed run (given
restore_path
).- Parameters:
restore_path – The path from where the Tuner can be restored. If provided, None of the rest args are needed.
resume_config – Resume config to configure which trials to continue.
trainable – The trainable to be tuned.
param_space – Search space of the tuning job. One thing to note is that both preprocessor and dataset can be tuned here.
tune_config – Tuning algorithm specific configs. Refer to ray.tune.tune_config.TuneConfig for more info.
run_config – Runtime configuration that is specific to individual trials. If passed, this will overwrite the run config passed to the Trainer, if applicable. Refer to ray.tune.RunConfig for more info.
Trial#
- class ray.tune.experiment.trial.Trial(trainable_name: str, *, config: Dict | None = None, trial_id: str | None = None, storage: StorageContext | None = None, evaluated_params: Dict | None = None, experiment_tag: str = '', placement_group_factory: PlacementGroupFactory | None = None, stopping_criterion: Dict[str, float] | None = None, checkpoint_config: CheckpointConfig | None = None, export_formats: List[str] | None = None, restore_path: str | None = None, trial_name_creator: Callable[[Trial], str] | None = None, trial_dirname_creator: Callable[[Trial], str] | None = None, log_to_file: str | None | Tuple[str | None, str | None] = None, max_failures: int = 0, stub: bool = False, _setup_default_resource: bool = True)[source]#
A trial object holds the state for one model training run.
Trials are themselves managed by the TrialRunner class, which implements the event loop for submitting trial runs to a Ray cluster.
Trials start in the PENDING state, and transition to RUNNING once started. On error, it transitions to ERROR, otherwise TERMINATED on success.
There are resources allocated to each trial. These should be specified using
PlacementGroupFactory
.- trainable_name#
Name of the trainable object to be executed.
- config#
Provided configuration dictionary with evaluated params.
- trial_id#
Unique identifier for the trial.
- path#
Path where results for this trial are stored. Can be on the local node or on cloud storage.
- local_path#
Path on the local disk where results are stored.
- remote_path#
Path on cloud storage where results are stored, or None if not set.
- relative_logdir#
Directory of the trial relative to its experiment directory.
- evaluated_params#
Evaluated parameters by search algorithm,
- experiment_tag#
Identifying trial name to show in the console
- status#
One of PENDING, RUNNING, PAUSED, TERMINATED, ERROR/
- error_file#
Path to the errors that this trial has raised.
DeveloperAPI: This API may change across minor Ray releases.
- create_placement_group_factory()[source]#
Compute placement group factory if needed.
Note: this must be called after all the placeholders in self.config are resolved.
- property local_dir#
Warning
DEPRECATED: This API is deprecated and may be removed in future Ray releases.
- property logdir: str | None#
Warning
DEPRECATED: This API is deprecated and may be removed in future Ray releases.
- property checkpoint: Checkpoint | None#
Returns the most recent checkpoint if one has been saved.
- init_logdir()[source]#
Warning
DEPRECATED: This API is deprecated and may be removed in future Ray releases.
- update_resources(resources: dict | PlacementGroupFactory)[source]#
EXPERIMENTAL: Updates the resource requirements.
Should only be called when the trial is not running.
- Raises:
ValueError – if trial status is running.
- set_storage(new_storage: StorageContext)[source]#
Updates the storage context of the trial.
If the
storage_path
orexperiment_dir_name
has changed, then this setter also updates the paths of all checkpoints tracked by the checkpoint manager. This enables restoration from a checkpoint if the user moves the directory.
- get_pickled_error() Exception | None [source]#
Returns the pickled error object if it exists in storage.
This is a pickled version of the latest error that the trial encountered.
- get_error() TuneError | None [source]#
Returns the error text file trace as a TuneError object if it exists in storage.
This is a text trace of the latest error that the trial encountered, which is used in the case that the error is not picklable.
- on_checkpoint(checkpoint_result: _TrainingResult)[source]#
Hook for handling checkpoints taken by the Trainable.
- Parameters:
checkpoint – Checkpoint taken.
- should_recover()[source]#
Returns whether the trial qualifies for retrying.
num_failures
should represent the number of times the trial has failed up to the moment this method is called. If we’ve failed 5 times andmax_failures=5
, then we should recover, since we only pass the limit on the 6th failure.Note this may return true even when there is no checkpoint, either because
self.checkpoint_freq
is0
or because the trial failed before a checkpoint has been made.
FunctionTrainable#
- class ray.tune.trainable.function_trainable.FunctionTrainable(config: Dict[str, Any] = None, logger_creator: Callable[[Dict[str, Any]], Logger] = None, storage: StorageContext | None = None)[source]#
Trainable that runs a user function reporting results.
This mode of execution does not support checkpoint/restore.
DeveloperAPI: This API may change across minor Ray releases.
Registry#
- ray.tune.register_trainable(name: str, trainable: Callable | Type, warn: bool = True)[source]#
Register a trainable function or class.
This enables a class or function to be accessed on every Ray process in the cluster.
- Parameters:
name – Name to register.
trainable – Function or tune.Trainable class. Functions must take (config, status_reporter) as arguments and will be automatically converted into a class during registration.
DeveloperAPI: This API may change across minor Ray releases.
- ray.tune.register_env(name: str, env_creator: Callable)[source]#
Register a custom environment for use with RLlib.
This enables the environment to be accessed on every Ray process in the cluster.
- Parameters:
name – Name to register.
env_creator – Callable that creates an env.
DeveloperAPI: This API may change across minor Ray releases.
Output#
- class ray.tune.experimental.output.ProgressReporter(verbosity: AirVerbosity, progress_metrics: List[str] | List[Dict[str, str]] | None = None)[source]#
Periodically prints out status update.
- class ray.tune.experimental.output.TrainReporter(verbosity: AirVerbosity, progress_metrics: List[str] | List[Dict[str, str]] | None = None)[source]#