Tune Internals

RayTrialExecutor

class ray.tune.ray_trial_executor.RayTrialExecutor(reuse_actors: bool = False, result_buffer_length: Optional[int] = None, refresh_period: Optional[float] = None)[source]

Bases: ray.tune.trial_executor.TrialExecutor

An implementation of TrialExecutor based on Ray. DeveloperAPI: This API may change across minor Ray releases.

set_max_pending_trials(max_pending: int) None[source]

Set the maximum number of allowed pending trials.

get_staged_trial()[source]

Get a trial whose placement group was successfully staged.

Can also return None if no trial is available.

Returns

Trial object or None.

start_trial(trial: ray.tune.trial.Trial) bool[source]

Starts the trial.

Will not return resources if trial repeatedly fails on start.

Parameters

trial – Trial to be started.

Returns

True if the remote runner has been started. False if trial was

not started (e.g. because of lacking resources/pending PG).

stop_trial(trial: ray.tune.trial.Trial, error: bool = False, exc: Optional[Union[ray.tune.error.TuneError, ray.exceptions.RayTaskError]] = None) None[source]

Stops the trial.

Stops this trial, releasing all allocating resources. If stopping the trial fails, the run will be marked as terminated in error, but no exception will be thrown.

Parameters
  • error – Whether to mark this trial as terminated in error.

  • exc – Optional exception.

continue_training(trial: ray.tune.trial.Trial) None[source]

Continues the training of this trial.

reset_trial(trial: ray.tune.trial.Trial, new_config: Dict, new_experiment_tag: str, logger_creator: Optional[Callable[[Dict], ray.tune.Logger]] = None) bool[source]

Tries to invoke Trainable.reset() to reset trial.

Parameters
  • trial – Trial to be reset.

  • new_config – New configuration for Trial trainable.

  • new_experiment_tag – New experiment name for trial.

  • logger_creator – Function that instantiates a logger on the actor process.

Returns

True if reset_config is successful else False.

has_resources_for_trial(trial: ray.tune.trial.Trial) bool[source]

Returns whether there are resources available for this trial.

This will return True as long as we didn’t reach the maximum number of pending trials. It will also return True if the trial placement group is already staged.

Parameters

trial – Trial object which should be scheduled.

Returns

boolean

debug_string() str[source]

Returns a human readable message for printing to the console.

on_step_begin(trials: List[ray.tune.trial.Trial]) None[source]

Before step() is called, update the available resources.

on_step_end(trials: List[ray.tune.trial.Trial]) None[source]

A hook called after running one step of the trial event loop.

Parameters

trials – The list of trials. Note, refrain from providing TrialRunner directly here.

save(trial: ray.tune.trial.Trial, storage: str = 'persistent', result: Optional[Dict] = None) ray.tune.checkpoint_manager._TuneCheckpoint[source]

Saves the trial’s state to a checkpoint asynchronously.

Parameters
  • trial – The trial to be saved.

  • storage – Where to store the checkpoint. Defaults to PERSISTENT.

  • result – The state of this trial as a dictionary to be saved. If result is None, the trial’s last result will be used.

Returns

Checkpoint object, or None if an Exception occurs.

restore(trial: ray.tune.trial.Trial) None[source]

Restores training state from a given model checkpoint.

Parameters

trial – The trial to be restored.

Raises
  • RuntimeError – This error is raised if no runner is found.

  • AbortTrialExecution – This error is raised if the trial is ineligible for restoration, given the Tune input arguments.

export_trial_if_needed(trial: ray.tune.trial.Trial) Dict[source]

Exports model of this trial based on trial.export_formats.

Returns

A dict that maps ExportFormats to successfully exported models.

has_gpus() bool[source]

Returns True if GPUs are detected on the cluster.

cleanup(trials: List[ray.tune.trial.Trial]) None[source]

Ensures that trials are cleaned up after stopping.

Parameters

trials – The list of trials. Note, refrain from providing TrialRunner directly here.

get_next_executor_event(live_trials: Set[ray.tune.trial.Trial], next_trial_exists: bool) ray.tune.ray_trial_executor.ExecutorEvent[source]

Get the next executor event to be processed in TrialRunner.

In case there are multiple events available for handling, the next event is determined by the following priority: 1. if there is next_trial_exists, and if there is cached resources to use, PG_READY is emitted. 2. if there is next_trial_exists and there is no cached resources to use, wait on pg future and randomized other futures. If multiple futures are ready, pg future will take priority to be handled first. 3. if there is no next_trial_exists, wait on just randomized other futures.

An example of #3 would be synchronous hyperband. Although there are pgs ready, the scheduler is holding back scheduling new trials since the whole band of trials is waiting for the slowest trial to finish. In this case, we prioritize handling training result to avoid deadlock situation.

This is a blocking wait with a timeout (specified with env var). The reason for the timeout is we still want to print status info periodically in TrialRunner for better user experience.

The handle of ExecutorEvent.STOP_RESULT is purely internal to RayTrialExecutor itself. All the other future results are handled by TrialRunner.

In the future we may want to do most of the handle of ExecutorEvent.RESTORE_RESULT and SAVING_RESULT in RayTrialExecutor itself and only notify TrialRunner to invoke corresponding callbacks. This view is more consistent with our goal of TrialRunner responsible for external facing Trial state transition, while RayTrialExecutor responsible for internal facing transitions, namely, is_saving, is_restoring etc.

Also you may notice that the boundary between RayTrialExecutor and PlacementGroupManager right now is really blurry. This will be improved once we move to an ActorPool abstraction.

next_trial_exists means that there is a trial to run - prioritize returning PG_READY in this case.

TrialExecutor

class ray.tune.trial_executor.TrialExecutor[source]

Module for interacting with remote trainables.

Manages platform-specific details such as resource handling and starting/stopping trials.

DeveloperAPI: This API may change across minor Ray releases.

set_status(trial: ray.tune.trial.Trial, status: str) None[source]

Sets status and checkpoints metadata if needed.

Only checkpoints metadata if trial status is a terminal condition. PENDING, PAUSED, and RUNNING switches have checkpoints taken care of in the TrialRunner.

Parameters
  • trial – Trial to checkpoint.

  • status – Status to set trial to.

get_checkpoints() Dict[str, str][source]

Returns a copy of mapping of the trial ID to pickled metadata.

abstract start_trial(trial: ray.tune.trial.Trial) bool[source]

Starts the trial restoring from checkpoint if checkpoint is provided.

Parameters

trial – Trial to be started.

Returns

True if trial started successfully, False otherwise.

abstract stop_trial(trial: ray.tune.trial.Trial, error: bool = False, exc: Optional[Union[ray.tune.error.TuneError, ray.exceptions.RayTaskError]] = None) None[source]

Stops the trial.

Stops this trial, releasing all allocating resources. If stopping the trial fails, the run will be marked as terminated in error, but no exception will be thrown.

Parameters
  • error – Whether to mark this trial as terminated in error.

  • exc – Optional exception.

continue_training(trial: ray.tune.trial.Trial) None[source]

Continues the training of this trial.

pause_trial(trial: ray.tune.trial.Trial) None[source]

Pauses the trial.

We want to release resources (specifically GPUs) when pausing an experiment. This results in PAUSED state that similar to TERMINATED.

abstract reset_trial(trial: ray.tune.trial.Trial, new_config: Dict, new_experiment_tag: str) bool[source]

Tries to invoke Trainable.reset() to reset trial.

Parameters
  • trial – Trial to be reset.

  • new_config – New configuration for Trial trainable.

  • new_experiment_tag – New experiment name for trial.

Returns

True if reset is successful else False.

on_step_begin(trials: List[ray.tune.trial.Trial]) None[source]

A hook called before running one step of the trial event loop.

Parameters

trials – The list of trials. Note, refrain from providing TrialRunner directly here.

on_step_end(trials: List[ray.tune.trial.Trial]) None[source]

A hook called after running one step of the trial event loop.

Parameters

trials – The list of trials. Note, refrain from providing TrialRunner directly here.

abstract debug_string() str[source]

Returns a human readable message for printing to the console.

abstract restore(trial: ray.tune.trial.Trial) None[source]

Restores training state from a checkpoint.

If checkpoint is None, try to restore from trial.checkpoint. If restoring fails, the trial status will be set to ERROR.

Parameters

trial – Trial to be restored.

Returns

False if error occurred, otherwise return True.

abstract save(trial: ray.tune.trial.Trial, storage: str = 'persistent', result: Optional[Dict] = None) ray.tune.checkpoint_manager._TuneCheckpoint[source]

Saves training state of this trial to a checkpoint.

If result is None, this trial’s last result will be used.

Parameters
  • trial – The state of this trial to be saved.

  • storage – Where to store the checkpoint. Defaults to PERSISTENT.

  • result – The state of this trial as a dictionary to be saved.

Returns

A Checkpoint object.

abstract export_trial_if_needed(trial: ray.tune.trial.Trial) Dict[source]

Exports model of this trial based on trial.export_formats.

Parameters

trial – The state of this trial to be saved.

Returns

A dict that maps ExportFormats to successfully exported models.

has_gpus() bool[source]

Returns True if GPUs are detected on the cluster.

cleanup(trials: List[ray.tune.trial.Trial]) None[source]

Ensures that trials are cleaned up after stopping.

Parameters

trials – The list of trials. Note, refrain from providing TrialRunner directly here.

set_max_pending_trials(max_pending: int) None[source]

Set the maximum number of allowed pending trials.

TrialRunner

class ray.tune.trial_runner.TrialRunner(search_alg: Optional[ray.tune.suggest.search.SearchAlgorithm] = None, scheduler: Optional[ray.tune.schedulers.trial_scheduler.TrialScheduler] = None, local_checkpoint_dir: Optional[str] = None, remote_checkpoint_dir: Optional[str] = None, sync_config: Optional[ray.tune.syncer.SyncConfig] = None, stopper: Optional[ray.tune.stopper.Stopper] = None, resume: Union[str, bool] = False, server_port: Optional[int] = None, fail_fast: bool = False, checkpoint_period: Optional[Union[str, int]] = None, trial_executor: Optional[ray.tune.ray_trial_executor.RayTrialExecutor] = None, callbacks: Optional[List[ray.tune.callback.Callback]] = None, metric: Optional[str] = None, driver_sync_trial_checkpoints: bool = False)[source]

A TrialRunner implements the event loop for scheduling trials on Ray.

The main job of TrialRunner is scheduling trials to efficiently use cluster resources, without overloading the cluster.

While Ray itself provides resource management for tasks and actors, this is not sufficient when scheduling trials that may instantiate multiple actors. This is because if insufficient resources are available, concurrent trials could deadlock waiting for new resources to become available. Furthermore, oversubscribing the cluster could degrade training performance, leading to misleading benchmark results.

Parameters
  • search_alg – SearchAlgorithm for generating Trial objects.

  • scheduler – Defaults to FIFOScheduler.

  • local_checkpoint_dir – Path where global checkpoints are stored and restored from.

  • remote_checkpoint_dir – Remote path where global checkpoints are stored and restored from. Used if resume == REMOTE.

  • sync_config – See tune.py:run.

  • stopper – Custom class for stopping whole experiments. See Stopper.

  • resume – see tune.py:run.

  • server_port – Port number for launching TuneServer.

  • fail_fast – Finishes as soon as a trial fails if True. If fail_fast=’raise’ provided, Tune will automatically raise the exception received by the Trainable. fail_fast=’raise’ can easily leak resources and should be used with caution.

  • checkpoint_period – Trial runner checkpoint periodicity in seconds. Defaults to "auto", which adjusts checkpointing time so that at most 5% of the time is spent on writing checkpoints.

  • trial_executor – Defaults to RayTrialExecutor.

  • callbacks – List of callbacks that will be called at different times in the training loop. Must be instances of the ray.tune.trial_runner.Callback class.

  • metric – Metric used to check received results. If a result is reported without this metric, an error will be raised. The error can be omitted by not providing a metric or by setting the env variable TUNE_DISABLE_STRICT_METRIC_CHECKING=0

Trial

class ray.tune.trial.Trial(trainable_name: str, config: Optional[Dict] = None, trial_id: Optional[str] = None, local_dir: Optional[str] = '/home/docs/ray_results', evaluated_params: Optional[Dict] = None, experiment_tag: str = '', resources: Optional[ray.tune.resources.Resources] = None, placement_group_factory: Optional[ray.tune.utils.placement_groups.PlacementGroupFactory] = None, stopping_criterion: Optional[Dict[str, float]] = None, remote_checkpoint_dir: Optional[str] = None, sync_function_tpl: Optional[str] = None, checkpoint_freq: int = 0, checkpoint_at_end: bool = False, sync_on_checkpoint: bool = True, keep_checkpoints_num: Optional[int] = None, checkpoint_score_attr: str = 'training_iteration', export_formats: Optional[List[str]] = None, restore_path: Optional[str] = None, trial_name_creator: Optional[Callable[[ray.tune.trial.Trial], str]] = None, trial_dirname_creator: Optional[Callable[[ray.tune.trial.Trial], str]] = None, log_to_file: Optional[str] = None, max_failures: int = 0, stub: bool = False, _setup_default_resource: bool = True)[source]

A trial object holds the state for one model training run.

Trials are themselves managed by the TrialRunner class, which implements the event loop for submitting trial runs to a Ray cluster.

Trials start in the PENDING state, and transition to RUNNING once started. On error it transitions to ERROR, otherwise TERMINATED on success.

There are resources allocated to each trial. These should be specified using PlacementGroupFactory.

trainable_name

Name of the trainable object to be executed.

config

Provided configuration dictionary with evaluated params.

trial_id

Unique identifier for the trial.

local_dir

Local_dir as passed to tune.run.

logdir

Directory where the trial logs are saved.

evaluated_params

Evaluated parameters by search algorithm,

experiment_tag

Identifying trial name to show in the console

status

One of PENDING, RUNNING, PAUSED, TERMINATED, ERROR/

error_file

Path to the errors that this trial has raised.

DeveloperAPI: This API may change across minor Ray releases.

Callbacks

class ray.tune.callback.Callback[source]

Tune base callback that can be extended and passed to a TrialRunner

Tune callbacks are called from within the TrialRunner class. There are several hooks that can be used, all of which are found in the submethod definitions of this base class.

The parameters passed to the **info dict vary between hooks. The parameters passed are described in the docstrings of the methods.

This example will print a metric each time a result is received:

from ray import tune
from ray.tune import Callback


class MyCallback(Callback):
    def on_trial_result(self, iteration, trials, trial, result,
                        **info):
        print(f"Got result: {result['metric']}")


def train(config):
    for i in range(10):
        tune.report(metric=i)


tune.run(
    train,
    callbacks=[MyCallback()])

PublicAPI (beta): This API is in beta and may change before becoming stable.

setup(stop: Optional[Stopper] = None, num_samples: Optional[int] = None, total_num_samples: Optional[int] = None, **info)[source]

Called once at the very beginning of training.

Any Callback setup should be added here (setting environment variables, etc.)

Parameters
  • stop – Stopping criteria. If time_budget_s was passed to tune.run, a TimeoutStopper will be passed here, either by itself or as a part of a CombinedStopper.

  • num_samples – Number of times to sample from the hyperparameter space. Defaults to 1. If grid_search is provided as an argument, the grid will be repeated num_samples of times. If this is -1, (virtually) infinite samples are generated until a stopping condition is met.

  • total_num_samples – Total number of samples factoring in grid search samplers.

  • **info – Kwargs dict for forward compatibility.

on_step_begin(iteration: int, trials: List[Trial], **info)[source]

Called at the start of each tuning loop step.

Parameters
  • iteration – Number of iterations of the tuning loop.

  • trials – List of trials.

  • **info – Kwargs dict for forward compatibility.

on_step_end(iteration: int, trials: List[Trial], **info)[source]

Called at the end of each tuning loop step.

The iteration counter is increased before this hook is called.

Parameters
  • iteration – Number of iterations of the tuning loop.

  • trials – List of trials.

  • **info – Kwargs dict for forward compatibility.

on_trial_start(iteration: int, trials: List[Trial], trial: Trial, **info)[source]

Called after starting a trial instance.

Parameters
  • iteration – Number of iterations of the tuning loop.

  • trials – List of trials.

  • trial – Trial that just has been started.

  • **info – Kwargs dict for forward compatibility.

on_trial_restore(iteration: int, trials: List[Trial], trial: Trial, **info)[source]

Called after restoring a trial instance.

Parameters
  • iteration – Number of iterations of the tuning loop.

  • trials – List of trials.

  • trial – Trial that just has been restored.

  • **info – Kwargs dict for forward compatibility.

on_trial_save(iteration: int, trials: List[Trial], trial: Trial, **info)[source]

Called after receiving a checkpoint from a trial.

Parameters
  • iteration – Number of iterations of the tuning loop.

  • trials – List of trials.

  • trial – Trial that just saved a checkpoint.

  • **info – Kwargs dict for forward compatibility.

on_trial_result(iteration: int, trials: List[Trial], trial: Trial, result: Dict, **info)[source]

Called after receiving a result from a trial.

The search algorithm and scheduler are notified before this hook is called.

Parameters
  • iteration – Number of iterations of the tuning loop.

  • trials – List of trials.

  • trial – Trial that just sent a result.

  • result – Result that the trial sent.

  • **info – Kwargs dict for forward compatibility.

on_trial_complete(iteration: int, trials: List[Trial], trial: Trial, **info)[source]

Called after a trial instance completed.

The search algorithm and scheduler are notified before this hook is called.

Parameters
  • iteration – Number of iterations of the tuning loop.

  • trials – List of trials.

  • trial – Trial that just has been completed.

  • **info – Kwargs dict for forward compatibility.

on_trial_error(iteration: int, trials: List[Trial], trial: Trial, **info)[source]

Called after a trial instance failed (errored).

The search algorithm and scheduler are notified before this hook is called.

Parameters
  • iteration – Number of iterations of the tuning loop.

  • trials – List of trials.

  • trial – Trial that just has errored.

  • **info – Kwargs dict for forward compatibility.

on_checkpoint(iteration: int, trials: List[Trial], trial: Trial, checkpoint: ray.tune.checkpoint_manager._TuneCheckpoint, **info)[source]

Called after a trial saved a checkpoint with Tune.

Parameters
  • iteration – Number of iterations of the tuning loop.

  • trials – List of trials.

  • trial – Trial that just has errored.

  • checkpoint – Checkpoint object that has been saved by the trial.

  • **info – Kwargs dict for forward compatibility.

on_experiment_end(trials: List[Trial], **info)[source]

Called after experiment is over and all trials have concluded.

Parameters
  • trials – List of trials.

  • **info – Kwargs dict for forward compatibility.

PlacementGroupFactory

class ray.tune.utils.placement_groups.PlacementGroupFactory(bundles: List[Dict[str, Union[int, float]]], strategy: str = 'PACK', *args, **kwargs)[source]

Wrapper class that creates placement groups for trials.

This function should be used to define resource requests for Ray Tune trials. It holds the parameters to create placement groups. At a minimum, this will hold at least one bundle specifying the resource requirements for each trial:

from ray import tune

tune.run(
    train,
    tune.PlacementGroupFactory([
        {"CPU": 1, "GPU": 0.5, "custom_resource": 2}
    ]))

If the trial itself schedules further remote workers, the resource requirements should be specified in additional bundles. You can also pass the placement strategy for these bundles, e.g. to enforce co-located placement:

from ray import tune

tune.run(
    train,
    resources_per_trial=tune.PlacementGroupFactory([
        {"CPU": 1, "GPU": 0.5, "custom_resource": 2},
        {"CPU": 2},
        {"CPU": 2},
    ], strategy="PACK"))

The example above will reserve 1 CPU, 0.5 GPUs and 2 custom_resources for the trainable itself, and reserve another 2 bundles of 2 CPUs each. The trial will only start when all these resources are available. This could be used e.g. if you had one learner running in the main trainable that schedules two remote workers that need access to 2 CPUs each.

If the trainable itself doesn’t require resources. You can specify it as:

from ray import tune

tune.run(
    train,
    resources_per_trial=tune.PlacementGroupFactory([
        {},
        {"CPU": 2},
        {"CPU": 2},
    ], strategy="PACK"))
Parameters
  • bundles (List[Dict]) – A list of bundles which represent the resources requirements.

  • strategy (str) –

    The strategy to create the placement group.

    • ”PACK”: Packs Bundles into as few nodes as possible.

    • ”SPREAD”: Places Bundles across distinct nodes as even as possible.

    • ”STRICT_PACK”: Packs Bundles into one node. The group is not allowed to span multiple nodes.

    • ”STRICT_SPREAD”: Packs Bundles across distinct nodes.

  • *args – Passed to the call of placement_group()

  • **kwargs – Passed to the call of placement_group()

PublicAPI (beta): This API is in beta and may change before becoming stable.

Registry

ray.tune.register_trainable(name: str, trainable: Union[Callable, Type], warn: bool = True)[source]

Register a trainable function or class.

This enables a class or function to be accessed on every Ray process in the cluster.

Parameters
  • name – Name to register.

  • trainable – Function or tune.Trainable class. Functions must take (config, status_reporter) as arguments and will be automatically converted into a class during registration.

ray.tune.register_env(name: str, env_creator: Callable)[source]

Register a custom environment for use with RLlib.

This enables the environment to be accessed on every Ray process in the cluster.

Parameters
  • name – Name to register.

  • env_creator – Callable that creates an env.