ray.train.trainer.BaseTrainer.restore#

classmethod BaseTrainer.restore(path: str | PathLike, storage_filesystem: pyarrow.fs.FileSystem | None = None, datasets: Dict[str, Dataset | Callable[[], Dataset]] | None = None, scaling_config: ScalingConfig | None = None, **kwargs) BaseTrainer[source]#

Restores a Train experiment from a previously interrupted/failed run.

Restore should be used for experiment-level fault tolerance in the event that the head node crashes (e.g., OOM or some other runtime error) or the entire cluster goes down (e.g., network error affecting all nodes).

A run that has already completed successfully will not be resumed from this API. To continue training from a successful run, launch a new run with the <Framework>Trainer(resume_from_checkpoint) API instead, passing in a checkpoint from the previous run to start with.

Note

Restoring an experiment from a path that’s pointing to a different location than the original experiment path is supported. However, Ray Train assumes that the full experiment directory is available (including checkpoints) so that it’s possible to resume trials from their latest state.

For example, if the original experiment path was run locally, then the results are uploaded to cloud storage, Ray Train expects the full contents to be available in cloud storage if attempting to resume via <Framework>Trainer.restore("s3://..."). The restored run will continue writing results to the same cloud storage location.

The following example can be paired with implementing job retry using Ray Jobs to produce a Train experiment that will attempt to resume on both experiment-level and trial-level failures:

import os
import ray
from ray import train
from ray.train.trainer import BaseTrainer

experiment_name = "unique_experiment_name"
storage_path = os.path.expanduser("~/ray_results")
experiment_dir = os.path.join(storage_path, experiment_name)

# Define some dummy inputs for demonstration purposes
datasets = {"train": ray.data.from_items([{"a": i} for i in range(10)])}

class CustomTrainer(BaseTrainer):
    def training_loop(self):
        pass

if CustomTrainer.can_restore(experiment_dir):
    trainer = CustomTrainer.restore(
        experiment_dir, datasets=datasets
    )
else:
    trainer = CustomTrainer(
        datasets=datasets,
        run_config=train.RunConfig(
            name=experiment_name,
            storage_path=storage_path,
            # Tip: You can also enable retries on failure for
            # worker-level fault tolerance
            failure_config=train.FailureConfig(max_failures=3),
        ),
    )

result = trainer.fit()
Parameters:
  • path – The path to the experiment directory of the training run to restore. This can be a local path or a remote URI if the experiment was uploaded to the cloud.

  • storage_filesystem – Custom pyarrow.fs.FileSystem corresponding to the path. This may be necessary if the original experiment passed in a custom filesystem.

  • datasets – Re-specified datasets used in the original training run. This must include all the datasets that were passed in the original trainer constructor.

  • scaling_config – Optionally re-specified scaling config. This can be modified to be different from the original spec.

  • **kwargs – Other optionally re-specified arguments, passed in by subclasses.

Raises:

ValueError – If all datasets were not re-supplied on restore.

Returns:

A restored instance of the class that is calling this method.

Return type:

BaseTrainer