Key Concepts

There are four main concepts in the Ray Train library.

  1. Trainers execute distributed training.

  2. Configuration objects are used to configure training.

  3. Checkpoints are returned as the result of training.

  4. Predictors can be used for inference and batch prediction.

../_images/train-specific.svg

Trainers

Trainers are responsible for executing (distributed) training runs. The output of a Trainer run is a Result that contains metrics from the training run and the latest saved Checkpoint. Trainers can also be configured with Datasets and Preprocessors for scalable data ingest and preprocessing.

There are three categories of built-in Trainers:

Ray Train supports the following deep learning trainers:

For these trainers, you usually define your own training function that loads the model and executes single-worker training steps. Refer to the following guides for more details:

Tree-based trainers utilize gradient-based decision trees for training. The most popular libraries for this are XGBoost and LightGBM.

For these trainers, you just pass a dataset and parameters. The training loop is configured automatically.

Some trainers don’t fit into the other two categories, such as:

Configuration

Trainers are configured with configuration objects. There are two main configuration classes, the ScalingConfig and the RunConfig. The latter contains subconfigurations, such as the FailureConfig, SyncConfig and CheckpointConfig.

Check out the Configurations User Guide for an in-depth guide on using these configurations.

Checkpoints

Calling Trainer.fit() returns a Result object, which includes information about the run such as the reported metrics and the saved checkpoints.

Checkpoints have the following purposes:

  • They can be passed to a Trainer to resume training from the given model state.

  • They can be used to create a Predictor / BatchPredictor for scalable batch prediction.

  • They can be deployed with Ray Serve.

Predictors

Predictors are the counterpart to Trainers. A Trainer trains a model on a dataset, and a predictor uses the resulting model and performs inference on it.

Each Trainer has a respective Predictor implementation that is compatible with its generated checkpoints.

A predictor can be passed into a BatchPredictor is used to scale up prediction over a Ray cluster. It takes a Ray Dataset as input.

See the Predictors user guide for more information and examples.