Ray Train Architecture#

The process of training models with Ray Train consists of several components. First, depending on the training framework you want to work with, you will have to provide a so-called Trainer that manages the training process. For instance, to use a PyTorch model, you use a TorchTrainer. The actual training load is distributed among workers on a cluster that belong to a WorkerGroup. Each framework has its specific communication protocols and exchange formats, which is why Ray Train provides Backend implementations (e.g. TorchBackend) that can be used to run the training process using a BackendExecutor.

Here’s a visual overview of the architecture components of Ray Train:

../_images/train-arch.svg

Below we discuss each component in a bit more detail.

Trainer#

Trainers are your main entry point to the Ray Train API. Train provides a BaseTrainer, and many framework-specific Trainers inherit from the derived DataParallelTrainer (like TensorFlow or Torch) and GBDTTrainer (like XGBoost or LightGBM). Defining an actual Trainer, such as TorchTrainer works as follows:

  • You pass in a function to the Trainer which defines the training logic.

  • The Trainer will create an Executor to run the distributed training.

  • The Trainer will handle callbacks based on the results from the executor.

Backend#

Backends are used to initialize and manage framework-specific communication protocols. Each training library (Torch, Horovod, TensorFlow, etc.) has a separate backend and takes specific configuration values defined in a BackendConfig. Each backend comes with a BackendExecutor that is used to run the training process.

Executor#

The executor is an interface (BackendExecutor) that executes distributed training. It handles the creation of a group of workers (using Ray Actors) and is initialized with a backend. The executor passes all required resources, the number of workers, and information about worker placement to the WorkerGroup.

WorkerGroup#

The WorkerGroup is a generic utility class for managing a group of Ray Actors. This is similar in concept to Fiber’s Ring.