Using Trainers

../_images/train.svg

Ray AIR Trainers provide a way to scale out training with popular machine learning frameworks. As part of Ray Train, Trainers enable users to run distributed multi-node training with fault tolerance. Fully integrated with the Ray ecosystem, Trainers leverage Ray Data to enable scalable preprocessing and performant distributed data ingestion. Also, Trainers can be composed with Tuners for distributed hyperparameter tuning.

After executing training, Trainers output the trained model in the form of a Checkpoint, which can be used for batch or online prediction inference.

There are three broad categories of Trainers that AIR offers:

Trainer Basics

All trainers inherit from the BaseTrainer interface. To construct a Trainer, you can provide:

  • A scaling_config, which specifies how many parallel training workers and what type of resources (CPUs/GPUs) to use per worker during training.

  • A run_config, which configures a variety of runtime parameters such as fault tolerance, logging, and callbacks.

  • A collection of datasets and a preprocessor for the provided datasets, which configures preprocessing and the datasets to ingest from.

  • resume_from_checkpoint, which is a checkpoint path to resume from, should your training run be interrupted.

After instantiating a Trainer, you can invoke it by calling Trainer.fit().

import ray

from ray.train.xgboost import XGBoostTrainer
from ray.air.config import ScalingConfig

train_dataset = ray.data.from_items([{"x": x, "y": x + 1} for x in range(32)])
trainer = XGBoostTrainer(
    label_column="y",
    params={"objective": "reg:squarederror"},
    scaling_config=ScalingConfig(num_workers=3),
    datasets={"train": train_dataset},
)
result = trainer.fit()

Deep Learning Trainers

Ray Train offers 3 main deep learning trainers: TorchTrainer, TensorflowTrainer, and HorovodTrainer.

These three trainers all take a train_loop_per_worker parameter, which is a function that defines the main training logic that runs on each training worker.

Under the hood, Ray AIR will use the provided scaling_config to instantiate the correct number of workers.

Upon instantiation, each worker will be able to reference a global Session object, which provides functionality for reporting metrics, saving checkpoints, and more.

You can provide multiple datasets to a trainer via the datasets parameter. If datasets includes a training dataset (denoted by the “train” key), then it will be split into multiple dataset shards, with each worker training on a single shard. All other datasets will not be split. You can access the data shard within a worker via get_dataset_shard(), and use to_tf() or iter_torch_batches to generate batches of Tensorflow or Pytorch tensors. You can read more about data ingest here.

Read more about Ray Train’s Deep Learning Trainers.

How to report metrics and checkpoints?

During model training, you may want to save training metrics and checkpoints for downstream processing (e.g., serving the model).

Use the Session API to gather metrics and save checkpoints. Checkpoints are synced to driver or the cloud storage based on user’s configurations, as specified in Trainer(run_config=...).

Tree-based Trainers

Ray Train offers 2 main tree-based trainers: XGBoostTrainer and LightGBMTrainer.

See here for a more detailed user-guide.

XGBoost Trainer

Ray AIR also provides an easy to use XGBoostTrainer for training XGBoost models at scale.

To use this trainer, you will need to first run: pip install -U xgboost-ray.

import ray

from ray.train.xgboost import XGBoostTrainer
from ray.air.config import ScalingConfig

train_dataset = ray.data.from_items([{"x": x, "y": x + 1} for x in range(32)])
trainer = XGBoostTrainer(
    label_column="y",
    params={"objective": "reg:squarederror"},
    scaling_config=ScalingConfig(num_workers=3),
    datasets={"train": train_dataset},
)
result = trainer.fit()

LightGBMTrainer

Similarly, Ray AIR comes with a LightGBMTrainer for training LightGBM models at scale.

To use this trainer, you will need to first run pip install -U lightgbm-ray.

import ray

from ray.train.lightgbm import LightGBMTrainer
from ray.air.config import ScalingConfig

train_dataset = ray.data.from_items([{"x": x, "y": x + 1} for x in range(32)])
trainer = LightGBMTrainer(
    label_column="y",
    params={"objective": "regression"},
    scaling_config=ScalingConfig(num_workers=3),
    datasets={"train": train_dataset},
)
result = trainer.fit()

Other Trainers

HuggingFace Trainer

HuggingFaceTrainer further extends TorchTrainer, built for interoperability with the HuggingFace Transformers library.

Users are required to provide a trainer_init_per_worker function which returns a transformers.Trainer object. The trainer_init_per_worker function will have access to preprocessed train and evaluation datasets.

Upon calling HuggingFaceTrainer.fit(), multiple workers (ray actors) will be spawned, and each worker will create its own copy of a transformers.Trainer.

Each worker will then invoke transformers.Trainer.train(), which will perform distributed training via Pytorch DDP.

Scikit-Learn Trainer

Note

This trainer is not distributed.

The Scikit-Learn Trainer is a thin wrapper to launch scikit-learn training within Ray AIR. Even though this trainer is not distributed, you can still benefit from its integration with Ray Tune for distributed hyperparameter tuning and scalable batch/online prediction.

import ray

from ray.train.sklearn import SklearnTrainer
from sklearn.ensemble import RandomForestRegressor

train_dataset = ray.data.from_items([{"x": x, "y": x + 1} for x in range(32)])
trainer = SklearnTrainer(
    estimator=RandomForestRegressor(),
    label_column="y",
    scaling_config=ray.air.config.ScalingConfig(trainer_resources={"CPU": 4}),
    datasets={"train": train_dataset},
)
result = trainer.fit()

RLlib Trainer

RLTrainer provides an interface to RL Trainables. This enables you to use the same abstractions as in the other trainers to define the scaling behavior, and to use Ray Data for offline training.

Please note that some scaling behavior still has to be defined separately. The scaling_config will set the number of training workers (“Rollout workers”). To set the number of e.g. evaluation workers, you will have to specify this in the config parameter of the RLTrainer:

from ray.air.config import RunConfig, ScalingConfig
from ray.train.rl import RLTrainer

trainer = RLTrainer(
    run_config=RunConfig(stop={"training_iteration": 5}),
    scaling_config=ScalingConfig(num_workers=2, use_gpu=False),
    algorithm="PPO",
    config={
        "env": "CartPole-v0",
        "framework": "tf",
        "evaluation_num_workers": 1,
        "evaluation_interval": 1,
        "evaluation_config": {"input": "sampler"},
    },
)
result = trainer.fit()

How to interpret training results?

Calling Trainer.fit() returns a Result, providing you access to metrics, checkpoints, and errors. You can interact with a Result object as follows:

result = trainer.fit()

# returns the last saved checkpoint
result.checkpoint

# returns the N best saved checkpoints, as configured in ``RunConfig.CheckpointConfig``
result.best_checkpoints

# returns the final metrics as reported
result.metrics

# returns the Exception if training failed.
result.error

# Returns a pandas dataframe of all reported results
result.metrics_dataframe

See the Result docstring for more details.