Ray Train API
Contents
Ray Train API#
PyTorch Ecosystem#
|
A Trainer for data parallel PyTorch training. |
|
Configuration for torch process group setup. |
PyTorch#
Gets the correct torch device configured for this process. |
|
|
Prepares the model for distributed execution. |
|
Prepares DataLoader for distributed execution. |
|
Limits sources of nondeterministic behavior. |
PyTorch Lightning#
|
Prepare the PyTorch Lightning Trainer for distributed execution. |
|
Setup Lightning DDP training environment for Ray cluster. |
|
Subclass of DDPStrategy to ensure compatibility with Ray orchestration. |
|
Subclass of FSDPStrategy to ensure compatibility with Ray orchestration. |
|
Subclass of DeepSpeedStrategy to ensure compatibility with Ray orchestration. |
|
A simple callback that reports checkpoints to Ray on train epoch end. |
Hugging Face Transformers#
|
Prepare your HuggingFace Transformer Trainer for Ray Train. |
|
A simple callback to report checkpoints and metrics to Ray Tarin. |
More Frameworks#
Tensorflow/Keras#
|
A Trainer for data parallel Tensorflow training. |
PublicAPI (beta): This API is in beta and may change before becoming stable. |
|
|
A utility function that overrides default config for Tensorflow Dataset. |
|
Keras callback for Ray Train reporting and checkpointing. |
Horovod#
|
A Trainer for data parallel Horovod training. |
|
Configurations for Horovod setup. |
XGBoost#
|
A Trainer for data parallel XGBoost training. |
LightGBM#
|
A Trainer for data parallel LightGBM training. |
Ray Train Configuration#
|
Configurable parameters for defining the checkpointing strategy. |
|
Class responsible for configuring Train dataset preprocessing. |
|
Configuration related to failure handling of each training/tuning run. |
|
Runtime configuration for training and tuning runs. |
|
Configuration for scaling training. |
|
Configuration object for Train/Tune file syncing to |
Ray Train Utilities#
Classes
|
A reference to data persisted as a directory in local or remote storage. |
Context for Ray training executions. |
Functions
Access the session's last checkpoint to resume from if applicable. |
|
Get or create a singleton training context. |
|
|
Returns the |
|
Report metrics and optionally save a checkpoint. |
Ray Train Output#
|
The final result of a ML training run or a Tune trial. |
Ray Train Developer APIs#
Trainer Base Classes#
|
Defines interface for distributed training on Ray. |
|
A Trainer for data parallel training. |
|
Abstract class for scaling gradient-boosting decision tree (GBDT) frameworks. |
Train Backend Base Classes#
|
Singleton for distributed communication backend. |
Parent class for configurations of training backend. |