Ray Train API#
PyTorch Ecosystem#
A Trainer for data parallel PyTorch training. |
|
Configuration for torch process group setup. |
|
Configuration for torch XLA setup. |
PyTorch#
Gets the correct torch device configured for this process. |
|
Gets the correct torch device list configured for this process. |
|
Prepares the model for distributed execution. |
|
Prepares |
|
Limits sources of nondeterministic behavior. |
PyTorch Lightning#
Prepare the PyTorch Lightning Trainer for distributed execution. |
|
Setup Lightning DDP training environment for Ray cluster. |
|
Subclass of DDPStrategy to ensure compatibility with Ray orchestration. |
|
Subclass of FSDPStrategy to ensure compatibility with Ray orchestration. |
|
Subclass of DeepSpeedStrategy to ensure compatibility with Ray orchestration. |
|
A simple callback that reports checkpoints to Ray on train epoch end. |
Hugging Face Transformers#
Prepare your HuggingFace Transformer Trainer for Ray Train. |
|
A simple callback to report checkpoints and metrics to Ray Train. |
More Frameworks#
Tensorflow/Keras#
A Trainer for data parallel Tensorflow training. |
|
PublicAPI (beta): This API is in beta and may change before becoming stable. |
|
A utility function that overrides default config for Tensorflow Dataset. |
|
Keras callback for Ray Train reporting and checkpointing. |
Horovod#
A Trainer for data parallel Horovod training. |
|
Configurations for Horovod setup. |
XGBoost#
A Trainer for data parallel XGBoost training. |
|
XGBoost callback to save checkpoints and report metrics. |
LightGBM#
A Trainer for data parallel LightGBM training. |
|
Creates a callback that reports metrics and checkpoints model. |
Ray Train Configuration#
Configurable parameters for defining the checkpointing strategy. |
|
Class responsible for configuring Train dataset preprocessing. |
|
Configuration related to failure handling of each training/tuning run. |
|
Runtime configuration for training and tuning runs. |
|
Configuration for scaling training. |
|
Configuration object for Train/Tune file syncing to |
Ray Train Utilities#
Classes
A reference to data persisted as a directory in local or remote storage. |
|
Context containing metadata that can be accessed within Ray Train workers. |
Functions
Access the latest reported checkpoint to resume from if one exists. |
|
Get or create a singleton training context. |
|
Returns the |
|
Report metrics and optionally save a checkpoint. |
Ray Train Output#
The final result of a ML training run or a Tune trial. |
Ray Train Errors#
Indicates a method or function was used outside of a session. |
|
An error indicating that training has failed. |
Ray Train Developer APIs#
Trainer Base Classes#
Defines interface for distributed training on Ray. |
|
A Trainer for data parallel training. |
Train Backend Base Classes#
Singleton for distributed communication backend. |
|
Parent class for configurations of training backend. |