Ray Train API#

Important

These API references are for the revamped Ray Train V2 implementation that is available starting from Ray 2.43 by enabling the environment variable RAY_TRAIN_V2_ENABLED=1. These APIs assume that the environment variable has been enabled.

See Ray Train V1 API for the old API references and the Ray Train V2 Migration Guide.

PyTorch Ecosystem#

`TorchTrainer`	A Trainer for data parallel PyTorch training.
`TorchConfig`	Configuration for torch process group setup.
`TorchXLAConfig`	Configuration for torch XLA setup.

PyTorch#

`get_device`	Gets the correct torch device configured for the current worker.
`get_devices`	Gets the list of torch devices configured for the current worker.
`prepare_model`	Prepares the model for distributed execution.
`prepare_data_loader`	Prepares `DataLoader` for distributed execution.
`enable_reproducibility`	Limits sources of nondeterministic behavior.

PyTorch Lightning#

`prepare_trainer`	Prepare the PyTorch Lightning Trainer for distributed execution.
`RayLightningEnvironment`	Setup Lightning DDP training environment for Ray cluster.
`RayDDPStrategy`	Subclass of DDPStrategy to ensure compatibility with Ray orchestration.
`RayFSDPStrategy`	Subclass of FSDPStrategy to ensure compatibility with Ray orchestration.
`RayDeepSpeedStrategy`	Subclass of DeepSpeedStrategy to ensure compatibility with Ray orchestration.
`RayTrainReportCallback`	A simple callback that reports checkpoints to Ray on train epoch end.

Hugging Face Transformers#

`prepare_trainer`	Prepare your HuggingFace Transformer Trainer for Ray Train.
`RayTrainReportCallback`	A simple callback to report checkpoints and metrics to Ray Train.

More Frameworks#

TensorFlow/Keras#

`TensorflowTrainer`	A Trainer for data parallel Tensorflow training.
`TensorflowConfig`	PublicAPI (beta): This API is in beta and may change before becoming stable.
`prepare_dataset_shard`	A utility function that overrides default config for Tensorflow Dataset.
`ReportCheckpointCallback`	Keras callback for Ray Train reporting and checkpointing.

XGBoost#

`XGBoostTrainer`	A Trainer for distributed data-parallel XGBoost training.
`RayTrainReportCallback`	XGBoost callback to save checkpoints and report metrics.

LightGBM#

`LightGBMTrainer`	A Trainer for distributed data-parallel LightGBM training.
`get_network_params`	Returns the network parameters to enable LightGBM distributed training.
`RayTrainReportCallback`	Creates a callback that reports metrics and checkpoints model.
`normalize_pandas_for_lightgbm`	Map Arrow-backed pandas dtypes to NumPy-nullable equivalents.

JAX#

JaxTrainer

A Trainer for Single-Program Multi-Data (SPMD) JAX training.

Ray Train Configuration#

`CheckpointConfig`	Configuration for checkpointing.
`DataConfig`	Class responsible for configuring Train dataset preprocessing.
`FailureConfig`	Configuration related to failure handling of each training run.
`LoggingConfig`	Configuration for Ray Train's logging behavior.
`RunConfig`	Runtime configuration for training runs.
`ScalingConfig`	Configuration for scaling training.
`ValidationConfig`	Configuration for validation, passed to the trainer.

Ray Train Utilities#

Classes

`Checkpoint`	A reference to data persisted as a directory in local or remote storage.
`CheckpointUploadMode`	The manner in which we want to upload the checkpoint.
`CheckpointConsistencyMode`	Read semantics for checkpoint retrieval during an ongoing run.
`TrainContext`	Abstract interface for training context.
`ValidationFn`	Protocol for a function that validates a checkpoint.
`ValidationTaskConfig`	Configuration for a specific validation task, passed to report().

Functions

`get_all_reported_checkpoints`	Get all the reported checkpoints so far.
`get_checkpoint`	Access the latest reported checkpoint to resume from if one exists.
`get_context`	Get or create a singleton training context.
`get_dataset_shard`	Returns the `ray.data.DataIterator` shard for this worker.
`report`	Report metrics and optionally save a checkpoint.

Collective

`barrier`	Create a barrier across all workers.
`broadcast_from_rank_zero`	Broadcast small (<1kb) data from the rank 0 worker to all other workers.

Ray Train Output#

`ReportedCheckpoint`	A user-reported checkpoint and its associated metrics.
`ReportedCheckpointStatus`	ReportedCheckpoint status enum.
`Result`	The output of `trainer.fit()`.

Ray Train Errors#

`ControllerError`	Exception raised when training fails due to a controller error.
`WorkerGroupError`	Exception raised from the worker group during training.
`TrainingFailedError`	Exception raised when training fails from a `trainer.fit()` call.

Ray Tune Integration Utilities#

tune.integration.ray_train.TuneReportCallback

Propagate metrics and checkpoint paths from Ray Train workers to Ray Tune.

Ray Train Developer APIs#

Trainer Base Class#

DataParallelTrainer

Base class for distributed data parallel training on Ray.

Train Backend Base Classes#

`Backend`	Singleton for distributed communication backend.
`BackendConfig`	Parent class for configurations of training backend.

Trainer Callbacks#

UserCallback

Callback interface for custom user-defined callbacks to handling events during training.