Ray Train workloads#

   

This tutorial series provides hands-on learning for Ray Train and its ecosystem (Ray Data, Anyscale Workspaces).
The tutorials walk through common ML workload patterns—vision, tabular, time series, generative, policy learning, and recommendation—showing how to scale them from single-node to fully distributed training and inference with minimal code changes.


Tutorial index#

1. Getting started#

  • Introduction to Ray Train
    Your starting point. Learn the basics of distributed training with PyTorch and Ray Train:

    • Why and when to use Ray Train vs. raw Distributed Data Parallel (DDP).

    • Wrapping models/data loaders with prepare_model / prepare_data_loader.

    • Using ScalingConfig and RunConfig for scale and checkpointing.

    • Reporting metrics, saving checkpoints, and inspecting results with train.report.

    • Running fully distributed end-to-end training on Anyscale.


2. Workload patterns (independent, work in any order)#

  • Vision workloads
    Real-world computer vision with Food-101, preprocessing with Ray Data, fault-tolerant ResNet training, and scalable inference tasks.

  • Tabular workloads
    Tabular ML with CoverType dataset, XGBoost + Ray Train, checkpoint-aware training, feature importance, and distributed inference.

  • Time series workloads
    New York City taxi demand forecasting with a Transformer model, scaling across GPUs, epoch-level fault tolerance, and remote inference from checkpoints.

  • Generative computer vision workloads
    A mini diffusion pipeline (Food-101-Lite), showcasing Ray Data preprocessing, PyTorch Lightning integration, checkpointing, and image generation.

  • Policy learning workloads
    Diffusion-policy pipeline on Gymnasium’s Pendulum-v1 dataset, scaling across GPUs, checkpoint-per-epoch, and direct policy rollout in-notebook.

  • Recommendation system workloads
    Matrix-factorization recommendation system with MovieLens 100K, streaming batches with iter_torch_batches, custom training loop with checkpointing, and modular separation of training/eval/inference.


Key benefits#

  • Unified abstraction: One training loop works across CPU, GPU, and multi-node clusters.

  • Fault tolerance: Resume from checkpoints on failures or pre-emptions.

  • Scalability: Move from laptop prototyping to cluster-scale training without code changes.

  • Observability: Access metrics, logs, and checkpoints through Ray and Anyscale tooling.

  • Flexibility: Mix and match workload patterns for real-world ML pipelines.