Ray Data: Scalable Datasets for ML#

Ray Data is a scalable data processing library for ML and AI workloads built on Ray. Ray Data provides flexible and performant APIs for expressing AI workloads such as batch inference, data preprocessing, and ingest for ML training. Unlike other distributed data systems, Ray Data features a streaming execution to efficiently process large datasets and maintain high utilization across both CPU and GPU workloads.

Why choose Ray Data?#

Modern AI workloads revolve around the usage of deep learning models, which are computationally intensive and often require specialized hardware such as GPUs. Unlike CPUs, GPUs often come with less memory, have different semantics for scheduling, and are much more expensive to run. Systems built to support traditional data processing pipelines often don’t utilize such resources well.

Ray Data supports AI workloads as a first-class citizen and offers several key advantages:

Faster and cheaper for deep learning: Ray Data streams data between CPU preprocessing and GPU inference/training tasks, maximizing resource utilization and reducing costs by keeping GPUs active.
Framework friendly: Ray Data provides performant, first-class integration with common AI frameworks (vLLM, PyTorch, HuggingFace, TensorFlow) and common cloud providers (AWS, GCP, Azure)
Support for multi-modal data: Ray Data leverages Apache Arrow and Pandas and provides support for many data formats used in ML workloads such as Parquet, Lance, images, JSON, CSV, audio, video, and more.
Scalable by default: Built on Ray for automatic scaling across heterogeneous clusters with different CPU and GPU machines. Code runs unchanged from one machine to hundreds of nodes processing hundreds of TB of data.