Ray Data: Scalable Datasets for ML#

Ray Data is a scalable data processing library for ML and AI workloads built on Ray. Ray Data provides flexible and performant APIs for expressing AI workloads such as batch inference, data preprocessing, and ingest for ML training. Unlike other distributed data systems, Ray Data features a streaming execution to efficiently process large datasets and maintain high utilization across both CPU and GPU workloads.

Why choose Ray Data?#

Modern AI workloads revolve around the usage of deep learning models, which are computationally intensive and often require specialized hardware such as GPUs. Unlike CPUs, GPUs often come with less memory, have different semantics for scheduling, and are much more expensive to run. Systems built to support traditional data processing pipelines often don’t utilize such resources well.

Ray Data supports AI workloads as a first-class citizen and offers several key advantages:

  • Faster and cheaper for deep learning: Ray Data streams data between CPU preprocessing and GPU inference/training tasks, maximizing resource utilization and reducing costs by keeping GPUs active.

  • Framework friendly: Ray Data provides performant, first-class integration with common AI frameworks (vLLM, PyTorch, HuggingFace, TensorFlow) and common cloud providers (AWS, GCP, Azure)

  • Support for multi-modal data: Ray Data leverages Apache Arrow and Pandas and provides support for many data formats used in ML workloads such as Parquet, Lance, images, JSON, CSV, audio, video, and more.

  • Scalable by default: Built on Ray for automatic scaling across heterogeneous clusters with different CPU and GPU machines. Code runs unchanged from one machine to hundreds of nodes processing hundreds of TB of data.

Install Ray Data#

To install Ray Data, run:

$ pip install -U 'ray[data]'

To learn more about installing Ray and its libraries, see Installing Ray.

Learn more#

Quickstart

Get started with Ray Data with a simple example.

Key Concepts

Learn the key concepts behind Ray Data. Learn what Datasets are and how they’re used.

User Guides

Learn how to use Ray Data, from basic usage to end-to-end guides.

Examples

Find both simple and scaling-out examples of using Ray Data.

API

Get more in-depth information about the Ray Data API.

Case studies for Ray Data#

Training ingest using Ray Data

Batch inference using Ray Data