Ray Data Overview#

Ray Data is a scalable data processing library for ML workloads, particularly suited for the following workloads:

Offline batch inference
Data preprocessing and ingest for ML training

It provides flexible and performant APIs for distributed data processing. For more details, see Transforming Data.

Ray Data is built on top of Ray, so it scales effectively to large clusters and offers scheduling support for both CPU and GPU resources. Ray Data uses streaming execution to efficiently process large datasets.

Why choose Ray Data?#

Offline Batch Inference#

Offline batch inference is a process for generating model predictions on a fixed set of input data. Ray Data offers an efficient and scalable solution for batch inference, providing faster execution and cost-effectiveness for deep learning applications. For more details on how to use Ray Data for offline batch inference, see the batch inference user guide.

How does Ray Data compare to other solutions for offline inference?#

Batch inference case studies#

Preprocessing and ingest for ML training#

Use Ray Data to load and preprocess data for distributed ML training pipelines in a streaming fashion. Key supported features for distributed training include:

Fast out-of-memory recovery
Support for heterogeneous clusters
No dropped rows during distributed dataset iteration

Ray Data serves as a last-mile bridge from storage or ETL pipeline outputs to distributed applications and libraries in Ray. Use it for unstructured data processing. For more details on how to use Ray Data for preprocessing and ingest for ML training, see Data loading for ML training.