Ray Data: Scalable Datasets for ML#

Ray Data is a scalable data processing library for ML workloads. It provides flexible and performant APIs for scaling Offline batch inference and Data preprocessing and ingest for ML training. Ray Data uses streaming execution to efficiently process large datasets.

Install Ray Data#

To install Ray Data, run:

$ pip install -U 'ray[data]'

To learn more about installing Ray and its libraries, see Installing Ray.

Learn more#

Ray Data Overview

Get an overview of Ray Data, the workloads that it supports, and how it compares to alternatives.


Understand the key concepts behind Ray Data. Learn what Datasets are and how they’re used.

User Guides

Learn how to use Ray Data, from basic usage to end-to-end guides.


Find both simple and scaling-out examples of using Ray Data.


Get more in-depth information about the Ray Data API.

Ray Blogs

Get the latest on engineering updates from the Ray team and how companies are using Ray Data.