Ray Data: Scalable Datasets for ML#

Ray Data scales common ML data processing patterns in batch inference and distributed training applications. Ray Data does this by providing streaming distributed transformations such as maps (map_batches), global and grouped aggregations (GroupedData), and shuffling operations (random_shuffle, sort, repartition).

Read on for an overview of the main use cases and operations supported by Ray Data.


Streaming Batch Inference#

Ray Data simplifies general purpose parallel GPU and CPU compute in Ray through its powerful streaming Dataset primitive. Datasets enable workloads such as GPU batch inference to run efficiently on large datasets, maximizing resource utilization by streaming the working data through Ray object store memory.


As part of the Ray ecosystem, Ray Data can leverage the full functionality of Ray’s distributed scheduler, e.g., using actors for optimizing setup time and GPU scheduling, and supports data throughputs of 100GiB/s or more for common inference workloads.

To learn more about the features Ray Data supports, read the Data User Guide.

Streaming Preprocessing for ML Training#

Use Ray Data to load and preprocess data for distributed ML training pipelines in a streaming fashion. Ray Data serves as a last-mile bridge from storage or ETL pipeline outputs to distributed applications and libraries in Ray. Don’t use it as a replacement for more general data processing systems.


Where to Go from Here?#

As new user of Ray Data, you may want to start with our Getting Started Guide. If you’ve run your first examples already, you might want to dive into Ray Data’ key concepts or our User Guide instead. Advanced users can refer directly to the Ray Data API reference for their projects.

Getting Started

Start with our quick start tutorials for working with Data. These concrete examples will give you an idea of how to use Ray Data.

Key Concepts

Understand the key concepts behind Ray Data. Learn what Datasets are and how they are executed in Ray Data.


Find both simple and scaling-out examples of using Ray Data for data processing and ML ingest.

Ray Data FAQ

Find answers to commonly asked questions in our detailed FAQ.


Get more in-depth information about the Ray Data API.

Other Data Processing Solutions

For running ETL pipelines, check out Spark-on-Ray. For scaling up your data science workloads, check out Dask-on-Ray, Modin, and Mars-on-Ray.

Datasource Compatibility#

Ray Data supports reading and writing many file formats. To view supported formats, read the Input/Output reference.

If your use case isn’t supported, reach out on Discourse or open a feature request on the Ray GitHub repo, and check out our guide for implementing a custom datasource if you’re interested in rolling your own integration!

Learn More#


Contributions to Ray Data are welcome! There are many potential improvements, including:

  • Supporting more data sources and transforms.

  • Integration with more ecosystem libraries.

  • Performance optimizations.