Ray Datasets: Distributed Data Preprocessing
Contents

Ray Datasets: Distributed Data Preprocessing#
Ray Datasets are the standard way to load and exchange data in Ray libraries and applications.
They provide basic distributed data transformations such as maps
(map_batches
),
global and grouped aggregations (GroupedDataset
), and
shuffling operations (random_shuffle
,
sort
,
repartition
),
and are compatible with a variety of file formats, data sources, and distributed frameworks.
Hereβs an overview of the integrations with other processing frameworks, file formats, and supported operations, as well as a glimpse at the Ray Datasets API.
Check our compatibility matrix to see if your favorite format is already supported.
Data Loading and Preprocessing for ML Training#
Ray Datasets are designed to load and preprocess data for distributed ML training pipelines. Compared to other loading solutions, Datasets are more flexible (e.g., can express higher-quality per-epoch global shuffles) and provides higher overall performance.
Ray Datasets are not intended as a replacement for more general data processing systems. Learn more about how Ray Datasets work with other ETL systems.
Datasets for Parallel Compute#
Datasets also simplify general purpose parallel GPU and CPU compute in Ray; for instance, for GPU batch inference. They provide a higher-level API for Ray tasks and actors for such embarrassingly parallel compute, internally handling operations like batching, pipelining, and memory management.

As part of the Ray ecosystem, Ray Datasets can leverage the full functionality of Rayβs distributed scheduler, e.g., using actors for optimizing setup time and GPU scheduling.
Where to Go from Here?#
As new user of Ray Datasets, you may want to start with our Getting Started guide. If youβve run your first examples already, you might want to dive into Ray Datasetsβ key concepts or our User Guide instead. Advanced users can refer directly to the Ray Datasets API reference for their projects.
Getting Started
Start with our quick start tutorials for working with Datasets. These concrete examples will give you an idea of how to use Ray Datasets.
Key Concepts
Understand the key concepts behind Ray Datasets. Learn what Datasets and Dataset Pipelines are and how they get executed in Ray Datasets.
User Guides
Examples
Find both simple and scaling-out examples of using Ray Datasets for data processing and ML ingest.
Ray Datasets FAQ
Find answers to commonly asked questions in our detailed FAQ.
API
Get more in-depth information about the Ray Datasets API.
Other Data Processing Solutions
For running ETL pipelines, check out Spark-on-Ray. For scaling up your data science workloads, check out Dask-on-Ray, Modin, and Mars-on-Ray.
Datasource Compatibility#
Ray Datasets supports reading and writing many file formats. The following compatibility matrices will help you understand which formats are currently available.
If none of these meet your needs, please reach out on Discourse or open a feature request on the Ray GitHub repo, and check out our guide for implementing a custom Datasets datasource if youβre interested in rolling your own integration!
Supported Input Formats#
Input Type |
Read API |
Status |
---|---|---|
CSV File Format |
β |
|
JSON File Format |
β |
|
Parquet File Format |
β |
|
Numpy File Format |
β |
|
Text Files |
β |
|
Image Files |
β |
|
Binary Files |
β |
|
TFRecord Files |
π§ |
|
Python Objects |
β |
|
Spark Dataframe |
β |
|
Dask Dataframe |
β |
|
Modin Dataframe |
β |
|
MARS Dataframe |
β |
|
Pandas Dataframe Objects |
β |
|
NumPy ndarray Objects |
β |
|
Arrow Table Objects |
β |
|
π€ (Hugging Face) Dataset |
β |
|
MongoDB |
β |
|
Custom Datasource |
β |
Supported Output Formats#
Output Type |
Dataset API |
Status |
---|---|---|
CSV File Format |
β |
|
JSON File Format |
β |
|
Parquet File Format |
β |
|
Numpy File Format |
β |
|
TFRecords File Format |
β |
|
MongoDB |
β |
|
Spark Dataframe |
β |
|
Dask Dataframe |
β |
|
Modin Dataframe |
β |
|
MARS Dataframe |
β |
|
Arrow Table Objects |
β |
|
Arrow Table Iterator |
β |
|
Single Pandas Dataframe |
β |
|
Pandas Dataframe Objects |
β |
|
NumPy ndarray Objects |
β |
|
Pandas Dataframe Iterator |
β |
|
PyTorch Tensor Iterator |
β |
|
TensorFlow Dataset |
β |
|
Random Access Dataset |
β |
|
Custom Datasource |
β |
Learn More#
Contribute#
Contributions to Ray Datasets are welcome! There are many potential improvements, including:
Supporting more data sources and transforms.
Integration with more ecosystem libraries.
Adding features such as
join()
.Performance optimizations.
