Ray Datasets: Distributed Data Loading and Compute

Ray Datasets are the standard way to load and exchange data in Ray libraries and applications. They provide basic distributed data transformations such as map, filter, and repartition, and are compatible with a variety of file formats, data sources, and distributed frameworks.

Here’s an overview of the integrations with other processing frameworks, file formats, and supported operations, as well as glimpse at the Ray Datasets API. Check our compatibility matrix to see if your favorite format is supported already.

../_images/dataset.svg

Ray Datasets simplifies general purpose parallel GPU and CPU compute in Ray; for instance, for GPU batch inference. It provides a higher level API for Ray tasks and actors in such embarrassingly parallel compute situations, internally handling operations like batching, pipelining, and memory management.

../_images/dataset-compute-1.png

As part of the Ray ecosystem, Ray Datasets can leverage the full functionality of Ray’s distributed scheduler, e.g., using actors for optimizing setup time and GPU scheduling.

Data Loading and Preprocessing for ML Training

Ray Datasets are designed to load and preprocess data for distributed ML training pipelines. Compared to other loading solutions, Datasets are more flexible (e.g., can express higher-quality per-epoch global shuffles) and provides higher overall performance.

Ray Datasets is not intended as a replacement for more general data processing systems. Learn more about how Ray Datasets works with other ETL systems.

Where to Go from Here?

As new user of Ray Datasets, you may want to start with our Getting Started Guide. If you’ve run your first examples already, you might want to dive into Ray Datasets’ key concepts or our User Guide instead. Advanced users can utilize the Ray Datasets API reference for their projects.

Getting Started

Start with our quick start tutorials for working with Datasets and Dataset Pipelines. These concrete examples will give you an idea of how to use Ray Datasets.

Key Concepts

Understand the key concepts behind Ray Datasets. Learn what Datasets and Dataset Pipelines are and how they get executed in Ray Datasets.

User Guide

Learn how to load and process data for ML, work with tensor data, or use pipelines. Run your first Dask, Spark, Mars and Modin examples on Ray Datasets.

API

Get more in-depth information about the Ray Datasets API.

Datasource Compatibility

Ray Datasets supports reading and writing many formats. The following two compatibility matrices will help you understand which formats are currently available.

Supported Input Formats

Input compatibility matrix

Input Type

Read API

Status

CSV File Format

ray.data.read_csv()

JSON File Format

ray.data.read_json()

Parquet File Format

ray.data.read_parquet()

Numpy File Format

ray.data.read_numpy()

Text Files

ray.data.read_text()

Binary Files

ray.data.read_binary_files()

Python Objects

ray.data.from_items()

Spark Dataframe

ray.data.from_spark()

Dask Dataframe

ray.data.from_dask()

Modin Dataframe

ray.data.from_modin()

MARS Dataframe

ray.data.from_mars()

(todo)

Pandas Dataframe Objects

ray.data.from_pandas()

NumPy ndarray Objects

ray.data.from_numpy()

Arrow Table Objects

ray.data.from_arrow()

Custom Datasource

ray.data.read_datasource()

Supported Output Formats

Output compatibility matrix

Output Type

Dataset API

Status

CSV File Format

ds.write_csv()

JSON File Format

ds.write_json()

Parquet File Format

ds.write_parquet()

Numpy File Format

ds.write_numpy()

Spark Dataframe

ds.to_spark()

Dask Dataframe

ds.to_dask()

Modin Dataframe

ds.to_modin()

MARS Dataframe

ds.to_mars()

(todo)

Arrow Table Objects

ds.to_arrow_refs()

Arrow Table Iterator

ds.iter_batches(batch_format="pyarrow")

Single Pandas Dataframe

ds.to_pandas()

Pandas Dataframe Objects

ds.to_pandas_refs()

NumPy ndarray Objects

ds.to_numpy_refs()

Pandas Dataframe Iterator

ds.iter_batches(batch_format="pandas")

PyTorch Iterable Dataset

ds.to_torch()

TensorFlow Iterable Dataset

ds.to_tf()

Random Access Dataset

ds.to_random_access_dataset()

Custom Datasource

ds.write_datasource()

Contribute

Contributions to Ray Datasets are welcome! There are many potential improvements, including:

  • Supporting more data sources and transforms.

  • Integration with more ecosystem libraries.

  • Adding features that require partitioning such as groupby() and join().

  • Performance optimizations.