Ray Data Glossary#

Batch format#

The way batches of data are represented.

Set batch_format in methods like Dataset.iter_batches() and Dataset.map_batches() to specify the batch type.

>>> import ray
>>> # Dataset is executed by streaming executor by default, which doesn't
>>> # preserve the order, so we explicitly set it here.
>>> ray.data.context.DataContext.get_current().execution_options.preserve_order = True
>>> dataset = ray.data.range(10)
>>> next(iter(dataset.iter_batches(batch_format="numpy", batch_size=5)))
{'id': array([0, 1, 2, 3, 4])}
>>> next(iter(dataset.iter_batches(batch_format="pandas", batch_size=5)))
   id
0   0
1   1
2   2
3   3
4   4

To learn more about batch formats, read Configuring batch formats.

Block#

A processing unit of data. A Dataset consists of a collection of blocks.

Under the hood, Ray Data partition records into a set of distributed data blocks. This allows it to perform operations in parallel.

Unlike a batch, which is a user-facing object, a block is an internal abstraction.

Block format#

The way blocks are represented.

Blocks are internally represented as Arrow tables or pandas DataFrames.

Ray Data (library)#

A library for distributed data processing.

Ray Data isn’t intended as a replacement for more general data processing systems. Its utility is as the last-mile bridge from ETL pipeline outputs to distributed ML applications and libraries in Ray.

To learn more about Ray Data, read Key Concepts.

Dataset (object)#

A class that produces a sequence of distributed data blocks.

Dataset exposes methods to read, transform, and consume data at scale.

To learn more about Datasets and the operations they support, read the Datasets API Reference.

Datasource#

A Datasource specifies how to read and write from a variety of external storage and data formats.

Examples of Datasources include ParquetDatasource, ImageDatasource, TFRecordDatasource, CSVDatasource, and MongoDatasource.

To learn more about Datasources, read Creating a Custom Datasource.

Record#

A single data item, which is always a Dict[str, Any].

Schema#

The name and type of the dataset fields.

To determine a dataset’s schema, call Dataset.schema().