Ray Datasets Glossary#

Batch format#

The way batches of data are represented.

Set batch_format in methods like Dataset.iter_batches() and Dataset.map_batches() to specify the batch type.

>>> import ray
>>> # Dataset is executed by streaming executor by default, which doesn't
>>> # preserve the order, so we explicitly set it here.
>>> ray.data.context.DatasetContext.get_current().execution_options.preserve_order = True
>>> dataset = ray.data.range_table(10)
>>> next(iter(dataset.iter_batches(batch_format="numpy", batch_size=5)))
{'value': array([0, 1, 2, 3, 4])}
>>> next(iter(dataset.iter_batches(batch_format="pandas", batch_size=5)))
   value
0      0
1      1
2      2
3      3
4      4

To learn more about batch formats, read UDF Input Batch Formats.

Block#

A processing unit of data. A Dataset consists of a collection of blocks.

Under the hood, Datasets partition records into a set of distributed data blocks. This allows Datasets to perform operations in parallel.

Unlike a batch, which is a user-facing object, a block is an internal abstraction.

Block format#

The way blocks are represented.

Blocks are represented as Arrow tables, pandas DataFrames, and Python lists. To determine the block format, call Dataset.dataset_format().

Datasets (library)#

A library for distributed data processing.

Datasets isn’t intended as a replacement for more general data processing systems. Its utility is as the last-mile bridge from ETL pipeline outputs to distributed ML applications and libraries in Ray.

To learn more about Ray Datasets, read Key Concepts.

Dataset (object)#

A class that represents a distributed collection of data.

Dataset exposes methods to read, transform, and consume data at scale.

To learn more about Datasets and the operations they support, read the Datasets API Reference.

Datasource#

A Datasource specifies how to read and write from a variety of external storage and data formats.

Examples of Datasources include ParquetDatasource, ImageDatasource, TFRecordDatasource, CSVDatasource, and MongoDatasource.

To learn more about Datasources, read Creating a Custom Datasource.

Record#

A single data item.

If your dataset is tabular, then records are TableRows. If your dataset is simple, then records are arbitrary Python objects. If your dataset is tensor, then records are NumPy ndarrays.

Schema#

The data type of a dataset.

If your dataset is tabular, then the schema describes the column names and data types. If your dataset is simple, then the schema describes the Python object type. If your dataset is tensor, then the schema describes the per-element tensor shape and data type.

To determine a dataset’s schema, call Dataset.schema().

Simple Dataset#

A Dataset that represents a collection of arbitrary Python objects.

>>> import ray
>>> ray.data.from_items(["spam", "ham", "eggs"])
Dataset(num_blocks=3, num_rows=3, schema=<class 'str'>)
Tensor Dataset#

A Dataset that represents a collection of ndarrays.

Tabular datasets that contain tensor columns aren’t tensor datasets.

>>> import numpy as np
>>> import ray
>>> ray.data.from_numpy(np.zeros((100, 32, 32, 3)))
Dataset(
   num_blocks=1,
   num_rows=100,
   schema={__value__: ArrowTensorType(shape=(32, 32, 3), dtype=double)}
)
Tabular Dataset#

A Dataset that represents columnar data.

>>> import ray
>>> ray.data.read_csv("s3://[email protected]/iris.csv")
Dataset(
   num_blocks=1,
   num_rows=150,
   schema={
      sepal length (cm): double,
      sepal width (cm): double,
      petal length (cm): double,
      petal width (cm): double,
      target: int64
   }
)
User-defined function (UDF)#

A callable that transforms batches or records of data. UDFs let you arbitrarily transform datasets.

Call Dataset.map_batches(), Dataset.map(), or Dataset.flat_map() to apply UDFs.

To learn more about UDFs, read Writing User-Defined Functions.