Ray Data Glossary
Ray Data Glossary#
- Batch format#
The way batches of data are represented.
Set
batch_format
in methods likeDataset.iter_batches()
andDataset.map_batches()
to specify the batch type.>>> import ray >>> # Dataset is executed by streaming executor by default, which doesn't >>> # preserve the order, so we explicitly set it here. >>> ray.data.context.DataContext.get_current().execution_options.preserve_order = True >>> dataset = ray.data.range(10) >>> next(iter(dataset.iter_batches(batch_format="numpy", batch_size=5))) {'id': array([0, 1, 2, 3, 4])} >>> next(iter(dataset.iter_batches(batch_format="pandas", batch_size=5))) id 0 0 1 1 2 2 3 3 4 4
To learn more about batch formats, read Configuring batch formats.
- Block#
A processing unit of data. A
Dataset
consists of a collection of blocks.Under the hood, Ray Data partition records into a set of distributed data blocks. This allows it to perform operations in parallel.
Unlike a batch, which is a user-facing object, a block is an internal abstraction.
- Block format#
The way blocks are represented.
Blocks are internally represented as Arrow tables or pandas DataFrames.
- Ray Data (library)#
A library for distributed data processing.
Ray Data isn’t intended as a replacement for more general data processing systems. Its utility is as the last-mile bridge from ETL pipeline outputs to distributed ML applications and libraries in Ray.
To learn more about Ray Data, read Key Concepts.
- Dataset (object)#
A class that produces a sequence of distributed data blocks.
Dataset
exposes methods to read, transform, and consume data at scale.To learn more about Datasets and the operations they support, read the Datasets API Reference.
- Datasource#
A
Datasource
specifies how to read and write from a variety of external storage and data formats.Examples of Datasources include
ParquetDatasource
,ImageDatasource
,TFRecordDatasource
,CSVDatasource
, andMongoDatasource
.To learn more about Datasources, read Creating a Custom Datasource.
- Record#
A single data item, which is always a
Dict[str, Any]
.- Schema#
The name and type of the dataset fields.
To determine a dataset’s schema, call
Dataset.schema()
.