ray.data.Dataset.default_batch_format#

Dataset.default_batch_format() Type[source]#

Return this dataset’s default batch format.

The default batch format describes what batches of data look like. To learn more about batch formats, read writing user-defined functions.

Note

If this dataset consists of more than a read, or if the schema can’t be determined from the metadata provided by the datasource, then this operation will trigger execution of the lazy transformations performed on this dataset, and will block until execution completes.

Examples

If your dataset represents a list of Python objects, then the default batch format is list.

>>> import ray
>>> ds = ray.data.range(100)
>>> ds  
Dataset(num_blocks=20, num_rows=100, schema=<class 'int'>)
>>> ds.default_batch_format()
<class 'list'>
>>> next(ds.iter_batches(batch_size=4))
[0, 1, 2, 3]

If your dataset contains a single TensorDtype or ArrowTensorType column named __value__ (as created by ray.data.from_numpy()), then the default batch format is np.ndarray. For more information on tensor datasets, read the tensor support guide.

>>> ds = ray.data.range_tensor(100)
>>> ds  
Dataset(num_blocks=20, num_rows=100, schema={__value__: ArrowTensorType(shape=(1,), dtype=int64)})
>>> ds.default_batch_format()
<class 'numpy.ndarray'>
>>> next(ds.iter_batches(batch_size=4))
array([[0],
       [1],
       [2],
       [3]])

If your dataset represents tabular data and doesn’t only consist of a __value__ tensor column (such as is created by ray.data.from_numpy()), then the default batch format is pd.DataFrame.

>>> import pandas as pd
>>> df = pd.DataFrame({"foo": ["a", "b"], "bar": [0, 1]})
>>> ds = ray.data.from_pandas(df)
>>> ds  
Dataset(num_blocks=1, num_rows=2, schema={foo: object, bar: int64})
>>> ds.default_batch_format()
<class 'pandas.core.frame.DataFrame'>
>>> next(ds.iter_batches(batch_size=4))
  foo  bar
0   a    0
1   b    1

See also

map_batches()

Call this function to transform batches of data.

iter_batches()

Call this function to iterate over batches of data.