ray.data.Dataset.default_batch_format#

Dataset.default_batch_format() Type[source]#

Return this dataset’s default batch format.

The default batch format describes what batches of data look like. To learn more about batch formats, read writing user-defined functions.

Example

If your dataset represents a list of Python objects, then the default batch format is list.

>>> import ray
>>> ds = ray.data.range(100)
>>> ds  
Dataset(num_blocks=20, num_rows=100, schema=<class 'int'>)
>>> ds.default_batch_format()
<class 'list'>
>>> next(ds.iter_batches(batch_size=4))
[0, 1, 2, 3]

If your dataset contains a single TensorDtype or ArrowTensorType column named __value__ (as created by ray.data.from_numpy()), then the default batch format is np.ndarray. For more information on tensor datasets, read the tensor support guide.

>>> ds = ray.data.range_tensor(100)
>>> ds  
Dataset(num_blocks=20, num_rows=100, schema={__value__: ArrowTensorType(shape=(1,), dtype=int64)})
>>> ds.default_batch_format()
<class 'numpy.ndarray'>
>>> next(ds.iter_batches(batch_size=4))
array([[0],
       [1],
       [2],
       [3]])

If your dataset represents tabular data and doesn’t only consist of a __value__ tensor column (such as is created by ray.data.from_numpy()), then the default batch format is pd.DataFrame.

>>> import pandas as pd
>>> df = pd.DataFrame({"foo": ["a", "b"], "bar": [0, 1]})
>>> ds = ray.data.from_pandas(df)
>>> ds  
Dataset(num_blocks=1, num_rows=2, schema={foo: object, bar: int64})
>>> ds.default_batch_format()
<class 'pandas.core.frame.DataFrame'>
>>> next(ds.iter_batches(batch_size=4))
  foo  bar
0   a    0
1   b    1

See also

map_batches()

Call this function to transform batches of data.

iter_batches()

Call this function to iterate over batches of data.