ray.data.Dataset.default_batch_format
ray.data.Dataset.default_batch_format#
- Dataset.default_batch_format() Type [source]#
Return this dataset’s default batch format.
The default batch format describes what batches of data look like. To learn more about batch formats, read writing user-defined functions.
Example
If your dataset represents a list of Python objects, then the default batch format is
list
.>>> import ray >>> ds = ray.data.range(100) >>> ds Dataset(num_blocks=20, num_rows=100, schema=<class 'int'>) >>> ds.default_batch_format() <class 'list'> >>> next(ds.iter_batches(batch_size=4)) [0, 1, 2, 3]
If your dataset contains a single
TensorDtype
orArrowTensorType
column named__value__
(as created byray.data.from_numpy()
), then the default batch format isnp.ndarray
. For more information on tensor datasets, read the tensor support guide.>>> ds = ray.data.range_tensor(100) >>> ds Dataset(num_blocks=20, num_rows=100, schema={__value__: ArrowTensorType(shape=(1,), dtype=int64)}) >>> ds.default_batch_format() <class 'numpy.ndarray'> >>> next(ds.iter_batches(batch_size=4)) array([[0], [1], [2], [3]])
If your dataset represents tabular data and doesn’t only consist of a
__value__
tensor column (such as is created byray.data.from_numpy()
), then the default batch format ispd.DataFrame
.>>> import pandas as pd >>> df = pd.DataFrame({"foo": ["a", "b"], "bar": [0, 1]}) >>> ds = ray.data.from_pandas(df) >>> ds Dataset(num_blocks=1, num_rows=2, schema={foo: object, bar: int64}) >>> ds.default_batch_format() <class 'pandas.core.frame.DataFrame'> >>> next(ds.iter_batches(batch_size=4)) foo bar 0 a 0 1 b 1
See also
map_batches()
Call this function to transform batches of data.
iter_batches()
Call this function to iterate over batches of data.