ray.data.Dataset.default_batch_format
ray.data.Dataset.default_batch_format#
- Dataset.default_batch_format() Type [source]#
Return this dataset’s default batch format.
The default batch format describes what batches of data look like. To learn more about batch formats, read writing user-defined functions.
Note
If this dataset consists of more than a read, or if the schema can’t be determined from the metadata provided by the datasource, then this operation will trigger execution of the lazy transformations performed on this dataset, and will block until execution completes.
Examples
If your dataset represents a list of Python objects, then the default batch format is
list
.>>> import ray >>> ds = ray.data.range(100) >>> ds Dataset(num_blocks=20, num_rows=100, schema=<class 'int'>) >>> ds.default_batch_format() <class 'list'> >>> next(ds.iter_batches(batch_size=4)) [0, 1, 2, 3]
If your dataset contains a single
TensorDtype
orArrowTensorType
column named__value__
(as created byray.data.from_numpy()
), then the default batch format isnp.ndarray
. For more information on tensor datasets, read the tensor support guide.>>> ds = ray.data.range_tensor(100) >>> ds Dataset(num_blocks=20, num_rows=100, schema={__value__: ArrowTensorType(shape=(1,), dtype=int64)}) >>> ds.default_batch_format() <class 'numpy.ndarray'> >>> next(ds.iter_batches(batch_size=4)) array([[0], [1], [2], [3]])
If your dataset represents tabular data and doesn’t only consist of a
__value__
tensor column (such as is created byray.data.from_numpy()
), then the default batch format ispd.DataFrame
.>>> import pandas as pd >>> df = pd.DataFrame({"foo": ["a", "b"], "bar": [0, 1]}) >>> ds = ray.data.from_pandas(df) >>> ds Dataset(num_blocks=1, num_rows=2, schema={foo: object, bar: int64}) >>> ds.default_batch_format() <class 'pandas.core.frame.DataFrame'> >>> next(ds.iter_batches(batch_size=4)) foo bar 0 a 0 1 b 1
See also
map_batches()
Call this function to transform batches of data.
iter_batches()
Call this function to iterate over batches of data.