ray.data.DatasetIterator.iter_batches
ray.data.DatasetIterator.iter_batches#
- abstract DatasetIterator.iter_batches(*, prefetch_blocks: int = 0, batch_size: int = 256, batch_format: typing_extensions.Literal[default, numpy, pandas] = 'default', drop_last: bool = False, local_shuffle_buffer_size: Optional[int] = None, local_shuffle_seed: Optional[int] = None) Iterator[Union[List[ray.data.block.T], pyarrow.Table, pandas.DataFrame, bytes, numpy.ndarray, Dict[str, numpy.ndarray]]] [source]#
Return a local batched iterator over the dataset.
Examples
>>> import ray >>> for batch in ray.data.range( ... 1000000 ... ).iterator().iter_batches(): ... print(batch)
Time complexity: O(1)
- Parameters
prefetch_blocks – The number of blocks to prefetch ahead of the current block during the scan.
batch_size – The number of rows in each batch, or None to use entire blocks as batches (blocks may contain different number of rows). The final batch may include fewer than
batch_size
rows ifdrop_last
isFalse
. Defaults to 256.batch_format – The format in which to return each batch. Specify “default” to use the default block format (promoting tables to Pandas and tensors to NumPy), “pandas” to select
pandas.DataFrame
, “pyarrow” to selectpyarrow.Table
, or “numpy” to selectnumpy.ndarray
for tensor datasets andDict[str, numpy.ndarray]
for tabular datasets. Default is “default”.drop_last – Whether to drop the last batch if it’s incomplete.
local_shuffle_buffer_size – If non-None, the data will be randomly shuffled using a local in-memory shuffle buffer, and this value will serve as the minimum number of rows that must be in the local in-memory shuffle buffer in order to yield a batch. When there are no more rows to add to the buffer, the remaining rows in the buffer will be drained.
local_shuffle_seed – The seed to use for the local random shuffle.
- Returns
An iterator over record batches.