ray.data.DataIterator.iter_batches#
- DataIterator.iter_batches(*, prefetch_batches: int = 1, batch_size: int = 256, batch_format: str | None = 'default', drop_last: bool = False, local_shuffle_buffer_size: int | None = None, local_shuffle_seed: int | None = None, _collate_fn: Callable[[pyarrow.Table | pandas.DataFrame | Dict[str, numpy.ndarray]], CollatedData] | None = None, _finalize_fn: Callable[[Any], Any] | None = None) Iterable[pyarrow.Table | pandas.DataFrame | Dict[str, numpy.ndarray]] [source]#
Return a batched iterable over the dataset.
Examples
>>> import ray >>> for batch in ray.data.range( ... 1000000 ... ).iterator().iter_batches(): ... print(batch)
Time complexity: O(1)
- Parameters:
prefetch_batches – The number of batches to fetch ahead of the current batch to fetch. If set to greater than 0, a separate threadpool will be used to fetch the objects to the local node, format the batches, and apply the collate_fn. Defaults to 1.
batch_size – The number of rows in each batch, or None to use entire blocks as batches (blocks may contain different number of rows). The final batch may include fewer than
batch_size
rows ifdrop_last
isFalse
. Defaults to 256.batch_format – Specify
"default"
to use the default block format (NumPy),"pandas"
to selectpandas.DataFrame
, “pyarrow” to selectpyarrow.Table
, or"numpy"
to selectDict[str, numpy.ndarray]
, or None to return the underlying block exactly as is with no additional formatting.drop_last – Whether to drop the last batch if it’s incomplete.
local_shuffle_buffer_size – If non-None, the data will be randomly shuffled using a local in-memory shuffle buffer, and this value will serve as the minimum number of rows that must be in the local in-memory shuffle buffer in order to yield a batch. When there are no more rows to add to the buffer, the remaining rows in the buffer will be drained.
local_shuffle_seed – The seed to use for the local random shuffle.
- Returns:
An iterable over record batches.