- Dataset.iter_batches(*, prefetch_batches: int = 1, batch_size: Optional[int] = 256, batch_format: Optional[str] = 'default', drop_last: bool = False, local_shuffle_buffer_size: Optional[int] = None, local_shuffle_seed: Optional[int] = None, _collate_fn: Optional[Callable[[Union[pyarrow.Table, pandas.DataFrame, Dict[str, numpy.ndarray]]], Any]] = None, prefetch_blocks: int = 0) Iterator[Union[pyarrow.Table, pandas.DataFrame, Dict[str, numpy.ndarray]]] [source]#
Return a local batched iterator over the dataset.
This operation will trigger execution of the lazy transformations performed on this dataset.
>>> import ray >>> for batch in ray.data.range(1000000).iter_batches(): ... print(batch)
Time complexity: O(1)
prefetch_batches – The number of batches to fetch ahead of the current batch to fetch. If set to greater than 0, a separate threadpool will be used to fetch the objects to the local node, format the batches, and apply the collate_fn. Defaults to 1. You can revert back to the old prefetching behavior that uses
use_legacy_iter_batchesto True in the datasetContext.
batch_size – The number of rows in each batch, or None to use entire blocks as batches (blocks may contain different number of rows). The final batch may include fewer than
False. Defaults to 256.
batch_format – Specify
"default"to use the default block format (NumPy),
pandas.DataFrame, “pyarrow” to select
Dict[str, numpy.ndarray], or None to return the underlying block exactly as is with no additional formatting.
drop_last – Whether to drop the last batch if it’s incomplete.
local_shuffle_buffer_size – If non-None, the data will be randomly shuffled using a local in-memory shuffle buffer, and this value will serve as the minimum number of rows that must be in the local in-memory shuffle buffer in order to yield a batch. When there are no more rows to add to the buffer, the remaining rows in the buffer will be drained.
local_shuffle_seed – The seed to use for the local random shuffle.
An iterator over record batches.