Dataset.iter_batches(*, prefetch_batches: int = 1, batch_size: int | None = 256, batch_format: str | None = 'default', drop_last: bool = False, local_shuffle_buffer_size: int | None = None, local_shuffle_seed: int | None = None, _collate_fn: Callable[[pyarrow.Table | pandas.DataFrame | Dict[str, numpy.ndarray]], CollatedData] | None = None) Iterable[pyarrow.Table | pandas.DataFrame | Dict[str, numpy.ndarray]][source]#

Return an iterable over batches of data.

This method is useful for model training.


This operation will trigger execution of the lazy transformations performed on this dataset.


import ray

ds = ray.data.read_images("example://image-datasets/simple")

for batch in ds.iter_batches(batch_size=2, batch_format="numpy"):
{'image': array([[[[...]]]], dtype=uint8)}
{'image': array([[[[...]]]], dtype=uint8)}

Time complexity: O(1)

  • prefetch_batches – The number of batches to fetch ahead of the current batch to fetch. If set to greater than 0, a separate threadpool is used to fetch the objects to the local node and format the batches. Defaults to 1.

  • batch_size – The number of rows in each batch, or None to use entire blocks as batches (blocks may contain different numbers of rows). The final batch may include fewer than batch_size rows if drop_last is False. Defaults to 256.

  • batch_format – If "default" or "numpy", batches are Dict[str, numpy.ndarray]. If "pandas", batches are pandas.DataFrame.

  • drop_last – Whether to drop the last batch if it’s incomplete.

  • local_shuffle_buffer_size – If not None, the data is randomly shuffled using a local in-memory shuffle buffer, and this value serves as the minimum number of rows that must be in the local in-memory shuffle buffer in order to yield a batch. When there are no more rows to add to the buffer, the remaining rows in the buffer are drained.

  • local_shuffle_seed – The seed to use for the local random shuffle.


An iterable over batches of data.