ray.data.Dataset.iter_batches#

Dataset.iter_batches(*, prefetch_batches: int = 1, batch_size: int | None = 256, batch_format: str | None = 'default', drop_last: bool = False, local_shuffle_buffer_size: int | None = None, local_shuffle_seed: int | None = None, _collate_fn: Callable[[pyarrow.Table | pandas.DataFrame | Dict[str, numpy.ndarray]], CollatedData] | None = None, prefetch_blocks: int = 0) Iterable[pyarrow.Table | pandas.DataFrame | Dict[str, numpy.ndarray]][source]#

Return an iterable over batches of data.

This method is useful for model training.

Note

This operation will trigger execution of the lazy transformations performed on this dataset.

Examples

import ray

ds = ray.data.read_images("example://image-datasets/simple")

for batch in ds.iter_batches(batch_size=2, batch_format="numpy"):
    print(batch)
{'image': array([[[[...]]]], dtype=uint8)}
...
{'image': array([[[[...]]]], dtype=uint8)}

Time complexity: O(1)

Parameters:
  • prefetch_batches – The number of batches to fetch ahead of the current batch to fetch. If set to greater than 0, a separate threadpool is used to fetch the objects to the local node and format the batches. Defaults to 1.

  • batch_size – The number of rows in each batch, or None to use entire blocks as batches (blocks may contain different numbers of rows). The final batch may include fewer than batch_size rows if drop_last is False. Defaults to 256.

  • batch_format – If "default" or "numpy", batches are Dict[str, numpy.ndarray]. If "pandas", batches are pandas.DataFrame.

  • drop_last – Whether to drop the last batch if it’s incomplete.

  • local_shuffle_buffer_size – If not None, the data is randomly shuffled using a local in-memory shuffle buffer, and this value serves as the minimum number of rows that must be in the local in-memory shuffle buffer in order to yield a batch. When there are no more rows to add to the buffer, the remaining rows in the buffer are drained.

  • local_shuffle_seed – The seed to use for the local random shuffle.

Returns:

An iterable over batches of data.