ray.data.DataIterator.iter_batches#

DataIterator.iter_batches(*, prefetch_batches: int = 1, batch_size: int = 256, batch_format: str | None = 'default', drop_last: bool = False, local_shuffle_buffer_size: int | None = None, local_shuffle_seed: int | None = None) Iterable[pyarrow.Table | pandas.DataFrame | Dict[str, numpy.ndarray] | cudf.DataFrame][source]#

Return a batched iterable over the dataset.

Examples

>>> import ray
>>> for batch in ray.data.range(
...     1000000
... ).iterator().iter_batches(): 
...     print(batch) 

Note

When you break out of the for-loop above, Ray Data shuts the streaming executor down so it stops producing blocks into the object store. This relies on Python firing GeneratorExit into the implicit iterator created by the for-loop.

If you instead hold a reference to the iterator yourself, the cleanup is deferred until that reference is dropped:

it = iter(ds.iter_batches())
for i, batch in enumerate(it):
    if i == 0:
        break
# The executor keeps producing blocks until ``it`` goes
# out of scope. Call ``it.close()`` to release resources
# eagerly, or stick with ``for batch in ds.iter_batches()``.

Some libraries (for example PyTorch Lightning’s limit_train_batches) hold an iter() reference internally to cap how many batches are consumed. In those cases prefer ds.limit(n) on the dataset so iteration ends naturally after n rows.

Time complexity: O(1)

Parameters:
  • prefetch_batches – The number of batches to fetch ahead of the current batch to fetch. If set to greater than 0, a separate threadpool will be used to fetch the objects to the local node, format the batches, and apply the collate_fn. Defaults to 1.

  • batch_size – The number of rows in each batch, or None to use entire blocks as batches (blocks may contain different number of rows). The final batch may include fewer than batch_size rows if drop_last is False. Defaults to 256.

  • batch_format – Specify "default" to use the default block format (NumPy), "pandas" to select pandas.DataFrame, "pyarrow" to select pyarrow.Table, "cudf" [Experimental] to select cudf.DataFrame, or "numpy" to select Dict[str, numpy.ndarray], or None to return the underlying block exactly as is with no additional formatting.

  • drop_last – Whether to drop the last batch if it’s incomplete.

  • local_shuffle_buffer_size – If non-None, the data will be randomly shuffled using a local in-memory shuffle buffer, and this value will serve as the minimum number of rows that must be in the local in-memory shuffle buffer in order to yield a batch. When there are no more rows to add to the buffer, the remaining rows in the buffer will be drained.

  • local_shuffle_seed – The seed to use for the local random shuffle.

Returns:

An iterable over record batches.