ray.data.DatasetPipeline.iter_batches
ray.data.DatasetPipeline.iter_batches#
- DatasetPipeline.iter_batches(*, prefetch_blocks: int = 0, batch_size: Optional[int] = 256, batch_format: str = 'default', drop_last: bool = False, local_shuffle_buffer_size: Optional[int] = None, local_shuffle_seed: Optional[int] = None) Iterator[Union[List[ray.data.block.T], pyarrow.Table, pandas.DataFrame, bytes, numpy.ndarray, Dict[str, numpy.ndarray]]] [source]#
Return a local batched iterator over the data in the pipeline.
Examples
>>> import ray >>> ds = ray.data.range(1000000).repeat(5) >>> for pandas_df in ds.iter_batches(): ... print(pandas_df)
Time complexity: O(1)
- Parameters
prefetch_blocks – The number of blocks to prefetch ahead of the current block during the scan.
batch_size – The number of rows in each batch, or None to use entire blocks as batches (blocks may contain different number of rows). The final batch may include fewer than
batch_size
rows ifdrop_last
isFalse
. Defaults to 256.batch_format – The format in which to return each batch. Specify “default” to use the current block format (promoting Arrow to pandas automatically), “pandas” to select
pandas.DataFrame
or “pyarrow” to selectpyarrow.Table
. Default is “default”.drop_last – Whether to drop the last batch if it’s incomplete.
local_shuffle_buffer_size – If non-None, the data will be randomly shuffled using a local in-memory shuffle buffer, and this value will serve as the minimum number of rows that must be in the local in-memory shuffle buffer in order to yield a batch. When there are no more rows to add to the buffer, the remaining rows in the buffer will be drained. This buffer size must be greater than or equal to
batch_size
, and thereforebatch_size
must also be specified when using local shuffling.local_shuffle_seed – The seed to use for the local random shuffle.
- Returns
An iterator over record batches.