Dataset.repartition(num_blocks: int, *, shuffle: bool = False) ray.data.dataset.Dataset[ray.data.block.T][source]#

Repartition the dataset into exactly this number of blocks.

This is a blocking operation. After repartitioning, all blocks in the returned dataset will have approximately the same number of rows.


>>> import ray
>>> ds = ray.data.range(100)
>>> # Set the number of output partitions to write to disk.
>>> ds.repartition(10).write_parquet("/tmp/test")

Time complexity: O(dataset size / parallelism)

  • num_blocks – The number of blocks.

  • shuffle – Whether to perform a distributed shuffle during the repartition. When shuffle is enabled, each output block contains a subset of data rows from each input block, which requires all-to-all data movement. When shuffle is disabled, output blocks are created from adjacent input blocks, minimizing data movement.


The repartitioned dataset.