- Dataset.repartition(num_blocks: int, *, shuffle: bool = False) ray.data.dataset.Dataset #
This method can be useful to tune the performance of your pipeline. To learn more, see Advanced: Performance Tips and Tuning.
If you’re writing data to files, you can also use this method to change the number of output files. To learn more, see Changing the number of output files.
Repartition has two modes. If
shuffle=False, Ray Data performs the minimal data movement needed to equalize block sizes. Otherwise, Ray Data performs a full distributed shuffle.
>>> import ray >>> ds = ray.data.range(100) >>> ds.repartition(10).num_blocks() 10
Time complexity: O(dataset size / parallelism)
num_blocks – The number of blocks.
shuffle – Whether to perform a distributed shuffle during the repartition. When shuffle is enabled, each output block contains a subset of data rows from each input block, which requires all-to-all data movement. When shuffle is disabled, output blocks are created from adjacent input blocks, minimizing data movement.