ray.data.Dataset.repartition#
- Dataset.repartition(num_blocks: int | None = None, target_num_rows_per_block: int | None = None, *, shuffle: bool = False) Dataset [source]#
Repartition the
Dataset
into exactly this number of blocks.This method can be useful to tune the performance of your pipeline. To learn more, see Advanced: Performance Tips and Tuning.
If you’re writing data to files, you can also use this method to change the number of output files. To learn more, see Changing the number of output files.
Note
Repartition has two modes. If
shuffle=False
, Ray Data performs the minimal data movement needed to equalize block sizes. Otherwise, Ray Data performs a full distributed shuffle.Note
This operation requires all inputs to be materialized in object store for it to execute.
Examples
>>> import ray >>> ds = ray.data.range(100).repartition(10).materialize() >>> ds.num_blocks() 10
Time complexity: O(dataset size / parallelism)
- Parameters:
num_blocks – Number of blocks after repartitioning.
target_num_rows_per_block – [Experimental] The target number of rows per block to repartition. Note that either
num_blocks
ortarget_num_rows_per_block
must be set, but not both. Whentarget_num_rows_per_block
is set, it only repartitionsDataset
blocks that are larger thantarget_num_rows_per_block
. Note that the system will internally figure out the number of rows per blocks for optimal execution, based on thetarget_num_rows_per_block
. This is the current behavior because of the implementation and may change in the future.shuffle – Whether to perform a distributed shuffle during the repartition. When shuffle is enabled, each output block contains a subset of data rows from each input block, which requires all-to-all data movement. When shuffle is disabled, output blocks are created from adjacent input blocks, minimizing data movement.
set (Note that either num_blocks or target_num_rows_per_block must be)
here
both. (but not)
Additionally
that (note)
memory (this operation will materialized whole dataset in)
True. (when shuffle is set to)
- Returns:
The repartitioned
Dataset
.