ray.data.Dataset.repartition#
- Dataset.repartition(num_blocks: int | None = None, target_num_rows_per_block: int | None = None, *, shuffle: bool = False, keys: List[str] | None = None, sort: bool = False) Dataset [source]#
Repartition the
Dataset
into exactly this number of blocks.This method can be useful to tune the performance of your pipeline. To learn more, see Advanced: Performance Tips and Tuning.
If you’re writing data to files, you can also use this method to change the number of output files. To learn more, see Changing the number of output files.
Note
Repartition has three modes:
When
num_blocks
andshuffle=True
are specified Ray Data performs a full distributed shuffle producing exactlynum_blocks
blocks.When
num_blocks
andshuffle=False
are specified, Ray Data does NOT perform full shuffle, instead opting in for splitting and combining of the blocks attempting to minimize the necessary data movement (relative to full-blown shuffle). Exactlynum_blocks
will be produced.If
target_num_rows_per_block
is set (exclusive withnum_blocks
andshuffle
), streaming repartitioning will be executed, where blocks will be made to carry no more thantarget_num_rows_per_block
. Smaller blocks will be combined into bigger ones up totarget_num_rows_per_block
as well.
Examples
>>> import ray >>> ds = ray.data.range(100).repartition(10).materialize() >>> ds.num_blocks() 10
Time complexity: O(dataset size / parallelism)
- Parameters:
num_blocks – Number of blocks after repartitioning.
target_num_rows_per_block – [Experimental] The target number of rows per block to repartition. Performs streaming repartitioning of the dataset (no shuffling). Note that either
num_blocks
ortarget_num_rows_per_block
must be set, but not both. Whentarget_num_rows_per_block
is set, it only repartitionsDataset
blocks that are larger thantarget_num_rows_per_block
. Note that the system will internally figure out the number of rows per blocks for optimal execution, based on thetarget_num_rows_per_block
. This is the current behavior because of the implementation and may change in the future.shuffle – Whether to perform a distributed shuffle during the repartition. When shuffle is enabled, each output block contains a subset of data rows from each input block, which requires all-to-all data movement. When shuffle is disabled, output blocks are created from adjacent input blocks, minimizing data movement.
keys – List of key columns repartitioning will use to determine which partition will row belong to after repartitioning (by applying hash-partitioning algorithm to the whole dataset). Note that, this config is only relevant when
DataContext.use_hash_based_shuffle
is set to True.sort – Whether the blocks should be sorted after repartitioning. Note, that by default blocks will be sorted in the ascending order.
Note that you must set either
num_blocks
ortarget_num_rows_per_block
but not both. Additionally note that this operation materializes the entire dataset in memory when you set shuffle to True.- Returns:
The repartitioned
Dataset
.