ray.data.Dataset.randomize_block_order#
- Dataset.randomize_block_order(*, seed: int | RandomSeedConfig | None = None) Dataset[source]#
Randomly shuffle the blocks of this
Dataset.This method is useful if you
split()your dataset into shards and want to randomize the data in each shard without performing a fullrandom_shuffle().Note
This operation requires all inputs to be materialized in object store for it to execute.
Examples
>>> import ray >>> ds = ray.data.range(100) >>> ds.take(5) [{'id': 0}, {'id': 1}, {'id': 2}, {'id': 3}, {'id': 4}] >>> ds.randomize_block_order().take(5) {'id': 15}, {'id': 16}, {'id': 17}, {'id': 18}, {'id': 19}] >>> ds.randomize_block_order(seed=RandomSeedConfig(seed=42, reseed_after_execution=False)).take(5) [{'id': 44}, {'id': 45}, {'id': 46}, {'id': 47}, {'id': 80}] >>> ds.randomize_block_order(seed=RandomSeedConfig(seed=42, reseed_after_execution=False)).take(5) [{'id': 44}, {'id': 45}, {'id': 46}, {'id': 47}, {'id': 80}]
Reproducible but non-deterministic across executions (e.g., training epochs): >>> ds = ray.data.range(100) >>> ds.randomize_block_order(seed=RandomSeedConfig(seed=42, reseed_after_execution=True)).take(5) # doctest: +SKIP [{‘id’: 40}, {‘id’: 41}, {‘id’: 42}, {‘id’: 43}, {‘id’: 28}] >>> ds.randomize_block_order(seed=RandomSeedConfig(seed=42, reseed_after_execution=True)).take(5) # doctest: +SKIP [{‘id’: 92}, {‘id’: 93}, {‘id’: 94}, {‘id’: 95}, {‘id’: 88}]
- Parameters:
seed – An optional random seed. Can be an integer or a
RandomSeedConfigobject. If an integer is provided, it defaults to fully deterministic behavior (same block order across executions). If None, the block order is non-deterministic. SeeRandomSeedConfigfor more details on seed behavior.- Returns:
The block-shuffled
Dataset.