ray.data.Dataset.random_shuffle#

Dataset.random_shuffle(*, seed: int | RandomSeedConfig | None = None, num_blocks: int | None = None, **ray_remote_args) Dataset[source]#

Randomly shuffle the rows of this Dataset.

Tip

This method can be slow. For better performance, try Iterating over batches with shuffling. Also, see Optimizing shuffles.

Note

This operation requires all inputs to be materialized in object store for it to execute.

Examples

>>> import ray
>>> from ray.data import RandomSeedConfig
>>> ds = ray.data.range(100)
>>> ds.random_shuffle().take(3)  
[{'id': 41}, {'id': 21}, {'id': 92}]
>>> ds.random_shuffle(seed=42).take(3)  
[{'id': 24}, {'id': 97}, {'id': 17}]

Fully deterministic across executions: >>> ds = ray.data.range(100) >>> ds.random_shuffle(seed=RandomSeedConfig(seed=42, reseed_after_execution=False)).take(3) # doctest: +SKIP [{‘id’: 24}, {‘id’: 97}, {‘id’: 17}] >>> ds.random_shuffle(seed=RandomSeedConfig(seed=42, reseed_after_execution=False)).take(3) # doctest: +SKIP [{‘id’: 24}, {‘id’: 97}, {‘id’: 17}]

Reproducible but non-deterministic across executions (e.g., training epochs): >>> ds = ray.data.range(100) >>> ds.random_shuffle(seed=RandomSeedConfig(seed=42, reseed_after_execution=True)).take(3) # doctest: +SKIP [{‘id’: 29}, {‘id’: 79}, {‘id’: 39}] >>> ds.random_shuffle(seed=RandomSeedConfig(seed=42, reseed_after_execution=True)).take(3) # doctest: +SKIP [{‘id’: 40}, {‘id’: 7}, {‘id’: 90}]

Time complexity: O(dataset size / parallelism)

Parameters:
  • seed – An optional random seed. Can be an integer or a RandomSeedConfig object. If an integer is provided, it defaults to fully deterministic behavior (same shuffle order across executions). If None, the shuffle is non-deterministic. See RandomSeedConfig for more details on seed behavior.

  • num_blocks – This parameter is deprecated. It was previously intended to specify the number of output blocks in the shuffled dataset, but is no longer supported. To control the number of output blocks, use Dataset.repartition() after shuffling instead.

  • **ray_remote_args – Additional resource requirements to request from Ray (e.g., num_gpus=1 to request GPUs for the map tasks). See ray.remote() for details.

Returns:

The shuffled Dataset.