ray.data.Dataset.random_shuffle#

Dataset.random_shuffle(*, seed: int | None = None, num_blocks: int | None = None, **ray_remote_args) Dataset[source]#

Randomly shuffle the rows of this Dataset.

Tip

This method can be slow. For better performance, try Iterating over batches with shuffling. Also, see Optimizing shuffles.

Note

This operation requires all inputs to be materialized in object store for it to execute.

Examples

>>> import ray
>>> ds = ray.data.range(100)
>>> ds.random_shuffle().take(3)  
{'id': 41}, {'id': 21}, {'id': 92}]
>>> ds.random_shuffle(seed=42).take(3)  
{'id': 77}, {'id': 21}, {'id': 63}]

Time complexity: O(dataset size / parallelism)

Parameters:

seed – Fix the random seed to use, otherwise one is chosen based on system randomness.

Returns:

The shuffled Dataset.