ray.data.Dataset.random_shuffle#
- Dataset.random_shuffle(*, seed: int | None = None, num_blocks: int | None = None, **ray_remote_args) Dataset[source]#
Randomly shuffle the rows of this
Dataset.Tip
This method can be slow. For better performance, try Iterating over batches with shuffling. Also, see Optimizing shuffles.
Note
This operation requires all inputs to be materialized in object store for it to execute.
Examples
>>> import ray >>> ds = ray.data.range(100) >>> ds.random_shuffle().take(3) {'id': 41}, {'id': 21}, {'id': 92}] >>> ds.random_shuffle(seed=42).take(3) {'id': 77}, {'id': 21}, {'id': 63}]
Time complexity: O(dataset size / parallelism)
- Parameters:
seed – Fix the random seed to use, otherwise one is chosen based on system randomness.
num_blocks – This parameter is deprecated. It was previously intended to specify the number of output blocks in the shuffled dataset, but is no longer supported. To control the number of output blocks, use
Dataset.repartition()after shuffling instead.**ray_remote_args – Additional resource requirements to request from Ray (e.g., num_gpus=1 to request GPUs for the map tasks). See
ray.remote()for details.
- Returns:
The shuffled
Dataset.