ray.data.FileShuffleConfig#
- class ray.data.FileShuffleConfig(seed: int | None = None, reseed_after_execution: bool = True)[source]#
Configuration for file shuffling.
This configuration object controls how files are shuffled while reading file-based datasets. The random seed behavior is determined by the combination of
seedandreseed_after_execution:If
seedis None, the random seed is always None (non-deterministic shuffling).If
seedis not None andreseed_after_executionis False, the random seed is constantlyseedacross executions.If
seedis not None andreseed_after_executionis True, the random seed is different for each execution.
Note
Even if you provided a seed, you might still observe a non-deterministic row order. This is because tasks are executed in parallel and their completion order might vary. If you need to preserve the order of rows, set
DataContext.get_current().execution_options.preserve_order.- Parameters:
seed – An optional integer seed for the file shuffler. If None, shuffling is non-deterministic. If provided, shuffling is deterministic based on this seed and the
reseed_after_executionsetting.reseed_after_execution – If True, the random seed considers both
seedandexecution_idx, resulting in different shuffling orders across executions. If False, the random seed is constantlyseed, resulting in the same shuffling order across executions. Only takes effect whenseedis not None. Defaults to True.
Example
>>> import ray >>> from ray.data import FileShuffleConfig >>> # Fixed seed - same shuffle across executions >>> shuffle = FileShuffleConfig(seed=42, reseed_after_execution=False) >>> ds = ray.data.read_images("s3://anonymous@ray-example-data/batoidea", shuffle=shuffle) >>> >>> # Seed with reseed_after_execution - different shuffle per execution >>> shuffle = FileShuffleConfig(seed=42, reseed_after_execution=True) >>> ds = ray.data.read_images("s3://anonymous@ray-example-data/batoidea", shuffle=shuffle)
DeveloperAPI: This API may change across minor Ray releases.
Methods
Attributes