ray.data.Dataset.random_sample#
- Dataset.random_sample(fraction: float, *, seed: int | RandomSeedConfig | None = None) Dataset[source]#
Returns a new
Datasetcontaining a random fraction of the rows. In other words, this method “randomly filters” the rows of the dataset without shuffling (i.e., changing the order of the rows).Note
This method returns roughly
fraction * total_rowsrows. An exact number of rows isn’t guaranteed.Examples
>>> import ray >>> from ray.data import RandomSeedConfig >>> ds1 = ray.data.range(100) >>> ds1.random_sample(0.1).count() 10 >>> # Deterministic across executions >>> ds2 = ray.data.range(1000) >>> ds2.random_sample(0.123, seed=42).take(2) [{'id': 2}, {'id': 9}] >>> ds2.random_sample(0.123, seed=42).take(2) [{'id': 2}, {'id': 9}] >>> # Different sample each execution >>> ds2.random_sample(0.123, seed=RandomSeedConfig(seed=42, reseed_after_execution=True)).take(2) [{'id': 2}, {'id': 9}] >>> ds2.random_sample(0.123, seed=RandomSeedConfig(seed=42, reseed_after_execution=True)).take(2) [{'id': 15}, {'id': 23}]
- Parameters:
fraction – The fraction of elements to sample. It must be between 0 and 1 (inclusive).
seed – An optional random seed. Can be an integer or a
RandomSeedConfigobject. If an integer is provided, it defaults to fully deterministic behavior (same sample across executions). If None, the sample is non-deterministic. SeeRandomSeedConfigfor more details on seed behavior.
- Returns:
Returns a
Datasetcontaining the sampled rows.