ray.data.Dataset.random_sample#

Dataset.random_sample(fraction: float, *, seed: int | RandomSeedConfig | None = None) Dataset[source]#

Returns a new Dataset containing a random fraction of the rows. In other words, this method “randomly filters” the rows of the dataset without shuffling (i.e., changing the order of the rows).

Note

This method returns roughly fraction * total_rows rows. An exact number of rows isn’t guaranteed.

Examples

>>> import ray
>>> from ray.data import RandomSeedConfig
>>> ds1 = ray.data.range(100)
>>> ds1.random_sample(0.1).count()  
10
>>> # Deterministic across executions
>>> ds2 = ray.data.range(1000)
>>> ds2.random_sample(0.123, seed=42).take(2)  
[{'id': 2}, {'id': 9}]
>>> ds2.random_sample(0.123, seed=42).take(2)  
[{'id': 2}, {'id': 9}]
>>> # Different sample each execution
>>> ds2.random_sample(0.123, seed=RandomSeedConfig(seed=42, reseed_after_execution=True)).take(2)  
[{'id': 2}, {'id': 9}]
>>> ds2.random_sample(0.123, seed=RandomSeedConfig(seed=42, reseed_after_execution=True)).take(2)  
[{'id': 15}, {'id': 23}]
Parameters:
  • fraction – The fraction of elements to sample. It must be between 0 and 1 (inclusive).

  • seed – An optional random seed. Can be an integer or a RandomSeedConfig object. If an integer is provided, it defaults to fully deterministic behavior (same sample across executions). If None, the sample is non-deterministic. See RandomSeedConfig for more details on seed behavior.

Returns:

Returns a Dataset containing the sampled rows.