random#

ray.data.expressions.random(*, seed: int | None = None, reseed_after_execution: bool = True) → RandomExpr[source]#

Create an expression that generates random numbers.

This creates an expression that generates random floating-point numbers between 0 (inclusive) and 1 (exclusive) for each row. The generator can be optionally seeded for reproducibility.

Parameters:

seed – An optional integer seed for the random number generator. If None, uses system randomness (non-deterministic).
reseed_after_execution – If False, the random number generator (RNG) will be initialized with the provided seed. Each dataset execution will produce the same set of random values (except for the usual randomness due to task parallelism and ordering of the data). If True, the provided seed is treated as an “initial” seed and each dataset execution will generate new random values. This is useful for reproducibility across multiple epochs in model training. Under the hood, the seed sequence used to initialize the RNG consists of three components: an index of the Ray task, an index of the dataset execution, and the provided seed. Defaults to True.

Returns:

A RandomExpr that generates random numbers

Example

>>> from ray.data.expressions import random
>>> random()
RANDOM()

>>> from ray.data.expressions import random
>>> import ray
>>> ds = ray.data.range(10)
>>> # Add random column without seed
>>> ds.with_column("rand", random()).take(3)
[{'id': 0, 'rand': 0.013528930983987442},
 {'id': 1, 'rand': 0.7534846535881974},
 {'id': 4, 'rand': 0.13351018846379803}]

For reproducibility, we can provide an integer seed.

>>> ds.with_column("rand", random(seed=42)).take_batch(batch_size=3)
{'id': array([0, 1, 2]), 'rand': array([0.67791253, 0.48577076, 0.48211206])}

By default, reseed_after_execution is True, so each dataset execution will generate new random values. This is useful for reproducibility across multiple epochs in model training.

>>> # Same dataset but executed for the second time
>>> ds.with_column("rand", random(seed=42)).take_batch(batch_size=3)
{'id': array([0, 1, 2]), 'rand': array([0.49661147, 0.36291881, 0.8829356 ])}

When reseed_after_execution is False, the random numbers are fully reproducible across executions.

>>> # 1st execution
>>> ds.with_column("rand", random(seed=42, reseed_after_execution=False)).take_batch(batch_size=3)
{'id': array([0, 1, 2]), 'rand': array([0.23680187, 0.09952025, 0.09413677])}
>>> # 2nd execution
>>> ds.with_column("rand", random(seed=42, reseed_after_execution=False)).take_batch(batch_size=3)
{'id': array([0, 1, 2]), 'rand': array([0.23680187, 0.09952025, 0.09413677])}

PublicAPI (alpha): This API is in alpha and may change before becoming stable.