mix#

Dataset.mix(*other: Dataset, weights: List[float] | None = None, stopping_condition: MixStoppingCondition = MixStoppingCondition.STOP_ON_LONGEST_DROP) → Dataset[source]#

Mix this dataset with others using weighted interleaving.

This is a streaming operator that interleaves blocks from multiple input datasets into a single output stream, respecting the target row ratio specified by weights. Each output block is drawn from exactly one input dataset; the operator tracks cumulative row counts and always pulls from whichever dataset has fallen furthest behind its target ratio.

Caution

Mixed datasets aren’t lineage-serializable. As a result, they can’t be used as a tunable hyperparameter in Ray Tune.

Examples

>>> import ray
>>> ds1 = ray.data.from_items([{"x": 1}, {"x": 2}, {"x": 3}, {"x": 4}]).repartition(2)
>>> ds2 = ray.data.from_items([{"x": 5}, {"x": 6}, {"x": 7}, {"x": 8}]).repartition(2)
>>> ds = ds1.mix(ds2, weights=[0.5, 0.5])
>>> list(ds.iter_batches(batch_size=4))
[{'x': [1, 2, 5, 6]}, {'x': [3, 4, 7, 8]}]

Parameters:

*other – The other datasets to mix with this one. All datasets must produce the same schema.
weights – Target row ratios for each dataset, where the first weight corresponds to self and subsequent weights correspond to *other in order. If None, defaults to equal weight per dataset. Weights are normalized internally so they don’t need to sum to 1.
stopping_condition – Controls when the pipeline terminates. See MixStoppingCondition for options. Defaults to STOP_ON_LONGEST_DROP.

Returns:

A new dataset whose rows are interleaved from the input datasets according to the specified weights.

Raises:

ValueError – If the length of weights doesn’t match the number of datasets.

PublicAPI (alpha): This API is in alpha and may change before becoming stable.