ray.data.Dataset.mix#
- Dataset.mix(*other: Dataset, weights: List[float] | None = None, stopping_condition: MixStoppingCondition = MixStoppingCondition.STOP_ON_LONGEST_DROP) Dataset[source]#
Mix this dataset with others using weighted interleaving.
This is a streaming operator that interleaves blocks from multiple input datasets into a single output stream, respecting the target row ratio specified by
weights. Each output block is drawn from exactly one input dataset; the operator tracks cumulative row counts and always pulls from whichever dataset has fallen furthest behind its target ratio.Caution
Mixed datasets aren’t lineage-serializable. As a result, they can’t be used as a tunable hyperparameter in Ray Tune.
Examples
>>> import ray >>> ds1 = ray.data.from_items([{"x": 1}, {"x": 2}, {"x": 3}, {"x": 4}]).repartition(2) >>> ds2 = ray.data.from_items([{"x": 5}, {"x": 6}, {"x": 7}, {"x": 8}]).repartition(2) >>> ds = ds1.mix(ds2, weights=[0.5, 0.5]) >>> list(ds.iter_batches(batch_size=4)) [{'x': [1, 2, 5, 6]}, {'x': [3, 4, 7, 8]}]
- Parameters:
*other – The other datasets to mix with this one. All datasets must produce the same schema.
weights – Target row ratios for each dataset, where the first weight corresponds to
selfand subsequent weights correspond to*otherin order. IfNone, defaults to equal weight per dataset. Weights are normalized internally so they don’t need to sum to 1.stopping_condition – Controls when the pipeline terminates. See
MixStoppingConditionfor options. Defaults toSTOP_ON_LONGEST_DROP.
- Returns:
A new dataset whose rows are interleaved from the input datasets according to the specified weights.
- Raises:
ValueError – If the length of
weightsdoesn’t match the number of datasets.
PublicAPI (alpha): This API is in alpha and may change before becoming stable.