ray.data.Dataset.split_proportionately#

Dataset.split_proportionately(proportions: List[float]) List[MaterializedDataset][source]#

Materialize and split the dataset using proportions.

A common use case for this is splitting the dataset into train and test sets (equivalent to eg. scikit-learn’s train_test_split). For a higher level abstraction, see Dataset.train_test_split().

This method splits datasets so that all splits always contains at least one row. If that isn’t possible, an exception is raised.

This is equivalent to caulculating the indices manually and calling Dataset.split_at_indices().

Note

This operation will trigger execution of the lazy transformations performed on this dataset.

Examples

>>> import ray
>>> ds = ray.data.range(10)
>>> d1, d2, d3 = ds.split_proportionately([0.2, 0.5])
>>> d1.take_batch()
{'id': array([0, 1])}
>>> d2.take_batch()
{'id': array([2, 3, 4, 5, 6])}
>>> d3.take_batch()
{'id': array([7, 8, 9])}

Time complexity: O(num splits)

Parameters:

proportions – List of proportions to split the dataset according to. Must sum up to less than 1, and each proportion must be bigger than 0.

Returns:

The dataset splits.

See also

Dataset.split()

Unlike split_proportionately(), which lets you split a dataset into different sizes, Dataset.split() splits a dataset into approximately equal splits.

Dataset.split_at_indices()

Dataset.split_proportionately() uses this method under the hood.

Dataset.streaming_split().

Unlike split(), streaming_split() doesn’t materialize the dataset in memory.