ray.data.Dataset.train_test_split#

Dataset.train_test_split(test_size: int | float, *, shuffle: bool = False, seed: int | None = None, stratify: str | None = None) → Tuple[MaterializedDataset, MaterializedDataset][source]#

Materialize and split the dataset into train and test subsets.

Note

This operation will trigger execution of the lazy transformations performed on this dataset.

Examples

>>> import ray
>>> ds = ray.data.range(8)
>>> train, test = ds.train_test_split(test_size=0.25)
>>> train.take_batch()
{'id': array([0, 1, 2, 3, 4, 5])}
>>> test.take_batch()
{'id': array([6, 7])}

Parameters:

test_size – If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. The train split always complements the test split.
shuffle – Whether or not to globally shuffle the dataset before splitting. Defaults to False. This may be a very expensive operation with a large dataset.
seed – Fix the random seed to use for shuffle, otherwise one is chosen based on system randomness. Ignored if shuffle=False.
stratify – Optional column name to use for stratified sampling. If provided, the splits will maintain the same proportions of each class in the stratify column across both train and test sets.

Returns:

Train and test subsets as two MaterializedDatasets.