ray.data.Dataset.split#

Dataset.split(n: int, *, equal: bool = False, locality_hints: List[Any] | None = None) → List[MaterializedDataset][source]#

Materialize and split the dataset into n disjoint pieces.

This method returns a list of MaterializedDataset that can be passed to Ray Tasks and Actors and used to read the dataset rows in parallel.

Note

This operation will trigger execution of the lazy transformations performed on this dataset.

Examples

@ray.remote
class Worker:

    def train(self, data_iterator):
        for batch in data_iterator.iter_batches(batch_size=8):
            pass

workers = [Worker.remote() for _ in range(4)]
shards = ray.data.range(100).split(n=4, equal=True)
ray.get([w.train.remote(s) for w, s in zip(workers, shards)])

Time complexity: O(1)

Parameters:

n – Number of child datasets to return.
equal – Whether to guarantee each split has an equal number of records. This might drop records if the rows can’t be divided equally among the splits.
locality_hints – [Experimental] A list of Ray actor handles of size n. The system tries to co-locate the blocks of the i-th dataset with the i-th actor to maximize data locality.

Returns:

A list of n disjoint dataset splits.