- Dataset.split(n: int, *, equal: bool = False, locality_hints: Optional[List[Any]] = None) List[ray.data.dataset.MaterializedDataset] #
Materialize and split the dataset into
This method returns a list of
MaterializedDatasetthat can be passed to Ray Tasks and Actors and used to read the dataset rows in parallel.
This operation will trigger execution of the lazy transformations performed on this dataset.
@ray.remote class Worker: def train(self, data_iterator): for batch in data_iterator.iter_batches(batch_size=8): pass workers = [Worker.remote() for _ in range(4)] shards = ray.data.range(100).split(n=4, equal=True) ray.get([w.train.remote(s) for w, s in zip(workers, shards)])
Time complexity: O(1)
n – Number of child datasets to return.
equal – Whether to guarantee each split has an equal number of records. This might drop records if the rows can’t be divided equally among the splits.
locality_hints – [Experimental] A list of Ray actor handles of size
n. The system tries to co-locate the blocks of the i-th dataset with the i-th actor to maximize data locality.
A list of
ndisjoint dataset splits.
This method is equivalent to
Dataset.split_at_indices()if you compute indices manually.