ray.data.Dataset.split#
- Dataset.split(n: int, *, equal: bool = False, locality_hints: List[Any] | None = None) List[MaterializedDataset] [source]#
Materialize and split the dataset into
n
disjoint pieces.This method returns a list of
MaterializedDataset
that can be passed to Ray Tasks and Actors and used to read the dataset rows in parallel.Note
This operation will trigger execution of the lazy transformations performed on this dataset.
Examples
@ray.remote class Worker: def train(self, data_iterator): for batch in data_iterator.iter_batches(batch_size=8): pass workers = [Worker.remote() for _ in range(4)] shards = ray.data.range(100).split(n=4, equal=True) ray.get([w.train.remote(s) for w, s in zip(workers, shards)])
Time complexity: O(1)
- Parameters:
n – Number of child datasets to return.
equal – Whether to guarantee each split has an equal number of records. This might drop records if the rows can’t be divided equally among the splits.
locality_hints – [Experimental] A list of Ray actor handles of size
n
. The system tries to co-locate the blocks of the i-th dataset with the i-th actor to maximize data locality.
- Returns:
A list of
n
disjoint dataset splits.
See also
Dataset.split_at_indices()
Unlike
split()
, which splits a dataset into approximately equal splits,Dataset.split_proportionately()
lets you split a dataset into different sizes.Dataset.split_proportionately()
This method is equivalent to
Dataset.split_at_indices()
if you compute indices manually.Dataset.streaming_split()
.Unlike
split()
,streaming_split()
doesn’t materialize the dataset in memory.