ray.data.Dataset.split#

Dataset.split(n: int, *, equal: bool = False, locality_hints: List[Any] | None = None) List[MaterializedDataset][source]#

Materialize and split the dataset into n disjoint pieces.

This method returns a list of MaterializedDataset that can be passed to Ray Tasks and Actors and used to read the dataset rows in parallel.

Note

This operation will trigger execution of the lazy transformations performed on this dataset.

Examples

@ray.remote
class Worker:

    def train(self, data_iterator):
        for batch in data_iterator.iter_batches(batch_size=8):
            pass

workers = [Worker.remote() for _ in range(4)]
shards = ray.data.range(100).split(n=4, equal=True)
ray.get([w.train.remote(s) for w, s in zip(workers, shards)])

Time complexity: O(1)

Parameters:
  • n – Number of child datasets to return.

  • equal – Whether to guarantee each split has an equal number of records. This might drop records if the rows can’t be divided equally among the splits.

  • locality_hints – [Experimental] A list of Ray actor handles of size n. The system tries to co-locate the blocks of the i-th dataset with the i-th actor to maximize data locality.

Returns:

A list of n disjoint dataset splits.

See also

Dataset.split_at_indices()

Unlike split(), which splits a dataset into approximately equal splits, Dataset.split_proportionately() lets you split a dataset into different sizes.

Dataset.split_proportionately()

This method is equivalent to Dataset.split_at_indices() if you compute indices manually.

Dataset.streaming_split().

Unlike split(), streaming_split() doesn’t materialize the dataset in memory.