ray.data.Dataset.split#

Dataset.split(n: int, *, equal: bool = False, locality_hints: Optional[List[Any]] = None) List[ray.data.dataset.MaterializedDataset][source]#

Materialize and split the dataset into n disjoint pieces.

This returns a list of MaterializedDatasets that can be passed to Ray tasks and actors and used to read the dataset records in parallel.

Note

This operation will trigger execution of the lazy transformations performed on this dataset.

Examples

>>> import ray
>>> ds = ray.data.range(100) 
>>> workers = ... 
>>> # Split up a dataset to process over `n` worker actors.
>>> shards = ds.split(len(workers), locality_hints=workers) 
>>> for shard, worker in zip(shards, workers): 
...     worker.consume.remote(shard) 

Time complexity: O(1)

See also: Dataset.split_at_indices, Dataset.split_proportionately,

and Dataset.streaming_split.

Parameters
  • n – Number of child datasets to return.

  • equal – Whether to guarantee each split has an equal number of records. This may drop records if they cannot be divided equally among the splits.

  • locality_hints – [Experimental] A list of Ray actor handles of size n. The system will try to co-locate the blocks of the i-th dataset with the i-th actor to maximize data locality.

Returns

A list of n disjoint dataset splits.