Key Concepts#

Datasets#

A Dataset contains a list of Ray object references to blocks. Each block holds a set of items in an Arrow table, pandas DataFrame, or Python list. Having multiple blocks in a dataset allows for parallel transformation and ingest.

For ML use cases, Datasets also natively supports mixing Tensors and tabular data.

There are three types of datasets:

The following figure visualizes a tabular dataset with three blocks, each holding 1000 rows:

../_images/dataset-arch.svg

Since a Dataset is just a list of Ray object references, it can be freely passed between Ray tasks, actors, and libraries like any other object reference. This flexibility is a unique characteristic of Ray Datasets.

Reading Data#

Datasets uses Ray tasks to read data from remote storage in parallel. Each read task reads one or more files and produces an output block:

../_images/dataset-read.svg

You can manually specify the number of read tasks, but the final parallelism is always capped by the number of files in the underlying dataset.

For an in-depth guide on creating datasets, read Creating Datasets.

Transforming Data#

Datasets uses either Ray tasks or Ray actors to transform data blocks. By default, Datasets uses tasks.

To use Actors, pass an ActorPoolStrategy to compute in methods like map_batches(). ActorPoolStrategy creates an autoscaling pool of Ray actors. This allows you to cache expensive state initialization (e.g., model loading for GPU-based tasks).

../_images/dataset-map.svg

For an in-depth guide on transforming datasets, read Transforming Datasets.

Shuffling Data#

Operations like sort() and groupby() require blocks to be partitioned by value or shuffled. Datasets uses tasks to shuffle blocks in a map-reduce style: map tasks partition blocks by value and then reduce tasks merge co-partitioned blocks.

Call repartition() to change the number of blocks in a Dataset. Repartition has two modes:

  • shuffle=False - performs the minimal data movement needed to equalize block sizes

  • shuffle=True - performs a full distributed shuffle

../_images/dataset-shuffle.svg

Datasets can shuffle hundreds of terabytes of data. For an in-depth guide on shuffle performance, read Performance Tips and Tuning.

Execution mode#

Most transformations are lazy. They don’t execute until you consume a dataset or call Dataset.materialize().

The transformations are executed in a streaming way, incrementally on the data and with operators processed in parallel, see Streaming Execution.

For an in-depth guide on Datasets execution, read Execution.

Fault tolerance#

Datasets performs lineage reconstruction to recover data. If an application error or system failure occurs, Datasets recreates lost blocks by re-executing tasks.

Fault tolerance isn’t supported in two cases:

  • If the original worker process that created the Dataset dies. This is because the creator stores the metadata for the objects that comprise the Dataset.

  • If you specify compute=ActorPoolStrategy() for transformations. This is because Datasets relies on task-based fault tolerance.