Key Concepts
Contents
Key Concepts#
Datasets#
A Dataset contains a list of Ray object references to blocks. Each block holds a set of items in an Arrow table, pandas DataFrame, or Python list. Having multiple blocks in a dataset allows for parallel transformation and ingest.
For ML use cases, Datasets also natively supports mixing Tensors and tabular data.
There are three types of datasets:
Simple datasets – Datasets that represent a collection of Python objects
Tabular datasets – Datasets that represent columnar data
Tensor datasets – Datasets that represent a collection of ndarrays
The following figure visualizes a tabular dataset with three blocks, each holding 1000 rows:
Since a Dataset is just a list of Ray object references, it can be freely passed between Ray tasks, actors, and libraries like any other object reference. This flexibility is a unique characteristic of Ray Datasets.
Reading Data#
Datasets uses Ray tasks to read data from remote storage in parallel. Each read task reads one or more files and produces an output block:
You can manually specify the number of read tasks, but the final parallelism is always capped by the number of files in the underlying dataset.
For an in-depth guide on creating datasets, read Creating Datasets.
Transforming Data#
Datasets uses either Ray tasks or Ray actors to transform data blocks. By default, Datasets uses tasks.
To use Actors, pass an ActorPoolStrategy
to compute
in methods like
map_batches()
. ActorPoolStrategy
creates an autoscaling
pool of Ray actors. This allows you to cache expensive state initialization
(e.g., model loading for GPU-based tasks).
For an in-depth guide on transforming datasets, read Transforming Datasets.
Shuffling Data#
Operations like sort()
and groupby()
require blocks to be partitioned by value or shuffled. Datasets uses tasks to shuffle blocks in a map-reduce
style: map tasks partition blocks by value and then reduce tasks merge co-partitioned
blocks.
Call repartition()
to change the number of blocks in a Dataset
.
Repartition has two modes:
shuffle=False
- performs the minimal data movement needed to equalize block sizesshuffle=True
- performs a full distributed shuffle
Datasets can shuffle hundreds of terabytes of data. For an in-depth guide on shuffle performance, read Performance Tips and Tuning.
Execution mode#
Most transformations are lazy. They don’t execute until you consume a dataset or call
Dataset.materialize()
.
The transformations are executed in a streaming way, incrementally on the data and with operators processed in parallel, see Streaming Execution.
For an in-depth guide on Datasets execution, read Execution.
Fault tolerance#
Datasets performs lineage reconstruction to recover data. If an application error or system failure occurs, Datasets recreates lost blocks by re-executing tasks.
Fault tolerance isn’t supported in two cases:
If the original worker process that created the Dataset dies. This is because the creator stores the metadata for the objects that comprise the Dataset.
If you specify
compute=ActorPoolStrategy()
for transformations. This is because Datasets relies on task-based fault tolerance.