Ray Data Internals#

This guide describes the implementation of Ray Data. The intended audience is advanced users and Ray Data developers.

For a gentler introduction to Ray Data, see Quickstart.

Key concepts#

Datasets and blocks#

Datasets#

Dataset is the main user-facing Python API. It represents a distributed data collection, and defines data loading and processing operations. You typically use the API in this way:

Create a Ray Dataset from external storage or in-memory data.
Apply transformations to the data.
Write the outputs to external storage or feed the outputs to training workers.

Blocks#

A block is the basic unit of data bulk that Ray Data stores in the object store and transfers over the network. Each block contains a disjoint subset of rows, and Ray Data loads and transforms these blocks in parallel.

The following figure visualizes a dataset with three blocks, each holding 1000 rows. Ray Data holds the Dataset on the process that triggers execution (which is usually the driver) and stores the blocks as objects in Ray’s shared-memory object store.

Block formats#

Blocks are Arrow tables or pandas DataFrames. Generally, blocks are Arrow tables unless Arrow can’t represent your data.

The block format doesn’t affect the type of data returned by APIs like iter_batches().

Block size limiting#

Ray Data bounds block sizes to avoid excessive communication overhead and prevent out-of-memory errors. Small blocks are good for latency and more streamed execution, while large blocks reduce scheduler and communication overhead. The default range attempts to make a good tradeoff for most jobs.

Ray Data attempts to bound block sizes between 1 MiB and 128 MiB. To change the block size range, configure the target_min_block_size and target_max_block_size attributes of DataContext.

import ray

ctx = ray.data.DataContext.get_current()
ctx.target_min_block_size = 1 * 1024 * 1024
ctx.target_max_block_size = 128 * 1024 * 1024

Dynamic block splitting#

If a block is larger than 192 MiB (50% more than the target max size), Ray Data dynamically splits the block into smaller blocks.

To change the size at which Ray Data splits blocks, configure MAX_SAFE_BLOCK_SIZE_FACTOR. The default value is 1.5.

import ray

ray.data.context.MAX_SAFE_BLOCK_SIZE_FACTOR = 1.5

Ray Data can’t split rows. So, if your dataset contains large rows (for example, large images), then Ray Data can’t bound the block size.

Shuffle Algorithms#

In data processing, shuffling refers to the process of redistributing individual dataset’s partitions (that in Ray Data are called blocks).

Ray Data implements two main shuffle algorithms:

Hash-shuffling#

Note

Hash-shuffling is available in Ray 2.46

Hash-shuffling is a classical hash-partitioning based shuffling where:

Partition phase: rows in every block are hash-partitioned based on values in the key columns into a specified number of partitions, following a simple residual formula of hash(key-values) % N (used in hash-tables and pretty much everywhere).
Push phase: partition’s shards from individual blocks are then pushed into corresponding aggregating actors (called HashShuffleAggregator) handling respective partitions.
Reduce phase: aggregators combine received individual partition’s shards back into blocks optionally applying additional transformations before producing the resulting blocks.

Hash-shuffling is particularly useful for operations that require deterministic partitioning based on keys, such as joins, group-by operations, and key-based repartitioning, by ensuring that rows with the same key-values are being placed into the same partition.

Note

To use hash-shuffling in your aggregations and repartitioning operations, you need to currently specify ray.data.DataContext.get_current().shuffle_strategy = ShuffleStrategy.HASH_SHUFFLE before creating a Dataset.

Range-partitioning shuffle#

Range-partitioning based shuffle also is a classical algorithm, based on the dataset being split into target number of ranges as determined by boundaries approximating the real ranges of the totally ordered (sorted) dataset.

Sampling phase: every input block is randomly sampled for (10) rows. Samples are combined into a single dataset, which is then sorted and split into target number of partitions defining approximate range boundaries.
Partition phase: every block is sorted and split into partitions based on the range boundaries derived in the previous step.
Reduce phase: individual partitions within the same range are then recombined to produce the resulting block.

Note

Range-partitioning shuffle is a default shuffling strategy. To set it explicitly specify ray.data.DataContext.get_current().shuffle_strategy = ShuffleStrategy.SORT_SHUFFLE_PULL_BASED before creating a Dataset.

Operators, plans, and planning#

Operators#

There are two types of operators: logical operators and physical operators. Logical operators are stateless objects that describe “what” to do. Physical operators are stateful objects that describe “how” to do it. An example of a logical operator is ReadOp, and an example of a physical operator is TaskPoolMapOperator.

Plans#

A logical plan is a series of logical operators, and a physical plan is a series of physical operators. When you call APIs like ray.data.read_images() and ray.data.Dataset.map_batches(), Ray Data produces a logical plan. When execution starts, the planner generates a corresponding physical plan.

The planner#

The Ray Data planner translates logical operators to one or more physical operators. For example, the planner translates the ReadOp logical operator into two physical operators: an InputDataBuffer and TaskPoolMapOperator. Whereas the ReadOp logical operator only describes the input data, the TaskPoolMapOperator physical operator actually launches tasks to read the data.

Plan optimization#

Ray Data applies optimizations to both logical and physical plans. For example, the OperatorFusionRule combines a chain of physical map operators into a single map operator. This prevents unnecessary serialization between map operators.

To add custom optimization rules, implement a class that extends Rule and configure DEFAULT_LOGICAL_RULES or DEFAULT_PHYSICAL_RULES.

import ray
from ray.data._internal.logical.interfaces import Rule
from ray.data._internal.logical.optimizers import get_logical_ruleset

class CustomRule(Rule):
    def apply(self, plan):
        ...

logical_ruleset = get_logical_ruleset()
logical_ruleset.add(CustomRule)

Types of physical operators#

Physical operators take in a stream of block references and output another stream of block references. Some physical operators launch Ray Tasks and Actors to transform the blocks, and others only manipulate the references.

MapOperator is the most common operator. All read, transform, and write operations are implemented with it. To process data, MapOperator implementations use either Ray Tasks or Ray Actors.

Non-map operators include OutputSplitter and LimitOperator. These two operators manipulate references to data, but don’t launch tasks or modify the underlying data.

Execution#

The executor#

The executor schedules tasks and moves data between physical operators.

The executor and operators are located on the process where dataset execution starts. For batch inference jobs, this process is usually the driver. For training jobs, the executor runs on a special actor called SplitCoordinator which handles streaming_split().

Tasks and actors launched by operators are scheduled across the cluster, and outputs are stored in Ray’s distributed object store. The executor manipulates references to objects, and doesn’t fetch the underlying data itself to the executor.

Out queues#

Each physical operator has an associated out queue. When a physical operator produces outputs, the executor moves the outputs to the operator’s out queue.

Streaming execution#

In contrast to bulk synchronous execution, Ray Data’s streaming execution doesn’t wait for one operator to complete to start the next. Each operator takes in and outputs a stream of blocks. This approach allows you to process datasets that are too large to fit in your cluster’s memory.

The scheduling loop#

The executor runs a loop. Each step works like this:

Wait until running tasks and actors have new outputs.
Move new outputs into the appropriate operator out queues.
Choose some operators and assign new inputs to them. These operator process the new inputs either by launching new tasks or manipulating metadata.

Choosing the best operator to assign inputs is one of the most important decisions in Ray Data. This decision is critical to the performance, stability, and scalability of a Ray Data job. The executor can schedule an operator if the operator satisfies the following conditions:

The operator has inputs.
There are adequate resources available.
The operator isn’t backpressured.

If there are multiple viable operators, the executor chooses the operator with the smallest out queue.

Scheduling#

Ray Data uses Ray Core for execution. Below is a summary of the scheduling strategy for Ray Data:

The SPREAD scheduling strategy ensures that data blocks and map tasks are evenly balanced across the cluster.
Map operations use the SPREAD scheduling strategy if the total argument size is less than 50 MB; otherwise, they use the DEFAULT scheduling strategy.
Read operations use the SPREAD scheduling strategy.
All other operations, such as split, sort, and shuffle, use the DEFAULT scheduling strategy.

Ray Data and Tune#

When using Ray Data in conjunction with Ray Tune, it’s important to ensure there are enough free CPUs for Ray Data to run on. By default, Tune tries to fully utilize cluster CPUs. This can prevent Ray Data from scheduling tasks, reducing performance or causing workloads to hang.

To ensure CPU resources are always available for Ray Data execution, limit the number of concurrent Tune trials with the max_concurrent_trials Tune option.

import ray
from ray import tune

# This workload will use spare cluster resources for execution.
def objective(*args):
    ray.data.range(10).show()

# Create a cluster with 4 CPU slots available.
ray.init(num_cpus=4)

# By setting `max_concurrent_trials=3`, this ensures the cluster will always
# have a sparse CPU for Dataset. Try setting `max_concurrent_trials=4` here,
# and notice that the experiment will appear to hang.
tuner = tune.Tuner(
    tune.with_resources(objective, {"cpu": 1}),
    tune_config=tune.TuneConfig(
        num_samples=1,
        max_concurrent_trials=3
    )
)
tuner.fit()

Memory Model#

This section describes how Ray Data manages execution and object store memory.

Ray divides each node’s memory into three pools. By default, it reserves 30% for the object store and 10% for system overhead, and treats the remaining as logical memory.

Each pool serves a different purpose:

Logical memory is what’s available for the heap of UDFs and built-in transformations like reads.
Object store holds buffered blocks.
System memory is what’s left for Ray Core (the raylet) and other processes outside your tasks.

Note

Zero-copy deserializable objects are an exception. They’re used in the UDF but accounted for only in the object store, so they serve as both the buffer and the working memory.

When a UDF processes data, it uses heap memory to do the work. For example, a UDF that calls a Torch preprocessor holds the tensors on the heap. As the UDF produces output rows or batches, Ray Data serializes them into PyArrow tables and stores them in the shared object store.

To limit object store use, Ray Data applies backpressure and stops launching tasks once enough data is buffered. If Ray Data produces more data than fits, Ray Core spills those objects to disk.

Note

A common misconception is that heavy queuing causes OOMs. While it’s true that heavy object store use contributes to worker OOMs by leaving less memory for the heaps of tasks and actors, heavy queuing doesn’t cause OOMs directly because Ray spills objects to disk. If Ray Data queues too much data, you see out-of-disk errors instead.

To limit heap memory use, Ray Data relies on memory hints from you to estimate how much heap memory each UDF needs. It passes those hints to Ray Core so the scheduler doesn’t oversubscribe the cluster. These hints don’t enforce any OS-level limit. They only guide scheduling.