Pattern: Using nested tasks to achieve nested parallelism

In this pattern, a remote task can dynamically call other remote tasks (including itself) for nested parallelism. This is useful when sub-tasks can be parallelized.

Keep in mind, though, that nested tasks come with their own cost: extra worker processes, scheduling overhead, bookkeeping overhead, etc. To achieve speedup with nested parallelism, make sure each of your nested tasks does significant work. See Anti-pattern: Over-parallelizing with too fine-grained tasks harms speedup for more details.

Example use case

You want to quick-sort a large list of numbers. By using nested tasks, we can sort the list in a distributed and parallel fashion.


Tree of tasks

Code example

import ray
import time
from numpy import random

def partition(collection):
    # Use the last element as the pivot
    pivot = collection.pop()
    greater, lesser = [], []
    for element in collection:
        if element > pivot:
    return lesser, pivot, greater

def quick_sort(collection):
    if len(collection) <= 200000:  # magic number
        return sorted(collection)
        lesser, pivot, greater = partition(collection)
        lesser = quick_sort(lesser)
        greater = quick_sort(greater)
    return lesser + [pivot] + greater

def quick_sort_distributed(collection):
    # Tiny tasks are an antipattern.
    # Thus, in our example we have a "magic number" to
    # toggle when distributed recursion should be used vs
    # when the sorting should be done in place. The rule
    # of thumb is that the duration of an individual task
    # should be at least 1 second.
    if len(collection) <= 200000:  # magic number
        return sorted(collection)
        lesser, pivot, greater = partition(collection)
        lesser = quick_sort_distributed.remote(lesser)
        greater = quick_sort_distributed.remote(greater)
        return ray.get(lesser) + [pivot] + ray.get(greater)

for size in [200000, 4000000, 8000000]:
    print(f"Array size: {size}")
    unsorted = random.randint(1000000, size=(size)).tolist()
    s = time.time()
    print(f"Sequential execution: {(time.time() - s):.3f}")
    s = time.time()
    print(f"Distributed execution: {(time.time() - s):.3f}")
    print("--" * 10)

# Outputs:

# Array size: 200000
# Sequential execution: 0.040
# Distributed execution: 0.152
# --------------------
# Array size: 4000000
# Sequential execution: 6.161
# Distributed execution: 5.779
# --------------------
# Array size: 8000000
# Sequential execution: 15.459
# Distributed execution: 11.282
# --------------------

We call ray.get() after both quick_sort_distributed function invocations take place. This allows you to maximize parallelism in the workload. See Anti-pattern: Calling ray.get in a loop harms parallelism for more details.

Notice in the execution times above that with smaller tasks, the non-distributed version is faster. However, as the task execution time increases, i.e. because the lists to sort are larger, the distributed version is faster.