Anti-pattern: Over-parallelizing with too fine-grained tasks harms speedup#

TLDR: Avoid over-parallelizing. Parallelizing tasks has higher overhead than using normal functions.

Parallelizing or distributing tasks usually comes with higher overhead than an ordinary function call. Therefore, if you parallelize a function that executes very quickly, the overhead could take longer than the actual function call!

To handle this problem, we should be careful about parallelizing too much. If you have a function or task that’s too small, you can use a technique called batching to make your tasks do more meaningful work in a single call.

Code example#

Anti-pattern:

import ray
import time
import itertools

ray.init()

numbers = list(range(10000))


def double(number):
    time.sleep(0.00001)
    return number * 2


start_time = time.time()
serial_doubled_numbers = [double(number) for number in numbers]
end_time = time.time()
print(f"Ordinary funciton call takes {end_time - start_time} seconds")
# Ordinary funciton call takes 0.16506004333496094 seconds


@ray.remote
def remote_double(number):
    return double(number)


start_time = time.time()
doubled_number_refs = [remote_double.remote(number) for number in numbers]
parallel_doubled_numbers = ray.get(doubled_number_refs)
end_time = time.time()
print(f"Parallelizing tasks takes {end_time - start_time} seconds")
# Parallelizing tasks takes 1.6061789989471436 seconds

Better approach: Use batching.

@ray.remote
def remote_double_batch(numbers):
    return [double(number) for number in numbers]


BATCH_SIZE = 1000
start_time = time.time()
doubled_batch_refs = []
for i in range(0, len(numbers), BATCH_SIZE):
    batch = numbers[i : i + BATCH_SIZE]
    doubled_batch_refs.append(remote_double_batch.remote(batch))
parallel_doubled_numbers_with_batching = list(
    itertools.chain(*ray.get(doubled_batch_refs))
)
end_time = time.time()
print(f"Parallelizing tasks with batching takes {end_time - start_time} seconds")
# Parallelizing tasks with batching takes 0.030150890350341797 seconds

As we can see from the example above, over-parallelizing has higher overhead and the program runs slower than the serial version. Through batching with a proper batch size, we are able to amortize the overhead and achieve the expected speedup.