Tips for testing Ray programs#

Ray programs can be a little tricky to test due to the nature of parallel programs. We’ve put together a list of tips and tricks for common testing practices for Ray programs.

Tip 1: Fixing the resource quantity with `ray.init(num_cpus=...)`#

By default, ray.init() detects the number of CPUs and GPUs on your local machine/cluster.

However, your testing environment may have a significantly lower number of resources. For example, the TravisCI build environment only has 2 cores

If tests are written to depend on ray.init(), they may be implicitly written in a way that relies on a larger multi-core machine.

This may easily result in tests exhibiting unexpected, flaky, or faulty behavior that is hard to reproduce.

To overcome this, you should override the detected resources by setting them in ray.init like: ray.init(num_cpus=2)

Tip 3: Create a mini-cluster with `ray.cluster_utils.Cluster`#

If writing an application for a cluster setting, you may want to mock a multi-node Ray cluster. This can be done with the ray.cluster_utils.Cluster utility.

Note

On Windows, support for multi-node Ray clusters is currently experimental and untested. If you run into issues please file a report at ray-project/ray#issues.

from ray.cluster_utils import Cluster

# Starts a head-node for the cluster.
cluster = Cluster(
    initialize_head=True,
    head_node_args={
        "num_cpus": 10,
    })

After starting a cluster, you can execute a typical ray script in the same process:

import ray

ray.init(address=cluster.address)

@ray.remote
def f(x):
    return x

for _ in range(1):
    ray.get([f.remote(1) for _ in range(1000)])

for _ in range(10):
    ray.get([f.remote(1) for _ in range(100)])

for _ in range(100):
    ray.get([f.remote(1) for _ in range(10)])

for _ in range(1000):
    ray.get([f.remote(1) for _ in range(1)])

You can also add multiple nodes, each with different resource quantities:

mock_node = cluster.add_node(num_cpus=10)

assert ray.cluster_resources()["CPU"] == 20

You can also remove nodes, which is useful when testing failure-handling logic:

cluster.remove_node(mock_node)

assert ray.cluster_resources()["CPU"] == 10

See the Cluster Util for more details.

Tip 4: Be careful when running tests in parallel #

Since Ray starts a variety of services, it is easy to trigger timeouts if too many services are started at once. Therefore, when using tools such as pytest xdist that run multiple tests in parallel, one should keep in mind that this may introduce flakiness into the test environment.

Tips for testing Ray programs#

Tip 1: Fixing the resource quantity with ray.init(num_cpus=...)#

Tip 2: Sharing the ray cluster across tests if possible#

Tip 3: Create a mini-cluster with ray.cluster_utils.Cluster#

Tip 4: Be careful when running tests in parallel#

Tip 1: Fixing the resource quantity with `ray.init(num_cpus=...)`#

Tip 2: Sharing the ray cluster across tests if possible #

Tip 3: Create a mini-cluster with `ray.cluster_utils.Cluster`#

Tip 4: Be careful when running tests in parallel #