Debugging and Profiling¶
Observing Ray Work¶
You can run
ray stack to dump the stack traces of all Ray workers on
the current node. This requires
py-spy to be installed. See the Troubleshooting page for more details.
Visualizing Tasks in the Ray Timeline¶
The most important tool is the timeline visualization tool. To visualize tasks
in the Ray timeline, you can dump the timeline as a JSON file by running
timeline from the command line or by using the following command.
Then open chrome://tracing in the Chrome web browser, and load
Profiling Using Python’s CProfile¶
A second way to profile the performance of your Ray application is to use Python’s native cProfile profiling module. Rather than tracking line-by-line of your application code, cProfile can give the total runtime of each loop function, as well as list the number of calls made and execution time of all function calls made within the profiled code.
line_profiler above, this detailed list of profiled function calls
includes internal function calls and function calls made within Ray!
However, similar to
line_profiler, cProfile can be enabled with minimal
changes to your application code (given that each section of the code you want
to profile is defined as its own function). To use cProfile, add an import
statement, then replace calls to the loop functions as follows:
import cProfile # Added import statement def ex1(): list1 =  for i in range(5): list1.append(ray.get(func.remote())) def main(): ray.init() cProfile.run('ex1()') # Modified call to ex1 cProfile.run('ex2()') cProfile.run('ex3()') if __name__ == "__main__": main()
Now, when executing your Python script, a cProfile list of profiled function
calls will be outputted to terminal for each call made to
At the very top of cProfile’s output gives the total execution time for
601 function calls (595 primitive calls) in 2.509 seconds
Following is a snippet of profiled function calls for
'ex1()'. Most of
these calls are quick and take around 0.000 seconds, so the functions of
interest are the ones with non-zero execution times:
ncalls tottime percall cumtime percall filename:lineno(function) ... 1 0.000 0.000 2.509 2.509 your_script_here.py:31(ex1) 5 0.000 0.000 0.001 0.000 remote_function.py:103(remote) 5 0.000 0.000 0.001 0.000 remote_function.py:107(_remote) ... 10 0.000 0.000 0.000 0.000 worker.py:2459(__init__) 5 0.000 0.000 2.508 0.502 worker.py:2535(get) 5 0.000 0.000 0.000 0.000 worker.py:2695(get_global_worker) 10 0.000 0.000 2.507 0.251 worker.py:374(retrieve_and_deserialize) 5 0.000 0.000 2.508 0.502 worker.py:424(get_object) 5 0.000 0.000 0.000 0.000 worker.py:514(submit_task) ...
The 5 separate calls to Ray’s
get, taking the full 0.502 seconds each call,
can be noticed at
worker.py:2535(get). Meanwhile, the act of calling the
remote function itself at
remote_function.py:103(remote) only takes 0.001
seconds over 5 calls, and thus is not the source of the slow performance of
Profiling Ray Actors with cProfile¶
Considering that the detailed output of cProfile can be quite different depending on what Ray functionalities we use, let us see what cProfile’s output might look like if our example involved Actors (for an introduction to Ray actors, see our Actor documentation here).
Now, instead of looping over five calls to a remote function like in
let’s create a new example and loop over five calls to a remote function
inside an actor. Our actor’s remote function again just sleeps for 0.5
# Our actor @ray.remote class Sleeper(object): def __init__(self): self.sleepValue = 0.5 # Equivalent to func(), but defined within an actor def actor_func(self): time.sleep(self.sleepValue)
Recalling the suboptimality of
ex1, let’s first see what happens if we
attempt to perform all five
actor_func() calls within a single actor:
def ex4(): # This is suboptimal in Ray, and should only be used for the sake of this example actor_example = Sleeper.remote() five_results =  for i in range(5): five_results.append(actor_example.actor_func.remote()) # Wait until the end to call ray.get() ray.get(five_results)
We enable cProfile on this example as follows:
def main(): ray.init() cProfile.run('ex4()') if __name__ == "__main__": main()
Running our new Actor example, cProfile’s abbreviated output is as follows:
12519 function calls (11956 primitive calls) in 2.525 seconds ncalls tottime percall cumtime percall filename:lineno(function) ... 1 0.000 0.000 0.015 0.015 actor.py:546(remote) 1 0.000 0.000 0.015 0.015 actor.py:560(_remote) 1 0.000 0.000 0.000 0.000 actor.py:697(__init__) ... 1 0.000 0.000 2.525 2.525 your_script_here.py:63(ex4) ... 9 0.000 0.000 0.000 0.000 worker.py:2459(__init__) 1 0.000 0.000 2.509 2.509 worker.py:2535(get) 9 0.000 0.000 0.000 0.000 worker.py:2695(get_global_worker) 4 0.000 0.000 2.508 0.627 worker.py:374(retrieve_and_deserialize) 1 0.000 0.000 2.509 2.509 worker.py:424(get_object) 8 0.000 0.000 0.001 0.000 worker.py:514(submit_task) ...
It turns out that the entire example still took 2.5 seconds to execute, or the
time for five calls to
actor_func() to run in serial. We remember in
that this behavior was because we did not wait until after submitting all five
remote function tasks to call
ray.get(), but we can verify on cProfile’s
ray.get() was only called once at
the end, for 2.509 seconds. What happened?
It turns out Ray cannot parallelize this example, because we have only
initialized a single
Sleeper actor. Because each actor is a single,
stateful worker, our entire code is submitted and ran on a single worker the
To better parallelize the actors in
ex4, we can take advantage
that each call to
actor_func() is independent, and instead
Sleeper actors. That way, we are creating five workers
that can run in parallel, instead of creating a single worker that
can only handle one call to
actor_func() at a time.
def ex4(): # Modified to create five separate Sleepers five_actors = [Sleeper.remote() for i in range(5)] # Each call to actor_func now goes to a different Sleeper five_results =  for actor_example in five_actors: five_results.append(actor_example.actor_func.remote()) ray.get(five_results)
Our example in total now takes only 1.5 seconds to run:
1378 function calls (1363 primitive calls) in 1.567 seconds ncalls tottime percall cumtime percall filename:lineno(function) ... 5 0.000 0.000 0.002 0.000 actor.py:546(remote) 5 0.000 0.000 0.002 0.000 actor.py:560(_remote) 5 0.000 0.000 0.000 0.000 actor.py:697(__init__) ... 1 0.000 0.000 1.566 1.566 your_script_here.py:71(ex4) ... 21 0.000 0.000 0.000 0.000 worker.py:2459(__init__) 1 0.000 0.000 1.564 1.564 worker.py:2535(get) 25 0.000 0.000 0.000 0.000 worker.py:2695(get_global_worker) 3 0.000 0.000 1.564 0.521 worker.py:374(retrieve_and_deserialize) 1 0.000 0.000 1.564 1.564 worker.py:424(get_object) 20 0.001 0.000 0.001 0.000 worker.py:514(submit_task) ...
This document discusses some common problems that people run into when using Ray as well as some known problems. If you encounter other problems, please let us know.
If Ray crashed, you may wonder what happened. Currently, this can occur for some of the following reasons.
Stressful workloads: Workloads that create many many tasks in a short amount of time can sometimes interfere with the heartbeat mechanism that we use to check that processes are still alive. On the head node in the cluster, you can check the files
/tmp/ray/session_*/logs/monitor*. They will indicate which processes Ray has marked as dead (due to a lack of heartbeats). However, it is currently possible for a process to get marked as dead without actually having died.
Starting many actors: Workloads that start a large number of actors all at once may exhibit problems when the processes (or libraries that they use) contend for resources. Similarly, a script that starts many actors over the lifetime of the application will eventually cause the system to run out of file descriptors. This is addressable, but currently we do not garbage collect actor processes until the script finishes.
Running out of file descriptors: As a workaround, you may be able to increase the maximum number of file descriptors with a command like
ulimit -n 65536. If that fails, double check that the hard limit is sufficiently large by running
ulimit -Hn. If it is too small, you can increase the hard limit as follows (these instructions work on EC2).
Increase the hard ulimit for open file descriptors system-wide by running the following.
sudo bash -c "echo $USER hard nofile 65536 >> /etc/security/limits.conf"
Logout and log back in.
You just ran an application using Ray, but it wasn’t as fast as you expected it to be. Or worse, perhaps it was slower than the serial version of the application! The most common reasons are the following.
Number of cores: How many cores is Ray using? When you start Ray, it will determine the number of CPUs on each machine with
psutil.cpu_count(). Ray usually will not schedule more tasks in parallel than the number of CPUs. So if the number of CPUs is 4, the most you should expect is a 4x speedup.
Physical versus logical CPUs: Do the machines you’re running on have fewer physical cores than logical cores? You can check the number of logical cores with
psutil.cpu_count()and the number of physical cores with
psutil.cpu_count(logical=False). This is common on a lot of machines and especially on EC2. For many workloads (especially numerical workloads), you often cannot expect a greater speedup than the number of physical CPUs.
Small tasks: Are your tasks very small? Ray introduces some overhead for each task (the amount of overhead depends on the arguments that are passed in). You will be unlikely to see speedups if your tasks take less than ten milliseconds. For many workloads, you can easily increase the sizes of your tasks by batching them together.
Variable durations: Do your tasks have variable duration? If you run 10 tasks with variable duration in parallel, you shouldn’t expect an N-fold speedup (because you’ll end up waiting for the slowest task). In this case, consider using
ray.waitto begin processing tasks that finish first.
Multi-threaded libraries: Are all of your tasks attempting to use all of the cores on the machine? If so, they are likely to experience contention and prevent your application from achieving a speedup. This is very common with some versions of
numpy, and in that case can usually be setting an environment variable like
MKL_NUM_THREADS(or the equivalent depending on your installation) to
For many - but not all - libraries, you can diagnose this by opening
topwhile your application is running. If one process is using most of the CPUs, and the others are using a small amount, this may be the problem. The most common exception is PyTorch, which will appear to be using all the cores despite needing
torch.set_num_threads(1)to be called to avoid contention.
If you are still experiencing a slowdown, but none of the above problems apply, we’d really like to know! Please create a GitHub issue and consider submitting a minimal code example that demonstrates the problem.
Outdated Function Definitions¶
Due to subtleties of Python, if you redefine a remote function, you may not always get the expected behavior. In this case, it may be that Ray is not running the newest version of the function.
Suppose you define a remote function
f and then redefine it. Ray should use
the newest version.
@ray.remote def f(): return 1 @ray.remote def f(): return 2 ray.get(f.remote()) # This should be 2.
However, the following are cases where modifying the remote function will not update Ray to the new version (at least without stopping and restarting Ray).
The function is imported from an external file: In this case,
fis defined in some external file
file.py. If you
import file, change the definition of
file.py, then re-
import file, the function
fwill not be updated.
This is because the second import gets ignored as a no-op, so
fis still defined by the first import.
A solution to this problem is to use
reload(file)instead of a second
import file. Reloading causes the new definition of
fto be re-executed, and exports it to the other machines. Note that in Python 3, you need to do
from importlib import reload.
The function relies on a helper function from an external file: In this case,
fcan be defined within your Ray application, but relies on a helper function
hdefined in some external file
file.py. If the definition of
hgets changed in
fwill not update Ray to use the new version of
This is because when
ffirst gets defined, its definition is shipped to all of the workers, and is unpickled. During unpickling,
file.pygets imported in the workers. Then when
fgets redefined, its definition is again shipped and unpickled in all of the workers. But since
file.pyhas been imported in the workers already, it is treated as a second import and is ignored as a no-op.
Unfortunately, reloading on the driver does not update
h, as the reload needs to happen on the worker.
A solution to this problem is to redefine
file.pybefore it calls
h. For example, if inside
def h(): return 1
And you define remote function
@ray.remote def f(): return file.h()
You can redefine
@ray.remote def f(): reload(file) return file.h()
This forces the reload to happen on the workers as needed. Note that in Python 3, you need to do
from importlib import reload.