Overview#

This section covers a list of available monitoring and debugging tools and features in Ray.

This documentation only covers the high-level description of available tools and features. For more details, see Ray Observability.

Dashboard (Web UI)#

Ray supports the web-based dashboard to help users monitor the cluster. When a new cluster is started, the dashboard is available through the default address localhost:8265 (port can be automatically incremented if port 8265 is already occupied).

See Ray Dashboard for more details.

Application Logging#

By default, all stdout and stderr of tasks and actors are streamed to the Ray driver (the entrypoint script that calls ray.init).

import ray

# Initiate a driver.
ray.init()


@ray.remote
def task():
    print("task")


ray.get(task.remote())


@ray.remote
class Actor:
    def ready(self):
        print("actor")


actor = Actor.remote()
ray.get(actor.ready.remote())

All stdout emitted from the print method is printed to the driver with a (the task or actor repr, the process ID, IP address) prefix.

(pid=45601) task
(Actor pid=480956) actor

See Logging for more details.

Exceptions#

Creating a new task or submitting an actor task generates an object reference. When ray.get is called on the object reference, the API raises an exception if anything goes wrong with a related task, actor or object. For example,

  • RayTaskError is raised when there’s an error from user code that throws an exception.

  • RayActorError is raised when an actor is dead (by a system failure such as node failure or user-level failure such as an exception from __init__ method).

  • RuntimeEnvSetupError is raised when the actor or task couldn’t be started because a runtime environment failed to be created.

See Exceptions Reference for more details.

Accessing Ray States#

Starting from Ray 2.0, it supports CLI / Python APIs to query the state of resources (e.g., actor, task, object, etc.).

For example, the following command will summarize the task state of the cluster.

ray summary tasks
======== Tasks Summary: 2022-07-22 08:54:38.332537 ========
Stats:
------------------------------------
total_actor_scheduled: 2
total_actor_tasks: 0
total_tasks: 2


Table (group by func_name):
------------------------------------
    FUNC_OR_CLASS_NAME        STATE_COUNTS    TYPE
0   task_running_300_seconds  RUNNING: 2      NORMAL_TASK
1   Actor.__init__            FINISHED: 2     ACTOR_CREATION_TASK

The following command will list all the actors from the cluster.

ray list actors
======== List: 2022-07-23 21:29:39.323925 ========
Stats:
------------------------------
Total: 2

Table:
------------------------------
    ACTOR_ID                          CLASS_NAME    NAME      PID  STATE
0  31405554844820381c2f0f8501000000  Actor                 96956  ALIVE
1  f36758a9f8871a9ca993b1d201000000  Actor                 96955  ALIVE

See Ray State API for more details.

Debugger#

Ray has a built-in debugger that allows you to debug your distributed applications. It allows you to set breakpoints in your Ray tasks and actors, and when hitting the breakpoint, you can drop into a PDB session that you can then use to:

  • Inspect variables in that context

  • Step within that task or actor

  • Move up or down the stack

See Ray Debugger for more details.

Monitoring Cluster State and Resource Demands#

You can monitor cluster usage and auto-scaling status by running (on the head node) a CLI command ray status. It displays

  • Cluster State: Nodes that are up and running. Addresses of running nodes. Information about pending nodes and failed nodes.

  • Autoscaling Status: The number of nodes that are autoscaling up and down.

  • Cluster Usage: The resource usage of the cluster. E.g., requested CPUs from all Ray tasks and actors. Number of GPUs that are used.

Here’s an example output.

$ ray status
======== Autoscaler status: 2021-10-12 13:10:21.035674 ========
Node status
---------------------------------------------------------------
Healthy:
 1 ray.head.default
 2 ray.worker.cpu
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/10.0 CPU
 0.00/70.437 GiB memory
 0.00/10.306 GiB object_store_memory

Demands:
 (no resource demands)

Metrics#

Ray collects and exposes the physical stats (e.g., CPU, memory, GRAM, disk, and network usage of each node), internal stats (e.g., number of actors in the cluster, number of worker failures of the cluster), and custom metrics (e.g., metrics defined by users). All stats can be exported as time series data (to Prometheus by default) and used to monitor the cluster over time.

See Ray Metrics for more details.

Profiling#

Ray is compatible with Python profiling tools such as CProfile. It also supports its built-in profiling tool such as :ref:`ray timeline <ray-timeline-doc>`.

See Profiling for more details.

Tracing#

To help debug and monitor Ray applications, Ray supports distributed tracing (integration with OpenTelemetry) across tasks and actors.

See Ray Tracing for more details.