Monitoring with the CLI or SDK#

Monitoring and debugging capabilities in Ray are available through a CLI or SDK.

CLI command `ray status`#

You can monitor node status and resource usage by running the CLI command, ray status, on the head node. It displays

Node Status: Nodes that are running and autoscaling up or down. Addresses of running nodes. Information about pending nodes and failed nodes.
Resource Usage: The Ray resource usage of the cluster. For example, requested CPUs from all Ray Tasks and Actors. Number of GPUs that are used.

Following is an example output:

$ ray status
======== Autoscaler status: 2021-10-12 13:10:21.035674 ========
Node status
---------------------------------------------------------------
Healthy:
 1 ray.head.default
 2 ray.worker.cpu
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/10.0 CPU
 0.00/70.437 GiB memory
 0.00/10.306 GiB object_store_memory

Demands:
 (no resource demands)

When you need more verbose info about each node, run ray status -v. This is helpful when you need to investigate why particular nodes don’t autoscale down.

Ray State CLI and SDK#

Tip

Provide feedback on using Ray state APIs - feedback form!

Use Ray State APIs to access the current state (snapshot) of Ray through the CLI or Python SDK (developer APIs).

Note

This feature requires a full installation of Ray using pip install "ray[default]". This feature also requires that the dashboard component is available. The dashboard component needs to be included when starting the Ray Cluster, which is the default behavior for ray start and ray.init().

Note

State API CLI commands are stable, while Python SDKs are DeveloperAPI. CLI usage is recommended over Python SDKs.

Get started#

This example uses the following script that runs two Tasks and creates two Actors.

import ray
import time

ray.init(num_cpus=4)

@ray.remote
def task_running_300_seconds():
    time.sleep(300)

@ray.remote
class Actor:
    def __init__(self):
        pass

# Create 2 tasks
tasks = [task_running_300_seconds.remote() for _ in range(2)]

# Create 2 actors
actors = [Actor.remote() for _ in range(2)]

See the summarized states of tasks. If it doesn’t return the output immediately, retry the command.

CLI (Recommended)

ray summary tasks

======== Tasks Summary: 2022-07-22 08:54:38.332537 ========
Stats:
------------------------------------
total_actor_scheduled: 2
total_actor_tasks: 0
total_tasks: 2


Table (group by func_name):
------------------------------------
    FUNC_OR_CLASS_NAME        STATE_COUNTS    TYPE
0   task_running_300_seconds  RUNNING: 2      NORMAL_TASK
1   Actor.__init__            FINISHED: 2     ACTOR_CREATION_TASK

Python SDK (Internal Developer API)

from ray.util.state import summarize_tasks
print(summarize_tasks())

{'cluster': {'summary': {'task_running_300_seconds': {'func_or_class_name': 'task_running_300_seconds', 'type': 'NORMAL_TASK', 'state_counts': {'RUNNING': 2}}, 'Actor.__init__': {'func_or_class_name': 'Actor.__init__', 'type': 'ACTOR_CREATION_TASK', 'state_counts': {'FINISHED': 2}}}, 'total_tasks': 2, 'total_actor_tasks': 0, 'total_actor_scheduled': 2, 'summary_by': 'func_name'}}

List all Actors.

CLI (Recommended)

ray list actors

======== List: 2022-07-23 21:29:39.323925 ========
Stats:
------------------------------
Total: 2

Table:
------------------------------
    ACTOR_ID                          CLASS_NAME    NAME      PID  STATE
0  31405554844820381c2f0f8501000000  Actor                 96956  ALIVE
1  f36758a9f8871a9ca993b1d201000000  Actor                 96955  ALIVE

Python SDK (Internal Developer API)

from ray.util.state import list_actors
print(list_actors())

[ActorState(actor_id='...', class_name='Actor', state='ALIVE', job_id='01000000', name='', node_id='...', pid=..., ray_namespace='...', serialized_runtime_env=None, required_resources=None, death_cause=None, is_detached=None, placement_group_id=None, repr_name=None), ActorState(actor_id='...', class_name='Actor', state='ALIVE', job_id='01000000', name='', node_id='...', pid=..., ray_namespace='...', serialized_runtime_env=None, required_resources=None, death_cause=None, is_detached=None, placement_group_id=None, repr_name=None)]

Get the state of a single Task using the get API.

CLI (Recommended)

# In this case, 31405554844820381c2f0f8501000000
ray get actors <ACTOR_ID>

---
actor_id: 31405554844820381c2f0f8501000000
class_name: Actor
death_cause: null
is_detached: false
name: ''
pid: 96956
resource_mapping: []
serialized_runtime_env: '{}'
state: ALIVE

Python SDK (Internal Developer API)

from ray.util.state import get_actor
# In this case, 31405554844820381c2f0f8501000000
print(get_actor(id=<ACTOR_ID>))

Access logs through the ray logs API.

CLI (Recommended)

ray list actors
# In this case, ACTOR_ID is 31405554844820381c2f0f8501000000
ray logs actor --id <ACTOR_ID>

--- Log has been truncated to last 1000 lines. Use `--tail` flag to toggle. ---

:actor_name:Actor
Actor created

Python SDK (Internal Developer API)

from ray.util.state import get_log

# In this case, ACTOR_ID is 31405554844820381c2f0f8501000000
for line in get_log(actor_id=<ACTOR_ID>):
    print(line)

Key Concepts#

Ray State APIs allow you to access states of resources through summary, list, and get APIs. It also supports logs API to access logs.

states: The state of the cluster of corresponding resources. States consist of immutable metadata (e.g., Actor’s name) and mutable states (e.g., Actor’s scheduling state or pid).
resources: Resources created by Ray. E.g., actors, tasks, objects, placement groups, and etc.
summary: API to return the summarized view of resources.
list: API to return every individual entity of resources.
get: API to return a single entity of resources in detail.
logs: API to access the log of Actors, Tasks, Workers, or system log files.

API Reference#

For the CLI Reference, see State CLI Reference.
For the SDK Reference, see State API Reference.
For the Log CLI Reference, see Log CLI Reference.

Using Ray CLI tools from outside the cluster#

These CLI commands have to be run on a node in the Ray Cluster. Examples for executing these commands from a machine outside the Ray Cluster are provided below.

VM Cluster Launcher

Execute a command on the cluster using ray exec:

$ ray exec <cluster config file> "ray status"

KubeRay

Execute a command on the cluster using kubectl exec and the configured RayCluster name. Ray uses the Service targeting the Ray head pod to execute a CLI command on the cluster.

# First, find the name of the Ray head service.
$ kubectl get pod | grep <RayCluster name>-head
# NAME                                             READY   STATUS    RESTARTS   AGE
# <RayCluster name>-head-xxxxx                     2/2     Running   0          XXs

# Then, use the name of the Ray head service to run `ray status`.
$ kubectl exec <RayCluster name>-head-xxxxx -- ray status

Monitoring with the CLI or SDK#

CLI command `ray status`#

Ray State CLI and SDK#

Get started#

Key Concepts#

User guides#

Getting a summary of states of entities by type#

List the states of all entities of certain type#

Get the states of a particular entity (task, actor, etc.)#

Fetch the logs of a particular entity (task, actor, etc.)#

Failure Semantics#

API Reference#

Using Ray CLI tools from outside the cluster#

Monitoring with the CLI or SDK#

CLI command ray status#

Ray State CLI and SDK#

Get started#

Key Concepts#

User guides#

Getting a summary of states of entities by type#

List the states of all entities of certain type#

Get the states of a particular entity (task, actor, etc.)#

Fetch the logs of a particular entity (task, actor, etc.)#

Failure Semantics#

API Reference#

Using Ray CLI tools from outside the cluster#

CLI command `ray status`#