System Metrics#
Ray exports a number of system metrics, which provide introspection into the state of Ray workloads, as well as hardware utilization statistics. The following table describes the officially supported metrics:
Note
Certain labels are common across all metrics, such as SessionName
(uniquely identifies a Ray cluster instance), instance
(per-node label applied by Prometheus, and JobId
(Ray job id, as applicable).
Prometheus Metric |
Labels |
Description |
---|---|---|
|
|
Current number of tasks (both remote functions and actor calls) by state. The State label (e.g., RUNNING, FINISHED, FAILED) describes the state of the task. See rpc::TaskState for more information. The function/method name is available as the Name label. If the task was retried due to failure or reconstruction, the IsRetry label will be set to “1”, otherwise “0”. |
|
|
Current number of actors in a particular state. The State label is described by rpc::ActorTableData proto in gcs.proto. The actor class name is available in the Name label. |
|
|
Logical resource usage for each node of the cluster. Each resource has some quantity that is in either USED state vs AVAILABLE state. The Name label defines the resource name (e.g., CPU, GPU). |
|
|
Object store memory usage in bytes, broken down by logical Location (SPILLED, MMAP_DISK, MMAP_SHM, and WORKER_HEAP). Definitions are as follows. SPILLED–Objects that have spilled to disk or a remote Storage solution (for example, AWS S3). The default is the disk. MMAP_DISK–Objects stored on a memory-mapped page on disk. This mode very slow and only happens under severe memory pressure. MMAP_SHM–Objects store on a memory-mapped page in Shared Memory. This mode is the default, in the absence of memory pressure. WORKER_HEAP–Objects, usually smaller, stored in the memory of the Ray Worker process itself. Small objects are stored in the worker heap. |
|
|
Current number of placement groups by state. The State label (e.g., PENDING, CREATED, REMOVED) describes the state of the placement group. See rpc::PlacementGroupTable for more information. |
|
|
The number of tasks and actors killed by the Ray Out of Memory killer (https://docs.ray.io/en/master/ray-core/scheduling/ray-oom-prevention.html) broken down by types (whether it is tasks or actors) and names (name of tasks and actors). |
|
|
The CPU utilization per node as a percentage quantity (0..100). This should be scaled by the number of cores per node to convert the units into cores. |
|
|
The number of CPU cores per node. |
|
|
The GPU utilization per GPU as a percentage quantity (0..NGPU*100). |
|
|
The amount of disk space used per node, in bytes. |
|
|
The amount of disk space available per node, in bytes. |
|
|
The disk write throughput per node, in bytes per second. |
|
|
The disk read throughput per node, in bytes per second. |
|
|
The amount of physical memory used per node, in bytes. |
|
|
The amount of physical memory available per node, in bytes. |
|
|
The measured unique set size in megabytes, broken down by logical Ray component. Ray components consist of system components (e.g., raylet, gcs, dashboard, or agent) and the method names of running tasks/actors. |
|
|
The measured CPU percentage, broken down by logical Ray component. Ray components consist of system components (e.g., raylet, gcs, dashboard, or agent) and the method names of running tasks/actors. |
|
|
The amount of GPU memory used per GPU, in bytes. |
|
|
The network receive throughput per node, in bytes per second. |
|
|
The network send throughput per node, in bytes per second. |
|
|
The number of healthy nodes in the cluster, broken down by autoscaler node type. |
|
|
The number of failed nodes reported by the autoscaler, broken down by node type. |
|
|
The number of pending nodes reported by the autoscaler, broken down by node type. |
Metrics Semantics and Consistency#
Ray guarantees all its internal state metrics are eventually consistent even in the presence of failures— should any worker fail, eventually the right state will be reflected in the Prometheus time-series output. However, any particular metrics query is not guaranteed to reflect an exact snapshot of the cluster state.
For the ray_tasks
and ray_actors
metrics, you should use sum queries to plot their outputs (e.g., sum(ray_tasks) by (Name, State)
). The reason for this is that Ray’s task metrics are emitted from multiple distributed components. Hence, there are multiple metric points, including negative metric points, emitted from different processes that must be summed to produce the correct logical view of the distributed system. For example, for a single task submitted and executed, Ray may emit (submitter) SUBMITTED_TO_WORKER: 1, (executor) SUBMITTED_TO_WORKER: -1, (executor) RUNNING: 1
, which reduces to SUBMITTED_TO_WORKER: 0, RUNNING: 1
after summation.