KubeRay metrics references#

controller-runtime metrics#

KubeRay exposes metrics provided by kubernetes-sigs/controller-runtime, including information about reconciliation, work queues, and more, to help users operate the KubeRay operator in production environments.

For more details about the default metrics provided by kubernetes-sigs/controller-runtime, see Default Exported Metrics References.

KubeRay custom metrics#

Starting with KubeRay 1.4.0, KubeRay provides metrics for its custom resources to help users better understand Ray clusters and Ray applications.

You can view these metrics by following the instructions below:

# Forward a local port to the KubeRay operator service.
kubectl port-forward service/kuberay-operator 8080

# View the metrics.
curl localhost:8080/metrics

# You should see metrics like the following if a RayCluster already exists:  
# kuberay_cluster_info{name="raycluster-kuberay",namespace="default",owner_kind="None"} 1

RayCluster metrics#

Metric name

Type

Description

Labels

kuberay_cluster_info

Gauge

Metadata information about RayCluster custom resources.

namespace: <RayCluster-namespace>
name: <RayCluster-name>
owner_kind: <RayJob|RayService|None>

kuberay_cluster_condition_provisioned

Gauge

Indicates whether the RayCluster is provisioned. See RayClusterProvisioned for more information.

namespace: <RayCluster-namespace>
name: <RayCluster-name>
condition: <true|false>

kuberay_cluster_provisioned_duration_seconds

Gauge

The time, in seconds, when a RayCluster’s RayClusterProvisioned status transitions from false (or unset) to true.

namespace: <RayCluster-namespace>
name: <RayCluster-name>

RayService metrics#

Metric name

Type

Description

Labels

kuberay_service_info

Gauge

Metadata information about RayService custom resources.

namespace: <RayService-namespace>
name: <RayService-name>

kuberay_service_condition_ready

Gauge

Describes whether the RayService is ready. Ready means users can send requests to the underlying cluster and the number of serve endpoints is greater than 0. See RayServiceReady for more information.

namespace: <RayService-namespace>
name: <RayService-name>

kuberay_service_condition_upgrade_in_progress

Gauge

Describes whether the RayService is performing a zero-downtime upgrade. See UpgradeInProgress for more information.

namespace: <RayService-namespace>
name: <RayService-name>

RayJob metrics#

Metric name

Type

Description

Labels

kuberay_job_info

Gauge

Metadata information about RayJob custom resources.

namespace: <RayJob-namespace>
name: <RayJob-name>

kuberay_job_deployment_status

Gauge

The RayJob’s current deployment status.

namespace: <RayJob-namespace>
name: <RayJob-name>
deployment_status: <New|Initializing|Running|Complete|Failed|Suspending|Suspended|Retrying|Waiting>

kuberay_job_execution_duration_seconds

Gauge

Duration of the RayJob CR’s JobDeploymentStatus transition from Initializing to either the Retrying state or a terminal state, such as Complete or Failed. The Retrying state indicates that the CR previously failed and that spec.backoffLimit is enabled.

namespace: <RayJob-namespace>
name: <RayJob-name>
job_deployment_status: <Complete|Failed>
retry_count: <count>