KubeRay Observability#

Methods 1 and 2 address control plane observability, while methods 3, 4, and 5 focus on data plane observability.

Method 1: Check KubeRay operator’s logs for errors#

# Typically, the operator's Pod name is kuberay-operator-xxxxxxxxxx-yyyyy.
kubectl logs $KUBERAY_OPERATOR_POD -n $YOUR_NAMESPACE | tee operator-log

Use this command to redirect the operator’s logs to a file called operator-log. Then search for errors in the file.

Method 2: Check custom resource status#

kubectl describe [raycluster|rayjob|rayservice] $CUSTOM_RESOURCE_NAME -n $YOUR_NAMESPACE

After running this command, check the status and events of the custom resource for any errors.

Method 3: Check logs of Ray Pods#

Check the Ray logs directly by accessing the log files on the Pods. See Ray Logging for more details.

kubectl exec -it $RAY_POD -n $YOUR_NAMESPACE -- bash
# Check the logs under /tmp/ray/session_latest/logs/

Method 4: Check Dashboard#

export HEAD_POD=$(kubectl get pods --selector=ray.io/node-type=head -o custom-columns=POD:metadata.name --no-headers)
kubectl port-forward $RAY_POD -n $YOUR_NAMESPACE --address 8265:8265
# Check $YOUR_IP:8265 in your browser to access the dashboard.
# For most cases, or localhost:8265 should work.

Method 5: Ray State CLI#

You can use the Ray State CLI on the head Pod to check the status of Ray Serve applications.

# Log into the head Pod
export HEAD_POD=$(kubectl get pods --selector=ray.io/node-type=head -o custom-columns=POD:metadata.name --no-headers)
kubectl exec -it $HEAD_POD -- ray summary actors

# [Example output]:
# ======== Actors Summary: 2023-07-11 17:58:24.625032 ========
# Stats:
# ------------------------------------
# total_actors: 14

# Table (group by class):
# ------------------------------------
#     CLASS_NAME                          STATE_COUNTS
# 0   ...                                 ALIVE: 1
# 1   ...                                 ALIVE: 1
# 2   ...                                 ALIVE: 3
# 3   ...                                 ALIVE: 1
# 4   ...                                 ALIVE: 1
# 5   ...                                 ALIVE: 1
# 6   ...                                 ALIVE: 1
# 7   ...                                 ALIVE: 1
# 8   ...                                 ALIVE: 1
# 9   ...                                 ALIVE: 1
# 10  ...                                 ALIVE: 1
# 11  ...                                 ALIVE: 1