Monitoring and observability

Ray comes with following observability features:

  1. The dashboard

  2. ray status

  3. Prometheus metrics

Please refer to the observability documentation for more on Ray’s observability features.

Monitoring the cluster via the dashboard

The dashboard provides detailed information about the state of the cluster, including the running jobs, actors, workers, nodes, etc.

By default, the cluster launcher and operator will launch the dashboard, but not publicly expose it.

If you launch your application via the cluster launcher, you can securely portforward local traffic to the dashboard via the ray dashboard command (which establishes an SSH tunnel). The dashboard will now be visible at http://localhost:8265.

The Kubernetes Operator makes the dashboard available via a Service targeting the Ray head pod. You can access the dashboard using kubectl port-forward.

Observing the autoscaler

The autoscaler makes decisions by scheduling information, and programmatic information from the cluster. This information, along with the status of starting nodes, can be accessed via the ray status command.

To dump the current state of a cluster launched via the cluster launcher, you can run ray exec cluster.yaml "Ray status".

For a more “live” monitoring experience, it is recommended that you run ray status in a watch loop: ray exec cluster.yaml "watch -n 1 Ray status".

With the kubernetes operator, you should replace ray exec cluster.yaml with kubectl exec <head node pod>.

Prometheus metrics

Ray is capable of producing prometheus metrics. When enabled, Ray produces some metrics about the Ray core, and some internal metrics by default. It also supports custom, user-defined metrics.

These metrics can be consumed by any metrics infrastructure which can ingest metrics from the prometheus server on the head node of the cluster.

Learn more about setting up prometheus here.