Exporting Metrics

To help monitor Ray applications, Ray

  • Collects some default system level metrics.

  • Exposes metrics in a Prometheus format. We’ll call the endpoint to access these metrics a Prometheus endpoint.

  • Supports custom metrics APIs that resemble Prometheus metric types.

This page describes how to access these metrics using Prometheus.

Note

It is currently an experimental feature and under active development. APIs are subject to change.

Getting Started (Single Node)

First, install Ray with the proper dependencies:

pip install "ray[default]"

Ray exposes its metrics in Prometheus format. This allows us to easily scrape them using Prometheus.

Let’s expose metrics through ray start.

ray start --head --metrics-export-port=8080 # Assign metrics export port on a head node.

Now, you can scrape Ray’s metrics using Prometheus.

First, download Prometheus. Download Link

tar xvfz prometheus-*.tar.gz
cd prometheus-*

With the ray[default] installation, Ray provides a prometheus config that works out of the box. After running ray, it can be found at /tmp/ray/session_latest/metrics/prometheus/prometheus.yml.

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
# Scrape from each ray node as defined in the service_discovery.json provided by ray.
- job_name: 'ray'
  file_sd_configs:
  - files:
    - '/tmp/ray/prom_metrics_service_discovery.json'

Next, let’s start Prometheus.

./prometheus --config.file=/tmp/ray/session_latest/metrics/prometheus/prometheus.yml

Now, you can access Ray metrics from the default Prometheus url, http://localhost:9090.

See here for more information on how to set up Prometheus on a Ray Cluster.

Grafana

Grafana is a tool that supports more advanced visualizations of prometheus metrics and allows you to create custom dashboards with your favorite metrics. Ray exports some default configurations which includes a default dashboard showing some of the most valuable metrics for debugging ray applications.

First, download Grafana. Download Link

Then run grafana using the built in configuration found in /tmp/ray/session_latest/metrics/grafana folder.

./bin/grafana-server --config /tmp/ray/session_latest/metrics/grafana/grafana.ini web

Now, you can access grafana using the default grafana url, http://localhost:3000. If this is your first time, you can login with the username: admin and password admin.

You can then see the default dashboard by going to dashboards -> manage -> Ray -> Default Dashboard.

https://raw.githubusercontent.com/ray-project/Images/master/docs/new-dashboard/default_grafana_dashboard.png

Application-level Metrics

Ray provides a convenient API in ray.util.metrics for defining and exporting custom metrics for visibility into your applications. There are currently three metrics supported: Counter, Gauge, and Histogram. These metrics correspond to the same Prometheus metric types. Below is a simple example of an actor that exports metrics using these APIs:

import time

import ray
from ray.util.metrics import Counter, Gauge, Histogram

ray.init(_metrics_export_port=8080)


@ray.remote
class MyActor:
    def __init__(self, name):
        self._curr_count = 0

        self.counter = Counter(
            "num_requests",
            description="Number of requests processed by the actor.",
            tag_keys=("actor_name",),
        )
        self.counter.set_default_tags({"actor_name": name})

        self.gauge = Gauge(
            "curr_count",
            description="Current count held by the actor. Goes up and down.",
            tag_keys=("actor_name",),
        )
        self.gauge.set_default_tags({"actor_name": name})

        self.histogram = Histogram(
            "request_latency",
            description="Latencies of requests in ms.",
            boundaries=[0.1, 1],
            tag_keys=("actor_name",),
        )
        self.histogram.set_default_tags({"actor_name": name})

    def process_request(self, num):
        start = time.time()
        self._curr_count += num

        # Increment the total request count.
        self.counter.inc()
        # Update the gauge to the new value.
        self.gauge.set(self._curr_count)
        # Record the latency for this request in ms.
        self.histogram.observe(1000 * (time.time() - start))

        return self._curr_count


print("Starting actor.")
my_actor = MyActor.remote("my_actor")
print("Calling actor.")
my_actor.process_request.remote(-10)
print("Calling actor.")
my_actor.process_request.remote(5)
print("Metrics should be exported.")
print("See http://localhost:8080 (this may take a few seconds to load).")

# Sleep so we can look at the metrics before exiting.
time.sleep(30)
print("Exiting!")

While the script is running, the metrics will be exported to localhost:8080 (this is the endpoint that Prometheus would be configured to scrape). If you open this in the browser, you should see the following output:

# HELP ray_request_latency Latencies of requests in ms.
# TYPE ray_request_latency histogram
ray_request_latency_bucket{Component="core_worker",Version="3.0.0.dev0",actor_name="my_actor",le="0.1"} 2.0
ray_request_latency_bucket{Component="core_worker",Version="3.0.0.dev0",actor_name="my_actor",le="1.0"} 2.0
ray_request_latency_bucket{Component="core_worker",Version="3.0.0.dev0",actor_name="my_actor",le="+Inf"} 2.0
ray_request_latency_count{Component="core_worker",Version="3.0.0.dev0",actor_name="my_actor"} 2.0
ray_request_latency_sum{Component="core_worker",Version="3.0.0.dev0",actor_name="my_actor"} 0.11992454528808594
# HELP ray_curr_count Current count held by the actor. Goes up and down.
# TYPE ray_curr_count gauge
ray_curr_count{Component="core_worker",Version="3.0.0.dev0",actor_name="my_actor"} -15.0
# HELP ray_num_requests_total Number of requests processed by the actor.
# TYPE ray_num_requests_total counter
ray_num_requests_total{Component="core_worker",Version="3.0.0.dev0",actor_name="my_actor"} 2.0

Please see ray.util.metrics for more details.