Ray Train Metrics#
Ray Train exports Prometheus metrics including the Ray Train controller state, worker group start times, checkpointing times and more. You can use these metrics to monitor Ray Train runs. The Ray dashboard displays these metrics in the Ray Train Grafana Dashboard. See Ray Dashboard documentation for more information.
The Ray Train dashboard also displays a subset of Ray Core metrics that are useful for monitoring training but are not listed in the table below. For more information about these metrics, see the System Metrics documentation.
The following table lists the Prometheus metrics emitted by Ray Train:
Prometheus Metric |
Labels |
Description |
|---|---|---|
|
|
Current state of the Ray Train controller. |
|
|
Total time taken to start the worker group. |
|
|
Total time taken to shut down the worker group. |
|
|
Cumulative time in seconds to report a checkpoint to storage. |