Usage Stats Collection#

Starting in Ray 1.13, Ray collects usage stats data by default (guarded by an opt-out prompt). This data will be used by the open-source Ray engineering team to better understand how to improve our libraries and core APIs, and how to prioritize bug fixes and enhancements.

Here are the guiding principles of our collection policy:

  • No surprises — you will be notified before we begin collecting data. You will be notified of any changes to the data being collected or how it is used.

  • Easy opt-out: You will be able to easily opt-out of data collection

  • Transparency — you will be able to review all data that is sent to us

  • Control — you will have control over your data, and we will honor requests to delete your data.

  • We will not collect any personally identifiable data or proprietary code/data

  • We will not sell data or buy data about you.

You will always be able to disable the usage stats collection.

For more context, please refer to this RFC.

What data is collected?#

We collect non-sensitive data that helps us understand how Ray is used (e.g., which Ray libraries are used). Personally identifiable data will never be collected. Please check the UsageStatsToReport class to see the data we collect.

How to disable it#

There are multiple ways to disable usage stats collection before starting a cluster:

  1. Add --disable-usage-stats option to the command that starts the Ray cluster (e.g., ray start --head --disable-usage-stats command).

  2. Run ray disable-usage-stats to disable collection for all future clusters. This won’t affect currently running clusters. Under the hood, this command writes {"usage_stats": true} to the global config file ~/.ray/config.json.

  3. Set the environment variable RAY_USAGE_STATS_ENABLED to 0 (e.g., RAY_USAGE_STATS_ENABLED=0 ray start --head command).

  4. If you’re using KubeRay, you can add disable-usage-stats: 'true' to .spec.[headGroupSpec|workerGroupSpecs].rayStartParams..

Currently there is no way to enable or disable collection for a running cluster; you have to stop and restart the cluster.

How does it work?#

When a Ray cluster is started via ray start –head, ray up, ray submit –start or ray exec –start, Ray will decide whether usage stats collection should be enabled or not by considering the following factors in order:

  1. It checks whether the environment variable RAY_USAGE_STATS_ENABLED is set: 1 means enabled and 0 means disabled.

  2. If the environment variable is not set, it reads the value of key usage_stats in the global config file ~/.ray/config.json: true means enabled and false means disabled.

  3. If neither is set and the console is interactive, then the user will be prompted to enable or disable the collection. If the console is non-interactive, usage stats collection will be enabled by default. The decision will be saved to ~/.ray/config.json, so the prompt is only shown once.

Note: usage stats collection is not enabled when using local dev clusters started via ray.init() unless it’s a nightly wheel. This means that Ray will never collect data from third-party library users not using Ray directly.

If usage stats collection is enabled, a background process on the head node will collect the usage stats and report to https://usage-stats.ray.io/ every hour. The reported usage stats will also be saved to /tmp/ray/session_xxx/usage_stats.json on the head node for inspection. You can check the existence of this file to see if collection is enabled.

Usage stats collection is very lightweight and should have no impact on your workload in any way.

Requesting removal of collected data#

To request removal of collected data, please email us at usage_stats@ray.io with the session_id that you can find in /tmp/ray/session_xxx/usage_stats.json.

Frequently Asked Questions (FAQ)#

Does the session_id map to personal data?

No, the uuid will be a Ray session/job-specific random ID that cannot be used to identify a specific person nor machine. It will not live beyond the lifetime of your Ray session; and is primarily captured to enable us to honor deletion requests.

The session_id is logged so that deletion requests can be honored.

Could an enterprise easily configure an additional endpoint or substitute a different endpoint?

We definitely see this use case and would love to chat with you to make this work – email usage_stats@ray.io.

Contact us#

If you have any feedback regarding usage stats collection, please email us at usage_stats@ray.io.