Log Persistence#

This page provides tips on how to collect logs from Ray clusters running on Kubernetes.

Tip

Skip to the deployment instructions for a sample configuration showing how to extract logs from a Ray pod.

The Ray log directory#

Each Ray pod runs several component processes, such as the Raylet, object manager, dashboard agent, etc. These components log to files in the directory /tmp/ray/session_latest/logs in the pod’s file system. Extracting and persisting these logs requires some setup.

Log processing tools#

There are a number of log processing tools available within the Kubernetes ecosystem. This page will shows how to extract Ray logs using Fluent Bit. Other popular tools include Fluentd, Filebeat, and Promtail.

Log collection strategies#

We mention two strategies for collecting logs written to a pod’s filesystem, sidecar containers and daemonsets. You can read more about these logging patterns in the Kubernetes documentation.

Sidecar containers#

We will provide an example of the sidecar strategy in this guide. You can process logs by configuring a log-processing sidecar for each Ray pod. Ray containers should be configured to share the /tmp/ray directory with the logging sidecar via a volume mount.

You can configure the sidecar to do either of the following:

  • Stream Ray logs to the sidecar’s stdout.

  • Export logs to an external service.

Daemonset#

Alternatively, it is possible to collect logs at the Kubernetes node level. To do this, one deploys a log-processing daemonset onto the Kubernetes cluster’s nodes. With this strategy, it is key to mount the Ray container’s /tmp/ray directory to the relevant hostPath.

Setting up logging sidecars with Fluent Bit#

In this section, we give an example of how to set up log-emitting Fluent Bit sidecars for Ray pods.

See the full config for a single-pod RayCluster with a logging sidecar here. We now discuss this configuration and show how to deploy it.

Configuring log processing#

The first step is to create a ConfigMap with configuration for Fluent Bit.

Here is a minimal ConfigMap which tells a Fluent Bit sidecar to

  • Tail Ray logs.

  • Output the logs to the container’s stdout.

apiVersion: v1
kind: ConfigMap
metadata:
  name: fluentbit-config
data:
  fluent-bit.conf: |
    [INPUT]
        Name tail
        Path /tmp/ray/session_latest/logs/*
        Tag ray
        Path_Key true
        Refresh_Interval 5
    [OUTPUT]
        Name stdout
        Match *

A few notes on the above config:

  • In addition to streaming logs to stdout, you can use an [OUTPUT] clause to export logs to any storage backend supported by Fluent Bit.

  • The Path_Key true line above ensures that file names are included in the log records emitted by Fluent Bit.

  • The Refresh_Interval 5 line asks Fluent Bit to refresh the list of files in the log directory once per 5 seconds, rather than the default 60. The reason is that the directory /tmp/ray/session_latest/logs/ does not exist initially (Ray must create it first). Setting the Refresh_Interval low allows us to see logs in the Fluent Bit container’s stdout sooner.

Adding logging sidecars to your RayCluster CR#

Adding log and config volumes#

For each pod template in our RayCluster CR, we need to add two volumes: One volume for Ray’s logs and another volume to store Fluent Bit configuration from the ConfigMap applied above.

        volumes:
        - name: ray-logs
          emptyDir: {}
        - name: fluentbit-config
          configMap:
            name: fluentbit-config

Mounting the Ray log directory#

Add the following volume mount to the Ray container’s configuration.

          volumeMounts:
          - mountPath: /tmp/ray
            name: ray-logs

Adding the Fluent Bit sidecar#

Finally, add the Fluent Bit sidecar container to each Ray pod config in your RayCluster CR.

        - name: fluentbit
          image: fluent/fluent-bit:1.9.6
          # These resource requests for Fluent Bit should be sufficient in production.
          resources:
            requests:
              cpu: 100m
              memory: 128Mi
            limits:
              cpu: 100m
              memory: 128Mi
          volumeMounts:
          - mountPath: /tmp/ray
            name: ray-logs
          - mountPath: /fluent-bit/etc/fluent-bit.conf
            subPath: fluent-bit.conf
            name: fluentbit-config

Mounting the ray-logs volume gives the sidecar container access to Ray’s logs. The fluentbit-config volume gives the sidecar access to logging configuration.

Putting everything together#

Putting all of the above elements together, we have the following yaml configuration for a single-pod RayCluster will a log-processing sidecar.

# Fluent Bit ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: fluentbit-config
data:
  fluent-bit.conf: |
    [INPUT]
        Name tail
        Path /tmp/ray/session_latest/logs/*
        Tag ray
        Path_Key true
        Refresh_Interval 5
    [OUTPUT]
        Name stdout
        Match *
---
# RayCluster CR with a FluentBit sidecar
apiVersion: ray.io/v1alpha1
kind: RayCluster
metadata:
  labels:
    controller-tools.k8s.io: "1.0"
  name: raycluster-complete-logs
spec:
  rayVersion: '2.3.0'
  headGroupSpec:
    rayStartParams:
      dashboard-host: '0.0.0.0'
    template:
      spec:
        containers:
        - name: ray-head
          image: rayproject/ray:2.3.0
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh","-c","ray stop"]
          # This config is meant for demonstration purposes only.
          # Use larger Ray containers in production!
          resources:
            limits:
              cpu: "1"
              memory: "1G"
            requests:
              cpu: "1"
              memory: "1G"
          # Share logs with Fluent Bit
          volumeMounts:
          - mountPath: /tmp/ray
            name: ray-logs
        # Fluent Bit sidecar
        - name: fluentbit
          image: fluent/fluent-bit:1.9.6
          # These resource requests for Fluent Bit should be sufficient in production.
          resources:
            requests:
              cpu: 100m
              memory: 128Mi
            limits:
              cpu: 100m
              memory: 128Mi
          volumeMounts:
          - mountPath: /tmp/ray
            name: ray-logs
          - mountPath: /fluent-bit/etc/fluent-bit.conf
            subPath: fluent-bit.conf
            name: fluentbit-config
        # Log and config volumes
        volumes:
        - name: ray-logs
          emptyDir: {}
        - name: fluentbit-config
          configMap:
            name: fluentbit-config

Deploying a RayCluster with logging CR#

Now, we will see how to deploy the configuration described above.

Deploy the KubeRay Operator if you haven’t yet. Refer to the Getting Started guide for instructions on this step.

Now, run the following commands to deploy the Fluent Bit ConfigMap and a single-pod RayCluster with a Fluent Bit sidecar.

kubectl apply -f https://raw.githubusercontent.com/ray-project/ray/releases/2.4.0/doc/source/cluster/kubernetes/configs/ray-cluster.log.yaml

Determine the Ray pod’s name with

kubectl get pod | grep raycluster-complete-logs

Examine the FluentBit sidecar’s STDOUT to see logs for Ray’s component processes.

# Substitute the name of your Ray pod.
kubectl logs raycluster-complete-logs-head-xxxxx -c fluentbit

Using structured logging#

The metadata of tasks or actors may be obtained by Ray’s :ref:runtime_context APIs <runtime-context-apis>. Runtime context APIs help you to add metadata to your logging messages, making your logs more structured.

import ray
# Initiate a driver.
ray.init()

 @ray.remote
def task():
    print(f"task_id: {ray.get_runtime_context().task_id}")

ray.get(task.remote())
(pid=47411) task_id: TaskID(a67dc375e60ddd1affffffffffffffffffffffff01000000)

Redirecting Ray logs to stderr#

By default, Ray logs are written to files under the /tmp/ray/session_*/logs directory. If you wish to redirect all internal Ray logging and your own logging within tasks/actors to stderr of the host nodes, you can do so by ensuring that the RAY_LOG_TO_STDERR=1 environment variable is set on the driver and on all Ray nodes. This practice is not recommended but may be useful if you are using a log aggregator that needs log records to be written to stderr in order for them to be captured.

Redirecting logging to stderr will also cause a ({component}) prefix, e.g. (raylet), to be added to each of the log record messages.

[2022-01-24 19:42:02,978 I 1829336 1829336] (gcs_server) grpc_server.cc:103: GcsServer server started, listening on port 50009.
[2022-01-24 19:42:06,696 I 1829415 1829415] (raylet) grpc_server.cc:103: ObjectManager server started, listening on port 40545.
2022-01-24 19:42:05,087 INFO (dashboard) dashboard.py:95 -- Setup static dir for dashboard: /mnt/data/workspace/ray/python/ray/dashboard/client/build
2022-01-24 19:42:07,500 INFO (dashboard_agent) agent.py:105 -- Dashboard agent grpc address: 0.0.0.0:49228

This should make it easier to filter the stderr stream of logs down to the component of interest. Note that multi-line log records will not have this component marker at the beginning of each line.

When running a local Ray cluster, this environment variable should be set before starting the local cluster:

os.environ["RAY_LOG_TO_STDERR"] = "1"
ray.init()

When starting a local cluster via the CLI or when starting nodes in a multi-node Ray cluster, this environment variable should be set before starting up each node:

env RAY_LOG_TO_STDERR=1 ray start

If using the Ray cluster launcher, you would specify this environment variable in the Ray start commands:

head_start_ray_commands:
    - ray stop
    - env RAY_LOG_TO_STDERR=1 ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml

worker_start_ray_commands:
    - ray stop
    - env RAY_LOG_TO_STDERR=1 ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076

When connecting to the cluster, be sure to set the environment variable before connecting:

os.environ["RAY_LOG_TO_STDERR"] = "1"
ray.init(address="auto")

Rotating logs#

Ray supports log rotation of log files. Note that not all components are currently supporting log rotation. (Raylet and Python/Java worker logs are not rotating).

By default, logs are rotating when it reaches to 512MB (maxBytes), and there could be up to 5 backup files (backupCount). Indexes are appended to all backup files (e.g., raylet.out.1) If you’d like to change the log rotation configuration, you can do it by specifying environment variables. For example,

RAY_ROTATION_MAX_BYTES=1024; ray start --head # Start a ray instance with maxBytes 1KB.
RAY_ROTATION_BACKUP_COUNT=1; ray start --head # Start a ray instance with backupCount 1.