RayCluster Configuration¶

This guide covers the key aspects of Ray cluster configuration on Kubernetes.

Introduction¶

Deployments of Ray on Kubernetes follow the operator pattern. The key players are

  • A custom resource called a RayCluster describing the desired state of a Ray cluster.

  • A custom controller, the KubeRay operator, which manages Ray pods in order to match the RayCluster’s spec.

To deploy a Ray cluster, one creates a RayCluster custom resource (CR):

kubectl apply -f raycluster.yaml

This guide covers the salient features of RayCluster CR configuration.

For reference, here is a condensed example of a RayCluster CR in yaml format.

apiVersion: ray.io/v1alpha1
kind: RayCluster
metadata:
  name: raycluster-complete
spec:
  rayVersion: "2.0.0"
  enableInTreeAutoscaling: true
  autoscalerOptions:
     ...
  headGroupSpec:
    serviceType: ClusterIP # Options are ClusterIP, NodePort, and LoadBalancer
    enableIngress: false # Optional
    rayStartParams:
      block: true
      dashboard-host: "0.0.0.0"
      ...
    template: # Pod template
        metadata: # Pod metadata
        spec: # Pod spec
            containers:
            - name: ray-head
              image: rayproject/ray-ml:2.0.0
              resources:
                limits:
                  cpu: 14
                  memory: 54Gi
                requests:
                  cpu: 14
                  memory: 54Gi
              # Keep this preStop hook in each Ray container config.
              lifecycle:
                preStop:
                  exec:
                    command: ["/bin/sh","-c","ray stop"]
              ports: # Optional service port overrides
              - containerPort: 6379
                name: gcs
              - containerPort: 8265
                name: dashboard
              - containerPort: 10001
                name: client
              - containerPort: 8000
                name: serve
                ...
  workerGroupSpecs:
  - groupName: small-group
    replicas: 1
    minReplicas: 1
    maxReplicas: 5
    rayStartParams:
        ...
    template: # Pod template
      spec:
        # Keep this initContainer in each workerGroup template.
        initContainers:
        - name: init-myservice
          image: busybox:1.28
          command: ['sh', '-c', "until nslookup $RAY_IP.$(cat /var/run/secrets/kubernetes.io/serviceaccount/namespace).svc.cluster.local; do echo waiting for myservice; sleep 2; done"]
        ...
  # Another workerGroup
  - groupName: medium-group
    ...
  # Yet another workerGroup, with access to special hardware perhaps.
  - groupName: gpu-group
    ...

The rest of this guide will discuss the RayCluster CR’s config fields.

The Ray Version¶

The field rayVersion specifies the version of Ray used in the Ray cluster. The rayVersion is used to fill default values for certain config fields. The Ray container images specified in the RayCluster CR should carry the same Ray version as the CR’s rayVersion. If you are using a nightly or development Ray image, it is fine to set rayVersion to the latest release version of Ray.

Pod configuration: headGroupSpec and workerGroupSpecs¶

At a high level, a RayCluster is a collection of Kubernetes pods, similar to a Kubernetes Deployment or StatefulSet. Just as with the Kubernetes built-ins, the key pieces of configuration are

  • Pod specification

  • Scale information (how many pods are desired)

The key difference between a Deployment and a RayCluster is that a RayCluster is specialized for running Ray applications. A Ray cluster consists of

  • One head pod which hosts global control processes for the Ray cluster. The head pod can also run Ray tasks and actors.

  • Any number of worker pods, which run Ray tasks and actors. Workers come in worker groups of identically configured pods. For each worker group, we must specify replicas, the number of pods we want of that group.

The head pod’s configuration is specified under headGroupSpec, while configuration for worker pods is specified under workerGroupSpecs. There may be multiple worker groups, each group with its own configuration. The replicas field of a workerGroupSpec specifies the number of worker pods of that group to keep in the cluster.

Pod templates¶

The bulk of the configuration for a headGroupSpec or workerGroupSpec goes in the template field. The template is a Kubernetes Pod template which determines the configuration for the pods in the group. Here are some of the subfields of the pod template to pay attention to:

resources¶

It’s important to specify container CPU and memory requests and limits for each group spec. For GPU workloads, you may also wish to specify GPU limits. For example, set nvidia.com/gpu:2 if using an Nvidia GPU device plugin and you wish to specify a pod with access to 2 GPUs. See this guide for more details on GPU support.

It’s ideal to size each Ray pod to take up the entire Kubernetes node on which it is scheduled. In other words, it’s best to run one large Ray pod per Kubernetes node. In general, it is more efficient to use a few large Ray pods than many small ones. The pattern of fewer large Ray pods has the following advantages:

  • more efficient use of each Ray pod’s shared memory object store

  • reduced communication overhead between Ray pods

  • reduced redundancy of per-pod Ray control structures such as Raylets

The CPU, GPU, and memory limits specified in the Ray container config will be automatically advertised to Ray. These values will be used as the logical resource capacities of Ray pods in the head or worker group. Note that CPU quantities will be rounded up to the nearest integer before being relayed to Ray. The resource capacities advertised to Ray may be overridden in the Ray Start Parameters.

On the other hand CPU, GPU, and memory requests will be ignored by Ray. For this reason, it is best when possible to set resource requests equal to resource limits.

nodeSelector and tolerations¶

You can control the scheduling of worker groups’ Ray pods by setting the nodeSelector and tolerations fields of the pod spec. Specifically, these fields determine on which Kubernetes nodes the pods may be scheduled. See the Kubernetes docs for more about Pod-to-Node assignment.

image¶

The Ray container images specified in the RayCluster CR should carry the same Ray version as the CR’s spec.rayVersion. If you are using a nightly or development Ray image, it is fine to specify Ray’s latest release version under spec.rayVersion.

Code dependencies for a given Ray task or actor must be installed on each Ray node that might run the task or actor. To achieve this, it is simplest to use the same Ray image for the Ray head and all worker groups. In any case, do make sure that all Ray images in your CR carry the same Ray version and Python version. To distribute custom code dependencies across your cluster, you can build a custom container image, using one of the official Ray images as the base. See this guide to learn more about the official Ray images. For dynamic dependency management geared towards iteration and developement, you can also use Runtime Environments.

Ray Start Parameters¶

The rayStartParams field of each group spec is a string-string map of arguments to the Ray container’s ray start entrypoint. For the full list of arguments, refer to the documentation for ray start. We make special note of the following arguments:

block¶

For most use-cases, this field should be set to “true” for all Ray pod. The container’s Ray entrypoint will then block forever until a Ray process exits, at which point the container will exit. If this field is omitted, ray start will start Ray processes in the background and the container will subsequently sleep forever until terminated. (Future versions of KubeRay may set block to true by default. See KubeRay issue #368.)

dashboard-host¶

For most use-cases, this field should be set to “0.0.0.0” for the Ray head pod. This is required to expose the Ray dashboard outside the Ray cluster. (Future versions might set this parameter by default.)

num-cpus¶

This optional field tells the Ray scheduler and autoscaler how many CPUs are available to the Ray pod. The CPU count can be autodetected from the Kubernetes resource limits specified in the group spec’s pod template. However, it is sometimes useful to override this autodetected value. For example, setting num-cpus:"0" for the Ray head pod will prevent Ray workloads with non-zero CPU requirements from being scheduled on the head. Note that the values of all Ray start parameters, including num-cpus, must be supplied as strings.

num-gpus¶

This optional field specifies the number of GPUs available to the Ray container. In KubeRay versions since 0.3.0, the number of GPUs can be auto-detected from Ray container resource limits. For certain advanced use-cases, you may wish to use num-gpus to set an override. Note that the values of all Ray start parameters, including num-gpus, must be supplied as strings.

memory¶

The memory available to the Ray is detected automatically from the Kubernetes resource limits. If you wish, you may override this autodetected value by setting the desired memory value, in bytes, under rayStartParams.memory. Note that the values of all Ray start parameters, including memory, must be supplied as strings.

resources¶

This field can be used to specify custom resource capacities for the Ray pod. These resource capacities will be advertised to the Ray scheduler and Ray autoscaler. For example, the following annotation will mark a Ray pod as having 1 unit of Custom1 capacity and 5 units of Custom2 capacity.

rayStartParams:
    resources: '"{\"Custom1\": 1, \"Custom2\": 5}"'

You can then annotate tasks and actors with annotations like @ray.remote(resources={"Custom2": 1}). The Ray scheduler and autoscaler will take appropriate action to schedule such tasks.

Note the format used to express the resources string. In particular, note that the backslashes are present as actual characters in the string. If you are specifying a RayCluster programmatically, you may have to escape the backslashes to make sure they are processed as part of the string.

The field rayStartParams.resources should only be used for custom resources. The keys CPU, GPU, and memory are forbidden. If you need to specify overrides for those resource fields, use the Ray start parameters num-cpus, num-gpus, or memory.

Services and Networking¶

The Ray head service.¶

The KubeRay operator automatically configures a Kubernetes Service exposing the default ports for several services of the Ray head pod, including

  • Ray Client (default port 10001)

  • Ray Dashboard (default port 8265)

  • Ray GCS server (default port 6379)

  • Ray Serve (default port 8000)

The name of the configured Kubernetes Service is the name, metadata.name, of the RayCluster followed by the suffix head-svc. For the example CR given on this page, the name of the head service will be raycluster-example-head-svc. Kubernetes networking (kube-dns) then allows us to address the Ray head’s services using the name raycluster-example-head-svc. For example, the Ray Client server can be accessed from a pod in the same Kubernetes namespace using

ray.init("ray://raycluster-example-head-svc:10001")

The Ray Client server can be accessed from a pod in another namespace using

ray.init("ray://raycluster-example-head-svc.default.svc.cluster.local:10001")

(This assumes the Ray cluster was deployed into the default Kuberentes namespace. If the Ray cluster is deployed in a non-default namespace, use that namespace in place of default.)

ServiceType, Ingresses¶

Ray Client and other services can be exposed outside the Kubernetes cluster using port-forwarding or an ingress. The simplest way to access the Ray head’s services is to use port-forwarding.

Other means of exposing the head’s services outside the cluster may require using a service of type LoadBalancer or NodePort. Set headGroupSpec.serviceType to the appropriate type for your application.

You may wish to set up an ingress to expose the Ray head’s services outside the cluster. If you set the optional boolean field headGroupSpec.enableIngress to true, the KubeRay operator will create an ingress for your Ray cluster. See the KubeRay documentation for details. However, it is up to you to set up an ingress controller. Moreover, the ingress created by the KubeRay operator might not be compatible with your network setup. It is valid to omit the headGroupSpec.enableIngress field and configure an ingress object yourself.

Specifying non-default ports.¶

If you wish to override the ports exposed by the Ray head service, you may do so by specifying the Ray head container’s ports list, under headGroupSpec. Here is an example of a list of non-default ports for the Ray head service.

ports:
- containerPort: 6380
  name: gcs
- containerPort: 8266
  name: dashboard
- containerPort: 10002
  name: client

If the head container’s ports list is specified, the Ray head service will expose precisely the ports in the list. In the above example, the head service will expose just three ports; in particular there will be no port exposed for Ray Serve.

For the Ray head to actually use the non-default ports specified in the ports list, you must also specify the relevant rayStartParams. For the above example,

rayStartParams:
  port: "6380"
  dashboard-port: "8266"
  ray-client-server-port: "10002"
  ...

Autoscaler Configuration¶

Note

If you are deciding whether to use autoscaling for a particular Ray application, check out this discussion. Note that autoscaling is supported only with Ray versions at least as new as Ray 1.11.0.

To enable the optional Ray Autoscaler support, set enableInTreeAutoscaling:true. The KubeRay operator will then automatically configure an autoscaling sidecar container for the Ray head pod. The autoscaler container collects resource metrics from the Ray cluster and automatically adjusts the replicas field of each workerGroupSpec as needed to fulfill the requirements of your Ray application.

Use the fields minReplicas and maxReplicas to constrain the number of replicas of an autoscaling workerGroup. When deploying an autoscaling cluster, one typically sets replicas and minReplicas to the same value. The Ray autoscaler will then take over and modify the replicas field as needed by the Ray application.

Autoscaler operation¶

We describe how the autoscaler interacts with the RayCluster CR.

Scale up¶

The autoscaler scales worker pods up to accomodate the load of logical resources from your Ray application. For example, suppose you submit a task requesting 2 GPUs:

@ray.remote(num_gpus=2)
...

If your Ray cluster does not currently have any GPU worker pods, and if your configuration specifies a worker type with at least 2 units of GPU capacity, a GPU pod will be upscaled.

The autoscaler scales Ray worker pods up by editing the replicas field of the relevant workerGroupSpec.

Scale down¶

The autoscaler scales a worker pod down when the pod has not been using any logical resources for a set period of time. In this context, “resources” are the logical Ray resources (such as CPU, GPU, memory, and custom resources) specified in Ray task and actor annotations. Usage of the Ray Object Store also marks a Ray worker pod as active and prevents downscaling.

To scale down Ray pods of a given workerGroup, the autoscaler adds the Ray pods’ names to the relevant workerGroupSpec’s scaleStrategy.workersToDelete list and decrements the replicas field.

Manually scaling¶

You may manually adjust a RayCluster’s scale by editing the replicas or workersToDelete fields. (It is also possible to implement custom scaling logic that adjusts scale on your behalf.) It is however, not recommended to manually edit replicas or workersToDelete for a RayCluster with autoscaling enabled.

autoscalerOptions¶

To enable Ray autoscaler support, it is enough to set enableInTreeAutoscaling:true. Should you need to adjust autoscaling behavior or change the autoscaler container’s configuration, you can use the RayCluster CR’s autoscalerOptions field. The autoscalerOptions field carries the following subfields:

upscalingMode¶

The upscalingMode field can be used to control the rate of Ray pod upscaling.

UpscalingMode is Conservative, Default, or Aggressive.

  • Conservative: Upscaling is rate-limited; the number of pending worker pods is at most the number of worker pods connected to the Ray cluster.

  • Default: Upscaling is not rate-limited.

  • Aggressive: An alias for Default; upscaling is not rate-limited.

You may wish to use Conservative upscaling if you plan to submit many short-lived tasks to your Ray cluster. In this situation, Default upscaling may trigger the thrashing behavior:

  • The autoscaler sees resource demands from the submitted short-lived tasks.

  • The autoscaler immediately creates Ray pods to accomodate the demand.

  • By the time the additional Ray pods are provisioned, the tasks have already run to completion.

  • The additional Ray pods are unused and scale down after a period of idleness.

Note, however, that it is generally not recommended to over-parallelize with Ray. Since running a Ray task incurs scheduling overhead, it is usually preferable to use a few long-running tasks over many short-running tasks. Ensuring that each task has a non-trivial amount of work to do will also help prevent the autoscaler from over-provisioning Ray pods.

idleTimeoutSeconds¶

idleTimeoutSeconds is the number of seconds to wait before scaling down a worker pod which is not using resources. In this context, “resources” are the logical Ray resources (such as CPU, GPU, memory, and custom resources) specified in Ray task and actor annotations. Usage of the Ray Object Store also marks a Ray worker pod as active and prevents downscaling.

idleTimeoutSeconds defaults to 60 seconds.

resources¶

The resources subfield of autoscalerOptions sets optional resource overrides for the autoscaler sidecar container. These overrides should be specified in the standard container resource spec format. The default values are as indicated below:

resources:
  limits:
    cpu: "500m"
    memory: "512Mi"
  requests:
    cpu: "500m"
    memory: "512Mi"

These defaults should be suitable for most use-cases. However, we do recommend monitoring autoscaler container resource usage and adjusting as needed.

image and imagePullPolicy¶

The image subfield of autoscalerOptions optionally overrides the autoscaler container image. If your RayCluster’s spec.RayVersion is at least 2.0.0, the autoscaler will default to using the same image as the Ray container. (Ray autoscaler code is bundled with the rest of Ray.) For older Ray versions, the autoscaler will default to the image rayproject/ray:2.0.0.

The imagePullPolicy subfield of autoscalerOptions optionally overrides the autoscaler container’s image pull policy. The default is Always.

The image and imagePullPolicy overrides are provided primarily for the purposes of autoscaler testing and development.

env and envFrom¶

The env and envFrom fields specify autoscaler container environment variables, for debugging and development purposes. These fields should be formatted following the Kuberentes API for container environment variables.

Pod and container lifecyle: preStop hooks and initContainers¶

There are two pieces of pod configuration that should always be included in the RayCluster CR. Future versions of KubeRay may configure these elements automatically.

initContainer¶

It is required for the configuration of each workerGroupSpec’s pod template to include the following block:

initContainers:
- name: init-myservice
  image: busybox:1.28
  command: ['sh', '-c', "until nslookup $RAY_IP.$(cat /var/run/secrets/kubernetes.io/serviceaccount/namespace).svc.cluster.local; do echo waiting for myservice; sleep 2; done"]

This instructs the worker pod to wait for creation of the Ray head service. The worker’s ray start command will use this service to connect to the Ray head.

(It is not required to include this init container in the Ray head pod’s configuration.)

preStopHook¶

It is recommended for every Ray container’s configuration to include the following blocking block:

lifecycle:
  preStop:
    exec:
      command: ["/bin/sh","-c","ray stop"]

To ensure graceful termination, ray stop is executed prior to the Ray pod’s termination.