The Ray Kubernetes Operator¶
Ray provides a Kubernetes Operator for managing autoscaling Ray clusters. Using the operator provides similar functionality to deploying a Ray cluster using the Ray Cluster Launcher. However, working with the operator does not require running Ray locally – all interactions with your Ray cluster are mediated by Kubernetes.
The operator makes use of a Kubernetes Custom Resource called a RayCluster.
A RayCluster is specified by a configuration similar to the yaml
files used by the Ray Cluster Launcher.
Internally, the operator uses Ray’s autoscaler to manage your Ray cluster. However, the autoscaler runs in a
separate operator pod, rather than on the Ray head node. Applying multiple RayCluster custom resources in the operator’s
namespace allows the operator to manage several Ray clusters.
The rest of this document explains step-by-step how to use the Ray Kubernetes Operator to launch a Ray cluster on your existing Kubernetes cluster.
Warning
The Ray Kubernetes Operator requires Kubernetes version at least v1.17.0
. Check Kubernetes version info with the command
kubectl version
.
Note
The example commands in this document launch six Kubernetes pods, using a total of 6 CPU and 3.5Gi memory.
If you are experimenting using a test Kubernetes environment such as minikube, make sure to provision sufficient resources, e.g.
minikube start --cpus=6 --memory="4G"
.
Alternatively, reduce resource usage by editing the yaml
files referenced in this document; for example, reduce minWorkers
in example_cluster.yaml
and example_cluster2.yaml
.
Applying the RayCluster Custom Resource Definition¶
First, we need to apply the Kubernetes Custom Resource Definition (CRD) defining a RayCluster.
Note
Creating a Custom Resource Definition requires the appropriate Kubernetes cluster-level privileges.
$ kubectl apply -f ray/python/ray/autoscaler/kubernetes/operator_configs/cluster_crd.yaml
customresourcedefinition.apiextensions.k8s.io/rayclusters.cluster.ray.io created
Picking a Kubernetes Namespace¶
The rest of the Kubernetes resources we will use are namespaced.
You can use an existing namespace for your Ray clusters or create a new one if you have permissions.
For this example, we will create a namespace called ray
.
$ kubectl create namespace ray
namespace/ray created
Starting the Operator¶
To launch the operator in our namespace, we execute the following command.
$ kubectl -n ray apply -f ray/python/ray/autoscaler/kubernetes/operator_configs/operator.yaml
serviceaccount/ray-operator-serviceaccount created
role.rbac.authorization.k8s.io/ray-operator-role created
rolebinding.rbac.authorization.k8s.io/ray-operator-rolebinding created
pod/ray-operator-pod created
The output shows that we’ve launched a Pod named ray-operator-pod
. This is the pod that runs the operator process.
The ServiceAccount, Role, and RoleBinding we have created grant the operator pod the permissions it needs to manage Ray clusters.
Launching Ray Clusters¶
Finally, to launch a Ray cluster, we create a RayCluster custom resource.
$ kubectl -n ray apply -f ray/python/ray/autoscaler/kubernetes/operator_configs/example_cluster.yaml
raycluster.cluster.ray.io/example-cluster created
The operator detects the RayCluster resource we’ve created and launches an autoscaling Ray cluster.
Our RayCluster configuration specifies minWorkers:2
in the second entry of spec.podTypes
, so we get a head node and two workers upon launch.
Note
For more details about RayCluster resources, we recommend take a looking at the annotated example example_cluster.yaml
applied in the last command.
$ kubectl -n ray get pods
NAME READY STATUS RESTARTS AGE
example-cluster-ray-head-hbxvv 1/1 Running 0 72s
example-cluster-ray-worker-4hvv6 1/1 Running 0 64s
example-cluster-ray-worker-78kp5 1/1 Running 0 64s
ray-operator-pod 1/1 Running 0 2m33s
We see four pods: the operator, the Ray head node, and two Ray worker nodes.
Let’s launch another cluster in the same namespace, this one specifiying minWorkers:1
.
$ kubectl -n ray apply -f ray/python/ray/autoscaler/kubernetes/operator_configs/example_cluster2.yaml
We confirm that both clusters are running in our namespace.
$ kubectl -n ray get rayclusters
NAME AGE
example-cluster 12m
example-cluster2 114s
$ kubectl -n ray get pods
NAME READY STATUS RESTARTS AGE
example-cluster-ray-head-th4wv 1/1 Running 0 10m
example-cluster-ray-worker-q9pjn 1/1 Running 0 10m
example-cluster-ray-worker-qltnp 1/1 Running 0 10m
example-cluster2-ray-head-kj5mg 1/1 Running 0 10s
example-cluster2-ray-worker-qsgnd 1/1 Running 0 1s
ray-operator-pod 1/1 Running 0 10m
Now we can run Ray programs on our Ray clusters.
Monitoring¶
Autoscaling logs are written to the operator pod’s stdout
and can be accessed with kubectl logs
.
Each line of output is prefixed by the name of the cluster followed by a colon.
The following command gets the last hundred lines of autoscaling logs for our second cluster.
$ kubectl -n ray logs ray-operator-pod | grep ^example-cluster2: | tail -n 100
The output should include monitoring updates that look like this:
example-cluster2:2020-12-12 13:55:36,814 DEBUG autoscaler.py:693 -- Cluster status: 1 nodes
example-cluster2: - MostDelayedHeartbeats: {'172.17.0.4': 0.04093289375305176, '172.17.0.5': 0.04084634780883789}
example-cluster2: - NodeIdleSeconds: Min=36 Mean=38 Max=41
example-cluster2: - ResourceUsage: 0.0/2.0 CPU, 0.0/1.0 Custom1, 0.0/1.0 is_spot, 0.0 GiB/0.58 GiB memory, 0.0 GiB/0.1 GiB object_store_memory
example-cluster2: - TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0
example-cluster2:Worker node types:
example-cluster2: - worker-nodes: 1
example-cluster2:2020-12-12 13:55:36,870 INFO resource_demand_scheduler.py:148 -- Cluster resources: [{'object_store_memory': 1.0, 'node:172.17.0.4': 1.0, 'memory': 5.0, 'CPU': 1.0}, {'object_store_memory': 1.0, 'is_spot': 1.0, 'memory': 6.0, 'node:172.17.0.5': 1.0, 'Custom1': 1.0, 'CPU': 1.0}]
example-cluster2:2020-12-12 13:55:36,870 INFO resource_demand_scheduler.py:149 -- Node counts: defaultdict(<class 'int'>, {'head-node': 1, 'worker-nodes
': 1})
example-cluster2:2020-12-12 13:55:36,870 INFO resource_demand_scheduler.py:159 -- Placement group demands: []
example-cluster2:2020-12-12 13:55:36,870 INFO resource_demand_scheduler.py:186 -- Resource demands: []
example-cluster2:2020-12-12 13:55:36,870 INFO resource_demand_scheduler.py:187 -- Unfulfilled demands: []
example-cluster2:2020-12-12 13:55:36,891 INFO resource_demand_scheduler.py:209 -- Node requests: {}
example-cluster2:2020-12-12 13:55:36,903 DEBUG autoscaler.py:654 -- example-cluster2-ray-worker-tdxdr is not being updated and passes config check (can_update=True).
example-cluster2:2020-12-12 13:55:36,923 DEBUG autoscaler.py:654 -- example-cluster2-ray-worker-tdxdr is not being updated and passes config check (can_update=True).
Updating and Retrying¶
To update a Ray cluster’s configuration, edit the yaml
file of the corresponding RayCluster resource
and apply it again:
$ kubectl -n ray apply -f ray/python/ray/autoscaler/kubernetes/operator_configs/example_cluster.yaml
To force a restart with the same configuration, you can add an annotation to the RayCluster resource’s metadata.labels
field, e.g.
apiVersion: cluster.ray.io/v1
kind: RayCluster
metadata:
name: example-cluster
annotations:
try: again
spec:
...
Then reapply the RayCluster, as above.
Currently, editing and reapplying a RayCluster resource will stop and restart Ray processes running on the corresponding Ray cluster. Similarly, deleting and relaunching the operator pod will stop and restart Ray processes on all Ray clusters in the operator’s namespace. This behavior may be modified in future releases.
Cleaning Up¶
We shut down a Ray cluster by deleting the associated RayCluster resource.
Either of the next two commands will delete our second cluster example-cluster2
.
$ kubectl -n ray delete raycluster example-cluster2
# OR
$ kubectl -n ray delete -f ray/python/ray/autoscaler/kubernetes/operator_configs/example_cluster2.yaml
The pods associated with example-cluster2
go into TERMINATING
status. In a few moments, we check that these pods are gone:
$ kubectl -n ray get pods
NAME READY STATUS RESTARTS AGE
example-cluster-ray-head-th4wv 1/1 Running 0 57m
example-cluster-ray-worker-q9pjn 1/1 Running 0 56m
example-cluster-ray-worker-qltnp 1/1 Running 0 56m
ray-operator-pod 1/1 Running 0 57m
Only the operator pod and the first example-cluster
remain.
To finish clean-up, we delete the cluster example-cluster
and then the operator’s resources.
$ kubectl -n ray delete raycluster example-cluster
$ kubectl -n ray delete -f ray/python/ray/autoscaler/kubernetes/operator_configs/operator.yaml
If you like, you can delete the RayCluster customer resource definition. (Using the operator again will then require reapplying the CRD.)
$ kubectl delete crd rayclusters.cluster.ray.io
# OR
$ kubectl delete -f ray/python/ray/autoscaler/kubernetes/operator_configs/cluster_crd.yaml