Getting Started#

In this guide, we show you how to manage and interact with Ray clusters on Kubernetes.

You can download this guide as an executable Jupyter notebook by clicking the download button on the top right of the page.

Preparation#

Install the latest Ray release#

This step is needed to interact with remote Ray clusters using Ray Job Submission.

! pip install -U "ray[default]"

See Installing Ray for more details.

Install kubectl#

We will use kubectl to interact with Kubernetes. Find installation instructions at the Kubernetes documentation.

Install Helm#

We will use Helm to deploy a KubeRay operator and a RayCluster custom resource. Find instructions at the Helm documentation.

Access a Kubernetes cluster#

We will need access to a Kubernetes cluster. There are two options:

  1. Configure access to a remote Kubernetes cluster OR

  2. Run the examples locally by installing kind. Start your kind cluster by running the following command:

! kind create cluster --image=kindest/node:v1.23.0

To run the example in this guide, make sure your Kubernetes cluster (or local Kind cluster) can accomodate additional resource requests of 3 CPU and 3Gi memory (for example, by setting your Docker resource limits high enough). Also, make sure your Kubernetes cluster and Kubectl are both at version at least 1.19.

Deploying the KubeRay operator#

Deploy the KubeRay operator with the Helm chart repository.

! helm repo add kuberay https://ray-project.github.io/kuberay-helm/

# Install both CRDs and KubeRay operator v0.5.0.
! helm install kuberay-operator kuberay/kuberay-operator --version 0.5.0

# Confirm that the operator is running in the namespace `default`.
! kubectl get pods
# NAME                                READY   STATUS    RESTARTS   AGE
# kuberay-operator-7fbdbf8c89-pt8bk   1/1     Running   0          27s

KubeRay offers multiple options for operator installations, such as Helm, Kustomize, and a single-namespaced operator. For further information, please refer to the installation instructions in the KubeRay documentation.

Deploying a Ray Cluster#

Once the KubeRay operator is running, we are ready to deploy a Ray cluster. To do so, we create a RayCluster Custom Resource (CR) in the default namespace.

# Deploy a sample RayCluster CR from the KubeRay Helm chart repo:
! helm install raycluster kuberay/ray-cluster --version 0.5.0

# Once the RayCluster CR has been created, you can view it by running:
! kubectl get rayclusters

# NAME                 DESIRED WORKERS   AVAILABLE WORKERS   STATUS   AGE
# raycluster-kuberay   1                 1                   ready    72s

The KubeRay operator will detect the RayCluster object. The operator will then start your Ray cluster by creating head and worker pods. To view Ray cluster’s pods, run the following command:

# View the pods in the Ray cluster named "raycluster-kuberay"
! kubectl get pods --selector=ray.io/cluster=raycluster-kuberay

# NAME                                          READY   STATUS    RESTARTS   AGE
# raycluster-kuberay-head-vkj4n                 1/1     Running   0          XXs
# raycluster-kuberay-worker-workergroup-xvfkr   1/1     Running   0          XXs

Wait for the pods to reach Running state. This may take a few minutes – most of this time is spent downloading the Ray images. In a separate shell, you may wish to observe the pods’ status in real-time with the following command:

# If you're on MacOS, first `brew install watch`.
# Run in a separate shell:
! watch -n 1 kubectl get pod

If your pods are stuck in the Pending state, you can check for errors via kubectl describe pod raycluster-kuberay-xxxx-xxxxx and ensure that your Docker resource limits are set high enough.

Note that in production scenarios, you will want to use larger Ray pods. In fact, it is advantageous to size each Ray pod to take up an entire Kubernetes node. See the configuration guide for more details.

Running Applications on a Ray Cluster#

Now, let’s interact with the Ray cluster we’ve deployed.

Accessing the cluster with kubectl exec#

The most straightforward way to experiment with your Ray cluster is to exec directly into the head pod. First, identify your Ray cluster’s head pod:

! export HEAD_POD=$(kubectl get pods --selector=ray.io/node-type=head -o custom-columns=POD:metadata.name --no-headers)
! echo $HEAD_POD
# raycluster-kuberay-head-vkj4n

# Print the cluster resources.
! kubectl exec -it $HEAD_POD -- python -c "import ray; ray.init(); print(ray.cluster_resources())"

# 2023-04-07 10:57:46,472 INFO worker.py:1243 -- Using address 127.0.0.1:6379 set in the environment variable RAY_ADDRESS
# 2023-04-07 10:57:46,472 INFO worker.py:1364 -- Connecting to existing Ray cluster at address: 10.244.0.6:6379...
# 2023-04-07 10:57:46,482 INFO worker.py:1550 -- Connected to Ray cluster. View the dashboard at http://10.244.0.6:8265 
# {'object_store_memory': 802572287.0, 'memory': 3000000000.0, 'node:10.244.0.6': 1.0, 'CPU': 2.0, 'node:10.244.0.7': 1.0}

While this can be useful for ad-hoc execution on the Ray Cluster, the recommended way to execute an application on a Ray Cluster is to use Ray Jobs.

Ray Job submission#

To set up your Ray Cluster for Ray Jobs submission, we just need to make sure that the Ray Jobs port is visible to the client. Ray listens for Job requests through the head pod’s Dashboard server.

First, we need to find the location of the Ray head node. The KubeRay operator configures a Kubernetes service targeting the Ray head pod. This service allows us to interact with Ray clusters without directly executing commands in the Ray container. To identify the Ray head service for our example cluster, run:

! kubectl get service raycluster-kuberay-head-svc

# NAME                          TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)                                         AGE
# raycluster-kuberay-head-svc   ClusterIP   10.96.93.74   <none>        8265/TCP,8080/TCP,8000/TCP,10001/TCP,6379/TCP   15m

Now that we have the name of the service, we can use port-forwarding to access the Ray Dashboard port (8265 by default).

Note: The following port-forwarding command is blocking. If you are following along from a Jupyter notebook, the command must be executed in a separate shell outside of the notebook.

# Execute this in a separate shell.
! kubectl port-forward --address 0.0.0.0 service/raycluster-kuberay-head-svc 8265:8265

# Visit ${YOUR_IP}:8265 for the Dashboard (e.g. 127.0.0.1:8265)

Note: We use port-forwarding in this guide as a simple way to experiment with a Ray cluster’s services. For production use-cases, you would typically either

  • Access the service from within the Kubernetes cluster or

  • Use an ingress controller to expose the service outside the cluster.

See the networking notes for details.

Now that we have access to the Dashboard port, we can submit jobs to the Ray Cluster:

# The following job's logs will show the Ray cluster's total resource capacity, including 2 CPUs.

! ray job submit --address http://localhost:8265 -- python -c "import ray; ray.init(); print(ray.cluster_resources())"

For a more detailed guide on using Ray Jobs to run applications on a Ray cluster, check out the quickstart guide

Cleanup#

Deleting a RayCluster#

To delete the RayCluster we deployed in this example, run the following command.

# Uninstall the RayCluster Helm chart
! helm uninstall raycluster
# release "raycluster" uninstalled

# Note that it may take several seconds for the Ray pods to be fully terminated.
# Confirm that the Ray Cluster's pods are gone by running
! kubectl get pods

# NAME                                READY   STATUS    RESTARTS   AGE
# kuberay-operator-7fbdbf8c89-pt8bk   1/1     Running   0          XXm

Deleting the KubeRay operator#

In typical operation, the KubeRay operator should be left as a long-running process that manages many Ray clusters. If you would like to delete the operator and associated resources, run

# Uninstall the KubeRay operator Helm chart
! helm uninstall kuberay-operator
# release "kuberay-operator" uninstalled

# Confirm that the KubeRay operator pod is gone by running
! kubectl get pods
# No resources found in default namespace.

Deleting a local kind cluster#

Finally, if you’d like to delete your local kind cluster, run

! kind delete cluster