Start Google Cloud GKE Cluster with TPUs for KubeRay#

See the GKE documentation for full details, or continue reading for a quick start.

Step 1: Create a Kubernetes cluster on GKE#

First, set the following environment variables to be used for GKE cluster creation:

export CLUSTER_NAME=CLUSTER_NAME
export COMPUTE_ZONE=ZONE
export CLUSTER_VERSION=CLUSTER_VERSION

Replace the following:

CLUSTER_NAME: The name of the GKE cluster to be created.
ZONE: The zone with available TPU quota, for a list of TPU availability by zones, see the GKE documentation.
CLUSTER_VERSION: The GKE version to use. TPU v6e is supported in GKE versions 1.31.2-gke.1115000 or later. See the GKE documentation for TPU generations and their minimum supported version.

Run the following commands on your local machine or on the Google Cloud Shell. If running from your local machine, install the Google Cloud SDK.

Create a Standard GKE cluster and enable the Ray Operator:

gcloud container clusters create $CLUSTER_NAME \
    --addons=RayOperator \
    --machine-type=n1-standard-16 \
    --cluster-version=$CLUSTER_VERSION \
    --location=$ZONE

Run the following command to add a TPU node pool to the cluster. You can also create it from the Google Cloud Console:

Create a node pool with a single-host v4 TPU topology as follows:

gcloud container node-pools create v4-4 \
  --zone $ZONE \
  --cluster $CLUSTER_NAME \
  --num-nodes 1 \
  --min-nodes 0 \
  --max-nodes 10 \
  --enable-autoscaling \
  --machine-type ct4p-hightpu-4t \
  --tpu-topology 2x2x1

For v4 TPUs, ZONE must be us-central2-b.

Alternatively, create a multi-host node pool as follows:

gcloud container node-pools create v4-8 \
  --zone $ZONE \
  --cluster $CLUSTER_NAME \
  --num-nodes 2 \
  --min-nodes 0 \
  --max-nodes 10 \
  --enable-autoscaling \
  --machine-type ct4p-hightpu-4t \
  --tpu-topology 2x2x2

For v4 TPUs, ZONE must be us-central2-b.

The --tpu-topology flag specifies the physical topology of the TPU Pod slice. This example uses a v4 TPU slice with either a 2x2x1 or 2x2x2 topology. v4 TPUs have 4 chips per VM host, so a 2x2x2 v4 slice has 8 chips total and 2 TPU hosts, each scheduled on their own node. GKE treats multi-host TPU slices as atomic units, and scales them using node pools rather than singular nodes. Therefore, the number of TPU hosts should always equal the number of nodes in the TPU node pool. For more information about selecting a TPU topology and accelerator, see the GKE documentation.

GKE uses Kubernetes node selectors to ensure TPU workloads run on the desired machine type and topology. For more details, see the GKE documentation.

Step 2: Connect to the GKE cluster#

Run the following command to download Google Cloud credentials and configure the Kubernetes CLI to use them.

gcloud container clusters get-credentials $CLUSTER_NAME --zone $ZONE

The remote GKE cluster is now reachable through kubectl. For more details, see the GKE documentation.

[Optional] Manually install KubeRay and the TPU webhook in a GKE cluster without the Ray Operator Addon:#

In a cluster without the Ray Operator Addon enabled, KubeRay can be manually installed using helm with the following commands:

helm repo add kuberay https://ray-project.github.io/kuberay-helm/

# Install both CRDs and KubeRay operator v1.4.2.
helm install kuberay-operator kuberay/kuberay-operator --version 1.4.2

GKE provides a validating and mutating webhook to handle TPU Pod scheduling and bootstrap certain environment variables used for JAX initialization. The Ray TPU webhook requires a KubeRay operator version of at least v1.1.0. GKE automatically installs the Ray TPU webhook through the Ray Operator Addon with GKE versions 1.30.0-gke.1747000 or later.

When manually installing the webhook, cert-manager is required to handle TLS certificate injection. You can install cert-manager in both GKE Standard and Autopilot clusters using the following helm commands:

Install cert-manager:

helm repo add jetstack https://charts.jetstack.io
helm repo update
helm install --create-namespace --namespace cert-manager --set installCRDs=true --set global.leaderElection.namespace=cert-manager cert-manager jetstack/cert-manager

Next, deploy the Ray TPU initialization webhook:

git clone https://github.com/GoogleCloudPlatform/ai-on-gke
cd ray-on-gke/tpu/kuberay-tpu-webhook
make deploy deploy-cert