Priority Scheduling with RayJob and Kueue#
This guide shows how to run Fine-tune a PyTorch Lightning Text Classifier with Ray Data example as a RayJob and leverage Kueue to orchestrate priority scheduling and quota management.
What’s Kueue?#
Kueue is a Kubernetes-native job queueing system that manages quotas and how jobs consume them. Kueue decides when:
To make a job wait
To admit a job to start, meaning that Kubernetes creates pods.
To preempt a job, meaning that Kubernetes deletes active pods.
Kueue has native support for some KubeRay APIs. Specifically, you can use Kueue to manage resources consumed by RayJob and RayCluster. See the Kueue documentation to learn more.
Step 0: Create a Kubernetes cluster on GKE (Optional)#
If you already have a Kubernetes cluster with GPUs, you can skip this step. Otherwise, follow Start Google Cloud GKE Cluster with GPUs for KubeRay to set up a Kubernetes cluster on GKE.
Step 1: Install the KubeRay operator#
Follow Deploy a KubeRay operator to install the latest stable KubeRay operator from the Helm repository. The KubeRay operator Pod must be on the CPU node if you set up the taint for the GPU node pool correctly.
Step 2: Install Kueue#
VERSION=v0.8.2
kubectl apply --server-side -f https://github.com/kubernetes-sigs/kueue/releases/download/$VERSION/manifests.yaml
See Kueue Installation for more details on installing Kueue.
Step 3: Configure Kueue with priority scheduling#
To understand this tutorial, it’s important to understand the following Kueue concepts:
# kueue-resources.yaml
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
name: "default-flavor"
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
name: "cluster-queue"
spec:
preemption:
withinClusterQueue: LowerPriority
namespaceSelector: {} # Match all namespaces.
resourceGroups:
- coveredResources: ["cpu", "memory", "nvidia.com/gpu"]
flavors:
- name: "default-flavor"
resources:
- name: "cpu"
nominalQuota: 2
- name: "memory"
nominalQuota: 8G
- name: "nvidia.com/gpu" # ClusterQueue only has quota for a single GPU.
nominalQuota: 1
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
namespace: "default"
name: "user-queue"
spec:
clusterQueue: "cluster-queue"
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: WorkloadPriorityClass
metadata:
name: prod-priority
value: 1000
description: "Priority class for prod jobs"
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: WorkloadPriorityClass
metadata:
name: dev-priority
value: 100
description: "Priority class for development jobs"
The YAML manifest configures:
ResourceFlavor
The ResourceFlavor
default-flavor
is an empty ResourceFlavor because the compute resources in the Kubernetes cluster are homogeneous. In other words, users can request 1 GPU without considering whether it’s an NVIDIA A100 or a T4 GPU.
ClusterQueue
The ClusterQueue
cluster-queue
only has 1 ResourceFlavordefault-flavor
with quotas for 2 CPUs, 8G memory, and 1 GPU. It exactly matches the resources requested by 1 RayJob custom resource. Hence, only 1 RayJob can run at a time.The ClusterQueue
cluster-queue
has a preemption policywithinClusterQueue: LowerPriority
. This policy allows the pending RayJob that doesn’t fit within the nominal quota for its ClusterQueue to preempt active RayJob custom resources in the ClusterQueue that have lower priority.
LocalQueue
The LocalQueue
user-queue
is a namespaced object in thedefault
namespace which belongs to a ClusterQueue. A typical practice is to assign a namespace to a tenant, team or user, of an organization. Users submit jobs to a LocalQueue, instead of to a ClusterQueue directly.
WorkloadPriorityClass
The WorkloadPriorityClass
prod-priority
has a higher value than the WorkloadPriorityClassdev-priority
. This means that RayJob custom resources with theprod-priority
priority class take precedence over RayJob custom resources with thedev-priority
priority class.
Create the Kueue resources:
kubectl apply -f kueue-resources.yaml
Step 4: Deploy a RayJob#
Download the RayJob that executes all the steps documented in Fine-tune a PyTorch Lightning Text Classifier. The source code is also in the KubeRay repository.
curl -LO https://raw.githubusercontent.com/ray-project/kuberay/master/ray-operator/config/samples/pytorch-text-classifier/ray-job.pytorch-distributed-training.yaml
Before creating the RayJob, modify the RayJob metadata with:
metadata:
generateName: dev-pytorch-text-classifier-
labels:
kueue.x-k8s.io/queue-name: user-queue
kueue.x-k8s.io/priority-class: dev-priority
kueue.x-k8s.io/queue-name: user-queue
: As the previous step mentioned, users submit jobs to a LocalQueue instead of directly to a ClusterQueue.kueue.x-k8s.io/priority-class: dev-priority
: Assign the RayJob with thedev-priority
WorkloadPriorityClass.A modified name to indicate that this job is for development.
Also note the resources required for this RayJob by looking at the resources that the Ray head Pod requests:
resources:
limits:
memory: "8G"
nvidia.com/gpu: "1"
requests:
cpu: "2"
memory: "8G"
nvidia.com/gpu: "1"
Now deploy the RayJob:
$ kubectl create -f ray-job.pytorch-distributed-training.yaml
rayjob.ray.io/dev-pytorch-text-classifier-r6d4p created
Verify that the RayCluster and the submitter Kubernetes Job are running:
$ kubectl get pod
NAME READY STATUS RESTARTS AGE
dev-pytorch-text-classifier-r6d4p-4nczg 1/1 Running 0 4s # Submitter Kubernetes Job
torch-text-classifier-r6d4p-raycluster-br45j-head-8bbwt 1/1 Running 0 34s # Ray head Pod
Delete the RayJob after verifying that the job has completed successfully.
$ kubectl get rayjobs.ray.io dev-pytorch-text-classifier-r6d4p -o jsonpath='{.status.jobStatus}'
SUCCEEDED
$ kubectl get rayjobs.ray.io dev-pytorch-text-classifier-r6d4p -o jsonpath='{.status.jobDeploymentStatus}'
Complete
$ kubectl delete rayjob dev-pytorch-text-classifier-r6d4p
rayjob.ray.io "dev-pytorch-text-classifier-r6d4p" deleted
Step 5: Queuing multiple RayJob resources#
Create 3 RayJob custom resources to see how Kueue interacts with KubeRay to implement job queueing.
$ kubectl create -f ray-job.pytorch-distributed-training.yaml
rayjob.ray.io/dev-pytorch-text-classifier-8vg2c created
$ kubectl create -f ray-job.pytorch-distributed-training.yaml
rayjob.ray.io/dev-pytorch-text-classifier-n5k89 created
$ kubectl create -f ray-job.pytorch-distributed-training.yaml
rayjob.ray.io/dev-pytorch-text-classifier-ftcs9 created
Because each RayJob requests 1 GPU and the ClusterQueue has quotas for only 1 GPU, Kueue automatically suspends new RayJob resources until GPU quotas become available.
You can also inspect the ClusterQueue
to see available and used quotas:
$ kubectl get clusterqueue
NAME COHORT PENDING WORKLOADS
cluster-queue 2
$ kubectl get clusterqueue cluster-queue -o yaml
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
...
...
...
status:
admittedWorkloads: 1 # Workloads admitted by queue.
flavorsReservation:
- name: default-flavor
resources:
- borrowed: "0"
name: cpu
total: "8"
- borrowed: "0"
name: memory
total: 19531250Ki
- borrowed: "0"
name: nvidia.com/gpu
total: "2"
flavorsUsage:
- name: default-flavor
resources:
- borrowed: "0"
name: cpu
total: "8"
- borrowed: "0"
name: memory
total: 19531250Ki
- borrowed: "0"
name: nvidia.com/gpu
total: "2"
pendingWorkloads: 2 # Queued workloads waiting for quotas.
reservingWorkloads: 1 # Running workloads that are using quotas.
Step 6: Deploy a RayJob with higher priority#
At this point there are multiple RayJob custom resources queued up but only enough quota to run a single RayJob. Now you can create a new RayJob with higher priority to preempt the already queued RayJob resources. Modify the RayJob with:
metadata:
generateName: prod-pytorch-text-classifier-
labels:
kueue.x-k8s.io/queue-name: user-queue
kueue.x-k8s.io/priority-class: prod-priority
kueue.x-k8s.io/queue-name: user-queue
: As the previous step mentioned, users submit jobs to a LocalQueue instead of directly to a ClusterQueue.kueue.x-k8s.io/priority-class: dev-priority
: Assign the RayJob with theprod-priority
WorkloadPriorityClass.A modified name to indicate that this job is for production.
Create the new RayJob:
$ kubectl create -f ray-job.pytorch-distributed-training.yaml
rayjob.ray.io/prod-pytorch-text-classifier-gkp9b created
Note that higher priority jobs preempt lower priority jobs when there aren’t enough quotas for both:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
prod-pytorch-text-classifier-gkp9b-r9k5r 1/1 Running 0 5s
torch-text-classifier-gkp9b-raycluster-s2f65-head-hfvht 1/1 Running 0 35s