Train a PyTorch model on Fashion MNIST with CPUs on Kubernetes#
This example runs distributed training of a PyTorch model on Fashion MNIST with Ray Train. See Train a PyTorch model on Fashion MNIST for more details.
Step 1: Create a Kubernetes cluster#
This step creates a local Kubernetes cluster using Kind. If you already have a Kubernetes cluster, you can skip this step.
kind create cluster --image=kindest/node:v1.26.0
Step 2: Install KubeRay operator#
Follow this document to install the latest stable KubeRay operator from the Helm repository.
Step 3: Create a RayJob#
A RayJob consists of a RayCluster custom resource and a job that can you can submit to the RayCluster. With RayJob, KubeRay creates a RayCluster and submits a job when the cluster is ready. The following is a CPU-only RayJob description YAML file for MNIST training on a PyTorch model.
# Download `ray-job.pytorch-mnist.yaml`
curl -LO https://raw.githubusercontent.com/ray-project/kuberay/master/ray-operator/config/samples/pytorch-mnist/ray-job.pytorch-mnist.yaml
You might need to adjust some fields in the RayJob description YAML file so that it can run in your environment:
replicas
underworkerGroupSpecs
inrayClusterSpec
: This field specifies the number of worker Pods that KubeRay schedules to the Kubernetes cluster. Each worker Pod requires 3 CPUs, and the head Pod requires 1 CPU, as described in thetemplate
field. A RayJob submitter Pod requires 1 CPU. For example, if your machine has 8 CPUs, the maximumreplicas
value is 2 to allow all Pods to reach theRunning
status.NUM_WORKERS
underruntimeEnvYAML
inspec
: This field indicates the number of Ray actors to launch (seeScalingConfig
in this Document for more information). Each Ray actor must be served by a worker Pod in the Kubernetes cluster. Therefore,NUM_WORKERS
must be less than or equal toreplicas
.CPUS_PER_WORKER
: This must be set to less than or equal to(CPU resource request per worker Pod) - 1
. For example, in the sample YAML file, the CPU resource request per worker Pod is 3 CPUs, soCPUS_PER_WORKER
must be set to 2 or less.
# `replicas` and `NUM_WORKERS` set to 2.
# Create a RayJob.
kubectl apply -f ray-job.pytorch-mnist.yaml
# Check existing Pods: According to `replicas`, there should be 2 worker Pods.
# Make sure all the Pods are in the `Running` status.
kubectl get pods
# NAME READY STATUS RESTARTS AGE
# kuberay-operator-6dddd689fb-ksmcs 1/1 Running 0 6m8s
# rayjob-pytorch-mnist-raycluster-rkdmq-small-group-worker-c8bwx 1/1 Running 0 5m32s
# rayjob-pytorch-mnist-raycluster-rkdmq-small-group-worker-s7wvm 1/1 Running 0 5m32s
# rayjob-pytorch-mnist-nxmj2 1/1 Running 0 4m17s
# rayjob-pytorch-mnist-raycluster-rkdmq-head-m4dsl 1/1 Running 0 5m32s
Check that the RayJob is in the RUNNING
status:
kubectl get rayjob
# NAME JOB STATUS DEPLOYMENT STATUS START TIME END TIME AGE
# rayjob-pytorch-mnist RUNNING Running 2024-06-17T04:08:25Z 11m
Step 4: Wait until the RayJob completes and check the training results#
Wait until the RayJob completes. It might take several minutes.
kubectl get rayjob
# NAME JOB STATUS DEPLOYMENT STATUS START TIME END TIME AGE
# rayjob-pytorch-mnist SUCCEEDED Complete 2024-06-17T04:08:25Z 2024-06-17T04:22:21Z 16m
After seeing JOB_STATUS
marked as SUCCEEDED
, you can check the training logs:
# Check Pods name.
kubectl get pods
# NAME READY STATUS RESTARTS AGE
# kuberay-operator-6dddd689fb-ksmcs 1/1 Running 0 113m
# rayjob-pytorch-mnist-raycluster-rkdmq-small-group-worker-c8bwx 1/1 Running 0 38m
# rayjob-pytorch-mnist-raycluster-rkdmq-small-group-worker-s7wvm 1/1 Running 0 38m
# rayjob-pytorch-mnist-nxmj2 0/1 Completed 0 38m
# rayjob-pytorch-mnist-raycluster-rkdmq-head-m4dsl 1/1 Running 0 38m
# Check training logs.
kubectl logs -f rayjob-pytorch-mnist-nxmj2
# 2024-06-16 22:23:01,047 INFO cli.py:36 -- Job submission server address: http://rayjob-pytorch-mnist-raycluster-rkdmq-head-svc.default.svc.cluster.local:8265
# 2024-06-16 22:23:01,844 SUCC cli.py:60 -- -------------------------------------------------------
# 2024-06-16 22:23:01,844 SUCC cli.py:61 -- Job 'rayjob-pytorch-mnist-l6ccc' submitted successfully
# 2024-06-16 22:23:01,844 SUCC cli.py:62 -- -------------------------------------------------------
# ...
# (RayTrainWorker pid=1138, ip=10.244.0.18)
# 0%| | 0/26421880 [00:00<?, ?it/s]
# (RayTrainWorker pid=1138, ip=10.244.0.18)
# 0%| | 32768/26421880 [00:00<01:27, 301113.97it/s]
# ...
# Training finished iteration 10 at 2024-06-16 22:33:05. Total running time: 7min 9s
# ╭───────────────────────────────╮
# │ Training result │
# ├───────────────────────────────┤
# │ checkpoint_dir_name │
# │ time_this_iter_s 28.2635 │
# │ time_total_s 423.388 │
# │ training_iteration 10 │
# │ accuracy 0.8748 │
# │ loss 0.35477 │
# ╰───────────────────────────────╯
# Training completed after 10 iterations at 2024-06-16 22:33:06. Total running time: 7min 10s
# Training result: Result(
# metrics={'loss': 0.35476621258825347, 'accuracy': 0.8748},
# path='/home/ray/ray_results/TorchTrainer_2024-06-16_22-25-55/TorchTrainer_122aa_00000_0_2024-06-16_22-25-55',
# filesystem='local',
# checkpoint=None
# )
# ...
Clean up#
Delete your RayJob with the following command:
kubectl delete -f ray-job.pytorch-mnist.yaml