RayJob Quickstart#
Prerequisites#
KubeRay v0.6.0 or higher
KubeRay v0.6.0 or v1.0.0: Ray 1.10 or higher.
KubeRay v1.1.1 or newer is highly recommended: Ray 2.8.0 or higher.
What’s a RayJob?#
A RayJob manages two aspects:
RayCluster: A RayCluster custom resource manages all Pods in a Ray cluster, including a head Pod and multiple worker Pods.
Job: A Kubernetes Job runs
ray job submit
to submit a Ray job to the RayCluster.
What does the RayJob provide?#
With RayJob, KubeRay automatically creates a RayCluster and submits a job when the cluster is ready. You can also configure RayJob to automatically delete the RayCluster once the Ray job finishes.
To understand the following content better, you should understand the difference between:
RayJob: A Kubernetes custom resource definition provided by KubeRay.
Ray job: A Ray job is a packaged Ray application that can run on a remote Ray cluster. See this document for more details.
Submitter: The submitter is a Kubernetes Job that runs
ray job submit
to submit a Ray job to the RayCluster.
RayJob Configuration#
RayCluster configuration
rayClusterSpec
- Defines the RayCluster custom resource to run the Ray job on.clusterSelector
- Use existing RayCluster custom resources to run the Ray job instead of creating a new one. See ray-job.use-existing-raycluster.yaml for example configurations.
Ray job configuration
entrypoint
- The submitter runsray job submit --address ... --submission-id ... -- $entrypoint
to submit a Ray job to the RayCluster.runtimeEnvYAML
(Optional): A runtime environment that describes the dependencies the Ray job needs to run, including files, packages, environment variables, and more. Provide the configuration as a multi-line YAML string. Example:spec: runtimeEnvYAML: | pip: - requests==2.26.0 - pendulum==2.1.2 env_vars: KEY: "VALUE"
See Runtime Environments for more details. (New in KubeRay version 1.0.0)
jobId
(Optional): Defines the submission ID for the Ray job. If not provided, KubeRay generates one automatically. See Ray Jobs CLI API Reference for more details about the submission ID.metadata
(Optional): See Ray Jobs CLI API Reference for more details about the--metadata-json
option.entrypointNumCpus
/entrypointNumGpus
/entrypointResources
(Optional): See Ray Jobs CLI API Reference for more details.backoffLimit
(Optional, added in version 1.2.0): Specifies the number of retries before marking this RayJob failed. Each retry creates a new RayCluster. The default value is 0.
Submission configuration
submissionMode
(Optional):submissionMode
specifies how RayJob submits the Ray job to the RayCluster. In “K8sJobMode”, the KubeRay operator creates a submitter Kubernetes Job to submit the Ray job. In “HTTPMode”, the KubeRay operator sends a request to the RayCluster to create a Ray job. The default value is “K8sJobMode”.submitterPodTemplate
(Optional): Defines the Pod template for the submitter Kubernetes Job. This field is only effective whensubmissionMode
is “K8sJobMode”.RAY_DASHBOARD_ADDRESS
- The KubeRay operator injects this environment variable to the submitter Pod. The value is$HEAD_SERVICE:$DASHBOARD_PORT
.RAY_JOB_SUBMISSION_ID
- The KubeRay operator injects this environment variable to the submitter Pod. The value is theRayJob.Status.JobId
of the RayJob.Example:
ray job submit --address=http://$RAY_DASHBOARD_ADDRESS --submission-id=$RAY_JOB_SUBMISSION_ID ...
See ray-job.sample.yaml for more details.
submitterConfig
(Optional): Additional configurations for the submitter Kubernetes Job.backoffLimit
(Optional, added in version 1.2.0): The number of retries before marking the submitter Job as failed. The default value is 2.
Automatic resource cleanup
shutdownAfterJobFinishes
(Optional): Determines whether to recycle the RayCluster after the Ray job finishes. The default value is false.ttlSecondsAfterFinished
(Optional): Only works ifshutdownAfterJobFinishes
is true. The KubeRay operator deletes the RayCluster and the submitterttlSecondsAfterFinished
seconds after the Ray job finishes. The default value is 0.activeDeadlineSeconds
(Optional): If the RayJob doesn’t transition theJobDeploymentStatus
toComplete
orFailed
withinactiveDeadlineSeconds
, the KubeRay operator transitions theJobDeploymentStatus
toFailed
, citingDeadlineExceeded
as the reason.DELETE_RAYJOB_CR_AFTER_JOB_FINISHES
(Optional, added in version 1.2.0): Set this environment variable for the KubeRay operator, not the RayJob resource. If you set this environment variable to true, the RayJob custom resource itself is deleted if you also setshutdownAfterJobFinishes
to true. Note that KubeRay deletes all resources created by the RayJob, including the Kubernetes Job.
Example: Run a simple Ray job with RayJob#
Step 1: Create a Kubernetes cluster with Kind#
kind create cluster --image=kindest/node:v1.26.0
Step 2: Install the KubeRay operator#
Follow the RayCluster Quickstart to install the latest stable KubeRay operator by Helm repository.
Step 3: Install a RayJob#
curl -s https://raw.githubusercontent.com/ray-project/kuberay/v1.2.2/ray-operator/config/samples/ray-job.sample.yaml | sed 's/2.9.0/2.9.0-aarch64/g' | kubectl apply -f -
kubectl apply -f https://raw.githubusercontent.com/ray-project/kuberay/v1.2.2/ray-operator/config/samples/ray-job.sample.yaml
Step 4: Verify the Kubernetes cluster status#
# Step 4.1: List all RayJob custom resources in the `default` namespace.
kubectl get rayjob
# [Example output]
# NAME JOB STATUS DEPLOYMENT STATUS START TIME END TIME AGE
# rayjob-sample Running 2024-03-02T19:09:15Z 96s
# Step 4.2: List all RayCluster custom resources in the `default` namespace.
kubectl get raycluster
# [Example output]
# NAME DESIRED WORKERS AVAILABLE WORKERS CPUS MEMORY GPUS STATUS AGE
# rayjob-sample-raycluster-tlsxc 1 1 400m 0 0 ready 91m
# Step 4.3: List all Pods in the `default` namespace.
# The Pod created by the Kubernetes Job will be terminated after the Kubernetes Job finishes.
kubectl get pods
# [Example output]
# kuberay-operator-7456c6b69b-rzv25 1/1 Running 0 3m57s
# rayjob-sample-lk9jx 0/1 Completed 0 2m49s => Pod created by a Kubernetes Job
# rayjob-sample-raycluster-9c546-head-gdxkg 1/1 Running 0 3m46s
# rayjob-sample-raycluster-9c546-worker-small-group-nfbxm 1/1 Running 0 3m46s
# Step 4.4: Check the status of the RayJob.
# The field `jobStatus` in the RayJob custom resource will be updated to `SUCCEEDED` and `jobDeploymentStatus`
# should be `Complete` once the job finishes.
kubectl get rayjobs.ray.io rayjob-sample -o jsonpath='{.status.jobStatus}'
# [Expected output]: "SUCCEEDED"
kubectl get rayjobs.ray.io rayjob-sample -o jsonpath='{.status.jobDeploymentStatus}'
# [Expected output]: "Complete"
The KubeRay operator creates a RayCluster custom resource based on the rayClusterSpec
and a submitter Kubernetes Job to submit a Ray job to the RayCluster.
In this example, the entrypoint
is python /home/ray/samples/sample_code.py
, and sample_code.py
is a Python script stored in a Kubernetes ConfigMap mounted to the head Pod of the RayCluster.
Because the default value of shutdownAfterJobFinishes
is false, the KubeRay operator doesn’t delete the RayCluster or the submitter when the Ray job finishes.
Step 5: Check the output of the Ray job#
kubectl logs -l=job-name=rayjob-sample
# [Example output]
# 2023-08-21 17:08:22,530 INFO cli.py:27 -- Job submission server address: http://rayjob-sample-raycluster-9c546-head-svc.default.svc.cluster.local:8265
# 2023-08-21 17:08:23,726 SUCC cli.py:33 -- ------------------------------------------------
# 2023-08-21 17:08:23,727 SUCC cli.py:34 -- Job 'rayjob-sample-5ntcr' submitted successfully
# 2023-08-21 17:08:23,727 SUCC cli.py:35 -- ------------------------------------------------
# 2023-08-21 17:08:23,727 INFO cli.py:226 -- Next steps
# 2023-08-21 17:08:23,727 INFO cli.py:227 -- Query the logs of the job:
# 2023-08-21 17:08:23,727 INFO cli.py:229 -- ray job logs rayjob-sample-5ntcr
# 2023-08-21 17:08:23,727 INFO cli.py:231 -- Query the status of the job:
# 2023-08-21 17:08:23,727 INFO cli.py:233 -- ray job status rayjob-sample-5ntcr
# 2023-08-21 17:08:23,727 INFO cli.py:235 -- Request the job to be stopped:
# 2023-08-21 17:08:23,728 INFO cli.py:237 -- ray job stop rayjob-sample-5ntcr
# 2023-08-21 17:08:23,739 INFO cli.py:245 -- Tailing logs until the job exits (disable with --no-wait):
# 2023-08-21 17:08:34,288 INFO worker.py:1335 -- Using address 10.244.0.6:6379 set in the environment variable RAY_ADDRESS
# 2023-08-21 17:08:34,288 INFO worker.py:1452 -- Connecting to existing Ray cluster at address: 10.244.0.6:6379...
# 2023-08-21 17:08:34,302 INFO worker.py:1633 -- Connected to Ray cluster. View the dashboard at http://10.244.0.6:8265
# test_counter got 1
# test_counter got 2
# test_counter got 3
# test_counter got 4
# test_counter got 5
# 2023-08-21 17:08:46,040 SUCC cli.py:33 -- -----------------------------------
# 2023-08-21 17:08:46,040 SUCC cli.py:34 -- Job 'rayjob-sample-5ntcr' succeeded
# 2023-08-21 17:08:46,040 SUCC cli.py:35 -- -----------------------------------
The Python script sample_code.py
used by entrypoint
is a simple Ray script that executes a counter’s increment function 5 times.
Step 6: Delete the RayJob#
kubectl delete -f https://raw.githubusercontent.com/ray-project/kuberay/v1.2.2/ray-operator/config/samples/ray-job.sample.yaml
Step 7: Create a RayJob with shutdownAfterJobFinishes
set to true#
curl -s https://raw.githubusercontent.com/ray-project/kuberay/v1.2.2/ray-operator/config/samples/ray-job.shutdown.yaml | sed 's/2.9.0/2.9.0-aarch64/g' | kubectl apply -f -
kubectl apply -f https://raw.githubusercontent.com/ray-project/kuberay/v1.2.2/ray-operator/config/samples/ray-job.shutdown.yaml
The ray-job.shutdown.yaml
defines a RayJob custom resource with shutdownAfterJobFinishes: true
and ttlSecondsAfterFinished: 10
.
Hence, the KubeRay operator deletes the RayCluster 10 seconds after the Ray job finishes. Note that the submitter job is not deleted
because it contains the ray job logs and does not use any cluster resources once completed. In addition, the submitter job will always
be cleaned up when the RayJob is eventually deleted due to its owner reference back to the RayJob.
Step 8: Check the RayJob status#
# Wait until `jobStatus` is `SUCCEEDED` and `jobDeploymentStatus` is `Complete`.
kubectl get rayjobs.ray.io rayjob-sample-shutdown -o jsonpath='{.status.jobDeploymentStatus}'
kubectl get rayjobs.ray.io rayjob-sample-shutdown -o jsonpath='{.status.jobStatus}'
Step 9: Check if the KubeRay operator deletes the RayCluster#
# List the RayCluster custom resources in the `default` namespace. The RayCluster
# associated with the RayJob `rayjob-sample-shutdown` should be deleted.
kubectl get raycluster
Step 10: Clean up#
# Step 10.1: Delete the RayJob
kubectl delete -f https://raw.githubusercontent.com/ray-project/kuberay/v1.2.2/ray-operator/config/samples/ray-job.shutdown.yaml
# Step 10.2: Delete the KubeRay operator
helm uninstall kuberay-operator
# Step 10.3: Delete the Kubernetes cluster
kind delete cluster