RayJob Quickstart#
KubeRay v0.6.0 or higher
KubeRay v0.6.0 or v1.0.0: Ray 1.10 or higher.
KubeRay v1.1.1 or newer is highly recommended: Ray 2.8.0 or higher.
What’s a RayJob?#
A RayJob manages two aspects:
RayCluster: A RayCluster custom resource manages all Pods in a Ray cluster, including a head Pod and multiple worker Pods.
Job: A Kubernetes Job runs
ray job submit
to submit a Ray job to the RayCluster.
What does the RayJob provide?#
With RayJob, KubeRay automatically creates a RayCluster and submits a job when the cluster is ready. You can also configure RayJob to automatically delete the RayCluster once the Ray job finishes.
To understand the following content better, you should understand the difference between:
RayJob: A Kubernetes custom resource definition provided by KubeRay.
Ray job: A Ray job is a packaged Ray application that can run on a remote Ray cluster. See this document for more details.
Submitter: The submitter is a Kubernetes Job that runs
ray job submit
to submit a Ray job to the RayCluster.
RayJob Configuration#
RayCluster configuration
- Defines the RayCluster custom resource to run the Ray job on.clusterSelector
- Use existing RayCluster custom resources to run the Ray job instead of creating a new one. See ray-job.use-existing-raycluster.yaml for example configurations.
Ray job configuration
- The submitter runsray job submit --address ... --submission-id ... -- $entrypoint
to submit a Ray job to the RayCluster.runtimeEnvYAML
(Optional): A runtime environment that describes the dependencies the Ray job needs to run, including files, packages, environment variables, and more. Provide the configuration as a multi-line YAML string. Example:spec: runtimeEnvYAML: | pip: - requests==2.26.0 - pendulum==2.1.2 env_vars: KEY: "VALUE"
See Runtime Environments for more details. (New in KubeRay version 1.0.0)
(Optional): Defines the submission ID for the Ray job. If not provided, KubeRay generates one automatically. See Ray Jobs CLI API Reference for more details about the submission ID.metadata
(Optional): See Ray Jobs CLI API Reference for more details about the--metadata-json
(Optional): See Ray Jobs CLI API Reference for more details.backoffLimit
(Optional, added in version 1.2.0): Specifies the number of retries before marking this RayJob failed. Each retry creates a new RayCluster. The default value is 0.
Submission configuration
specifies how RayJob submits the Ray job to the RayCluster. In “K8sJobMode”, the KubeRay operator creates a submitter Kubernetes Job to submit the Ray job. In “HTTPMode”, the KubeRay operator sends a request to the RayCluster to create a Ray job. The default value is “K8sJobMode”.submitterPodTemplate
(Optional): Defines the Pod template for the submitter Kubernetes Job. This field is only effective whensubmissionMode
- The KubeRay operator injects this environment variable to the submitter Pod. The value is$HEAD_SERVICE:$DASHBOARD_PORT
- The KubeRay operator injects this environment variable to the submitter Pod. The value is theRayJob.Status.JobId
of the RayJob.Example:
ray job submit --address=http://$RAY_DASHBOARD_ADDRESS --submission-id=$RAY_JOB_SUBMISSION_ID ...
See ray-job.sample.yaml for more details.
(Optional): Additional configurations for the submitter Kubernetes Job.backoffLimit
(Optional, added in version 1.2.0): The number of retries before marking the submitter Job as failed. The default value is 2.
Automatic resource cleanup
(Optional): Determines whether to recycle the RayCluster after the Ray job finishes. The default value is false.ttlSecondsAfterFinished
(Optional): Only works ifshutdownAfterJobFinishes
is true. The KubeRay operator deletes the RayCluster and the submitterttlSecondsAfterFinished
seconds after the Ray job finishes. The default value is 0.activeDeadlineSeconds
(Optional): If the RayJob doesn’t transition theJobDeploymentStatus
, the KubeRay operator transitions theJobDeploymentStatus
, citingDeadlineExceeded
(Optional, added in version 1.2.0): Set this environment variable for the KubeRay operator, not the RayJob resource. If you set this environment variable to true, the RayJob custom resource itself is deleted if you also setshutdownAfterJobFinishes
to true. Note that KubeRay deletes all resources created by the RayJob, including the Kubernetes Job.
(Optional): Ifsuspend
is true, KubeRay deletes both the RayCluster and the submitter. Note that Kueue also implements scheduling strategies by mutating this field. Avoid manually updating this field if you use Kueue to schedule RayJob.deletionPolicy
(Optional, alpha in v1.3.0): Indicates what resources of the RayJob are deleted upon job completion. Valid values areDeleteCluster
. If unset, deletion policy is based onspec.shutdownAfterJobFinishes
. This field requires theRayJobDeletionPolicy
feature gate to be enabled.DeleteCluster
- Deletion policy to delete the RayCluster custom resource, and its Pods, on job completion.DeleteWorkers
- Deletion policy to delete only the worker Pods on job completion.DeleteSelf
- Deletion policy to delete the RayJob custom resource (and all associated resources) on job completion.DeleteNone
- Deletion policy to delete no resources on job completion.
Example: Run a simple Ray job with RayJob#
Step 1: Create a Kubernetes cluster with Kind#
kind create cluster --image=kindest/node:v1.26.0
Step 2: Install the KubeRay operator#
Follow the RayCluster Quickstart to install the latest stable KubeRay operator by Helm repository.
Step 3: Install a RayJob#
curl -s https://raw.githubusercontent.com/ray-project/kuberay/v1.3.0/ray-operator/config/samples/ray-job.sample.yaml | sed 's/2.41.0/2.41.0-aarch64/g' | kubectl apply -f -
kubectl apply -f https://raw.githubusercontent.com/ray-project/kuberay/v1.3.0/ray-operator/config/samples/ray-job.sample.yaml
Step 4: Verify the Kubernetes cluster status#
# Step 4.1: List all RayJob custom resources in the `default` namespace.
kubectl get rayjob
# [Example output]
# rayjob-sample Running 2024-03-02T19:09:15Z 96s
# Step 4.2: List all RayCluster custom resources in the `default` namespace.
kubectl get raycluster
# [Example output]
# rayjob-sample-raycluster-tlsxc 1 1 400m 0 0 ready 91m
# Step 4.3: List all Pods in the `default` namespace.
# The Pod created by the Kubernetes Job will be terminated after the Kubernetes Job finishes.
kubectl get pods
# [Example output]
# kuberay-operator-7456c6b69b-rzv25 1/1 Running 0 3m57s
# rayjob-sample-lk9jx 0/1 Completed 0 2m49s => Pod created by a Kubernetes Job
# rayjob-sample-raycluster-9c546-head-gdxkg 1/1 Running 0 3m46s
# rayjob-sample-raycluster-9c546-worker-small-group-nfbxm 1/1 Running 0 3m46s
# Step 4.4: Check the status of the RayJob.
# The field `jobStatus` in the RayJob custom resource will be updated to `SUCCEEDED` and `jobDeploymentStatus`
# should be `Complete` once the job finishes.
kubectl get rayjobs.ray.io rayjob-sample -o jsonpath='{.status.jobStatus}'
# [Expected output]: "SUCCEEDED"
kubectl get rayjobs.ray.io rayjob-sample -o jsonpath='{.status.jobDeploymentStatus}'
# [Expected output]: "Complete"
The KubeRay operator creates a RayCluster custom resource based on the rayClusterSpec
and a submitter Kubernetes Job to submit a Ray job to the RayCluster.
In this example, the entrypoint
is python /home/ray/samples/sample_code.py
, and sample_code.py
is a Python script stored in a Kubernetes ConfigMap mounted to the head Pod of the RayCluster.
Because the default value of shutdownAfterJobFinishes
is false, the KubeRay operator doesn’t delete the RayCluster or the submitter when the Ray job finishes.
Step 5: Check the output of the Ray job#
kubectl logs -l=job-name=rayjob-sample
# [Example output]
# 2023-08-21 17:08:22,530 INFO cli.py:27 -- Job submission server address: http://rayjob-sample-raycluster-9c546-head-svc.default.svc.cluster.local:8265
# 2023-08-21 17:08:23,726 SUCC cli.py:33 -- ------------------------------------------------
# 2023-08-21 17:08:23,727 SUCC cli.py:34 -- Job 'rayjob-sample-5ntcr' submitted successfully
# 2023-08-21 17:08:23,727 SUCC cli.py:35 -- ------------------------------------------------
# 2023-08-21 17:08:23,727 INFO cli.py:226 -- Next steps
# 2023-08-21 17:08:23,727 INFO cli.py:227 -- Query the logs of the job:
# 2023-08-21 17:08:23,727 INFO cli.py:229 -- ray job logs rayjob-sample-5ntcr
# 2023-08-21 17:08:23,727 INFO cli.py:231 -- Query the status of the job:
# 2023-08-21 17:08:23,727 INFO cli.py:233 -- ray job status rayjob-sample-5ntcr
# 2023-08-21 17:08:23,727 INFO cli.py:235 -- Request the job to be stopped:
# 2023-08-21 17:08:23,728 INFO cli.py:237 -- ray job stop rayjob-sample-5ntcr
# 2023-08-21 17:08:23,739 INFO cli.py:245 -- Tailing logs until the job exits (disable with --no-wait):
# 2023-08-21 17:08:34,288 INFO worker.py:1335 -- Using address set in the environment variable RAY_ADDRESS
# 2023-08-21 17:08:34,288 INFO worker.py:1452 -- Connecting to existing Ray cluster at address:
# 2023-08-21 17:08:34,302 INFO worker.py:1633 -- Connected to Ray cluster. View the dashboard at
# test_counter got 1
# test_counter got 2
# test_counter got 3
# test_counter got 4
# test_counter got 5
# 2023-08-21 17:08:46,040 SUCC cli.py:33 -- -----------------------------------
# 2023-08-21 17:08:46,040 SUCC cli.py:34 -- Job 'rayjob-sample-5ntcr' succeeded
# 2023-08-21 17:08:46,040 SUCC cli.py:35 -- -----------------------------------
The Python script sample_code.py
used by entrypoint
is a simple Ray script that executes a counter’s increment function 5 times.
Step 6: Delete the RayJob#
kubectl delete -f https://raw.githubusercontent.com/ray-project/kuberay/v1.3.0/ray-operator/config/samples/ray-job.sample.yaml
Step 7: Create a RayJob with shutdownAfterJobFinishes
set to true#
curl -s https://raw.githubusercontent.com/ray-project/kuberay/v1.3.0/ray-operator/config/samples/ray-job.shutdown.yaml | sed 's/2.41.0/2.41.0-aarch64/g' | kubectl apply -f -
kubectl apply -f https://raw.githubusercontent.com/ray-project/kuberay/v1.3.0/ray-operator/config/samples/ray-job.shutdown.yaml
The ray-job.shutdown.yaml
defines a RayJob custom resource with shutdownAfterJobFinishes: true
and ttlSecondsAfterFinished: 10
Hence, the KubeRay operator deletes the RayCluster 10 seconds after the Ray job finishes. Note that the submitter job isn’t deleted
because it contains the ray job logs and doesn’t use any cluster resources once completed. In addition, the RayJob cleans up the submitter job
when the RayJob is eventually deleted due to its owner reference back to the RayJob.
Step 8: Check the RayJob status#
# Wait until `jobStatus` is `SUCCEEDED` and `jobDeploymentStatus` is `Complete`.
kubectl get rayjobs.ray.io rayjob-sample-shutdown -o jsonpath='{.status.jobDeploymentStatus}'
kubectl get rayjobs.ray.io rayjob-sample-shutdown -o jsonpath='{.status.jobStatus}'
Step 9: Check if the KubeRay operator deletes the RayCluster#
# List the RayCluster custom resources in the `default` namespace. The RayCluster
# associated with the RayJob `rayjob-sample-shutdown` should be deleted.
kubectl get raycluster
Step 10: Clean up#
# Step 10.1: Delete the RayJob
kubectl delete -f https://raw.githubusercontent.com/ray-project/kuberay/v1.3.0/ray-operator/config/samples/ray-job.shutdown.yaml
# Step 10.2: Delete the KubeRay operator
helm uninstall kuberay-operator
# Step 10.3: Delete the Kubernetes cluster
kind delete cluster