Deploy on Kubernetes#

This section should help you:

understand how to install and use the KubeRay operator.
understand how to deploy a Ray Serve application using a RayService.
understand how to monitor and update your application.

Deploying Ray Serve on Kubernetes provides the scalable compute of Ray Serve and operational benefits of Kubernetes. This combination also allows you to integrate with existing applications that may be running on Kubernetes. When running on Kubernetes, use the RayService controller from KubeRay.

NOTE: Anyscale is a managed Ray solution that provides high-availability, high-performance autoscaling, multi-cloud clusters, spot instance support, and more out of the box.

A RayService CR encapsulates a multi-node Ray Cluster and a Serve application that runs on top of it into a single Kubernetes manifest. Deploying, upgrading, and getting the status of the application can be done using standard kubectl commands. This section walks through how to deploy, monitor, and upgrade the Text ML example on Kubernetes.

Installing the KubeRay operator#

Follow the KubeRay quickstart guide to:

Install kubectl and Helm
Prepare a Kubernetes cluster
Deploy a KubeRay operator

Setting up a RayService custom resource (CR)#

Once the KubeRay controller is running, manage your Ray Serve application by creating and updating a RayService CR (example).

Under the spec section in the RayService CR, set the following fields:

serveConfigV2: Represents the configuration that Ray Serve uses to deploy the application. Using serve build to print the Serve configuration and copy-paste it directly into your Kubernetes config and RayService CR.

rayClusterConfig: Populate this field with the contents of the spec field from the RayCluster CR YAML file. Refer to KubeRay configuration for more details.

Tip

To enhance the reliability of your application, particularly when dealing with large dependencies that may require a significant amount of time to download, consider including the dependencies in your image’s Dockerfile, so the dependencies are available as soon as the pods start.

Deploying a Serve application#

When the RayService is created, the KubeRay controller first creates a Ray cluster using the provided configuration. Then, once the cluster is running, it deploys the Serve application to the cluster using the REST API. The controller also creates a Kubernetes Service that can be used to route traffic to the Serve application.

To see an example, deploy the Text ML example. The Serve config for the example is embedded into this sample RayService CR. Save this CR locally to a file named ray-service.text-ml.yaml:

Note

The example RayService uses very low numCpus values for demonstration purposes. In production, provide more resources to the Serve application. Learn more about how to configure KubeRay clusters here.
If you have dependencies that must be installed during deployment, you can add them to the runtime_env in the Deployment code. Learn more here

$ curl -o ray-service.text-ml.yaml https://raw.githubusercontent.com/ray-project/kuberay/2ba0dd7bea387ac9df3681666bab3d622e89846c/ray-operator/config/samples/ray-service.text-ml.yaml

To deploy the example, we simply kubectl apply the CR. This creates the underlying Ray cluster, consisting of a head and worker node pod (see Ray Clusters Key Concepts for more details on Ray clusters), as well as the service that can be used to query our application:

$ kubectl apply -f ray-service.text-ml.yaml

$ kubectl get rayservices
NAME                SERVICE STATUS   NUM SERVE ENDPOINTS
rayservice-sample   Running          1

$ kubectl get pods
NAME                                                          READY   STATUS    RESTARTS   AGE
rayservice-sample-raycluster-7wlx2-head-hr8mg                 1/1     Running   0          XXs
rayservice-sample-raycluster-7wlx2-small-group-worker-tb8nn   1/1     Running   0          XXs

$ kubectl get services
NAME                                          TYPE        CLUSTER-IP        EXTERNAL-IP   PORT(S)                                         AGE
rayservice-sample-head-svc                    ClusterIP   None              <none>        10001/TCP,8265/TCP,6379/TCP,8080/TCP,8000/TCP   XXs
rayservice-sample-raycluster-7wlx2-head-svc   ClusterIP   None              <none>        10001/TCP,8265/TCP,6379/TCP,8080/TCP,8000/TCP   XXs
rayservice-sample-serve-svc                   ClusterIP   192.168.145.219   <none>        8000/TCP                                        XXs

Note that the rayservice-sample-serve-svc above is the one that can be used to send queries to the Serve application – this will be used in the next section.

Querying the application#

Once the RayService is running, we can query it over HTTP using the service created by the KubeRay controller. This service can be queried directly from inside the cluster, but to access it from your laptop you’ll need to configure a Kubernetes ingress or use port forwarding as below:

$ kubectl port-forward service/rayservice-sample-serve-svc 8000
$ curl -X POST -H "Content-Type: application/json" localhost:8000/summarize_translate -d '"It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief"'
c'était le meilleur des temps, c'était le pire des temps .

Getting the status of the application#

As the RayService is running, the KubeRay controller continually monitors it and writes relevant status updates to the CR. You can view the status of the application using kubectl describe. This includes the status of the cluster, events such as health check failures or restarts, and the application-level statuses reported by serve status.

$ kubectl get rayservices
NAME                AGE
rayservice-sample   7s

$ kubectl describe rayservice rayservice-sample
...
Status:
  Active Service Status:
    Ray Cluster Status:
      Available Worker Replicas:  1
      Desired CPU:                2500m
      Desired GPU:                0
      Desired Memory:             4Gi
      Desired TPU:                0
      Desired Worker Replicas:    1
      Endpoints:
        Client:     10001
        Dashboard:  8265
        Metrics:    8080
        Redis:      6379
        Serve:      8000
      Head:
        Pod IP:             10.48.99.153
        Pod Name:           rayservice-sample-raycluster-7wlx2-head-dqv7t
        Service IP:         10.48.99.153
        Service Name:       rayservice-sample-raycluster-7wlx2-head-svc
      Last Update Time:     2025-04-28T06:32:13Z
      Max Worker Replicas:  5
      Min Worker Replicas:  1
      Observed Generation:  1
  Observed Generation:      1
  Pending Service Status:
    Application Statuses:
      text_ml_app:
        Health Last Update Time:  2025-04-28T06:39:02Z
        Serve Deployment Statuses:
          Summarizer:
            Health Last Update Time:  2025-04-28T06:39:02Z
            Status:                   HEALTHY
          Translator:
            Health Last Update Time:  2025-04-28T06:39:02Z
            Status:                   HEALTHY
        Status:                       RUNNING
    Ray Cluster Name:                 rayservice-sample-raycluster-7wlx2
    Ray Cluster Status:
      Desired CPU:     0
      Desired GPU:     0
      Desired Memory:  0
      Desired TPU:     0
      Head:
  Service Status:  Running
Events:
  Type    Reason   Age                      From                   Message
  ----    ------   ----                     ----                   -------
  Normal  Running  2m15s (x29791 over 16h)  rayservice-controller  The Serve applicaton is now running and healthy.

Updating the application#

To update the RayService, modify the manifest and apply it use kubectl apply. There are two types of updates that can occur:

Application-level updates: when only the Serve config options are changed, the update is applied in-place on the same Ray cluster. This enables lightweight updates such as scaling a deployment up or down or modifying autoscaling parameters.
Cluster-level updates: when the RayCluster config options are changed, such as updating the container image for the cluster, it may result in a cluster-level update. In this case, a new cluster is started, and the application is deployed to it. Once the new cluster is ready, the Kubernetes service is updated to point to the new cluster and the previous cluster is terminated. There should not be any downtime for the application, but note that this requires the Kubernetes cluster to be large enough to schedule both Ray clusters.

Example: Serve config update#

In the Text ML example above, change the language of the Translator in the Serve config to German:

  - name: Translator
    num_replicas: 1
    user_config:
      language: german

Now to update the application we apply the modified manifest:

$ kubectl apply -f ray-service.text-ml.yaml

$ kubectl describe rayservice rayservice-sample
...
  Serve Deployment Statuses:
    text_ml_app_Translator:
      Health Last Update Time:  2023-09-07T18:21:36Z
      Last Update Time:         2023-09-07T18:21:36Z
      Status:                   UPDATING
...

Query the application to see a different translation in German:

$ curl -X POST -H "Content-Type: application/json" localhost:8000/summarize_translate -d '"It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief"'
Es war die beste Zeit, es war die schlimmste Zeit .

Updating the RayCluster config#

The process of updating the RayCluster config is the same as updating the Serve config. For example, we can update the number of worker nodes to 2 in the manifest:

workerGroupSpecs:
  # the number of pods in the worker group.
  - replicas: 2

$ kubectl apply -f ray-service.text-ml.yaml

$ kubectl describe rayservice rayservice-sample
...
  pendingServiceStatus:
    appStatus: {}
    dashboardStatus:
      healthLastUpdateTime: "2022-07-18T21:54:53Z"
      lastUpdateTime: "2022-07-18T21:54:54Z"
    rayClusterName: rayservice-sample-raycluster-bshfr
    rayClusterStatus: {}
...

In the status, you can see that the RayService is preparing a pending cluster. After the pending cluster is healthy, it becomes the active cluster and the previous cluster is terminated.

Autoscaling#

You can configure autoscaling for your Serve application by setting the autoscaling field in the Serve config. Learn more about the configuration options in the Serve Autoscaling Guide.

To enable autoscaling in a KubeRay Cluster, you need to set enableInTreeAutoscaling to True. Additionally, there are other options available to configure the autoscaling behavior. For further details, please refer to the documentation here.

Note

In most use cases, it is recommended to enable Kubernetes autoscaling to fully utilize the resources in your cluster. If you are using GKE, you can utilize the AutoPilot Kubernetes cluster. For instructions, see Create an Autopilot Cluster. For EKS, you can enable Kubernetes cluster autoscaling by utilizing the Cluster Autoscaler. For detailed information, see Cluster Autoscaler on AWS. To understand the relationship between Kubernetes autoscaling and Ray autoscaling, see Ray Autoscaler with Kubernetes Cluster Autoscaler.

Load balancer#

Set up ingress to expose your Serve application with a load balancer. See this configuration

Note

Ray Serve runs HTTP proxy on every node, allowing you to use /-/routes as the endpoint for node health checks.
Ray Serve uses port 8000 as the default HTTP proxy traffic port. You can change the port by setting http_options in the Serve config. Learn more details here.

Monitoring#

Monitor your Serve application using the Ray Dashboard.

Learn more about how to configure and manage Dashboard here.
Learn about the Ray Serve Dashboard here.
Learn how to set up Prometheus and Grafana for Dashboard.
Learn about the Ray Serve logs and how to persistent logs on Kubernetes.

Note

To troubleshoot application deployment failures in Serve, you can check the KubeRay operator logs by running kubectl logs -f <kuberay-operator-pod-name> (e.g., kubectl logs -f kuberay-operator-7447d85d58-lv7pf). The KubeRay operator logs contain information about the Serve application deployment event and Serve application health checks.
You can also check the controller log and deployment log, which are located under /tmp/ray/session_latest/logs/serve/ in both the head node pod and worker node pod. These logs contain information about specific deployment failure reasons and autoscaling events.

Next Steps#

See Add End-to-End Fault Tolerance to learn more about Serve’s failure conditions and how to guard against them.