Deploy on Kubernetes#

This section should help you:

  • understand how to install and use the KubeRay operator.

  • understand how to deploy a Ray Serve application using a RayService.

  • understand how to monitor and update your application.

The recommended way to deploy Ray Serve is on Kubernetes, providing the best of both worlds: the user experience and scalable compute of Ray Serve and operational benefits of Kubernetes. This also allows you to integrate with existing applications that may be running on Kubernetes. The recommended practice when running on Kubernetes is to use the RayService controller that’s provided as part of KubeRay. The RayService controller automatically handles important production requirements such as health checking, status reporting, failure recovery, and upgrades.

A RayService CR encapsulates a multi-node Ray Cluster and a Serve application that runs on top of it into a single Kubernetes manifest. Deploying, upgrading, and getting the status of the application can be done using standard kubectl commands. This section walks through how to deploy, monitor, and upgrade the Text ML example on Kubernetes.

Installing the KubeRay operator#

Follow the KubeRay quickstart guide to:

  • Install kubectl and Helm

  • Prepare a Kubernetes cluster

  • Deploy a KubeRay operator

Setting up a RayService custom resource (CR)#

Once the KubeRay controller is running, manage your Ray Serve application by creating and updating a RayService CR (example).

Under the spec section in the RayService CR, set the following fields:

serviceUnhealthySecondThreshold: Represents the threshold in seconds that defines when a service is considered unhealthy (application status is not RUNNING status). The default is 60 seconds. When the service is unhealthy, the KubeRay Service controller tries to recreate a new cluster and deploy the application to the new cluster.

deploymentUnhealthySecondThreshold: Represents the number of seconds that the Serve application status can be unavailable before the service is considered unhealthy. The Serve application status is unavailable whenever the Ray dashboard is unavailable. The default is 60 seconds. When the service is unhealthy, the KubeRay Service controller tries to recreate a new cluster and deploy the application to the new cluster.

serveConfigV2: Represents the configuration that Ray Serve uses to deploy the application. Using serve build to print the Serve configuration and copy-paste it directly into your Kubernetes config and RayService CR.

rayClusterConfig: Populate this field with the contents of the spec field from the RayCluster CR YAML file. Refer to KubeRay configuration for more details.

Tip

To enhance the reliability of your application, particularly when dealing with large dependencies that may require a significant amount of time to download, consider increasing the value of the deploymentUnhealthySecondThreshold to avoid a cluster restart.

Alternatively, include the dependencies in your image’s Dockerfile, so the dependencies are available as soon as the pods start.

Deploying a Serve application#

When the RayService is created, the KubeRay controller first creates a Ray cluster using the provided configuration. Then, once the cluster is running, it deploys the Serve application to the cluster using the REST API. The controller also creates a Kubernetes Service that can be used to route traffic to the Serve application.

To see an example, deploy the Text ML example. The Serve config for the example is embedded into this sample RayService CR. Save this CR locally to a file named ray-service.text-ml.yaml:

Note

  • The example RayService uses very low numCpus values for demonstration purposes. In production, provide more resources to the Serve application. Learn more about how to configure KubeRay clusters here.

  • If you have dependencies that must be installed during deployment, you can add them to the runtime_env in the Deployment code. Learn more here

$ curl -o ray-service.text-ml.yaml https://raw.githubusercontent.com/ray-project/kuberay/5b1a5a11f5df76db2d66ed332ff0802dc3bbff76/ray-operator/config/samples/ray-service.text-ml.yaml

To deploy the example, we simply kubectl apply the CR. This creates the underlying Ray cluster, consisting of a head and worker node pod (see Ray Clusters Key Concepts for more details on Ray clusters), as well as the service that can be used to query our application:

$ kubectl apply -f ray-service.text-ml.yaml

$ kubectl get rayservices
NAME                AGE
rayservice-sample   7s

$ kubectl get pods
NAME                                                      READY   STATUS    RESTARTS   AGE
ervice-sample-raycluster-454c4-worker-small-group-b6mmg   1/1     Running   0          XXs
kuberay-operator-7fbdbf8c89-4lrnr                         1/1     Running   0          XXs
rayservice-sample-raycluster-454c4-head-krk9d             1/1     Running   0          XXs

$ kubectl get services

rayservice-sample-head-svc                         ClusterIP   ...        8080/TCP,6379/TCP,8265/TCP,10001/TCP,8000/TCP,52365/TCP   XXs
rayservice-sample-raycluster-454c4-dashboard-svc   ClusterIP   ...        52365/TCP                                                 XXs
rayservice-sample-raycluster-454c4-head-svc        ClusterIP   ...        8000/TCP,52365/TCP,8080/TCP,6379/TCP,8265/TCP,10001/TCP   XXs
rayservice-sample-serve-svc                        ClusterIP   ...        8000/TCP                                                  XXs

Note that the rayservice-sample-serve-svc above is the one that can be used to send queries to the Serve application – this will be used in the next section.

Querying the application#

Once the RayService is running, we can query it over HTTP using the service created by the KubeRay controller. This service can be queried directly from inside the cluster, but to access it from your laptop you’ll need to configure a Kubernetes ingress or use port forwarding as below:

$ kubectl port-forward service/rayservice-sample-serve-svc 8000
$ curl -X POST -H "Content-Type: application/json" localhost:8000 -d '"It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief"'
c'était le meilleur des temps, c'était le pire des temps .

Getting the status of the application#

As the RayService is running, the KubeRay controller continually monitors it and writes relevant status updates to the CR. You can view the status of the application using kubectl describe. This includes the status of the cluster, events such as health check failures or restarts, and the application-level statuses reported by serve status.

$ kubectl get rayservices
NAME                AGE
rayservice-sample   7s

$ kubectl describe rayservice rayservice-sample
...
Status:
  Active Service Status:
    Application Statuses:
      text_ml_app:
        Health Last Update Time:  2023-09-07T01:21:30Z
        Last Update Time:         2023-09-07T01:21:30Z
        Serve Deployment Statuses:
          text_ml_app_Summarizer:
            Health Last Update Time:  2023-09-07T01:21:30Z
            Last Update Time:         2023-09-07T01:21:30Z
            Status:                   HEALTHY
          text_ml_app_Translator:
            Health Last Update Time:  2023-09-07T01:21:30Z
            Last Update Time:         2023-09-07T01:21:30Z
            Status:                   HEALTHY
        Status:                       RUNNING
    Dashboard Status:
      Health Last Update Time:  2023-09-07T01:21:30Z
      Is Healthy:               true
      Last Update Time:         2023-09-07T01:21:30Z
    Ray Cluster Name:           rayservice-sample-raycluster-kkd2p
    Ray Cluster Status:
      Head:
  Observed Generation:  1
  Pending Service Status:
    Dashboard Status:
    Ray Cluster Status:
      Head:
  Service Status:  Running
Events:
  Type    Reason   Age                      From                   Message
  ----    ------   ----                     ----                   -------
  Normal  Running  2m15s (x29791 over 16h)  rayservice-controller  The Serve applicaton is now running and healthy.

Updating the application#

To update the RayService, modify the manifest and apply it use kubectl apply. There are two types of updates that can occur:

  • Application-level updates: when only the Serve config options are changed, the update is applied in-place on the same Ray cluster. This enables lightweight updates such as scaling a deployment up or down or modifying autoscaling parameters.

  • Cluster-level updates: when the RayCluster config options are changed, such as updating the container image for the cluster, it may result in a cluster-level update. In this case, a new cluster is started, and the application is deployed to it. Once the new cluster is ready, the Kubernetes service is updated to point to the new cluster and the previous cluster is terminated. There should not be any downtime for the application, but note that this requires the Kubernetes cluster to be large enough to schedule both Ray clusters.

Example: Serve config update#

In the Text ML example above, change the language of the Translator in the Serve config to German:

  - name: Translator
    num_replicas: 1
    user_config:
      language: german

Now to update the application we apply the modified manifest:

$ kubectl apply -f ray-service.text-ml.yaml

$ kubectl describe rayservice rayservice-sample
...
  Serve Deployment Statuses:
    text_ml_app_Translator:
      Health Last Update Time:  2023-09-07T18:21:36Z
      Last Update Time:         2023-09-07T18:21:36Z
      Status:                   UPDATING
...

Query the application to see a different translation in German:

$ curl -X POST -H "Content-Type: application/json" localhost:8000 -d '"It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief"'
Es war die beste Zeit, es war die schlimmste Zeit .

Updating the RayCluster config#

The process of updating the RayCluster config is the same as updating the Serve config. For example, we can update the number of worker nodes to 2 in the manifest:

workerGroupSpecs:
  # the number of pods in the worker group.
  - replicas: 2
$ kubectl apply -f ray-service.text-ml.yaml

$ kubectl describe rayservice rayservice-sample
...
  pendingServiceStatus:
    appStatus: {}
    dashboardStatus:
      healthLastUpdateTime: "2022-07-18T21:54:53Z"
      lastUpdateTime: "2022-07-18T21:54:54Z"
    rayClusterName: rayservice-sample-raycluster-bshfr
    rayClusterStatus: {}
...

In the status, you can see that the RayService is preparing a pending cluster. After the pending cluster is healthy, it becomes the active cluster and the previous cluster is terminated.

Autoscaling#

You can configure autoscaling for your Serve application by setting the autoscaling field in the Serve config. Learn more about the configuration options in the Serve Autoscaling Guide.

To enable autoscaling in a KubeRay Cluster, you need to set enableInTreeAutoscaling to True. Additionally, there are other options available to configure the autoscaling behavior. For further details, please refer to the documentation here.

Note

In most use cases, it is recommended to enable Kubernetes autoscaling to fully utilize the resources in your cluster. If you are using GKE, you can utilize the AutoPilot Kubernetes cluster. For instructions, see Create an Autopilot Cluster. For EKS, you can enable Kubernetes cluster autoscaling by utilizing the Cluster Autoscaler. For detailed information, see Cluster Autoscaler on AWS. To understand the relationship between Kubernetes autoscaling and Ray autoscaling, see Ray Autoscaler with Kubernetes Cluster Autoscaler.

Load balancer#

Set up ingress to expose your Serve application with a load balancer. See this configuration

Note

  • Ray Serve runs HTTP proxy on every node, allowing you to use /-/routes as the endpoint for node health checks.

  • Ray Serve uses port 8000 as the default HTTP proxy traffic port. You can change the port by setting http_options in the Serve config. Learn more details here.

Monitoring#

Monitor your Serve application using the Ray Dashboard.

Note

  • To troubleshoot application deployment failures in Serve, you can check the KubeRay operator logs by running kubectl logs -f <kuberay-operator-pod-name> (e.g., kubectl logs -f kuberay-operator-7447d85d58-lv7pf). The KubeRay operator logs contain information about the Serve application deployment event and Serve application health checks.

  • You can also check the controller log and deployment log, which are located under /tmp/ray/session_latest/logs/serve/ in both the head node pod and worker node pod. These logs contain information about specific deployment failure reasons and autoscaling events.

Next Steps#

See Add End-to-End Fault Tolerance to learn more about Serve’s failure conditions and how to guard against them.