Serve a Large Language Model using Ray Serve LLM on Kubernetes#
This guide provides a step-by-step guide for deploying a Large Language Model (LLM) using Ray Serve LLM on Kubernetes. Leveraging KubeRay, Ray Serve, and vLLM, this guide deploys the Qwen/Qwen2.5-7B-Instruct
model from Hugging Face, enabling scalable, efficient, and OpenAI-compatible LLM serving within a Kubernetes environment. See Serving LLMs for information on Ray Serve LLM.
Prerequisites#
This example downloads model weights from the Qwen/Qwen2.5-7B-Instruct
Hugging Face repository. To completely finish this guide, you must fulfill the following requirements:
A Hugging Face account and a Hugging Face access token with read access to gated repositories.
In your RayService custom resource, set the
HUGGING_FACE_HUB_TOKEN
environment variable to the Hugging Face token to enable model downloads.A Kubernetes cluster with GPUs.
Step 1: Create a Kubernetes cluster with GPUs#
Refer to the Kubernetes cluster setup instructions for guides on creating a Kubernetes cluster.
Step 2: Install the KubeRay operator#
Install the most recent stable KubeRay operator from the Helm repository by following Deploy a KubeRay operator. The Kubernetes NoSchedule
taint in the example config prevents the KubeRay operator pod from running on a GPU node.
Step 3: Create a Kubernetes Secret containing your Hugging Face access token#
For additional security, instead of passing the HF access token directly as an environment variable, create a Kubernetes secret containing your Hugging Face access token. Download the Ray Serve LLM service config .yaml file using the following command:
curl -o ray-service.llm-serve.yaml https://raw.githubusercontent.com/ray-project/kuberay/master/ray-operator/config/samples/ray-service.llm-serve.yaml
After downloading, update the value for hf_token
to your private access token in the Secret
.
apiVersion: v1
kind: Secret
metadata:
name: hf-token
type: Opaque
stringData:
hf_token: <your-hf-access-token-value>
Step 4: Deploy a RayService#
After adding the Hugging Face access token, create a RayService custom resource using the config file:
kubectl apply -f ray-service.llm-serve.yaml
This step sets up a custom Ray Serve app to serve the Qwen/Qwen2.5-7B-Instruct
model, creating an OpenAI-compatible server. You can inspect and modify the serveConfigV2
section in the YAML file to learn more about the Serve app:
serveConfigV2: |
applications:
- name: llms
import_path: ray.serve.llm:build_openai_app
route_prefix: "/"
args:
llm_configs:
- model_loading_config:
model_id: qwen2.5-7b-instruct
model_source: Qwen/Qwen2.5-7B-Instruct
engine_kwargs:
dtype: bfloat16
max_model_len: 1024
device: auto
gpu_memory_utilization: 0.75
deployment_config:
autoscaling_config:
min_replicas: 1
max_replicas: 4
target_ongoing_requests: 64
max_ongoing_requests: 128
In particular, this configuration loads the model from Qwen/Qwen2.5-7B-Instruct
and sets its model_id
to qwen2.5-7b-instruct
. The LLMDeployment
initializes the underlying LLM engine using the engine_kwargs
field. The deployment_config
section sets the desired number of engine replicas. By default, each replica requires one GPU. See Serving LLMs and the Ray Serve config documentation for more information.
Wait for the RayService resource to become healthy. You can confirm its status by running the following command:
kubectl get rayservice ray-serve-llm -o yaml
After a few minutes, the result should be similar to the following:
status:
activeServiceStatus:
applicationStatuses:
llms:
serveDeploymentStatuses:
LLMDeployment:qwen2_5-7b-instruct:
status: HEALTHY
LLMRouter:
status: HEALTHY
status: RUNNING
Step 5: Send a request#
To send requests to the Ray Serve deployment, port-forward port 8000 from the Serve app service:
kubectl port-forward ray-serve-llm-serve-svc 8000
Note that this Kubernetes service comes up only after Ray Serve apps are running and ready.
Test the service with the following command:
curl --location 'http://localhost:8000/v1/chat/completions' --header 'Content-Type: application/json'
--data '{
"model": "qwen2.5-7b-instruct",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Provide steps to serve an LLM using Ray Serve."
}
]
}'
The output should be in the following format:
{
"id": "qwen2.5-7b-instruct-550d3fd491890a7e7bca74e544d3479e",
"object": "chat.completion",
"created": 1746595284,
"model": "qwen2.5-7b-instruct",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"reasoning_content": null,
"content": "Sure! Ray Serve is a library built on top of Ray...",
"tool_calls": []
},
"logprobs": null,
"finish_reason": "stop",
"stop_reason": null
}
],
"usage": {
"prompt_tokens": 30,
"total_tokens": 818,
"completion_tokens": 788,
"prompt_tokens_details": null
},
"prompt_logprobs": null
}
Step 6: View the Ray dashboard#
kubectl port-forward svc/ray-serve-llm-head-svc 8265
Once forwarded, navigate to the Serve tab on the dashboard to review application status, deployments, routers, logs, and other relevant features.