Deploy LLM with Ray Serve LLM#

This guide walks you through deploying a large language model (LLM) using Ray Serve LLM. It covers configuration, deployment, and interaction with the model.

The example maintains compatibility with the OpenAI API while leveraging Ray Serve LLM’s powerful features for production-grade deployments.

For more details, see the Ray Serve LLM API documentation: https://docs.ray.io/en/latest/serve/llm/index.html

Anyscale-Specific Configuration

Note: This tutorial is optimized for the Anyscale platform. When running on open source Ray, additional configuration is required. For example, you’ll need to manually:

Configure your Ray Cluster: Set up your multi-node environment (including head and worker nodes) and manage resource allocation (e.g., autoscaling, GPU/CPU assignments) without the Anyscale automation. See the Ray Cluster Setup documentation for details: https://docs.ray.io/en/latest/cluster/getting-started.html.
Manage Dependencies: Install and manage dependencies on each node since you won’t have Anyscale’s Docker-based dependency management. Refer to the Ray Installation Guide for instructions on installing and updating Ray in your environment: https://docs.ray.io/en/latest/ray-core/handling-dependencies.html.
Set Up Storage: Configure your own distributed or shared storage system (instead of relying on Anyscale’s integrated cluster storage). Check out the Ray Cluster Configuration guide for suggestions on setting up shared storage solutions: https://docs.ray.io/en/latest/train/user-guides/persistent-storage.html.

Dependencies#

Ensure you have the correct dependencies and install the required python packages using:

pip install "ray[serve,llm]>=2.45.0" 

Note: If you are on Anyscale platform, you can use the docker image: anyscale/ray-llm:2.45.0-py311-cu124.

Otherwise, feel free to build you own docker image on Anyscale as well, which could potentially speed up the workspace spin up time and worker node load time.

We have included the Dockerfile in the workspace.

Setting Up Your LLM Deployment#

Step 1: Configure the Deployment and Workder Node#

Create an LLMConfig object that sets up your model’s runtime environment, loading parameters, and engine options.

In this deployment, we set accelerator_type='L4' to use the L4 GPU node.

Because the model is large, we use tensor parallelism by setting 'tensor_parallel_size': 4 to distribute the load across 4 GPUs.

Load Huggingface gated models:

Qwen models do not require the Hugging Face token, but some models (such as Llama 3.1 models) may require registration and access. To use the gated Huggingface models, follow the link: https://docs.ray.io/en/latest/serve/llm/troubleshooting.html#how-do-i-use-gated-huggingface-models

To use other engine arguments from vLLM

If you would like to use more arguments from vLLM, checkout: https://docs.vllm.ai/en/latest/serving/engine_args.html

from ray import serve
from ray.serve.llm import LLMConfig


llm_config = LLMConfig(
    model_loading_config={
        'model_id': 'Qwen/Qwen2.5-32B-Instruct'
    },
    engine_kwargs={
        'max_num_batched_tokens': 8192,
        'max_model_len': 8192,
        'max_num_seqs': 64,
        'tensor_parallel_size': 4,
        'trust_remote_code': True,
    },
    accelerator_type='L4',
    deployment_config={
        'autoscaling_config': {
            'target_ongoing_requests': 32
        },
        'max_ongoing_requests': 64,
    },
)

Tips for Faster Model Downloading / Loading:#

For faster model downloading, you can enable fast download by setting HF_HUB_ENABLE_HF_TRANSFER and installing with pip install hf_transfer. Check out: https://docs.ray.io/en/latest/serve/llm/troubleshooting.html#why-is-downloading-the-model-so-slow
For faster model loading, you can download model weights to Anyscale’s cluster storage improves cluster startup times and scaling efficiency. For example, we can specify the Qwen model files (around 60GB in size) stored at: /mnt/cluster_storage/Qwen/Qwen2.5-32B-Instruct

Step 2: Start Ray Serve and Deploy Your Model#

You can directly use build_openai_app to build a OpenAI API compatible app:

from ray.serve.llm import build_openai_app

# Build and deploy the model with OpenAI api compatibility:
llm_app = build_openai_app({"llm_configs": [llm_config]})
serve.run(llm_app)

Streaming Chat Completions with OpenAI Client#

If successful, you should see the following information printed out:

"INFO 2025-03-02 17:17:14,162 serve 61769 -- Application 'default' is ready at http://127.0.0.1:8000/. 

INFO 2025-03-02 17:17:14,162 serve 61769 -- Deployed app 'default' successfully."

Note: we have appended “v1” to the base URL because the OpenAI client requires it.

Next, we can initialize an OpenAI client using this URL and an API key (though we don’t need the key for now) and then stream chat completions from your deployed model.

from openai import OpenAI

# Initialize client
client = OpenAI(base_url="http://localhost:8000/v1", api_key="fake-key")
model_id='Qwen/Qwen2.5-32B-Instruct' ## model id need to be same as your deployment

# Basic chat completion with streaming
response = client.chat.completions.create(
    model=model_id,
    messages=[{"role": "user", "content": "Hello!"}],
    stream=True
)

for chunk in response:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="", flush=True)

Shut down the service#

When you need to stop your service, simply run the following command:

!serve shutdown --yes

Production Deployment#

For a production-ready deployment, use Anyscale Services. This allows you to deploy the Ray Serve application to a dedicated cluster with built-in scalability, fault tolerance, and load balancing.

Let’s put our deployment code in serve_llm.py, then we can deploy the service with this command:

anyscale service deploy serve_llm:llm_app --name=llm-service-qwen2p5-32B

(anyscale +1.6s) Restarting existing service 'llm-service-qwen2p5-32B'.
(anyscale +2.3s) Using workspace runtime dependencies env vars: {'HF_TOKEN': 'HF_TOKEN'}.
(anyscale +2.3s) Uploading local dir '.' to cloud storage.
(anyscale +8.1s) Service 'llm-service-qwen2p5-32B' deployed (version ID: 75vs71q8).
(anyscale +8.1s) View the service in the UI: 'https://console.anyscale.com/services/service2_ybvl7arasth81zdll29mfm1jts'
(anyscale +8.1s) Query the service once it's running using the following curl command (add the path you want to query):
(anyscale +8.1s) curl -H "Authorization: Bearer v-ysnEivLuvxo3ZITC8b7SkI0jZ1taqXk_eBprAr0TY" https://llm-service-qwen2p5-32b-xxx.xxx.x.anyscaleuserdata.com/
(autoscaler +11m55s) [autoscaler] Downscaling node i-018a78b690239bb67 (node IP: 10.0.2.213) due to node idle termination.

Building an LLM Client for Chat Interactions#

This section introduces a custom LLMClient class that wraps around the OpenAI API. It supports both streaming responses (token-by-token) and full message retrieval.

Note: Before proceeding, please ensure that the service is operational, as it may take a few moments for it to become fully available.

from openai import OpenAI
from typing import Optional, Generator

from typing import Dict, List, Union
import torch
import numpy as np
from sentence_transformers import SentenceTransformer
from pprint import pprint
import chromadb


from openai import OpenAI
from typing import Optional, Generator

class LLMClient:
    def __init__(self, base_url: str, api_key: Optional[str] = None, model_id: str = None):
        # Ensure the base_url ends with a slash and does not include '/routes'
        if not base_url.endswith("/"):
            base_url += "/"
        if "/routes" in base_url:
            raise ValueError("base_url must end with '.com'")

        self.model_id = model_id
        self.client = OpenAI(
            base_url=base_url + "v1",
            api_key=api_key or "NOT A REAL KEY",
        )

    def get_response_streaming(
        self,
        prompt: str,
        temperature: float = 0.01,
    ) -> Generator[str, None, None]:
        """
        Get a response from the model based on the provided prompt.
        Yields the response tokens as they are streamed.
        """
        chat_completions = self.client.chat.completions.create(
            model=self.model_id,
            messages=[{"role": "user", "content": prompt}],
            temperature=temperature,
            stream=True
        )

        for chat in chat_completions:
            delta = chat.choices[0].delta
            if delta.content:
                yield delta.content

    def get_response(
        self,
        prompt: str,
        temperature: float = 0.01,
    ) -> str:
        """
        Get a complete response from the model based on the provided prompt.
        """
        chat_response = self.client.chat.completions.create(
            model=self.model_id,
            messages=[{"role": "user", "content": prompt}],
            temperature=temperature,
            stream=False
        )
        return chat_response.choices[0].message.content

Query the Service#

When you deploy, you expose the service to a publicly accessible IP address which you can send requests to. In the previous cell’s output, copy your API_KEY and BASE_URL.

Replace and fill in the placeholder values for the BASE_URL and API_KEY in the following code:

Example: Streaming Response#

# Initialize client
model_id='Qwen/Qwen2.5-32B-Instruct' ## model id need to be same as your deployment 
base_url = "https://llm-service-qwen2p5-32b-xxx.xxx.x.anyscaleuserdata.com" ## replace with your own service base url
api_key = "" ## replace with your own api key


llm_client = LLMClient(
    base_url=base_url,
    api_key=api_key,
    model_id=model_id,
)


# --- Get the response with streaming ---
prompt = "what is ray?"
print("Model response (streaming):")
for token in llm_client.get_response_streaming(prompt, temperature=0.5):
    print(token, end="")

Model response (streaming):
{"asctime": "2025-05-19 10:01:04,287", "levelname": "INFO", "message": "HTTP Request: POST https://llm-service-qwen2p5-32b-jgz99.cld-kvedzwag2qa8i5bj.s.anyscaleuserdata.com/v1/chat/completions \"HTTP/1.1 200 OK\"", "filename": "_client.py", "lineno": 1025, "job_id": "02000000", "worker_id": "02000000ffffffffffffffffffffffffffffffffffffffffffffffff", "node_id": "7a87cdeb8936fafd92d0d4cab8456af74f2aae665f59cec80664527f", "timestamp_ns": 1747674064287065217}
Ray is a high-performance distributed computing framework that was originally developed by researchers at the RISELab (formerly known as AMPLab) at the University of California, Berkeley. It is designed to make it easier to write and scale parallel and distributed applications in Python. Ray is particularly well-suited for machine learning, reinforcement learning, and other data-intensive computing tasks.

Ray provides several key features:

1. **Task Parallelism**: Ray allows you to define tasks that can be executed in parallel across multiple CPUs or GPUs.

2. **Actor Model**: Ray supports the actor model of concurrency, which means you can create and manage stateful objects (actors) that can be distributed across multiple nodes.

3. **Scalability**: Ray is designed to scale from a single machine to a large cluster, making it easy to distribute computation and data across multiple machines.

4. **Integration**: Ray integrates with many popular machine learning frameworks and libraries, such as TensorFlow, PyTorch, and Scikit-learn, making it a versatile tool for data scientists and machine learning engineers.

5. **Ease of Use**: Ray aims to be easy to use, with a simple API that allows you to write parallel and distributed applications without needing deep knowledge of distributed systems.

Ray is used in various industries for tasks ranging from training large machine learning models to real-time data processing and simulation. It's particularly popular in the reinforcement learning community due to its support for complex, multi-agent scenarios.

Example: Non-Streaming (Full Response)#

prompt = "what is ray?"

# --- Get the response without streaming ---
response = llm_client.get_response(prompt, temperature=0.5)
print("\n\nModel response (non-streaming):")
print(response)

{"asctime": "2025-05-19 10:01:31,654", "levelname": "INFO", "message": "HTTP Request: POST https://llm-service-qwen2p5-32b-jgz99.cld-kvedzwag2qa8i5bj.s.anyscaleuserdata.com/v1/chat/completions \"HTTP/1.1 200 OK\"", "filename": "_client.py", "lineno": 1025, "job_id": "02000000", "worker_id": "02000000ffffffffffffffffffffffffffffffffffffffffffffffff", "node_id": "7a87cdeb8936fafd92d0d4cab8456af74f2aae665f59cec80664527f", "timestamp_ns": 1747674091654282480}

Model response (non-streaming):
"Ray" can refer to different things depending on the context. Here are a few possibilities:

1. **Physics**: In physics, a ray is a line or beam of light, heat, or other form of electromagnetic radiation or particles traveling in a straight line. For example, in optics, rays are used to model the path that light takes.

2. **Computer Science**: Ray could refer to "Ray Tracing," a rendering technique used in computer graphics to generate an image by tracing the path of light as pixels in an image plane and simulating the effects of its encounters with virtual objects. It's widely used in video games, movies, and other applications requiring high-quality 3D graphics.

3. **Software**: "Ray" can also refer to an open-source distributed computing framework developed by the RISELab at UC Berkeley. This framework is designed to enable the development of scalable and high-performance applications, particularly in the fields of machine learning, reinforcement learning, and other data-intensive applications.

4. **Brand or Product Name**: "Ray" could also be part of a brand name or product name, such as Ray-Ban sunglasses or other consumer products.

If you're asking about a specific context or application of "Ray," please provide more details so I can give you a more accurate answer.