Overview#

Ray Serve LLM APIs allow users to deploy multiple LLM models together with a familiar Ray Serve API, while providing compatibility with the OpenAI API.

Features#

⚡️ Automatic scaling and load balancing
🌐 Unified multi-node multi-model deployment
🔌 OpenAI compatible
🔄 Multi-LoRA support with shared base models
🚀 Engine agnostic architecture (i.e. vLLM, SGLang, etc)

Requirements#

pip install ray[serve,llm]>=2.43.0 vllm>=0.7.2

# Suggested dependencies when using vllm 0.7.2:
pip install xgrammar==0.1.11 pynvml==12.0.0

Key Components#

The ray.serve.llm module provides two key deployment types for serving LLMs:

VLLMService#

The VLLMService sets up and manages the vLLM engine for model serving. It can be used standalone or combined with your own custom Ray Serve deployments.

LLMRouter#

This deployment provides an OpenAI-compatible FastAPI ingress and routes traffic to the appropriate model for multi-model services. The following endpoints are supported:

/v1/chat/completions: Chat interface (ChatGPT-style)
/v1/completions: Text completion
/v1/models: List available models
/v1/models/{model}: Model information

Configuration#

LLMConfig#

The LLMConfig class specifies model details such as:

Model loading sources (HuggingFace or cloud storage)
Hardware requirements (accelerator type)
Engine arguments (e.g. vLLM engine kwargs)
LoRA multiplexing configuration
Serve auto-scaling parameters

Quickstart Examples#

Deployment through `LLMRouter`#

from ray import serve
from ray.serve.llm.configs import LLMConfig
from ray.serve.llm.deployments import VLLMService, LLMRouter

llm_config = LLMConfig(
    model_loading_config=dict(
        model_id="qwen-0.5b",
        model_source="Qwen/Qwen2.5-0.5B-Instruct",
    ),
    deployment_config=dict(
        autoscaling_config=dict(
            min_replicas=1, max_replicas=2,
        )
    ),
    # Pass the desired accelerator type (e.g. A10G, L4, etc.)
    accelerator_type="A10G",
    # You can customize the engine arguments (e.g. vLLM engine kwargs)
    engine_kwargs=dict(
        tensor_parallel_size=2,
    ),
)

# Deploy the application
deployment = VLLMService.as_deployment().bind(llm_config)
llm_app = LLMRouter.as_deployment().bind([deployment])
serve.run(llm_app)

You can query the deployed models using either cURL or the OpenAI Python client:

cURL

curl -X POST http://localhost:8000/v1/chat/completions \
     -H "Content-Type: application/json" \
     -H "Authorization: Bearer fake-key" \
     -d '{
           "model": "qwen-0.5b",
           "messages": [{"role": "user", "content": "Hello!"}]
         }'

Python

from openai import OpenAI

# Initialize client
client = OpenAI(base_url="http://localhost:8000/v1", api_key="fake-key")

# Basic chat completion with streaming
response = client.chat.completions.create(
    model="qwen-0.5b",
    messages=[{"role": "user", "content": "Hello!"}]
    stream=True
)

for chunk in response:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="", flush=True)

For deploying multiple models, you can pass a list of LLMConfig objects to the LLMRouter deployment:

from ray import serve
from ray.serve.llm.configs import LLMConfig
from ray.serve.llm.deployments import VLLMService, LLMRouter

llm_config1 = LLMConfig(
    model_loading_config=dict(
        model_id="qwen-0.5b",
        model_source="Qwen/Qwen2.5-0.5B-Instruct",
    ),
    deployment_config=dict(
        autoscaling_config=dict(
            min_replicas=1, max_replicas=2,
        )
    ),
    accelerator_type="A10G",
)

llm_config2 = LLMConfig(
    model_loading_config=dict(
        model_id="qwen-1.5b",
        model_source="Qwen/Qwen2.5-1.5B-Instruct",
    ),
    deployment_config=dict(
        autoscaling_config=dict(
            min_replicas=1, max_replicas=2,
        )
    ),
    accelerator_type="A10G",
)

# Deploy the application
deployment1 = VLLMService.as_deployment().bind(llm_config1)
deployment2 = VLLMService.as_deployment().bind(llm_config2)
llm_app = LLMRouter.as_deployment().bind([deployment1, deployment2])
serve.run(llm_app)

Production Deployment#

For production deployments, Ray Serve LLM provides utilities for config-driven deployments. You can specify your deployment configuration using YAML files:

Inline Config

# config.yaml
applications:
- args:
    llm_configs:
        - model_loading_config:
            model_id: qwen-0.5b
            model_source: Qwen/Qwen2.5-0.5B-Instruct
          accelerator_type: A10G
          deployment_config:
            autoscaling_config:
                min_replicas: 1
                max_replicas: 2
        - model_loading_config:
            model_id: qwen-1.5b
            model_source: Qwen/Qwen2.5-1.5B-Instruct
          accelerator_type: A10G
          deployment_config:
            autoscaling_config:
                min_replicas: 1
                max_replicas: 2
  import_path: ray.serve.llm.builders:build_openai_app
  name: llm_app
  route_prefix: "/"

Standalone Config

# config.yaml
applications:
- args:
    llm_configs:
        - models/qwen-0.5b.yaml
        - models/qwen-1.5b.yaml
  import_path: ray.serve.llm.builders:build_openai_app
  name: llm_app
  route_prefix: "/"

# models/qwen-0.5b.yaml
model_loading_config:
  model_id: qwen-0.5b
  model_source: Qwen/Qwen2.5-0.5B-Instruct
accelerator_type: A10G
deployment_config:
  autoscaling_config:
    min_replicas: 1
    max_replicas: 2

# models/qwen-1.5b.yaml
model_loading_config:
  model_id: qwen-1.5b
  model_source: Qwen/Qwen2.5-1.5B-Instruct
accelerator_type: A10G
deployment_config:
  autoscaling_config:
    min_replicas: 1
    max_replicas: 2

To deploy using either configuration file:

serve run config.yaml

Advanced Usage Patterns#

For each usage pattern, we provide a server and client code snippet.

Multi-LoRA Deployment#

You can use LoRA (Low-Rank Adaptation) to efficiently fine-tune models by configuring the LoraConfig. We use Ray Serve’s multiplexing feature to serve multiple LoRA checkpoints from the same model. This allows the weights to be loaded on each replica on-the-fly and be cached via an LRU mechanism.

Server

from ray import serve
from ray.serve.llm.configs import LLMConfig
from ray.serve.llm.builders import build_openai_app

# Configure the model with LoRA
llm_config = LLMConfig(
    model_loading_config=dict(
        model_id="qwen-0.5b",
        model_source="Qwen/Qwen2.5-0.5B-Instruct",
    ),
    lora_config=dict(
        # Let's pretend this is where LoRA weights are stored on S3.
        # For example
        # s3://my_dynamic_lora_path/lora_model_1_ckpt
        # s3://my_dynamic_lora_path/lora_model_2_ckpt
        # are two of the LoRA checkpoints
        dynamic_lora_loading_path="s3://my_dynamic_lora_path",
        max_num_adapters_per_replica=16,
    ),
    deployment_config=dict(
        autoscaling_config=dict(
            min_replicas=1,
            max_replicas=2,
        )
    ),
    accelerator_type="A10G",
)

# Build and deploy the model
app = build_openai_app({"llm_configs": [llm_config]})
serve.run(app)

Client

from openai import OpenAI

# Initialize client
client = OpenAI(base_url="http://localhost:8000/v1", api_key="fake-key")

# Make a request to the desired lora checkpoint
response = client.chat.completions.create(
    model="qwen-0.5b:lora_model_1_ckpt",
    messages=[{"role": "user", "content": "Hello!"}]
)

for chunk in response:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="", flush=True)

Structured Output#

For structured output, you can use JSON mode similar to OpenAI’s API:

Server

from ray import serve
from ray.serve.llm.configs import LLMConfig
from ray.serve.llm.builders import build_openai_app

# Configure the model with LoRA
llm_config = LLMConfig(
    model_loading_config=dict(
        model_id="qwen-0.5b",
        model_source="Qwen/Qwen2.5-0.5B-Instruct",
    ),
    deployment_config=dict(
        autoscaling_config=dict(
            min_replicas=1,
            max_replicas=2,
        )
    ),
    accelerator_type="A10G",
)

# Build and deploy the model
app = build_openai_app({"llm_configs": [llm_config]})
serve.run(app)

Client

from openai import OpenAI

# Initialize client
client = OpenAI(base_url="http://localhost:8000/v1", api_key="fake-key")

# Request structured JSON output
response = client.chat.completions.create(
    model="qwen-0.5b",
    response_format={"type": "json_object"},
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant that outputs JSON."
        },
        {
            "role": "user",
            "content": "List three colors in JSON format"
        }
    ],
    stream=True,
)

for chunk in response:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="", flush=True)
# Example response:
# {
#   "colors": [
#     "red",
#     "blue",
#     "green"
#   ]
# }

Vision Language Models#

For multimodal models that can process both text and images:

Server

from ray import serve
from ray.serve.llm.configs import LLMConfig
from ray.serve.llm.builders import build_openai_app


# Configure a vision model
llm_config = LLMConfig(
    model_loading_config=dict(
        model_id="pixtral-12b",
        model_source="mistral-community/pixtral-12b",
    ),
    deployment_config=dict(
        autoscaling_config=dict(
            min_replicas=1,
            max_replicas=2,
        )
    ),
    accelerator_type="A10G",
)

# Build and deploy the model
app = build_openai_app({"llm_configs": [llm_config]})
serve.run(app)

Client

from openai import OpenAI

# Initialize client
client = OpenAI(base_url="http://localhost:8000/v1", api_key="fake-key")

# Create and send a request with an image
response = client.chat.completions.create(
    model="pixtral-12b",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "What's in this image?"
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://example.com/image.jpg"
                    }
                }
            ]
        }
    ]
)

for chunk in response:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="", flush=True)

Frequently Asked Questions#

How do I use gated Huggingface models?#

You can use runtime_env to specify the env variables that are required to access the model. To set the deployment options, you can use the get_serve_options method on the LLMConfig object.

from ray import serve
from ray.serve.llm.configs import LLMConfig
from ray.serve.llm.deployments import VLLMService, LLMRouter
import os

llm_config = LLMConfig(
    model_loading_config=dict(
        model_id="llama-3-8b-instruct",
        model_source="meta-llama/Meta-Llama-3-8B-Instruct",
    ),
    deployment_config=dict(
        autoscaling_config=dict(
            min_replicas=1, max_replicas=2,
        )
    ),
    # Pass the desired accelerator type (e.g. A10G, L4, etc.)
    accelerator_type="A10G",
    runtime_env=dict(
        env_vars=dict(
            HF_TOKEN=os.environ["HF_TOKEN"]
        )
    ),
)

# Deploy the application
deployment = VLLMService.as_deployment(llm_config.get_serve_options(name_prefix="VLLM:")).bind(llm_config)
llm_app = LLMRouter.as_deployment().bind([deployment])
serve.run(llm_app)

Why is downloading the model so slow?#

If you are using huggingface models, you can enable fast download by setting HF_HUB_ENABLE_HF_TRANSFER and installing pip install hf_transfer.

from ray import serve
from ray.serve.llm.configs import LLMConfig
from ray.serve.llm.deployments import VLLMService, LLMRouter
import os

llm_config = LLMConfig(
    model_loading_config=dict(
        model_id="llama-3-8b-instruct",
        model_source="meta-llama/Meta-Llama-3-8B-Instruct",
    ),
    deployment_config=dict(
        autoscaling_config=dict(
            min_replicas=1, max_replicas=2,
        )
    ),
    # Pass the desired accelerator type (e.g. A10G, L4, etc.)
    accelerator_type="A10G",
    runtime_env=dict(
        env_vars=dict(
            HF_TOKEN=os.environ["HF_TOKEN"],
            HF_HUB_ENABLE_HF_TRANSFER="1"
        )
    ),
)

# Deploy the application
deployment = VLLMService.as_deployment(llm_config.get_serve_options(name_prefix="VLLM:")).bind(llm_config)
llm_app = LLMRouter.as_deployment().bind([deployment])
serve.run(llm_app)

Overview#

Features#

Requirements#

Key Components#

VLLMService#

LLMRouter#

Configuration#

LLMConfig#

Quickstart Examples#

Deployment through LLMRouter#

Production Deployment#

Advanced Usage Patterns#

Multi-LoRA Deployment#

Structured Output#

Vision Language Models#

Frequently Asked Questions#

How do I use gated Huggingface models?#

Why is downloading the model so slow?#

Deployment through `LLMRouter`#