Serving LLMs#

Ray Serve LLM APIs allow users to deploy multiple LLM models together with a familiar Ray Serve API, while providing compatibility with the OpenAI API.

Features#

⚡️ Automatic scaling and load balancing
🌐 Unified multi-node multi-model deployment
🔌 OpenAI compatible
🔄 Multi-LoRA support with shared base models
🚀 Engine agnostic architecture (i.e. vLLM, SGLang, etc)

Requirements#

pip install ray[serve,llm]>=2.43.0 vllm>=0.7.2

# Suggested dependencies when using vllm 0.7.2:
pip install xgrammar==0.1.11 pynvml==12.0.0

Key Components#

The ray.serve.llm module provides two key deployment types for serving LLMs:

LLMServer#

The LLMServer sets up and manages the vLLM engine for model serving. It can be used standalone or combined with your own custom Ray Serve deployments.

LLMRouter#

This deployment provides an OpenAI-compatible FastAPI ingress and routes traffic to the appropriate model for multi-model services. The following endpoints are supported:

/v1/chat/completions: Chat interface (ChatGPT-style)
/v1/completions: Text completion
/v1/embeddings: Text embeddings
/v1/models: List available models
/v1/models/{model}: Model information

Configuration#

LLMConfig#

The LLMConfig class specifies model details such as:

Model loading sources (HuggingFace or cloud storage)
Hardware requirements (accelerator type)
Engine arguments (e.g. vLLM engine kwargs)
LoRA multiplexing configuration
Serve auto-scaling parameters

Quickstart Examples#

Deployment through `LLMRouter`#

Builder Pattern

from ray import serve
from ray.serve.llm import LLMConfig, build_openai_app

llm_config = LLMConfig(
    model_loading_config=dict(
        model_id="qwen-0.5b",
        model_source="Qwen/Qwen2.5-0.5B-Instruct",
    ),
    deployment_config=dict(
        autoscaling_config=dict(
            min_replicas=1, max_replicas=2,
        )
    ),
    # Pass the desired accelerator type (e.g. A10G, L4, etc.)
    accelerator_type="A10G",
    # You can customize the engine arguments (e.g. vLLM engine kwargs)
    engine_kwargs=dict(
        tensor_parallel_size=2,
    ),
)

app = build_openai_app({"llm_configs": [llm_config]})
serve.run(app, blocking=True)

Bind Pattern

from ray import serve
from ray.serve.llm import LLMConfig, LLMServer, LLMRouter

llm_config = LLMConfig(
    model_loading_config=dict(
        model_id="qwen-0.5b",
        model_source="Qwen/Qwen2.5-0.5B-Instruct",
    ),
    deployment_config=dict(
        autoscaling_config=dict(
            min_replicas=1, max_replicas=2,
        )
    ),
    # Pass the desired accelerator type (e.g. A10G, L4, etc.)
    accelerator_type="A10G",
    # You can customize the engine arguments (e.g. vLLM engine kwargs)
    engine_kwargs=dict(
        tensor_parallel_size=2,
    ),
)

# Deploy the application
deployment = LLMServer.as_deployment(llm_config.get_serve_options(name_prefix="vLLM:")).bind(llm_config)
llm_app = LLMRouter.as_deployment().bind([deployment])
serve.run(llm_app, blocking=True)

You can query the deployed models using either cURL or the OpenAI Python client:

cURL

curl -X POST http://localhost:8000/v1/chat/completions \
     -H "Content-Type: application/json" \
     -H "Authorization: Bearer fake-key" \
     -d '{
           "model": "qwen-0.5b",
           "messages": [{"role": "user", "content": "Hello!"}]
         }'

Python

from openai import OpenAI

# Initialize client
client = OpenAI(base_url="http://localhost:8000/v1", api_key="fake-key")

# Basic chat completion with streaming
response = client.chat.completions.create(
    model="qwen-0.5b",
    messages=[{"role": "user", "content": "Hello!"}],
    stream=True
)

for chunk in response:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="", flush=True)

For deploying multiple models, you can pass a list of LLMConfig objects to the LLMRouter deployment:

Builder Pattern

from ray import serve
from ray.serve.llm import LLMConfig, build_openai_app


llm_config1 = LLMConfig(
    model_loading_config=dict(
        model_id="qwen-0.5b",
        model_source="Qwen/Qwen2.5-0.5B-Instruct",
    ),
    deployment_config=dict(
        autoscaling_config=dict(
            min_replicas=1, max_replicas=2,
        )
    ),
    accelerator_type="A10G",
)

llm_config2 = LLMConfig(
    model_loading_config=dict(
        model_id="qwen-1.5b",
        model_source="Qwen/Qwen2.5-1.5B-Instruct",
    ),
    deployment_config=dict(
        autoscaling_config=dict(
            min_replicas=1, max_replicas=2,
        )
    ),
    accelerator_type="A10G",
)

app = build_openai_app({"llm_configs": [llm_config1, llm_config2]})
serve.run(app, blocking=True)

Bind Pattern

from ray import serve
from ray.serve.llm import LLMConfig, LLMServer, LLMRouter

llm_config1 = LLMConfig(
    model_loading_config=dict(
        model_id="qwen-0.5b",
        model_source="Qwen/Qwen2.5-0.5B-Instruct",
    ),
    deployment_config=dict(
        autoscaling_config=dict(
            min_replicas=1, max_replicas=2,
        )
    ),
    accelerator_type="A10G",
)

llm_config2 = LLMConfig(
    model_loading_config=dict(
        model_id="qwen-1.5b",
        model_source="Qwen/Qwen2.5-1.5B-Instruct",
    ),
    deployment_config=dict(
        autoscaling_config=dict(
            min_replicas=1, max_replicas=2,
        )
    ),
    accelerator_type="A10G",
)

# Deploy the application
deployment1 = LLMServer.as_deployment(llm_config1.get_serve_options(name_prefix="vLLM:")).bind(llm_config1)
deployment2 = LLMServer.as_deployment(llm_config2.get_serve_options(name_prefix="vLLM:")).bind(llm_config2)
llm_app = LLMRouter.as_deployment().bind([deployment1, deployment2])
serve.run(llm_app, blocking=True)

See also Serve DeepSeek for an example of deploying DeepSeek models.

Production Deployment#

For production deployments, Ray Serve LLM provides utilities for config-driven deployments. You can specify your deployment configuration using YAML files:

Inline Config

# config.yaml
applications:
- args:
    llm_configs:
        - model_loading_config:
            model_id: qwen-0.5b
            model_source: Qwen/Qwen2.5-0.5B-Instruct
          accelerator_type: A10G
          deployment_config:
            autoscaling_config:
                min_replicas: 1
                max_replicas: 2
        - model_loading_config:
            model_id: qwen-1.5b
            model_source: Qwen/Qwen2.5-1.5B-Instruct
          accelerator_type: A10G
          deployment_config:
            autoscaling_config:
                min_replicas: 1
                max_replicas: 2
  import_path: ray.serve.llm:build_openai_app
  name: llm_app
  route_prefix: "/"

Standalone Config

# config.yaml
applications:
- args:
    llm_configs:
        - models/qwen-0.5b.yaml
        - models/qwen-1.5b.yaml
  import_path: ray.serve.llm:build_openai_app
  name: llm_app
  route_prefix: "/"

# models/qwen-0.5b.yaml
model_loading_config:
  model_id: qwen-0.5b
  model_source: Qwen/Qwen2.5-0.5B-Instruct
accelerator_type: A10G
deployment_config:
  autoscaling_config:
    min_replicas: 1
    max_replicas: 2

# models/qwen-1.5b.yaml
model_loading_config:
  model_id: qwen-1.5b
  model_source: Qwen/Qwen2.5-1.5B-Instruct
accelerator_type: A10G
deployment_config:
  autoscaling_config:
    min_replicas: 1
    max_replicas: 2

To deploy using either configuration file:

serve run config.yaml

Generate config files#

Ray Serve LLM provides a CLI to generate config files for your deployment:

python -m ray.serve.llm.gen_config

Note: This command requires interactive inputs. You should execute it directly in the terminal.

This command lets you pick from a common set of OSS LLMs and helps you configure them. You can tune settings like GPU type, tensor parallelism, and autoscaling parameters.

Note that if you’re configuring a model whose architecture is different from the provided list of models, you should closely review the generated model config file to provide the correct values.

This command generates two files: an LLM config file, saved in model_config/, and a Ray Serve config file, serve_TIMESTAMP.yaml, that you can reference and re-run in the future.

After reading and reviewing the generated model config, see the vLLM engine configuration docs for further customization.

Observability#

Ray enables LLM service-level logging by default, and makes these statistics available using Grafana and Prometheus. For more details on configuring Grafana and Prometheus, see Collecting and monitoring metrics.

These higher-level metrics track request and token behavior across deployed models. For example: average total tokens per request, ratio of input tokens to generated tokens, and peak tokens per second.

For visualization, Ray ships with a Serve LLM-specific dashboard, which is automatically available in Grafana. Example below:

Engine Metrics#

All engine metrics, including vLLM, are available through the Ray metrics export endpoint and are queryable using Prometheus. See vLLM metrics for a complete list. These are also visualized by the Serve LLM Grafana dashboard. Dashboard panels include: time per output token (TPOT), time to first token (TTFT), and GPU cache utilization.

Engine metric logging is off by default, and must be manually enabled. In addition, you must enable the vLLM V1 engine to use engine metrics. To enable engine-level metric logging, set log_engine_metrics: True when configuring the LLM deployment. For example:

Python

from ray import serve
from ray.serve.llm import LLMConfig, build_openai_app

llm_config = LLMConfig(
    model_loading_config=dict(
        model_id="qwen-0.5b",
        model_source="Qwen/Qwen2.5-0.5B-Instruct",
    ),
    deployment_config=dict(
        autoscaling_config=dict(
            min_replicas=1, max_replicas=2,
        )
    ),
    log_engine_metrics=True
)

app = build_openai_app({"llm_configs": [llm_config]})
serve.run(app, blocking=True)

YAML

# config.yaml
applications:
- args:
    llm_configs:
        - model_loading_config:
            model_id: qwen-0.5b
            model_source: Qwen/Qwen2.5-0.5B-Instruct
        accelerator_type: A10G
        deployment_config:
            autoscaling_config:
                min_replicas: 1
                max_replicas: 2
        log_engine_metrics: true
import_path: ray.serve.llm:build_openai_app
name: llm_app
route_prefix: "/"

Frequently Asked Questions#

How do I use gated Huggingface models?#

You can use runtime_env to specify the env variables that are required to access the model. To set the deployment options, you can use the get_serve_options method on the LLMConfig object.

from ray import serve
from ray.serve.llm import LLMConfig, LLMServer, LLMRouter
import os

llm_config = LLMConfig(
    model_loading_config=dict(
        model_id="llama-3-8b-instruct",
        model_source="meta-llama/Meta-Llama-3-8B-Instruct",
    ),
    deployment_config=dict(
        autoscaling_config=dict(
            min_replicas=1, max_replicas=2,
        )
    ),
    # Pass the desired accelerator type (e.g. A10G, L4, etc.)
    accelerator_type="A10G",
    runtime_env=dict(
        env_vars=dict(
            HF_TOKEN=os.environ["HF_TOKEN"]
        )
    ),
)

# Deploy the application
deployment = LLMServer.as_deployment(llm_config.get_serve_options(name_prefix="vLLM:")).bind(llm_config)
llm_app = LLMRouter.as_deployment().bind([deployment])
serve.run(llm_app, blocking=True)

Why is downloading the model so slow?#

If you are using huggingface models, you can enable fast download by setting HF_HUB_ENABLE_HF_TRANSFER and installing pip install hf_transfer.

from ray import serve
from ray.serve.llm import LLMConfig, LLMServer, LLMRouter
import os

llm_config = LLMConfig(
    model_loading_config=dict(
        model_id="llama-3-8b-instruct",
        model_source="meta-llama/Meta-Llama-3-8B-Instruct",
    ),
    deployment_config=dict(
        autoscaling_config=dict(
            min_replicas=1, max_replicas=2,
        )
    ),
    # Pass the desired accelerator type (e.g. A10G, L4, etc.)
    accelerator_type="A10G",
    runtime_env=dict(
        env_vars=dict(
            HF_TOKEN=os.environ["HF_TOKEN"],
            HF_HUB_ENABLE_HF_TRANSFER="1"
        )
    ),
)

# Deploy the application
deployment = LLMServer.as_deployment(llm_config.get_serve_options(name_prefix="vLLM:")).bind(llm_config)
llm_app = LLMRouter.as_deployment().bind([deployment])
serve.run(llm_app, blocking=True)

How to configure tokenizer pool size so it doesn’t hang?#

When using tokenizer_pool_size in vLLM’s engine_kwargs, tokenizer_pool_size is also required to configure together in order to have the tokenizer group scheduled correctly.

An example config is shown below:

# config.yaml
applications:
- args:
    llm_configs:
        - engine_kwargs:
            max_model_len: 1000
            tokenizer_pool_size: 2
            tokenizer_pool_extra_config: "{\"runtime_env\": {}}"
          model_loading_config:
            model_id: Qwen/Qwen2.5-7B-Instruct
  import_path: ray.serve.llm:build_openai_app
  name: llm_app
  route_prefix: "/"

Usage Data Collection#

We collect usage data to improve Ray Serve LLM. We collect data about the following features and attributes:

model architecture used for serving
whether JSON mode is used
whether LoRA is used and how many LoRA weights are loaded initially at deployment time
whether autoscaling is used and the min and max replicas setup
tensor parallel size used
initial replicas count
GPU type used and number of GPUs used

If you would like to opt-out from usage data collection, you can follow Ray usage stats to disable it.

Serving LLMs#

Features#

Requirements#

Key Components#

LLMServer#

LLMRouter#

Configuration#

LLMConfig#

Quickstart Examples#

Deployment through `LLMRouter`#

Production Deployment#

Generate config files#

Observability#

Engine Metrics#

Advanced Usage Patterns#

Multi-LoRA Deployment#

Embeddings#

Structured Output#

Vision Language Models#

Using remote storage for model weights#

Frequently Asked Questions#

How do I use gated Huggingface models?#

Why is downloading the model so slow?#

How to configure tokenizer pool size so it doesn’t hang?#

Usage Data Collection#

Serving LLMs#

Features#

Requirements#

Key Components#

LLMServer#

LLMRouter#

Configuration#

LLMConfig#

Quickstart Examples#

Deployment through LLMRouter#

Production Deployment#

Generate config files#

Observability#

Engine Metrics#

Advanced Usage Patterns#

Multi-LoRA Deployment#

Embeddings#

Structured Output#

Vision Language Models#

Using remote storage for model weights#

Frequently Asked Questions#

How do I use gated Huggingface models?#

Why is downloading the model so slow?#

How to configure tokenizer pool size so it doesn’t hang?#

Usage Data Collection#

Deployment through `LLMRouter`#