Skip to main content

Ctrl+K

Join us at Ray Summit 2025 — Register early and save.

Site Navigation

Get Started
Use Cases
Example Gallery
Library
Docs
Resources

Try Managed Ray

Site Navigation

Get Started
Use Cases
Example Gallery
Library
Docs
Resources

Try Managed Ray

Overview
Getting Started
Installation
Use Cases
- Ray for ML Infrastructure
Examples
Ecosystem
Ray Core
Ray Data
Ray Train
Ray Tune
Ray Serve
Ray RLlib
More Libraries
Ray Clusters
Monitoring and Debugging
Developer Guides
Glossary
Security

Ray Serve: Scalable and Programmable Serving
User guides
vLLM compatibility

vLLM compatibility#

Ray Serve LLM provides an OpenAI-compatible API that aligns with vLLM’s OpenAI-compatible server. Most of the engine_kwargs that work with vllm serve work with Ray Serve LLM, giving you access to vLLM’s feature set through Ray Serve’s distributed deployment capabilities.

This compatibility means you can:

Use the same model configurations and engine arguments as vLLM
Leverage vLLM’s latest features (multimodal, structured output, reasoning models)
Switch between vllm serve and Ray Serve LLM with no code changes and scale
Take advantage of Ray Serve’s production features (autoscaling, multi-model serving, advanced routing)

This guide shows how to use vLLM features such as embeddings, structured output, vision language models, and reasoning models with Ray Serve.

Embeddings#

You can generate embeddings by setting the task parameter to "embed" in the engine arguments. Models supporting this use case are listed in the vLLM text embedding models documentation.

Deploy an embedding model#

Server

from ray import serve
from ray.serve.llm import LLMConfig, build_openai_app

llm_config = LLMConfig(
    model_loading_config=dict(
        model_id="qwen-0.5b",
        model_source="Qwen/Qwen2.5-0.5B-Instruct",
    ),
    engine_kwargs=dict(
        task="embed",
    ),
)

app = build_openai_app({"llm_configs": [llm_config]})
serve.run(app, blocking=True)

Python Client

from openai import OpenAI

# Initialize client
client = OpenAI(base_url="http://localhost:8000/v1", api_key="fake-key")

# Generate embeddings
response = client.embeddings.create(
    model="qwen-0.5b",
    input=["A text to embed", "Another text to embed"],
)

for data in response.data:
    print(data.embedding)  # List of float of len 4096

cURL

curl -X POST http://localhost:8000/v1/embeddings \
     -H "Content-Type: application/json" \
     -H "Authorization: Bearer fake-key" \
     -d '{
           "model": "qwen-0.5b",
           "input": ["A text to embed", "Another text to embed"],
           "encoding_format": "float"
         }'

Transcriptions#

You can generate audio transcriptions using Speech-to-Text (STT) models trained specifically for Automatic Speech Recognition (ASR) tasks. Models supporting this use case are listed in the vLLM transcription models documentation.

Deploy a transcription model#

Server

from ray import serve
from ray.serve.llm import LLMConfig, build_openai_app

llm_config = LLMConfig(
    model_loading_config={
        "model_id": "voxtral-mini",
        "model_source": "mistralai/Voxtral-Mini-3B-2507",
    },
    deployment_config={
        "autoscaling_config": {
            "min_replicas": 1,
            "max_replicas": 4,
        }
    },
    accelerator_type="A10G",
    # You can customize the engine arguments (e.g. vLLM engine kwargs)
    engine_kwargs={
        "tokenizer_mode": "mistral",
        "config_format": "mistral",
        "load_format": "mistral",
    },
    log_engine_metrics=True,
)

app = build_openai_app({"llm_configs": [llm_config]})
serve.run(app, blocking=True)

Python Client

from openai import OpenAI

# Initialize client
client = OpenAI(base_url="http://localhost:8000/v1", api_key="fake-key")

# Open audio file
with open("/path/to/audio.wav", "rb") as f:
    # Make a request to the transcription model
    response = client.audio.transcriptions.create(
        model="whisper-large",
        file=f,
        temperature=0.0,
        language="en",
    )

    print(response.text)

cURL

curl http://localhost:8000/v1/audio/transcriptions \
    -X POST \
    -H "Authorization: Bearer fake-key" \
    -F "file=@/path/to/audio.wav" \
    -F "model=whisper-large" \
    -F "temperature=0.0" \
    -F "language=en"

Structured output#

You can request structured JSON output similar to OpenAI’s API using JSON mode or JSON schema validation with Pydantic models.

JSON mode#

Server

from ray import serve
from ray.serve.llm import LLMConfig, build_openai_app

llm_config = LLMConfig(
    model_loading_config=dict(
        model_id="qwen-0.5b",
        model_source="Qwen/Qwen2.5-0.5B-Instruct",
    ),
    deployment_config=dict(
        autoscaling_config=dict(
            min_replicas=1,
            max_replicas=2,
        )
    ),
    accelerator_type="A10G",
)

# Build and deploy the model
app = build_openai_app({"llm_configs": [llm_config]})
serve.run(app, blocking=True)

Client (JSON Object)

from openai import OpenAI

# Initialize client
client = OpenAI(base_url="http://localhost:8000/v1", api_key="fake-key")

# Request structured JSON output
response = client.chat.completions.create(
    model="qwen-0.5b",
    response_format={"type": "json_object"},
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant that outputs JSON."
        },
        {
            "role": "user",
            "content": "List three colors in JSON format"
        }
    ],
    stream=True,
)

for chunk in response:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="", flush=True)
# Example response:
# {
#   "colors": [
#     "red",
#     "blue",
#     "green"
#   ]
# }

JSON schema with Pydantic#

You can specify the exact schema you want for the response using Pydantic models:

from openai import OpenAI
from typing import List, Literal
from pydantic import BaseModel

# Initialize client
client = OpenAI(base_url="http://localhost:8000/v1", api_key="fake-key")

# Define a pydantic model of a preset of allowed colors
class Color(BaseModel):
    colors: List[Literal["cyan", "magenta", "yellow"]]

# Request structured JSON output
response = client.chat.completions.create(
    model="qwen-0.5b",
    response_format={
        "type": "json_schema",
        "json_schema": Color.model_json_schema()
    },
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant that outputs JSON."
        },
        {
            "role": "user",
            "content": "List three colors in JSON format"
        }
    ],
    stream=True,
)

for chunk in response:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="", flush=True)
# Example response:
# {
#   "colors": [
#     "cyan",
#     "magenta",
#     "yellow"
#   ]
# }

Vision language models#

You can deploy multimodal models that process both text and images. Ray Serve LLM supports vision models through vLLM’s multimodal capabilities.

Deploy a vision model#

Server

from ray import serve
from ray.serve.llm import LLMConfig, build_openai_app


# Configure a vision model
llm_config = LLMConfig(
    model_loading_config=dict(
        model_id="pixtral-12b",
        model_source="mistral-community/pixtral-12b",
    ),
    deployment_config=dict(
        autoscaling_config=dict(
            min_replicas=1,
            max_replicas=2,
        )
    ),
    accelerator_type="L40S",
    engine_kwargs=dict(
        tensor_parallel_size=1,
        max_model_len=8192,
    ),
)

# Build and deploy the model
app = build_openai_app({"llm_configs": [llm_config]})
serve.run(app, blocking=True)

Client

from openai import OpenAI

# Initialize client
client = OpenAI(base_url="http://localhost:8000/v1", api_key="fake-key")

# Create and send a request with an image
response = client.chat.completions.create(
    model="pixtral-12b",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "What's in this image?"
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://example.com/image.jpg"
                    }
                }
            ]
        }
    ],
    stream=True,
)

for chunk in response:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="", flush=True)

Supported models#

For a complete list of supported vision models, see the vLLM multimodal models documentation.

Reasoning models#

Ray Serve LLM supports reasoning models such as DeepSeek-R1 and QwQ through vLLM. These models use extended thinking processes before generating final responses.

For reasoning model support and configuration, see the vLLM reasoning models documentation.

See also#

vLLM supported models - Complete list of supported models and features
vLLM OpenAI compatibility - vLLM’s OpenAI-compatible server documentation
Quickstart - Basic LLM deployment examples

previous

Multi-LoRA deployment

next

Fractional GPU serving

On this page

Embeddings
- Deploy an embedding model
Transcriptions
- Deploy a transcription model
Structured output
- JSON mode
- JSON schema with Pydantic
Vision language models
- Deploy a vision model
- Supported models
Reasoning models
See also

Thanks for the feedback!

Was this helpful?

Yes

No

Feedback

Submit

© Copyright 2025, The Ray Team.

Created using Sphinx 7.3.7.

Built with the PyData Sphinx Theme 0.14.1.