Serving LLMs#

Ray Serve LLM deploys large language models in production. It builds on Ray Serve primitives for distributed, multi-node LLM serving and exposes an OpenAI-compatible API.

Key features#

  • OpenAI-compatible API for chat, completions, and embeddings.

  • Multi-node, multi-model deployment with autoscaling and load balancing.

  • Parallelism strategies: tensor, pipeline, expert, and data parallel attention.

  • Prefill-decode disaggregation to scale the prefill and decode phases independently.

  • Custom request routing, including prefix-aware routing for higher cache hit rates.

  • Multi-LoRA serving on a shared base model.

  • Engine-agnostic backends such as vLLM and SGLang.

  • Built-in metrics and Grafana dashboards.

Install#

Ray Serve LLM ships with Ray. Install it with the llm extra:

pip install "ray[llm]"

This pulls in vLLM and the OpenAI-compatible server stack. You need a GPU to run most models. The Quickstart covers prerequisites, supported hardware, and gated-model setup.

Deploy your first model#

Define an LLMConfig, build an OpenAI-compatible app, and run it:

from ray import serve
from ray.serve.llm import LLMConfig, build_openai_app

llm_config = LLMConfig(
    model_loading_config={
        "model_id": "qwen-0.5b",
        "model_source": "Qwen/Qwen2.5-0.5B-Instruct",
    },
    deployment_config={
        "autoscaling_config": {
            "min_replicas": 1,
            "max_replicas": 2,
        }
    },
    # Pass the desired accelerator type (e.g. A10G, L4, etc.)
    accelerator_type="A10G",
    # You can customize the engine arguments (e.g. vLLM engine kwargs)
    engine_kwargs={
        "tensor_parallel_size": 2,
    },
)

app = build_openai_app({"llm_configs": [llm_config]})
serve.run(app, blocking=True)

Once it is running, query it with any OpenAI client at http://localhost:8000/v1. See the Quickstart for client snippets, multi-model apps, and config-driven (YAML) deployments.

Find your path#

  • New here? Start with the Quickstart to deploy and query a model.

  • Configuring a deployment? The Configuration reference explains every LLMConfig field.

  • Scaling up? The User guides cover parallelism, routing, caching, LoRA, and observability.

  • Want the internals? The Architecture docs explain components, request flow, and serving patterns.

  • Deploying a specific model? The Examples walk through small, medium, large, vision, and reasoning models end to end.

  • Hitting an issue? Check Troubleshooting and Benchmarks.