Serving LLMs#

Ray Serve LLM APIs allow users to deploy multiple LLM models together with a familiar Ray Serve API, while providing compatibility with the OpenAI API.

Features#

  • ⚡️ Automatic scaling and load balancing

  • 🌐 Unified multi-node multi-model deployment

  • 🔌 OpenAI compatible

  • 🔄 Multi-LoRA support with shared base models

  • 🚀 Engine agnostic architecture (i.e. vLLM, SGLang, etc)

Requirements#

pip install ray[serve,llm]>=2.43.0 vllm>=0.7.2

# Suggested dependencies when using vllm 0.7.2:
pip install xgrammar==0.1.11 pynvml==12.0.0

Key Components#

The ray.serve.llm module provides two key deployment types for serving LLMs:

LLMServer#

The LLMServer sets up and manages the vLLM engine for model serving. It can be used standalone or combined with your own custom Ray Serve deployments.

OpenAiIngress#

This deployment provides an OpenAI-compatible FastAPI ingress and routes traffic to the appropriate model for multi-model services. The following endpoints are supported:

  • /v1/chat/completions: Chat interface (ChatGPT-style)

  • /v1/completions: Text completion

  • /v1/embeddings: Text embeddings

  • /v1/models: List available models

  • /v1/models/{model}: Model information

Configuration#

LLMConfig#

The LLMConfig class specifies model details such as:

  • Model loading sources (HuggingFace or cloud storage)

  • Hardware requirements (accelerator type)

  • Engine arguments (e.g. vLLM engine kwargs)

  • LoRA multiplexing configuration

  • Serve auto-scaling parameters