User guides#

How-to guides for deploying, scaling, and operating Ray Serve LLM. If you are new, start with the Quickstart, then come back here to go deeper.

Configure and deploy#

  • Configuration reference: every LLMConfig field, from model loading and engine kwargs to accelerators, placement, and deployment options.

  • Deployment initialization: speed up model loading and replica startup with caching, streaming load formats, and initialization callbacks.

  • Multi-LoRA deployment: serve many LoRA adapters on a shared base model with runtime switching and an LRU cache.

Scale across GPUs and nodes#

Optimize latency and throughput#

Choose an engine#

  • vLLM compatibility: use vLLM features such as embeddings, structured outputs, vision, and reasoning through Ray Serve LLM.

  • SGLang integration: run SGLang as the inference engine instead of vLLM.

Operate in production#