SGLang integration#
Ray Serve LLM provides an OpenAI-compatible API that integrates with SGLang via the server_cls parameter on LLMConfig. Most engine_kwargs that work with sglang serve also work here, giving you SGLang’s feature set through Ray Serve’s distributed deployment capabilities.
The integration uses SGLangServer, a custom server class that wraps SGLang’s in-process engine and exposes chat, completions, embeddings, tokenize, and detokenize endpoints through the standard Ray Serve LLM protocol.
This compatibility means you can:
Use SGLang’s RadixAttention and other optimizations with Ray Serve’s production features
Deploy SGLang models with autoscaling, multi-model serving, and advanced routing
Serve models across multiple nodes with tensor and pipeline parallelism
Note
Community SGLang support is in early development. Track progress and provide feedback at ray-project/ray#61114.
Prerequisites#
pip install ray[serve,llm] "sglang[all,ray]"
Set the following environment variable before running any example:
CUDA:
RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=0ROCm:
RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES=0
Online serving (single node)#
Deploy a single-node SGLang model with autoscaling. The server_cls parameter tells Ray Serve LLM to use the SGLangServer instead of the default vLLM engine.
from ray.llm._internal.serve.engines.sglang import SGLangServer
from ray import serve
from ray.serve.llm import LLMConfig, build_openai_app
llm_config = LLMConfig(
model_loading_config={
"model_id": "Llama-3.1-8B-Instruct",
"model_source": "unsloth/Llama-3.1-8B-Instruct",
},
deployment_config={
"autoscaling_config": {
"min_replicas": 1,
"max_replicas": 2,
}
},
server_cls=SGLangServer,
engine_kwargs={
"trust_remote_code": True,
"model_path": "unsloth/Llama-3.1-8B-Instruct",
"tp_size": 1,
"mem_fraction_static": 0.8,
},
)
app = build_openai_app({"llm_configs": [llm_config]})
serve.run(app, blocking=True)
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="fake-key")
# Chat completions
print("=== Chat Completions ===")
chat_response = client.chat.completions.create(
model="Llama-3.1-8B-Instruct",
messages=[
{"role": "user", "content": "List 3 countries and their capitals."},
],
temperature=0,
max_tokens=64,
)
print(chat_response.choices[0].message.content)
# Text completions
print("\n=== Text Completions ===")
completion_response = client.completions.create(
model="Llama-3.1-8B-Instruct",
prompt="San Francisco is a",
temperature=0,
max_tokens=30,
)
print(completion_response.choices[0].text)
# Chat completions
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Llama-3.1-8B-Instruct",
"messages": [{"role": "user", "content": "List 3 countries and their capitals."}],
"temperature": 0,
"max_tokens": 64
}'
# Text completions
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Llama-3.1-8B-Instruct",
"prompt": "San Francisco is a",
"max_tokens": 30,
"temperature": 0
}'
Run:
RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=0 serve run serve_sglang_example:app
Online serving (multi-node with TP+PP)#
Deploy a large model across multiple nodes using tensor parallelism (TP=4) and pipeline parallelism (PP=2). This requires 2 nodes with 4 GPUs each (8 GPUs total).
The placement_group_strategy: "PACK" fills GPUs on each node before moving to the next, so with 2 nodes (4 GPUs each) each node gets one pipeline stage. The SGLangServer.get_deployment_options() method constructs placement groups from the placement_group_config.
from ray.llm._internal.serve.engines.sglang import SGLangServer
from ray import serve
from ray.serve.llm import LLMConfig, build_openai_app
llm_config = LLMConfig(
model_loading_config={
"model_id": "Llama-3.1-70B-Instruct",
"model_source": "meta-llama/Llama-3.1-70B-Instruct",
},
deployment_config={
"autoscaling_config": {
"min_replicas": 1,
"max_replicas": 2,
"target_ongoing_requests": 4,
}
},
# PACK fills GPUs on each node before moving to the next.
# With 8 bundles across 2 nodes (4 GPUs each), each node gets 4 bundles.
placement_group_config={
"placement_group_bundles": [{"CPU": 1, "GPU": 1}] + [{"GPU": 1}] * 7,
"placement_group_strategy": "PACK",
},
server_cls=SGLangServer,
engine_kwargs={
"model_path": "meta-llama/Llama-3.1-70B-Instruct",
"tp_size": 4,
"pp_size": 2,
"mem_fraction_static": 0.8,
},
)
app = build_openai_app({"llm_configs": [llm_config]})
serve.run(app, blocking=True)
Run:
RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=0 serve run serve_sglang_multinode_example:app
Limitations#
The following SGLang features are available upstream but not yet integrated into Ray Serve LLM. Community contributions are welcome:
Engine replicas: Multiple engine replicas within a single deployment. See ray-project/ray#62480.
Observability: Engine-level metrics (e.g. KV cache utilization, request queue depth).
Prefill disaggregation: Separating prefill and decode phases across different workers.
Wide EP: Wide expert parallelism for Mixture-of-Experts models.
Elastic EP: Fault-tolerant expert parallelism with dynamic rank health tracking.
Transcriptions and score: The
/v1/audio/transcriptionsand/v1/scoreendpoints.
Dependencies#
SGLang’s in-process engine overrides Python signal handlers on startup. The SGLangServer.__init__ includes a workaround that saves and restores signal handlers around engine initialization. If you encounter issues with graceful shutdown, this is a known area of friction.
See also#
Quickstart examples - Basic LLM deployment examples
Cross-node parallelism - Cross-node parallelism with placement groups