ray.serve.llm.deployments.LLMRouter#
- class ray.serve.llm.deployments.LLMRouter(llm_deployments: List[DeploymentHandle], *, _get_lora_model_metadata_func: Callable[[str, LLMConfig], Awaitable[Dict[str, Any]]] | None = None)[source]#
Bases:
LLMRouter
The implementation of the OpenAI compatiple model router.
- This deployment creates the following endpoints:
/v1/chat/completions: Chat interface (OpenAI-style)
/v1/completions: Text completion
/v1/models: List available models
/v1/models/{model}: Model information
Examples
from ray import serve from ray.serve.config import AutoscalingConfig from ray.serve.llm.configs import LLMConfig, ModelLoadingConfig, DeploymentConfig from ray.serve.llm.deployments import VLLMDeployment from ray.serve.llm.openai_api_models import ChatCompletionRequest llm_config1 = LLMConfig( model_loading_config=ModelLoadingConfig( served_model_name="llama-3.1-8b", # Name shown in /v1/models model_source="meta-llama/Llama-3.1-8b-instruct", ), deployment_config=DeploymentConfig( autoscaling_config=AutoscalingConfig( min_replicas=1, max_replicas=8, ) ), ) llm_config2 = LLMConfig( model_loading_config=ModelLoadingConfig( served_model_name="llama-3.2-3b", # Name shown in /v1/models model_source="meta-llama/Llama-3.2-3b-instruct", ), deployment_config=DeploymentConfig( autoscaling_config=AutoscalingConfig( min_replicas=1, max_replicas=8, ) ), ) # Deploy the application vllm_deployment1 = VLLMDeployment.as_deployment(llm_config1.get_serve_options()).bind(llm_config1) vllm_deployment2 = VLLMDeployment.as_deployment(llm_config2.get_serve_options()).bind(llm_config2) llm_app = LLMModelRouterDeployment.as_deployment().bind([vllm_deployment1, vllm_deployment2]) serve.run(llm_app)
PublicAPI (alpha): This API is in alpha and may change before becoming stable.
Methods
Converts this class to a Ray Serve deployment with ingress.