ray.serve.llm.deployments.LLMRouter#

class ray.serve.llm.deployments.LLMRouter(llm_deployments: List[DeploymentHandle], *, _get_lora_model_metadata_func: Callable[[str, LLMConfig], Awaitable[Dict[str, Any]]] | None = None)[source]#

Bases: LLMRouter

The implementation of the OpenAI compatiple model router.

This deployment creates the following endpoints:

/v1/chat/completions: Chat interface (OpenAI-style)
/v1/completions: Text completion
/v1/models: List available models
/v1/models/{model}: Model information

Examples

from ray import serve
from ray.serve.config import AutoscalingConfig
from ray.serve.llm.configs import LLMConfig, ModelLoadingConfig, DeploymentConfig
from ray.serve.llm.deployments import VLLMDeployment
from ray.serve.llm.openai_api_models import ChatCompletionRequest


llm_config1 = LLMConfig(
    model_loading_config=ModelLoadingConfig(
        served_model_name="llama-3.1-8b",  # Name shown in /v1/models
        model_source="meta-llama/Llama-3.1-8b-instruct",
    ),
    deployment_config=DeploymentConfig(
        autoscaling_config=AutoscalingConfig(
            min_replicas=1, max_replicas=8,
        )
    ),
)
llm_config2 = LLMConfig(
    model_loading_config=ModelLoadingConfig(
        served_model_name="llama-3.2-3b",  # Name shown in /v1/models
        model_source="meta-llama/Llama-3.2-3b-instruct",
    ),
    deployment_config=DeploymentConfig(
        autoscaling_config=AutoscalingConfig(
            min_replicas=1, max_replicas=8,
        )
    ),
)

# Deploy the application
vllm_deployment1 = VLLMDeployment.as_deployment(llm_config1.get_serve_options()).bind(llm_config1)
vllm_deployment2 = VLLMDeployment.as_deployment(llm_config2.get_serve_options()).bind(llm_config2)
llm_app = LLMModelRouterDeployment.as_deployment().bind([vllm_deployment1, vllm_deployment2])
serve.run(llm_app)

PublicAPI (alpha): This API is in alpha and may change before becoming stable.

Methods

as_deployment

Converts this class to a Ray Serve deployment with ingress.