ray.serve.llm.deployments.VLLMService#

class ray.serve.llm.deployments.VLLMService(llm_config: LLMConfig, *, engine_cls: Type[VLLMEngine] | None = None, image_retriever_cls: Type[ImageRetriever] | None = None, model_downloader: LoraModelLoader | None = None)[source]#

Bases: VLLMService

The implementation of the VLLM engine deployment.

To build a VLLMDeployment object you should use build_vllm_deployment function. We also expose a lower level API for more control over the deployment class through as_deployment method.

Examples

from ray import serve
from ray.serve.config import AutoscalingConfig
from ray.serve.llm.configs import LLMConfig, ModelLoadingConfig, DeploymentConfig
from ray.serve.llm.deployments import VLLMDeployment
from ray.serve.llm.openai_api_models import ChatCompletionRequest

# Configure the model
llm_config = LLMConfig(
    model_loading_config=ModelLoadingConfig(
        served_model_name="llama-3.1-8b",
        model_source="meta-llama/Llama-3.1-8b-instruct",
    ),
    deployment_config=DeploymentConfig(
        autoscaling_config=AutoscalingConfig(
            min_replicas=1,
            max_replicas=8,
        )
    ),
)

# Build the deployment directly
VLLMDeployment = VLLMService.as_deployment(llm_config.get_serve_options())
vllm_app = VLLMDeployment.bind(llm_config)

model_handle = serve.run(vllm_app)

# Query the model via `chat` api
from ray.serve.llm.openai_api_models import ChatCompletionRequest
request = ChatCompletionRequest(
    model="llama-3.1-8b",
    messages=[
        {
            "role": "user",
            "content": "Hello, world!"
        }
    ]
)
response = ray.get(model_handle.chat(request))
print(response)

PublicAPI (alpha): This API is in alpha and may change before becoming stable.

Methods

`__init__`	Constructor of VLLMDeployment.
`as_deployment`	Convert the VLLMService to a Ray Serve deployment.
`chat`	Runs a chat request to the vllm engine, and return the response.
`check_health`	Check the health of the vllm engine.
`completions`	Runs a completion request to the vllm engine, and return the response.