ray.serve.llm.deployments.VLLMService#
- class ray.serve.llm.deployments.VLLMService(llm_config: LLMConfig, *, engine_cls: Type[VLLMEngine] | None = None, image_retriever_cls: Type[ImageRetriever] | None = None, model_downloader: LoraModelLoader | None = None)[source]#
Bases:
VLLMService
The implementation of the VLLM engine deployment.
To build a VLLMDeployment object you should use
build_vllm_deployment
function. We also expose a lower level API for more control over the deployment class throughas_deployment
method.Examples
from ray import serve from ray.serve.config import AutoscalingConfig from ray.serve.llm.configs import LLMConfig, ModelLoadingConfig, DeploymentConfig from ray.serve.llm.deployments import VLLMDeployment from ray.serve.llm.openai_api_models import ChatCompletionRequest # Configure the model llm_config = LLMConfig( model_loading_config=ModelLoadingConfig( served_model_name="llama-3.1-8b", model_source="meta-llama/Llama-3.1-8b-instruct", ), deployment_config=DeploymentConfig( autoscaling_config=AutoscalingConfig( min_replicas=1, max_replicas=8, ) ), ) # Build the deployment directly VLLMDeployment = VLLMService.as_deployment(llm_config.get_serve_options()) vllm_app = VLLMDeployment.bind(llm_config) model_handle = serve.run(vllm_app) # Query the model via `chat` api from ray.serve.llm.openai_api_models import ChatCompletionRequest request = ChatCompletionRequest( model="llama-3.1-8b", messages=[ { "role": "user", "content": "Hello, world!" } ] ) response = ray.get(model_handle.chat(request)) print(response)
PublicAPI (alpha): This API is in alpha and may change before becoming stable.
Methods
Constructor of VLLMDeployment.
Convert the VLLMService to a Ray Serve deployment.
Runs a chat request to the vllm engine, and return the response.
Check the health of the vllm engine.
Runs a completion request to the vllm engine, and return the response.