ray.serve.llm.LLMServer#
- class ray.serve.llm.LLMServer(llm_config: LLMConfig, *, engine_cls: Type[VLLMEngine] | None = None, image_retriever_cls: Type[ImageRetriever] | None = None, model_downloader: LoraModelLoader | None = None)[source]#
Bases:
LLMServer
The implementation of the vLLM engine deployment.
To build a Deployment object you should use
build_llm_deployment
function. We also expose a lower level API for more control over the deployment class throughas_deployment
method.Examples
from ray import serve from ray.serve.llm import LLMConfig, LLMServer # Configure the model llm_config = LLMConfig( model_loading_config=dict( served_model_name="llama-3.1-8b", model_source="meta-llama/Llama-3.1-8b-instruct", ), deployment_config=dict( autoscaling_config=dict( min_replicas=1, max_replicas=8, ) ), ) # Build the deployment directly LLMDeployment = LLMServer.as_deployment(llm_config.get_serve_options()) llm_app = LLMDeployment.bind(llm_config) model_handle = serve.run(llm_app) # Query the model via `chat` api from ray.serve.llm.openai_api_models import ChatCompletionRequest request = ChatCompletionRequest( model="llama-3.1-8b", messages=[ { "role": "user", "content": "Hello, world!" } ] ) response = ray.get(model_handle.chat(request)) print(response)
PublicAPI (alpha): This API is in alpha and may change before becoming stable.
Methods
Constructor of LLMServer.
Convert the LLMServer to a Ray Serve deployment.
Runs a chat request to the vllm engine, and return the response.
Check the health of the vllm engine.
Runs a completion request to the vllm engine, and return the response.