ray.serve.llm.LLMServer.completions#
- async LLMServer.completions(request: CompletionRequest, raw_request_info: RawRequestInfo | None = None) AsyncGenerator[List[str | ErrorResponse] | CompletionResponse, None][source]#
Runs a completion request to the LLM engine and returns the response.
- Parameters:
request – A CompletionRequest object.
raw_request_info – Optional RawRequestInfo containing data from the original HTTP request.
- Returns:
An AsyncGenerator of the response. If stream is True and batching is enabled, then the generator will yield a list of completion streaming responses (strings of the format data: {response_json}nn). Otherwise, it will yield the CompletionResponse object directly.