ray.serve.llm.LLMServer.chat#
- async LLMServer.chat(request: ChatCompletionRequest) AsyncGenerator[List[str | ErrorResponse] | ChatCompletionResponse, None] [source]#
Runs a chat request to the LLM engine and returns the response.
- Args:
request: A ChatCompletionRequest object.
- Returns:
An AsyncGenerator of the response. If stream is True and batching is enabled, then the generator will yield a list of chat streaming responses (strings of the format data: {response_json}
). Otherwise, it will yield the ChatCompletionResponse object directly.