ray.serve.llm.LLMServer.chat#

async LLMServer.chat(request: ChatCompletionRequest, raw_request_info: RawRequestInfo | None = None) AsyncGenerator[List[str | ErrorResponse] | ChatCompletionResponse, None][source]#

Runs a chat request to the LLM engine and returns the response.

Parameters:
  • request – A ChatCompletionRequest object.

  • raw_request_info – Optional RawRequestInfo containing data from the original HTTP request.

Returns:

An AsyncGenerator of the response. If stream is True and batching is enabled, then the generator will yield a list of chat streaming responses (strings of the format data: {response_json}nn). Otherwise, it will yield the ChatCompletionResponse object directly.