ray.serve.llm.LLMServer#
- class ray.serve.llm.LLMServer(**kwargs)[source]#
Bases:
LLMServerMethods
Runs a chat request to the LLM engine and returns the response.
Check the health of the replica.
Runs a completion request to the LLM engine and returns the response.
Runs an embeddings request to the engine and returns the response.
Reset the prefix cache of the underlying engine
Runs a score request to the engine and returns the response.
Start the underlying engine.
Start profiling
Stop profiling
Synchronous constructor that returns an unstarted instance.
Runs an transcriptions request to the engine and returns the response.