ray.serve.llm.LLMServer#
- class ray.serve.llm.LLMServer(**kwargs)[source]#
Bases:
LLMServer
Methods
Convert the LLMServer to a Ray Serve deployment.
Runs a chat request to the LLM engine and returns the response.
Check the health of the replica.
Runs a completion request to the LLM engine and returns the response.
Runs an embeddings request to the engine and returns the response.
Reset the prefix cache of the underlying engine
Runs a score request to the engine and returns the response.
Start the underlying engine.
Start profiling
Stop profiling
Synchronous constructor that returns an unstarted instance.