ray.serve.llm.LLMServer#
- class ray.serve.llm.LLMServer(**kwargs)[source]#
Bases:
LLMServerMethods
Runs a chat request to the LLM engine and returns the response.
Check the health of the replica.
Execute a collective RPC call on all workers.
Runs a completion request to the LLM engine and returns the response.
Detokenize the input token IDs.
Runs an embeddings request to the engine and returns the response.
Check whether the engine is currently paused.
Check whether the engine is currently sleeping.
Pause generation on the engine.
Reset the KV prefix cache on the engine.
Resume generation on the engine after pause.
Runs a score request to the engine and returns the response.
Put the engine to sleep.
Start the underlying engine.
Start profiling
Stop profiling
Synchronous constructor that returns an unstarted instance.
Tokenize the input text.
Runs an transcriptions request to the engine and returns the response.
Wake up the engine from sleep mode.