ray.serve.llm.LLMServer#

class ray.serve.llm.LLMServer(**kwargs)[source]#

Bases: LLMServer

Methods

chat

Runs a chat request to the LLM engine and returns the response.

check_health

Check the health of the replica.

collective_rpc

Execute a collective RPC call on all workers.

completions

Runs a completion request to the LLM engine and returns the response.

detokenize

Detokenize the input token IDs.

embeddings

Runs an embeddings request to the engine and returns the response.

is_paused

Check whether the engine is currently paused.

is_sleeping

Check whether the engine is currently sleeping.

pause

Pause generation on the engine.

reset_prefix_cache

Reset the KV prefix cache on the engine.

resume

Resume generation on the engine after pause.

score

Runs a score request to the engine and returns the response.

sleep

Put the engine to sleep.

start

Start the underlying engine.

start_profile

Start profiling

stop_profile

Stop profiling

sync_init

Synchronous constructor that returns an unstarted instance.

tokenize

Tokenize the input text.

transcriptions

Runs an transcriptions request to the engine and returns the response.

wakeup

Wake up the engine from sleep mode.