ray.serve.llm.configs.LLMConfig#

pydantic model ray.serve.llm.configs.LLMConfig[source]#

The configuration for starting an LLM deployment.

PublicAPI (alpha): This API is in alpha and may change before becoming stable.

field accelerator_type: str [Required]#: The type of accelerator runs the model on. Only the following values are supported: [‘V100’, ‘P100’, ‘T4’, ‘P4’, ‘K80’, ‘A10G’, ‘L4’, ‘L40S’, ‘A100’, ‘H100’, ‘H200’, ‘Intel-GPU-Max-1550’, ‘Intel-GPU-Max-1100’, ‘Intel-GAUDI’, ‘AMD-Instinct-MI100’, ‘AMD-Instinct-MI250X’, ‘AMD-Instinct-MI250X-MI250’, ‘AMD-Instinct-MI210’, ‘AMD-Instinct-MI300X-OAM’, ‘AMD-Radeon-R9-200-HD-7900’, ‘AMD-Radeon-HD-7900’, ‘aws-neuron-core’, ‘TPU-V2’, ‘TPU-V3’, ‘TPU-V4’, ‘TPU-V5P’, ‘TPU-V5LITEPOD’, ‘TPU-V6E’, ‘A100-40G’, ‘A100-80G’]

field deployment_config: Dict[str, Any] [Optional]#: The Ray @server.deployment options. See @server.deployment for more details.

field engine_kwargs: Dict[str, Any] = {}#: Additional keyword arguments for the engine. In case of vLLM, this will include all the configuration knobs they provide out of the box, except for tensor-parallelism which is set automatically from Ray Serve configs.

field llm_engine: str = 'VLLM'#: The LLMEngine that should be used to run the model. Only the following values are supported: [‘VLLM’]

field lora_config: LoraConfig | None = None#: Settings for LoRA adapter.

field model_loading_config: ModelLoadingConfig [Required]#: The settings for how to download and expose the model.

field runtime_env: Dict[str, Any] | None = None#: The runtime_env to use for the model deployment replica and the engine workers.

apply_checkpoint_info(model_id_or_path: str, trust_remote_code: bool = False) → None[source]#: Apply the checkpoint info to the model config.

get_engine_config()[source]#

Returns the engine config for the given LLM config.

LLMConfig not only has engine config but also deployment config, etc.

get_serve_options(*, name_prefix: str) → Dict[str, Any][source]#

Get the Serve options for the given LLM config.

This method is used to generate the Serve options for the given LLM config.

Examples

from ray import serve
from ray.serve.llm.configs import LLMConfig, ModelLoadingConfig
from ray.serve.llm.deployments import VLLMDeployment


llm_config = LLMConfig(
    model_loading_config=ModelLoadingConfig(model_id="test_model"),
    accelerator_type="L4",
    runtime_env={"env_vars": {"FOO": "bar"}},
)
serve_options = llm_config.get_serve_options(name_prefix="Test:")
vllm_app = VLLMDeployment.options(**serve_options).bind(llm_config)
serve.run(vllm_app)

Keyword Arguments:: name_prefix – The prefix to use for the deployment name.
Returns:: The dictionary to use in .options() when creating the deployment.

multiplex_config() → ServeMultiplexConfig[source]#

classmethod parse_yaml(file, **kwargs) → ModelT#

ray_accelerator_type() → str[source]#: Converts the accelerator type to the Ray Core format.

validator validate_accelerator_type » accelerator_type[source]#

validator validate_deployment_config » deployment_config[source]#: Validates the deployment config dictionary.

validator validate_llm_engine » llm_engine[source]#: Validates the llm_engine string value.

property input_modality: str#: Returns the input modality of the model. There could be more types in the future. Right now assumes if the model doesn’t support version, it’ll be text.

property max_request_context_length: int | None#

property model_id: str#

property prompt_format: HuggingFacePromptFormat#

property supports_vision: bool#