ray.serve.llm.LLMConfig#

pydantic model ray.serve.llm.LLMConfig[source]#

The configuration for starting an LLM deployment.

PublicAPI (alpha): This API is in alpha and may change before becoming stable.

field accelerator_type: str | None = None#

The type of accelerator runs the model on. Only the following values are supported: [‘V100’, ‘P100’, ‘T4’, ‘P4’, ‘K80’, ‘A10G’, ‘L4’, ‘L40S’, ‘A100’, ‘H100’, ‘H200’, ‘H20’, ‘B200’, ‘Intel-GPU-Max-1550’, ‘Intel-GPU-Max-1100’, ‘Intel-GAUDI’, ‘AMD-Instinct-MI100’, ‘AMD-Instinct-MI250X’, ‘AMD-Instinct-MI250X-MI250’, ‘AMD-Instinct-MI210’, ‘AMD-Instinct-MI300A’, ‘AMD-Instinct-MI300X-OAM’, ‘AMD-Instinct-MI300X-HF’, ‘AMD-Instinct-MI308X’, ‘AMD-Instinct-MI325X-OAM’, ‘AMD-Instinct-MI350X-OAM’, ‘AMD-Instinct-MI355X-OAM’, ‘AMD-Radeon-R9-200-HD-7900’, ‘AMD-Radeon-HD-7900’, ‘aws-neuron-core’, ‘TPU-V2’, ‘TPU-V3’, ‘TPU-V4’, ‘TPU-V5P’, ‘TPU-V5LITEPOD’, ‘TPU-V6E’, ‘Ascend910B’, ‘Ascend910B4’, ‘A100-40G’, ‘A100-80G’]

field deployment_config: Dict[str, Any] [Optional]#

The Ray @server.deployment options. Supported fields are: name, num_replicas, ray_actor_options, max_ongoing_requests, autoscaling_config, max_queued_requests, user_config, health_check_period_s, health_check_timeout_s, graceful_shutdown_wait_loop_s, graceful_shutdown_timeout_s, logging_config, request_router_config. For more details, see the Ray Serve Documentation.

field engine_kwargs: Dict[str, Any] = {}#

Additional keyword arguments for the engine. In case of vLLM, this will include all the configuration knobs they provide out of the box, except for tensor-parallelism which is set automatically from Ray Serve configs.

field experimental_configs: Dict[str, Any] [Optional]#

Experimental configurations for Ray Serve LLM. This is a dictionary of key-value pairs. Current supported keys are: - stream_batching_interval_ms: Ray Serve LLM batches streaming requests together. This config decides how long to wait for the batch before processing the requests. Defaults to 50.0. - num_ingress_replicas: The number of replicas for the router. Ray Serve will take the max amount all the replicas. Default would be 2 router replicas per model replica.

field llm_engine: str = 'vLLM'#

The LLMEngine that should be used to run the model. Only the following values are supported: [‘vLLM’]

field log_engine_metrics: bool | None = True#

Enable additional engine metrics via Ray Prometheus port. Default is True.

field lora_config: Dict[str, Any] | LoraConfig | None = None#

Settings for LoRA adapter. Validated against LoraConfig.

field model_loading_config: Dict[str, Any] | ModelLoadingConfig [Required]#

The settings for how to download and expose the model. Validated against ModelLoadingConfig.

field placement_group_config: Dict[str, Any] | None = None#

Ray placement group configuration for scheduling vLLM engine workers. Defines resource bundles and placement strategy for multi-node deployments. Should contain ‘bundles’ (list of resource dicts) and optionally ‘strategy’ (defaults to ‘PACK’). Example: {‘bundles’: [{‘GPU’: 1, ‘CPU’: 2}], ‘strategy’: ‘PACK’}

field runtime_env: Dict[str, Any] | None = None#

The runtime_env to use for the model deployment replica and the engine workers.

apply_checkpoint_info(model_id_or_path: str, trust_remote_code: bool = False) None[source]#

Apply the checkpoint info to the model config.

get_engine_config() None | VLLMEngineConfig[source]#

Returns the engine config for the given LLM config.

LLMConfig not only has engine config but also deployment config, etc.

multiplex_config() ServeMultiplexConfig[source]#
classmethod parse_yaml(file, **kwargs) ModelT#
setup_engine_backend()[source]#
update_engine_kwargs(**kwargs: Any) None[source]#

Update the engine_kwargs and the engine_config engine_kwargs.

This is typically called during engine starts, when certain engine_kwargs (e.g., data_parallel_rank) become available.

validator validate_accelerator_type  »  accelerator_type[source]#
validator validate_deployment_config  »  deployment_config[source]#

Validates the deployment config dictionary.

validator validate_experimental_configs  »  experimental_configs[source]#

Validates the experimental configs dictionary.

validator validate_llm_engine  »  llm_engine[source]#

Validates the llm_engine string value.

validator validate_lora_config  »  lora_config[source]#

Validates the lora config dictionary.

validator validate_model_loading_config  »  model_loading_config[source]#

Validates the model loading config dictionary.

property input_modality: str#

Returns the input modality of the model. There could be more types in the future. Right now assumes if the model doesn’t support version, it’ll be text.

property max_request_context_length: int | None#
property model_architecture: str#
property model_id: str#
property supports_vision: bool#