ray.serve.llm.LLMConfig#
- pydantic model ray.serve.llm.LLMConfig[source]#
The configuration for starting an LLM deployment.
PublicAPI (alpha): This API is in alpha and may change before becoming stable.
- field accelerator_type: str | None = None#
The type of accelerator runs the model on. Only the following values are supported: [‘V100’, ‘P100’, ‘T4’, ‘P4’, ‘K80’, ‘A10G’, ‘L4’, ‘L40S’, ‘A100’, ‘H100’, ‘H200’, ‘H20’, ‘B200’, ‘Intel-GPU-Max-1550’, ‘Intel-GPU-Max-1100’, ‘Intel-GAUDI’, ‘AMD-Instinct-MI100’, ‘AMD-Instinct-MI250X’, ‘AMD-Instinct-MI250X-MI250’, ‘AMD-Instinct-MI210’, ‘AMD-Instinct-MI300A’, ‘AMD-Instinct-MI300X-OAM’, ‘AMD-Instinct-MI300X-HF’, ‘AMD-Instinct-MI308X’, ‘AMD-Instinct-MI325X-OAM’, ‘AMD-Instinct-MI350X-OAM’, ‘AMD-Instinct-MI355X-OAM’, ‘AMD-Radeon-R9-200-HD-7900’, ‘AMD-Radeon-HD-7900’, ‘aws-neuron-core’, ‘TPU-V2’, ‘TPU-V3’, ‘TPU-V4’, ‘TPU-V5P’, ‘TPU-V5LITEPOD’, ‘TPU-V6E’, ‘Ascend910B’, ‘Ascend910B4’, ‘A100-40G’, ‘A100-80G’]
- field deployment_config: Dict[str, Any] [Optional]#
The Ray @server.deployment options. Supported fields are:
name
,num_replicas
,ray_actor_options
,max_ongoing_requests
,autoscaling_config
,max_queued_requests
,user_config
,health_check_period_s
,health_check_timeout_s
,graceful_shutdown_wait_loop_s
,graceful_shutdown_timeout_s
,logging_config
,request_router_config
. For more details, see the Ray Serve Documentation.
- field engine_kwargs: Dict[str, Any] = {}#
Additional keyword arguments for the engine. In case of vLLM, this will include all the configuration knobs they provide out of the box, except for tensor-parallelism which is set automatically from Ray Serve configs.
- field experimental_configs: Dict[str, Any] [Optional]#
Experimental configurations for Ray Serve LLM. This is a dictionary of key-value pairs. Current supported keys are: -
stream_batching_interval_ms
: Ray Serve LLM batches streaming requests together. This config decides how long to wait for the batch before processing the requests. Defaults to 50.0. -num_ingress_replicas
: The number of replicas for the router. Ray Serve will take the max amount all the replicas. Default would be 2 router replicas per model replica.
- field llm_engine: str = 'vLLM'#
The LLMEngine that should be used to run the model. Only the following values are supported: [‘vLLM’]
- field log_engine_metrics: bool | None = True#
Enable additional engine metrics via Ray Prometheus port. Default is True.
- field lora_config: Dict[str, Any] | LoraConfig | None = None#
Settings for LoRA adapter. Validated against LoraConfig.
- field model_loading_config: Dict[str, Any] | ModelLoadingConfig [Required]#
The settings for how to download and expose the model. Validated against ModelLoadingConfig.
- field placement_group_config: Dict[str, Any] | None = None#
Ray placement group configuration for scheduling vLLM engine workers. Defines resource bundles and placement strategy for multi-node deployments. Should contain ‘bundles’ (list of resource dicts) and optionally ‘strategy’ (defaults to ‘PACK’). Example: {‘bundles’: [{‘GPU’: 1, ‘CPU’: 2}], ‘strategy’: ‘PACK’}
- field runtime_env: Dict[str, Any] | None = None#
The runtime_env to use for the model deployment replica and the engine workers.
- apply_checkpoint_info(model_id_or_path: str, trust_remote_code: bool = False) None [source]#
Apply the checkpoint info to the model config.
- get_engine_config() None | VLLMEngineConfig [source]#
Returns the engine config for the given LLM config.
LLMConfig not only has engine config but also deployment config, etc.
- classmethod parse_yaml(file, **kwargs) ModelT #
- update_engine_kwargs(**kwargs: Any) None [source]#
Update the engine_kwargs and the engine_config engine_kwargs.
This is typically called during engine starts, when certain engine_kwargs (e.g., data_parallel_rank) become available.
- validator validate_accelerator_type » accelerator_type[source]#
- validator validate_deployment_config » deployment_config[source]#
Validates the deployment config dictionary.
- validator validate_experimental_configs » experimental_configs[source]#
Validates the experimental configs dictionary.
- validator validate_llm_engine » llm_engine[source]#
Validates the llm_engine string value.
- validator validate_lora_config » lora_config[source]#
Validates the lora config dictionary.
- validator validate_model_loading_config » model_loading_config[source]#
Validates the model loading config dictionary.