Fractional GPU serving#

Serve multiple small models on the same GPU for cost-efficient deployments.

Note

This feature hasn’t been extensively tested in production. If you encounter any issues, report them on GitHub with reproducible code.

Fractional GPU allocation allows you to run multiple model replicas on a single GPU by customizing placement groups. This approach maximizes GPU utilization and reduces costs when serving small models that don’t require a full GPU’s resources.

When to use fractional GPUs#

Consider fractional GPU allocation when:

You’re serving small models with low concurrency that don’t require a full GPU for model weights and KV cache.
You have multiple models that fit this profile.

Deploy with fractional GPU allocation#

The following example shows how to serve 8 replicas of a small model on 4 L4 GPUs (2 replicas per GPU):

from ray.serve.llm import LLMConfig, ModelLoadingConfig
from ray.serve.llm import build_openai_app
from ray import serve


llm_config = LLMConfig(
    model_loading_config=ModelLoadingConfig(
        model_id="HuggingFaceTB/SmolVLM-256M-Instruct",
    ),
    engine_kwargs=dict(
        gpu_memory_utilization=0.4,
        use_tqdm_on_load=False,
        enforce_eager=True,
        max_model_len=2048,
    ),
    deployment_config=dict(
        autoscaling_config=dict(
            min_replicas=8, max_replicas=8,
        )
    ),
    accelerator_type="L4",
    # Set fraction of GPU for each replica
    placement_group_config=dict(bundles=[dict(GPU=0.49)]),
    runtime_env=dict(
        env_vars={
            # Must match the GPU fraction in placement_group_config
            "VLLM_RAY_PER_WORKER_GPUS": "0.49",
            "VLLM_DISABLE_COMPILE_CACHE": "1",
        },
    ),
)

app = build_openai_app({"llm_configs": [llm_config]})
serve.run(app, blocking=True)

Configuration parameters#

Use the following parameters to configure fractional GPU allocation. The placement group configuration is required for fractional GPU setup. The memory management and performance settings are vLLM-specific optimizations that you can adjust based on your model and workload requirements.

Placement group configuration#

placement_group_config: Specifies the GPU fraction each replica uses. Set GPU to the fraction (for example, 0.49 for approximately half a GPU). Use slightly less than the theoretical fraction to account for system overhead—this headroom prevents out-of-memory errors.
VLLM_RAY_PER_WORKER_GPUS: Environment variable that tells vLLM GPU workers to claim the specified fraction of GPU resources.

Memory management#

gpu_memory_utilization: Controls how much GPU memory vLLM pre-allocates. vLLM allocates memory based on this setting regardless of Ray’s GPU scheduling. In the example, 0.4 means vLLM targets 40% of GPU memory for the model, KV cache, and CUDAGraph memory.

Performance settings#

enforce_eager: Set to True to disable CUDA graphs and reduce memory overhead.
max_model_len: Limits the maximum sequence length, reducing memory requirements.
use_tqdm_on_load: Set to False to disable progress bars during model loading.

Workarounds#

VLLM_DISABLE_COMPILE_CACHE: Set to 1 to avoid a resource contention issue among workers during torch compile caching.

Best practices#

Calculate GPU allocation#

Leave headroom: Use slightly less than the theoretical fraction (for example, 0.49 instead of 0.5) to account for system overhead.
Match memory to workload: Ensure gpu_memory_utilization × GPU memory × number of replicas per GPU doesn’t exceed total GPU memory.
Account for all memory: Consider model weights, KV cache, CUDA graphs, and framework overhead.

Optimize for your models#

Test memory requirements: Profile your model’s actual memory usage before setting gpu_memory_utilization. This information often gets printed as part of the vLLM initialization.
Start conservative: Begin with fewer replicas per GPU and increase gradually while monitoring memory usage.
Monitor OOM errors: Watch for out-of-memory errors that indicate you need to reduce replicas or lower gpu_memory_utilization.

Production considerations#

Validate performance: Test throughput and latency with your actual workload before production deployment.
Consider autoscaling carefully: Fractional GPU deployments work best with fixed replica counts rather than autoscaling.

Troubleshooting#

Out of memory errors#

Reduce gpu_memory_utilization (for example, from 0.4 to 0.3)
Decrease the number of replicas per GPU
Lower max_model_len to reduce KV cache size
Enable enforce_eager=True if not already set to ensure CUDA graph memory requirements don’t cause issues

Replicas fail to start#

Verify that your fractional allocation matches your replica count (for example, 2 replicas with GPU=0.49 each)
Check that VLLM_RAY_PER_WORKER_GPUS matches placement_group_config GPU value
Ensure your model size is appropriate for fractional GPU allocation

Resource contention issues#

Ensure VLLM_DISABLE_COMPILE_CACHE=1 is set to avoid torch compile caching conflicts
Check Ray logs for resource allocation errors
Verify placement group configuration is applied correctly

Fractional GPU serving#

When to use fractional GPUs#

Deploy with fractional GPU allocation#

Configuration parameters#

Placement group configuration#

Memory management#

Performance settings#

Workarounds#

Best practices#

Calculate GPU allocation#

Optimize for your models#

Production considerations#

Troubleshooting#

Out of memory errors#

Replicas fail to start#

Resource contention issues#

See also#