Fractional GPU serving#
Serve multiple small models on the same GPU for cost-efficient deployments.
Note
This feature hasn’t been extensively tested in production. If you encounter any issues, report them on GitHub with reproducible code.
Fractional GPU allocation allows you to run multiple model replicas on a single GPU by customizing placement groups. This approach maximizes GPU utilization and reduces costs when serving small models that don’t require a full GPU’s resources.
When to use fractional GPUs#
Consider fractional GPU allocation when:
You’re serving small models with low concurrency that don’t require a full GPU for model weights and KV cache.
You have multiple models that fit this profile.
Deploy with fractional GPU allocation#
The following example shows how to serve 8 replicas of a small model on 4 L4 GPUs (2 replicas per GPU):
from ray.serve.llm import LLMConfig, ModelLoadingConfig
from ray.serve.llm import build_openai_app
from ray import serve
llm_config = LLMConfig(
model_loading_config=ModelLoadingConfig(
model_id="HuggingFaceTB/SmolVLM-256M-Instruct",
),
engine_kwargs=dict(
gpu_memory_utilization=0.4,
use_tqdm_on_load=False,
enforce_eager=True,
max_model_len=2048,
),
deployment_config=dict(
autoscaling_config=dict(
min_replicas=8, max_replicas=8,
)
),
accelerator_type="L4",
# Set fraction of GPU for each replica
placement_group_config=dict(bundles=[dict(GPU=0.49)]),
runtime_env=dict(
env_vars={
# Must match the GPU fraction in placement_group_config
"VLLM_RAY_PER_WORKER_GPUS": "0.49",
"VLLM_DISABLE_COMPILE_CACHE": "1",
},
),
)
app = build_openai_app({"llm_configs": [llm_config]})
serve.run(app, blocking=True)
Configuration parameters#
Use the following parameters to configure fractional GPU allocation. The placement group configuration is required for fractional GPU setup. The memory management and performance settings are vLLM-specific optimizations that you can adjust based on your model and workload requirements.
Placement group configuration#
placement_group_config
: Specifies the GPU fraction each replica uses. SetGPU
to the fraction (for example,0.49
for approximately half a GPU). Use slightly less than the theoretical fraction to account for system overhead—this headroom prevents out-of-memory errors.VLLM_RAY_PER_WORKER_GPUS
: Environment variable that tells vLLM GPU workers to claim the specified fraction of GPU resources.
Memory management#
gpu_memory_utilization
: Controls how much GPU memory vLLM pre-allocates. vLLM allocates memory based on this setting regardless of Ray’s GPU scheduling. In the example,0.4
means vLLM targets 40% of GPU memory for the model, KV cache, and CUDAGraph memory.
Performance settings#
enforce_eager
: Set toTrue
to disable CUDA graphs and reduce memory overhead.max_model_len
: Limits the maximum sequence length, reducing memory requirements.use_tqdm_on_load
: Set toFalse
to disable progress bars during model loading.
Workarounds#
VLLM_DISABLE_COMPILE_CACHE
: Set to1
to avoid a resource contention issue among workers during torch compile caching.
Best practices#
Calculate GPU allocation#
Leave headroom: Use slightly less than the theoretical fraction (for example,
0.49
instead of0.5
) to account for system overhead.Match memory to workload: Ensure
gpu_memory_utilization
× GPU memory × number of replicas per GPU doesn’t exceed total GPU memory.Account for all memory: Consider model weights, KV cache, CUDA graphs, and framework overhead.
Optimize for your models#
Test memory requirements: Profile your model’s actual memory usage before setting
gpu_memory_utilization
. This information often gets printed as part of the vLLM initialization.Start conservative: Begin with fewer replicas per GPU and increase gradually while monitoring memory usage.
Monitor OOM errors: Watch for out-of-memory errors that indicate you need to reduce replicas or lower
gpu_memory_utilization
.
Production considerations#
Validate performance: Test throughput and latency with your actual workload before production deployment.
Consider autoscaling carefully: Fractional GPU deployments work best with fixed replica counts rather than autoscaling.
Troubleshooting#
Out of memory errors#
Reduce
gpu_memory_utilization
(for example, from0.4
to0.3
)Decrease the number of replicas per GPU
Lower
max_model_len
to reduce KV cache sizeEnable
enforce_eager=True
if not already set to ensure CUDA graph memory requirements don’t cause issues
Replicas fail to start#
Verify that your fractional allocation matches your replica count (for example, 2 replicas with
GPU=0.49
each)Check that
VLLM_RAY_PER_WORKER_GPUS
matchesplacement_group_config
GPU valueEnsure your model size is appropriate for fractional GPU allocation
Resource contention issues#
Ensure
VLLM_DISABLE_COMPILE_CACHE=1
is set to avoid torch compile caching conflictsCheck Ray logs for resource allocation errors
Verify placement group configuration is applied correctly
See also#
Quickstart - Basic LLM deployment examples
Ray placement groups - Ray Core placement group documentation