Batch Inference with LoRA Adapters#
In this example, we show how to perform batch inference using Ray Data LLM with LLM and a LoRA adapter.
To run this example, we need to install the following dependencies:
pip install -qU "ray[data]" "vllm==0.7.2"
import ray
from ray.data.llm import build_llm_processor, vLLMEngineProcessorConfig
# 1. Construct a vLLM processor config.
processor_config = vLLMEngineProcessorConfig(
# The base model.
model="unsloth/Llama-3.2-1B-Instruct",
# vLLM engine config.
engine_kwargs=dict(
# Enable LoRA in the vLLM engine; otherwise you won't be able to
# process requests with LoRA adapters.
enable_lora=True,
# You need to set the LoRA rank for the adapter.
# The LoRA rank is the value of "r" in the LoRA config.
# If you want to use multiple LoRA adapters in this pipeline,
# please specify the maximum LoRA rank among all of them.
max_lora_rank=32,
# The maximum number of LoRA adapters vLLM cached. "1" means
# vLLM only caches one LoRA adapter at a time, so if your dataset
# needs more than one LoRA adapters, then there would be context
# switching. On the other hand, while increasing max_loras reduces
# the context switching, it increases the memory footprint.
max_loras=1,
# Older GPUs (e.g. T4) don't support bfloat16. You should remove
# this line if you're using later GPUs.
dtype="half",
# Reduce the model length to fit small GPUs. You should remove
# this line if you're using large GPUs.
max_model_len=1024,
),
# The batch size used in Ray Data.
batch_size=16,
# Use one GPU in this example.
concurrency=1,
# If you save the LoRA adapter in S3, you can set the following path.
# dynamic_lora_loading_path="s3://your-lora-bucket/",
)
# 2. Construct a processor using the processor config.
processor = build_llm_processor(
processor_config,
# Convert the input data to the OpenAI chat form.
preprocess=lambda row: dict(
# If you specify "model" in a request, and the model is different
# from the model you specify in the processor config, then this
# is the LoRA adapter. The "model" here can be a LoRA adapter
# available in the HuggingFace Hub or a local path.
#
# If you set dynamic_lora_loading_path, then only specify the LoRA
# path under dynamic_lora_loading_path.
model="EdBergJr/Llama32_Baha_3",
messages=[
{"role": "system",
"content": "You are a calculator. Please only output the answer "
"of the given equation."},
{"role": "user", "content": f"{row['id']} ** 3 = ?"},
],
sampling_params=dict(
temperature=0.3,
max_tokens=20,
detokenize=False,
),
),
# Only keep the generated text in the output dataset.
postprocess=lambda row: {
"resp": row["generated_text"],
},
)
# 3. Synthesize a dataset with 30 rows.
ds = ray.data.range(30)
# 4. Apply the processor to the dataset. Note that this line won't kick off
# anything because processor is execution lazily.
ds = processor(ds)
# Materialization kicks off the pipeline execution.
ds = ds.materialize()
# 5. Print all outputs.
for out in ds.take_all():
print(out)
print("==========")
# 6. Shutdown Ray to release resources.
ray.shutdown()