ray.serve.llm.build_openai_app#

ray.serve.llm.build_openai_app(llm_serving_args: LLMServingArgs) Application[source]#

Helper to build an OpenAI compatible app with the llm deployment setup from the given llm serving args. This is the main entry point for users to create a Serve application serving LLMs.

Examples

Example usage in code.#
from ray import serve
from ray.serve.llm import LLMConfig, LLMServingArgs, build_openai_app

llm_config1 = LLMConfig(
    model_loading_config=dict(
        model_id="qwen-0.5b",
        model_source="Qwen/Qwen2.5-0.5B-Instruct",
    ),
    deployment_config=dict(
        autoscaling_config=dict(
            min_replicas=1, max_replicas=2,
        )
    ),
    accelerator_type="A10G",
)

llm_config2 = LLMConfig(
    model_loading_config=dict(
        model_id="qwen-1.5b",
        model_source="Qwen/Qwen2.5-1.5B-Instruct",
    ),
    deployment_config=dict(
        autoscaling_config=dict(
            min_replicas=1, max_replicas=2,
        )
    ),
    accelerator_type="A10G",
)

# Deploy the application
llm_app = build_openai_app(
    LLMServingArgs(
        llm_configs=[
            llm_config1,
            llm_config2,
        ]
    )
)
serve.run(llm_app)


# Querying the model via openai client
from openai import OpenAI

# Initialize client
client = OpenAI(base_url="http://localhost:8000/v1", api_key="fake-key")

# Basic completion
response = client.chat.completions.create(
    model="qwen-0.5b",
    messages=[{"role": "user", "content": "Hello!"}]
)
Example usage in YAML.#
# config.yaml
applications:
- args:
    llm_configs:
        - model_loading_config:
            model_id: qwen-0.5b
            model_source: Qwen/Qwen2.5-0.5B-Instruct
          accelerator_type: A10G
          deployment_config:
            autoscaling_config:
                min_replicas: 1
                max_replicas: 2
        - model_loading_config:
            model_id: qwen-1.5b
            model_source: Qwen/Qwen2.5-1.5B-Instruct
          accelerator_type: A10G
          deployment_config:
            autoscaling_config:
                min_replicas: 1
                max_replicas: 2
  import_path: ray.serve.llm:build_openai_app
  name: llm_app
  route_prefix: "/"
Parameters:

llm_serving_args – The list of llm configs or the paths to the llm config to build the app.

Returns:

The configured Ray Serve Application router.

PublicAPI (alpha): This API is in alpha and may change before becoming stable.