ray.data.llm.SGLangEngineProcessorConfig#

class ray.data.llm.SGLangEngineProcessorConfig(*, batch_size: int = 32, resources_per_bundle: Dict[str, float] | None = None, accelerator_type: str | None = None, concurrency: int | Tuple[int, int] = 1, experimental: Dict[str, Any] = None, model_source: str, runtime_env: Dict[str, Any] | None = None, max_pending_requests: int | None = None, max_concurrent_batches: int = 8, should_continue_on_error: bool = False, apply_chat_template: bool = True, chat_template: str | None = None, tokenize: bool = True, detokenize: bool = True, has_image: bool = False, chat_template_stage: Any = True, tokenize_stage: Any = True, detokenize_stage: Any = True, prepare_image_stage: Any = False, prepare_multimodal_stage: Any = False, engine_kwargs: Dict[str, Any] = None, task_type: Literal['generate'] = 'generate')[source]#

The configuration for the SGLang engine processor.

Parameters:

model_source – The model source to use for the SGLang engine.
batch_size – The batch size to send to the SGLang engine. Large batch sizes are likely to saturate the compute resources and could achieve higher throughput. On the other hand, small batch sizes are more fault-tolerant and could reduce bubbles in the data pipeline. You can tune the batch size to balance the throughput and fault-tolerance based on your use case.
engine_kwargs – The kwargs to pass to the SGLang engine. Default engine kwargs are tp_size: 1, dp_size: 1, skip_tokenizer_init: True.
task_type – The task type to use. If not specified, will use ‘generate’ by default.
runtime_env – The runtime environment to use for the SGLang engine. See this doc for more details.
max_pending_requests – The maximum number of pending requests. If not specified, will use the default value from the SGLang engine.
max_concurrent_batches – The maximum number of concurrent batches in the engine. This is to overlap the batch processing to avoid the tail latency of each batch. The default value may not be optimal when the batch size or the batch processing latency is too small, but it should be good enough for batch size >= 64.
chat_template_stage – Chat templating stage config (bool | dict | ChatTemplateStageConfig). Defaults to True. Use nested config for per-stage control over batch_size, concurrency, runtime_env, num_cpus, and memory. Legacy apply_chat_template and chat_template fields are deprecated but still supported.
tokenize_stage – Tokenizer stage config (bool | dict | TokenizerStageConfig). Defaults to True. Use nested config for per-stage control over batch_size, concurrency, runtime_env, num_cpus, memory, and model_source. Legacy tokenize field is deprecated but still supported.
detokenize_stage – Detokenizer stage config (bool | dict | DetokenizeStageConfig). Defaults to True. Use nested config for per-stage control over batch_size, concurrency, runtime_env, num_cpus, memory, and model_source. Legacy detokenize field is deprecated but still supported.
accelerator_type – The accelerator type used by the LLM stage in a processor. Default to None, meaning that only the CPU will be used.
concurrency – The number of workers for data parallelism. Default to 1. If concurrency is a tuple (m, n), Ray creates an autoscaling actor pool that scales between m and n workers (1 <= m <= n). If concurrency is an int n, both CPU and GPU stages use an autoscaling pool from (1, n). Stage-specific concurrency can be set via nested stage configs.

Examples

import ray
from ray.data.llm import SGLangEngineProcessorConfig, build_processor

config = SGLangEngineProcessorConfig(
    model_source="meta-llama/Meta-Llama-3.1-8B-Instruct",
    engine_kwargs=dict(
        dtype="half",
    ),
    concurrency=1,
    batch_size=64,
)
processor = build_processor(
    config,
    preprocess=lambda row: dict(
        messages=[
            {"role": "system", "content": "You are a calculator"},
            {"role": "user", "content": f"{row['id']} ** 3 = ?"},
        ],
        sampling_params=dict(
            temperature=0.3,
            max_new_tokens=20,
        ),
    ),
    postprocess=lambda row: dict(
        resp=row["generated_text"],
    ),
)

ds = ray.data.range(300)
ds = processor(ds)
for row in ds.take_all():
    print(row)

PublicAPI (beta): This API is in beta and may change before becoming stable.

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'protected_namespaces': (), 'validate_assignment': True}#: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'accelerator_type': FieldInfo(annotation=Union[str, NoneType], required=False, description='The accelerator type used by the LLM stage in a processor. Default to None, meaning that only the CPU will be used.'), 'apply_chat_template': FieldInfo(annotation=bool, required=False, default=True, description='[DEPRECATED] Prefer `chat_template_stage`. Whether to apply chat template.'), 'batch_size': FieldInfo(annotation=int, required=False, default=32, description='Large batch sizes are likely to saturate the compute resources and could achieve higher throughput. On the other hand, small batch sizes are more fault-tolerant and could reduce bubbles in the data pipeline. You can tune the batch size to balance the throughput and fault-tolerance based on your use case. Defaults to 32.'), 'chat_template': FieldInfo(annotation=Union[str, NoneType], required=False, description='[DEPRECATED] Prefer `chat_template_stage.chat_template`. The chat template to use.'), 'chat_template_stage': FieldInfo(annotation=Any, required=False, default=True, description='Chat templating stage config (bool | dict | ChatTemplateStageConfig).'), 'concurrency': FieldInfo(annotation=Union[int, Tuple[int, int]], required=False, default=1, description='The number of workers for data parallelism. Default to 1. If ``concurrency`` is a ``tuple`` ``(m, n)``, Ray creates an autoscaling actor pool that scales between ``m`` and ``n`` workers (``1 <= m <= n``). If ``concurrency`` is an ``int`` ``n``, Ray uses either a fixed pool of ``n`` workers or an autoscaling pool from ``1`` to ``n`` workers, depending on the processor and stage.'), 'detokenize': FieldInfo(annotation=bool, required=False, default=True, description='[DEPRECATED] Prefer `detokenize_stage`. Whether to detokenize the output.'), 'detokenize_stage': FieldInfo(annotation=Any, required=False, default=True, description='Detokenizer stage config (bool | dict | DetokenizeStageConfig).'), 'engine_kwargs': FieldInfo(annotation=Dict[str, Any], required=False, default_factory=dict, description='The kwargs to pass to the SGLang engine. See https://docs.sglang.ai/backend/server_arguments.html for more details.'), 'experimental': FieldInfo(annotation=Dict[str, Any], required=False, default_factory=dict, description='[Experimental] Experimental configurations.Supported keys:\n`max_tasks_in_flight_per_actor`: The maximum number of tasks in flight per actor. Default to 4.'), 'has_image': FieldInfo(annotation=bool, required=False, default=False, description='[DEPRECATED] Prefer `prepare_multimodal_stage` for processing multimodal data. Whether the input messages have images.'), 'max_concurrent_batches': FieldInfo(annotation=int, required=False, default=8, description='The maximum number of concurrent batches in the engine. This is to overlap the batch processing to avoid the tail latency of each batch. The default value may not be optimal when the batch size or the batch processing latency is too small, but it should be good enough for batch size >= 32.'), 'max_pending_requests': FieldInfo(annotation=Union[int, NoneType], required=False, description='The maximum number of pending requests. If not specified, will use the default value from the backend engine.'), 'model_source': FieldInfo(annotation=str, required=True, description='The model source to use for the offline processing.'), 'prepare_image_stage': FieldInfo(annotation=Any, required=False, default=False, description='[DEPRECATED] Prefer `prepare_multimodal_stage` for processing multimodal data. Prepare image stage config (bool | dict | PrepareImageStageConfig).'), 'prepare_multimodal_stage': FieldInfo(annotation=Any, required=False, default=False, description='Prepare multimodal stage config (bool | dict | PrepareMultimodalStageConfig).'), 'resources_per_bundle': FieldInfo(annotation=Union[Dict[str, float], NoneType], required=False, description='[DEPRECATED] This parameter is deprecated and will be removed in a future version. ', json_schema_extra={'deprecated': True}), 'runtime_env': FieldInfo(annotation=Union[Dict[str, Any], NoneType], required=False, description='The runtime environment to use for the offline processing.'), 'should_continue_on_error': FieldInfo(annotation=bool, required=False, default=False, description="If True, continue processing when inference fails for a row instead of raising an exception. Failed rows will have a non-empty '__inference_error__' column containing the error message, and other output columns will be empty strings. Error rows bypass postprocess. If False (default), any inference error will raise an exception."), 'task_type': FieldInfo(annotation=Literal['generate'], required=False, default='generate', description="The task type to use. If not specified, will use 'generate' by default."), 'tokenize': FieldInfo(annotation=bool, required=False, default=True, description='[DEPRECATED] Prefer `tokenize_stage`. Whether to tokenize input before engine.'), 'tokenize_stage': FieldInfo(annotation=Any, required=False, default=True, description='Tokenizer stage config (bool | dict | TokenizerStageConfig).')}#

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.