ray.data.llm.TokenizerStageConfig#

class ray.data.llm.TokenizerStageConfig(*, enabled: bool = True, batch_size: int | None = None, concurrency: int | Tuple[int, int] | None = None, runtime_env: Dict[str, Any] | None = None, num_cpus: float | None = None, memory: float | None = None, model_source: str | None = None)[source]#

The configuration for the tokenizer stage.

Parameters:
  • enabled – Whether this stage is enabled. Defaults to True.

  • model_source – Model source/identifier for this stage. If not specified, will use the processor-level model_source.

  • batch_size – Rows per batch. If not specified, will use the processor-level batch_size.

  • concurrency – Actor pool size or range for this stage. If not specified, will use the processor-level concurrency. If concurrency is a tuple (m, n), Ray creates an autoscaling actor pool that scales between m and n workers (1 <= m <= n). If concurrency is an int n, CPU stages use an autoscaling pool from (1, n).

  • runtime_env – Optional runtime environment for this stage. If not specified, will use the processor-level runtime_env. See this doc for more details.

  • num_cpus – Number of CPUs to reserve for each map worker in this stage.

  • memory – Heap memory in bytes to reserve for each map worker in this stage.

PublicAPI (beta): This API is in beta and may change before becoming stable.

model_config: ClassVar[ConfigDict] = {'extra': 'forbid', 'protected_namespaces': ()}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].