ray.data.preprocessors.Categorizer#

class ray.data.preprocessors.Categorizer(columns: List[str], dtypes: Dict[str, pandas.CategoricalDtype] | None = None)[source]#

Bases: Preprocessor

Convert columns to pd.CategoricalDtype.

Use this preprocessor with frameworks that have built-in support for pd.CategoricalDtype like LightGBM.

Warning

If you don’t specify dtypes, fit this preprocessor before splitting your dataset into train and test splits. This ensures categories are consistent across splits.

Examples

>>> import pandas as pd
>>> import ray
>>> from ray.data.preprocessors import Categorizer
>>>
>>> df = pd.DataFrame(
... {
...     "sex": ["male", "female", "male", "female"],
...     "level": ["L4", "L5", "L3", "L4"],
... })
>>> ds = ray.data.from_pandas(df)  
>>> categorizer = Categorizer(columns=["sex", "level"])
>>> categorizer.fit_transform(ds).schema().types  
[CategoricalDtype(categories=['female', 'male'], ordered=False), CategoricalDtype(categories=['L3', 'L4', 'L5'], ordered=False)]

If you know the categories in advance, you can specify the categories with the dtypes parameter.

>>> categorizer = Categorizer(
...     columns=["sex", "level"],
...     dtypes={"level": pd.CategoricalDtype(["L3", "L4", "L5", "L6"], ordered=True)},
... )
>>> categorizer.fit_transform(ds).schema().types  
[CategoricalDtype(categories=['female', 'male'], ordered=False), CategoricalDtype(categories=['L3', 'L4', 'L5', 'L6'], ordered=True)]
Parameters:
  • columns – The columns to convert to pd.CategoricalDtype.

  • dtypes – An optional dictionary that maps columns to pd.CategoricalDtype objects. If you don’t include a column in dtypes, the categories are inferred.

PublicAPI (alpha): This API is in alpha and may change before becoming stable.

Methods

deserialize

Load the original preprocessor serialized via self.serialize().

fit

Fit this Preprocessor to the Dataset.

fit_transform

Fit this Preprocessor to the Dataset and then transform the Dataset.

preferred_batch_format

Batch format hint for upstream producers to try yielding best block format.

serialize

Return this preprocessor serialized as a string.

transform

Transform the given dataset.

transform_batch

Transform a single batch of data.