class ray.data.preprocessors.Categorizer(columns: List[str], dtypes: Optional[Dict[str, pandas.core.dtypes.dtypes.CategoricalDtype]] = None)[source]#

Bases: ray.data.preprocessor.Preprocessor

Convert columns to pd.CategoricalDtype.

Use this preprocessor with frameworks that have built-in support for pd.CategoricalDtype like LightGBM.


If you don’t specify dtypes, fit this preprocessor before splitting your dataset into train and test splits. This ensures categories are consistent across splits.


>>> import pandas as pd
>>> import ray
>>> from ray.data.preprocessors import Categorizer
>>> df = pd.DataFrame(
... {
...     "sex": ["male", "female", "male", "female"],
...     "level": ["L4", "L5", "L3", "L4"],
... })
>>> ds = ray.data.from_pandas(df)  
>>> categorizer = Categorizer(columns=["sex", "level"])
>>> categorizer.fit_transform(ds).schema().types  
[CategoricalDtype(categories=['female', 'male'], ordered=False), CategoricalDtype(categories=['L3', 'L4', 'L5'], ordered=False)]

If you know the categories in advance, you can specify the categories with the dtypes parameter.

>>> categorizer = Categorizer(
...     columns=["sex", "level"],
...     dtypes={"level": pd.CategoricalDtype(["L3", "L4", "L5", "L6"], ordered=True)},
... )
>>> categorizer.fit_transform(ds).schema().types  
[CategoricalDtype(categories=['female', 'male'], ordered=False), CategoricalDtype(categories=['L3', 'L4', 'L5', 'L6'], ordered=True)]
  • columns – The columns to convert to pd.CategoricalDtype.

  • dtypes – An optional dictionary that maps columns to pd.CategoricalDtype objects. If you don’t include a column in dtypes, the categories are inferred.

PublicAPI (alpha): This API is in alpha and may change before becoming stable.