ray.data.preprocessors.Categorizer#

class ray.data.preprocessors.Categorizer(columns: List[str], dtypes: Dict[str, pandas.CategoricalDtype] | None = None, output_columns: List[str] | None = None)[source]#

Bases: SerializablePreprocessorBase

Convert columns to pd.CategoricalDtype.

Use this preprocessor with frameworks that have built-in support for pd.CategoricalDtype like LightGBM.

Warning

If you don’t specify dtypes, fit this preprocessor before splitting your dataset into train and test splits. This ensures categories are consistent across splits.

Examples

>>> import pandas as pd
>>> import ray
>>> from ray.data.preprocessors import Categorizer
>>>
>>> df = pd.DataFrame(
... {
...     "sex": ["male", "female", "male", "female"],
...     "level": ["L4", "L5", "L3", "L4"],
... })
>>> ds = ray.data.from_pandas(df)  
>>> categorizer = Categorizer(columns=["sex", "level"])
>>> categorizer.fit_transform(ds).schema().types  
[CategoricalDtype(categories=['female', 'male'], ordered=False), CategoricalDtype(categories=['L3', 'L4', 'L5'], ordered=False)]

Categorizer can also be used in append mode by providing the name of the output_columns that should hold the categorized values.

>>> categorizer = Categorizer(columns=["sex", "level"], output_columns=["sex_cat", "level_cat"])
>>> categorizer.fit_transform(ds).to_pandas()  
      sex level sex_cat level_cat
0    male    L4    male        L4
1  female    L5  female        L5
2    male    L3    male        L3
3  female    L4  female        L4

If you know the categories in advance, you can specify the categories with the dtypes parameter.

>>> categorizer = Categorizer(
...     columns=["sex", "level"],
...     dtypes={"level": pd.CategoricalDtype(["L3", "L4", "L5", "L6"], ordered=True)},
... )
>>> categorizer.fit_transform(ds).schema().types  
[CategoricalDtype(categories=['female', 'male'], ordered=False), CategoricalDtype(categories=['L3', 'L4', 'L5', 'L6'], ordered=True)]

Parameters:

columns – The columns to convert to pd.CategoricalDtype.
dtypes – An optional dictionary that maps columns to pd.CategoricalDtype objects. If you don’t include a column in dtypes, the categories are inferred.
output_columns – The names of the transformed columns. If None, the transformed columns will be the same as the input columns. If not None, the length of output_columns must match the length of columns, othwerwise an error will be raised.

PublicAPI (alpha): This API is in alpha and may change before becoming stable.

Methods

`deserialize`	Deserialize a preprocessor from serialized data.
`fit`	Fit this Preprocessor to the Dataset.
`fit_transform`	Fit this Preprocessor to the Dataset and then transform the Dataset.
`get_preprocessor_class_id`	Get the preprocessor class identifier for this preprocessor class.
`get_version`	Get the version number for this preprocessor class.
`preferred_batch_format`	Batch format hint for upstream producers to try yielding best block format.
`serialize`	Serialize this preprocessor to a string or bytes.
`set_preprocessor_class_id`	Set the preprocessor class identifier for this preprocessor class.
`set_version`	Set the version number for this preprocessor class.
`transform`	Transform the given dataset.
`transform_batch`	Transform a single batch of data.

Attributes

`MAGIC_CLOUDPICKLE`
`SERIALIZER_FORMAT_VERSION`
`columns`
`dtypes`
`output_columns`
`stat_computation_plan`