ray.data.preprocessors.Categorizer#
- class ray.data.preprocessors.Categorizer(columns: List[str], dtypes: Dict[str, pandas.CategoricalDtype] | None = None)[source]#
Bases:
Preprocessor
Convert columns to
pd.CategoricalDtype
.Use this preprocessor with frameworks that have built-in support for
pd.CategoricalDtype
like LightGBM.Warning
If you don’t specify
dtypes
, fit this preprocessor before splitting your dataset into train and test splits. This ensures categories are consistent across splits.Examples
>>> import pandas as pd >>> import ray >>> from ray.data.preprocessors import Categorizer >>> >>> df = pd.DataFrame( ... { ... "sex": ["male", "female", "male", "female"], ... "level": ["L4", "L5", "L3", "L4"], ... }) >>> ds = ray.data.from_pandas(df) >>> categorizer = Categorizer(columns=["sex", "level"]) >>> categorizer.fit_transform(ds).schema().types [CategoricalDtype(categories=['female', 'male'], ordered=False), CategoricalDtype(categories=['L3', 'L4', 'L5'], ordered=False)]
If you know the categories in advance, you can specify the categories with the
dtypes
parameter.>>> categorizer = Categorizer( ... columns=["sex", "level"], ... dtypes={"level": pd.CategoricalDtype(["L3", "L4", "L5", "L6"], ordered=True)}, ... ) >>> categorizer.fit_transform(ds).schema().types [CategoricalDtype(categories=['female', 'male'], ordered=False), CategoricalDtype(categories=['L3', 'L4', 'L5', 'L6'], ordered=True)]
- Parameters:
columns – The columns to convert to
pd.CategoricalDtype
.dtypes – An optional dictionary that maps columns to
pd.CategoricalDtype
objects. If you don’t include a column indtypes
, the categories are inferred.
PublicAPI (alpha): This API is in alpha and may change before becoming stable.
Methods
Load the original preprocessor serialized via
self.serialize()
.Fit this Preprocessor to the Dataset.
Fit this Preprocessor to the Dataset and then transform the Dataset.
Batch format hint for upstream producers to try yielding best block format.
Return this preprocessor serialized as a string.
Transform the given dataset.
Transform a single batch of data.