ray.data.preprocessors.Categorizer#
- class ray.data.preprocessors.Categorizer(columns: List[str], dtypes: Dict[str, pandas.CategoricalDtype] | None = None, output_columns: List[str] | None = None)[source]#
Bases:
Preprocessor
Convert columns to
pd.CategoricalDtype
.Use this preprocessor with frameworks that have built-in support for
pd.CategoricalDtype
like LightGBM.Warning
If you don’t specify
dtypes
, fit this preprocessor before splitting your dataset into train and test splits. This ensures categories are consistent across splits.Examples
>>> import pandas as pd >>> import ray >>> from ray.data.preprocessors import Categorizer >>> >>> df = pd.DataFrame( ... { ... "sex": ["male", "female", "male", "female"], ... "level": ["L4", "L5", "L3", "L4"], ... }) >>> ds = ray.data.from_pandas(df) >>> categorizer = Categorizer(columns=["sex", "level"]) >>> categorizer.fit_transform(ds).schema().types [CategoricalDtype(categories=['female', 'male'], ordered=False), CategoricalDtype(categories=['L3', 'L4', 'L5'], ordered=False)]
Categorizer
can also be used in append mode by providing the name of the output_columns that should hold the categorized values.>>> categorizer = Categorizer(columns=["sex", "level"], output_columns=["sex_cat", "level_cat"]) >>> categorizer.fit_transform(ds).to_pandas() sex level sex_cat level_cat 0 male L4 male L4 1 female L5 female L5 2 male L3 male L3 3 female L4 female L4
If you know the categories in advance, you can specify the categories with the
dtypes
parameter.>>> categorizer = Categorizer( ... columns=["sex", "level"], ... dtypes={"level": pd.CategoricalDtype(["L3", "L4", "L5", "L6"], ordered=True)}, ... ) >>> categorizer.fit_transform(ds).schema().types [CategoricalDtype(categories=['female', 'male'], ordered=False), CategoricalDtype(categories=['L3', 'L4', 'L5', 'L6'], ordered=True)]
- Parameters:
columns – The columns to convert to
pd.CategoricalDtype
.dtypes – An optional dictionary that maps columns to
pd.CategoricalDtype
objects. If you don’t include a column indtypes
, the categories are inferred.output_columns – The names of the transformed columns. If None, the transformed columns will be the same as the input columns. If not None, the length of
output_columns
must match the length ofcolumns
, othwerwise an error will be raised.
PublicAPI (alpha): This API is in alpha and may change before becoming stable.
Methods
Load the original preprocessor serialized via
self.serialize()
.Fit this Preprocessor to the Dataset.
Fit this Preprocessor to the Dataset and then transform the Dataset.
Batch format hint for upstream producers to try yielding best block format.
Return this preprocessor serialized as a string.
Transform the given dataset.
Transform a single batch of data.