ray.data.preprocessors.MultiHotEncoder#
- class ray.data.preprocessors.MultiHotEncoder(columns: List[str], *, max_categories: Dict[str, int] | None = None, output_columns: List[str] | None = None)[source]#
Bases:
Preprocessor
Multi-hot encode categorical data.
This preprocessor replaces each list of categories with an \(m\)-length binary list, where \(m\) is the number of unique categories in the column or the value specified in
max_categories
. The \(i\\text{-th}\) element of the binary list is \(1\) if category \(i\) is in the input list and \(0\) otherwise.Columns must contain hashable objects or lists of hashable objects. Also, you can’t have both types in the same column.
Note
The logic is similar to scikit-learn’s [MultiLabelBinarizer][1]
Examples
>>> import pandas as pd >>> import ray >>> from ray.data.preprocessors import MultiHotEncoder >>> >>> df = pd.DataFrame({ ... "name": ["Shaolin Soccer", "Moana", "The Smartest Guys in the Room"], ... "genre": [ ... ["comedy", "action", "sports"], ... ["animation", "comedy", "action"], ... ["documentary"], ... ], ... }) >>> ds = ray.data.from_pandas(df) >>> >>> encoder = MultiHotEncoder(columns=["genre"]) >>> encoder.fit_transform(ds).to_pandas() name genre 0 Shaolin Soccer [1, 0, 1, 0, 1] 1 Moana [1, 1, 1, 0, 0] 2 The Smartest Guys in the Room [0, 0, 0, 1, 0]
MultiHotEncoder
can also be used in append mode by providing the name of the output_columns that should hold the encoded values.>>> encoder = MultiHotEncoder(columns=["genre"], output_columns=["genre_encoded"]) >>> encoder.fit_transform(ds).to_pandas() name genre genre_encoded 0 Shaolin Soccer [comedy, action, sports] [1, 0, 1, 0, 1] 1 Moana [animation, comedy, action] [1, 1, 1, 0, 0] 2 The Smartest Guys in the Room [documentary] [0, 0, 0, 1, 0]
If you specify
max_categories
, thenMultiHotEncoder
creates features for only the most frequent categories.>>> encoder = MultiHotEncoder(columns=["genre"], max_categories={"genre": 3}) >>> encoder.fit_transform(ds).to_pandas() name genre 0 Shaolin Soccer [1, 1, 1] 1 Moana [1, 1, 0] 2 The Smartest Guys in the Room [0, 0, 0] >>> encoder.stats_ OrderedDict([('unique_values(genre)', {'comedy': 0, 'action': 1, 'sports': 2})])
- Parameters:
columns – The columns to separately encode.
max_categories – The maximum number of features to create for each column. If a value isn’t specified for a column, then a feature is created for every unique category in that column.
output_columns – The names of the transformed columns. If None, the transformed columns will be the same as the input columns. If not None, the length of
output_columns
must match the length ofcolumns
, othwerwise an error will be raised.
See also
OneHotEncoder
If you’re encoding individual categories instead of lists of categories, use
OneHotEncoder
.OrdinalEncoder
If your categories are ordered, you may want to use
OrdinalEncoder
.
[1]: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html
PublicAPI (alpha): This API is in alpha and may change before becoming stable.
Methods
Load the original preprocessor serialized via
self.serialize()
.Fit this Preprocessor to the Dataset.
Fit this Preprocessor to the Dataset and then transform the Dataset.
Batch format hint for upstream producers to try yielding best block format.
Return this preprocessor serialized as a string.
Transform the given dataset.
Transform a single batch of data.