ray.data.preprocessors.MultiHotEncoder#

class ray.data.preprocessors.MultiHotEncoder(columns: List[str], *, max_categories: Dict[str, int] | None = None)[source]#

Bases: Preprocessor

Multi-hot encode categorical data.

This preprocessor replaces each list of categories with an \(m\)-length binary list, where \(m\) is the number of unique categories in the column or the value specified in max_categories. The \(i\text{-th}\) element of the binary list is \(1\) if category \(i\) is in the input list and \(0\) otherwise.

Columns must contain hashable objects or lists of hashable objects. Also, you can’t have both types in the same column.

Note

The logic is similar to scikit-learn’s MultiLabelBinarizer.

Examples

>>> import pandas as pd
>>> import ray
>>> from ray.data.preprocessors import MultiHotEncoder
>>>
>>> df = pd.DataFrame({
...     "name": ["Shaolin Soccer", "Moana", "The Smartest Guys in the Room"],
...     "genre": [
...         ["comedy", "action", "sports"],
...         ["animation", "comedy",  "action"],
...         ["documentary"],
...     ],
... })
>>> ds = ray.data.from_pandas(df)  
>>>
>>> encoder = MultiHotEncoder(columns=["genre"])
>>> encoder.fit_transform(ds).to_pandas()  
                            name            genre
0                 Shaolin Soccer  [1, 0, 1, 0, 1]
1                          Moana  [1, 1, 1, 0, 0]
2  The Smartest Guys in the Room  [0, 0, 0, 1, 0]

If you specify max_categories, then MultiHotEncoder creates features for only the most frequent categories.

>>> encoder = MultiHotEncoder(columns=["genre"], max_categories={"genre": 3})
>>> encoder.fit_transform(ds).to_pandas()  
                            name      genre
0                 Shaolin Soccer  [1, 1, 1]
1                          Moana  [1, 1, 0]
2  The Smartest Guys in the Room  [0, 0, 0]
>>> encoder.stats_  
OrderedDict([('unique_values(genre)', {'comedy': 0, 'action': 1, 'sports': 2})])
Parameters:
  • columns – The columns to separately encode.

  • max_categories – The maximum number of features to create for each column. If a value isn’t specified for a column, then a feature is created for every unique category in that column.

See also

OneHotEncoder

If you’re encoding individual categories instead of lists of categories, use OneHotEncoder.

OrdinalEncoder

If your categories are ordered, you may want to use OrdinalEncoder.

PublicAPI (alpha): This API is in alpha and may change before becoming stable.

Methods

deserialize

Load the original preprocessor serialized via self.serialize().

fit

Fit this Preprocessor to the Dataset.

fit_transform

Fit this Preprocessor to the Dataset and then transform the Dataset.

preferred_batch_format

Batch format hint for upstream producers to try yielding best block format.

serialize

Return this preprocessor serialized as a string.

transform

Transform the given dataset.

transform_batch

Transform a single batch of data.