ray.data.preprocessors.MultiHotEncoder#
- class ray.data.preprocessors.MultiHotEncoder(columns: List[str], *, max_categories: Dict[str, int] | None = None, output_columns: List[str] | None = None)[source]#
- Bases: - Preprocessor- Multi-hot encode categorical data. - This preprocessor replaces each list of categories with an \(m\)-length binary list, where \(m\) is the number of unique categories in the column or the value specified in - max_categories. The \(i\\text{-th}\) element of the binary list is \(1\) if category \(i\) is in the input list and \(0\) otherwise.- Columns must contain hashable objects or lists of hashable objects. Also, you can’t have both types in the same column. - Note - The logic is similar to scikit-learn’s [MultiLabelBinarizer][1] - Examples - >>> import pandas as pd >>> import ray >>> from ray.data.preprocessors import MultiHotEncoder >>> >>> df = pd.DataFrame({ ... "name": ["Shaolin Soccer", "Moana", "The Smartest Guys in the Room"], ... "genre": [ ... ["comedy", "action", "sports"], ... ["animation", "comedy", "action"], ... ["documentary"], ... ], ... }) >>> ds = ray.data.from_pandas(df) >>> >>> encoder = MultiHotEncoder(columns=["genre"]) >>> encoder.fit_transform(ds).to_pandas() name genre 0 Shaolin Soccer [1, 0, 1, 0, 1] 1 Moana [1, 1, 1, 0, 0] 2 The Smartest Guys in the Room [0, 0, 0, 1, 0] - MultiHotEncodercan also be used in append mode by providing the name of the output_columns that should hold the encoded values.- >>> encoder = MultiHotEncoder(columns=["genre"], output_columns=["genre_encoded"]) >>> encoder.fit_transform(ds).to_pandas() name genre genre_encoded 0 Shaolin Soccer [comedy, action, sports] [1, 0, 1, 0, 1] 1 Moana [animation, comedy, action] [1, 1, 1, 0, 0] 2 The Smartest Guys in the Room [documentary] [0, 0, 0, 1, 0] - If you specify - max_categories, then- MultiHotEncodercreates features for only the most frequent categories.- >>> encoder = MultiHotEncoder(columns=["genre"], max_categories={"genre": 3}) >>> encoder.fit_transform(ds).to_pandas() name genre 0 Shaolin Soccer [1, 1, 1] 1 Moana [1, 1, 0] 2 The Smartest Guys in the Room [0, 0, 0] >>> encoder.stats_ OrderedDict([('unique_values(genre)', {'comedy': 0, 'action': 1, 'sports': 2})]) - Parameters:
- columns – The columns to separately encode. 
- max_categories – The maximum number of features to create for each column. If a value isn’t specified for a column, then a feature is created for every unique category in that column. 
- output_columns – The names of the transformed columns. If None, the transformed columns will be the same as the input columns. If not None, the length of - output_columnsmust match the length of- columns, othwerwise an error will be raised.
 
 - See also - OneHotEncoder
- If you’re encoding individual categories instead of lists of categories, use - OneHotEncoder.
- OrdinalEncoder
- If your categories are ordered, you may want to use - OrdinalEncoder.
 - [1]: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html - PublicAPI (alpha): This API is in alpha and may change before becoming stable. - Methods - Load the original preprocessor serialized via - self.serialize().- Fit this Preprocessor to the Dataset. - Fit this Preprocessor to the Dataset and then transform the Dataset. - Batch format hint for upstream producers to try yielding best block format. - Return this preprocessor serialized as a string. - Transform the given dataset. - Transform a single batch of data.