ray.data.preprocessors.OneHotEncoder#
- class ray.data.preprocessors.OneHotEncoder(columns: List[str], *, max_categories: Dict[str, int] | None = None)[source]#
Bases:
Preprocessor
One-hot encode categorical data.
This preprocessor creates a column named
{column}_{category}
for each unique{category}
in{column}
. The value of a column is 1 if the category matches and 0 otherwise.If you encode an infrequent category (see
max_categories
) or a category that isn’t in the fitted dataset, then the category is encoded as all 0s.Columns must contain hashable objects or lists of hashable objects.
Note
Lists are treated as categories. If you want to encode individual list elements, use
MultiHotEncoder
.Example
>>> import pandas as pd >>> import ray >>> from ray.data.preprocessors import OneHotEncoder >>> >>> df = pd.DataFrame({"color": ["red", "green", "red", "red", "blue", "green"]}) >>> ds = ray.data.from_pandas(df) >>> encoder = OneHotEncoder(columns=["color"]) >>> encoder.fit_transform(ds).to_pandas() color_blue color_green color_red 0 0 0 1 1 0 1 0 2 0 0 1 3 0 0 1 4 1 0 0 5 0 1 0
If you one-hot encode a value that isn’t in the fitted dataset, then the value is encoded with zeros.
>>> df = pd.DataFrame({"color": ["yellow"]}) >>> batch = ray.data.from_pandas(df) >>> encoder.transform(batch).to_pandas() color_blue color_green color_red 0 0 0 0
Likewise, if you one-hot encode an infrequent value, then the value is encoded with zeros.
>>> encoder = OneHotEncoder(columns=["color"], max_categories={"color": 2}) >>> encoder.fit_transform(ds).to_pandas() color_red color_green 0 1 0 1 0 1 2 1 0 3 1 0 4 0 0 5 0 1
- Parameters:
columns – The columns to separately encode.
max_categories – The maximum number of features to create for each column. If a value isn’t specified for a column, then a feature is created for every category in that column.
See also
MultiHotEncoder
If you want to encode individual list elements, use
MultiHotEncoder
.OrdinalEncoder
If your categories are ordered, you may want to use
OrdinalEncoder
.
PublicAPI (alpha): This API is in alpha and may change before becoming stable.
Methods
Load the original preprocessor serialized via
self.serialize()
.Fit this Preprocessor to the Dataset.
Fit this Preprocessor to the Dataset and then transform the Dataset.
Batch format hint for upstream producers to try yielding best block format.
Return this preprocessor serialized as a string.
Transform the given dataset.
Transform a single batch of data.