
class ray.data.preprocessors.OneHotEncoder(columns: List[str], *, max_categories: Dict[str, int] | None = None, output_columns: List[str] | None = None)[source]#

Bases: Preprocessor

One-hot encode categorical data.

This preprocessor transforms each specified column into a one-hot encoded vector. Each element in the vector corresponds to a unique category in the column, with a value of 1 if the category matches and 0 otherwise.

If a category is infrequent (based on max_categories) or not present in the fitted dataset, it is encoded as all 0s.

Columns must contain hashable objects or lists of hashable objects.


Lists are treated as categories. If you want to encode individual list elements, use MultiHotEncoder.


>>> import pandas as pd
>>> import ray
>>> from ray.data.preprocessors import OneHotEncoder
>>> df = pd.DataFrame({"color": ["red", "green", "red", "red", "blue", "green"]})
>>> ds = ray.data.from_pandas(df)  
>>> encoder = OneHotEncoder(columns=["color"])
>>> encoder.fit_transform(ds).to_pandas()  
0  [0, 0, 1]
1  [0, 1, 0]
2  [0, 0, 1]
3  [0, 0, 1]
4  [1, 0, 0]
5  [0, 1, 0]

MultiHotEncoder can also be used in append mode by providing the name of the output_columns that should hold the encoded values.

>>> encoder = OneHotEncoder(columns=["color"], output_columns=["color_encoded"])
>>> encoder.fit_transform(ds).to_pandas()  
   color color_encoded
0    red     [0, 0, 1]
1  green     [0, 1, 0]
2    red     [0, 0, 1]
3    red     [0, 0, 1]
4   blue     [1, 0, 0]
5  green     [0, 1, 0]

If you one-hot encode a value that isn’t in the fitted dataset, then the value is encoded with zeros.

>>> df = pd.DataFrame({"color": ["yellow"]})
>>> batch = ray.data.from_pandas(df)  
>>> encoder.transform(batch).to_pandas()  
   color_blue  color_green  color_red
0           0            0          0

Likewise, if you one-hot encode an infrequent value, then the value is encoded with zeros.

>>> encoder = OneHotEncoder(columns=["color"], max_categories={"color": 2})
>>> encoder.fit_transform(ds).to_pandas()  
   color_red  color_green
0          1            0
1          0            1
2          1            0
3          1            0
4          0            0
5          0            1
  • columns – The columns to separately encode.

  • max_categories – The maximum number of features to create for each column. If a value isn’t specified for a column, then a feature is created for every category in that column.

  • output_columns – The names of the transformed columns. If None, the transformed columns will be the same as the input columns. If not None, the length of output_columns must match the length of columns, othwerwise an error will be raised.

See also


If you want to encode individual list elements, use MultiHotEncoder.


If your categories are ordered, you may want to use OrdinalEncoder.

PublicAPI (alpha): This API is in alpha and may change before becoming stable.



Load the original preprocessor serialized via self.serialize().


Fit this Preprocessor to the Dataset.


Fit this Preprocessor to the Dataset and then transform the Dataset.


Batch format hint for upstream producers to try yielding best block format.


Return this preprocessor serialized as a string.


Transform the given dataset.


Transform a single batch of data.