ray.data.preprocessors.OrdinalEncoder#

class ray.data.preprocessors.OrdinalEncoder(columns: List[str], *, encode_lists: bool = True, output_columns: List[str] | None = None)[source]#

Bases: SerializablePreprocessorBase

Encode values within columns as ordered integer values.

OrdinalEncoder encodes categorical features as integers that range from \(0\) to \(n - 1\), where \(n\) is the number of categories.

If you transform a value that isn’t in the fitted datset, then the value is encoded as float("nan").

Columns must contain either hashable values or lists of hashable values. Also, you can’t have both scalars and lists in the same column.

Examples

Use OrdinalEncoder to encode categorical features as integers.

>>> import pandas as pd
>>> import ray
>>> from ray.data.preprocessors import OrdinalEncoder
>>> df = pd.DataFrame({
...     "sex": ["male", "female", "male", "female"],
...     "level": ["L4", "L5", "L3", "L4"],
... })
>>> ds = ray.data.from_pandas(df)  
>>> encoder = OrdinalEncoder(columns=["sex", "level"])
>>> encoder.fit_transform(ds).to_pandas()  
   sex  level
0    1      1
1    0      2
2    1      0
3    0      1

OrdinalEncoder can also be used in append mode by providing the name of the output_columns that should hold the encoded values.

>>> encoder = OrdinalEncoder(columns=["sex", "level"], output_columns=["sex_encoded", "level_encoded"])
>>> encoder.fit_transform(ds).to_pandas()  
      sex level  sex_encoded  level_encoded
0    male    L4            1              1
1  female    L5            0              2
2    male    L3            1              0
3  female    L4            0              1

If you transform a value not present in the original dataset, then the value is encoded as float("nan").

>>> df = pd.DataFrame({"sex": ["female"], "level": ["L6"]})
>>> ds = ray.data.from_pandas(df)  
>>> encoder.transform(ds).to_pandas()  
   sex  level
0    0    NaN

OrdinalEncoder can also encode categories in a list.

>>> df = pd.DataFrame({
...     "name": ["Shaolin Soccer", "Moana", "The Smartest Guys in the Room"],
...     "genre": [
...         ["comedy", "action", "sports"],
...         ["animation", "comedy",  "action"],
...         ["documentary"],
...     ],
... })
>>> ds = ray.data.from_pandas(df)  
>>> encoder = OrdinalEncoder(columns=["genre"])
>>> encoder.fit_transform(ds).to_pandas()  
                            name      genre
0                 Shaolin Soccer  [2, 0, 4]
1                          Moana  [1, 2, 0]
2  The Smartest Guys in the Room        [3]
Parameters:
  • columns – The columns to separately encode.

  • encode_lists – If True, encode list elements. If False, encode whole lists (i.e., replace each list with an integer). True by default.

  • output_columns – The names of the transformed columns. If None, the transformed columns will be the same as the input columns. If not None, the length of output_columns must match the length of columns, othwerwise an error will be raised.

See also

OneHotEncoder

Another preprocessor that encodes categorical data.

PublicAPI (alpha): This API is in alpha and may change before becoming stable.

Methods

deserialize

Deserialize a preprocessor from serialized data.

fit

Fit this Preprocessor to the Dataset.

fit_transform

Fit this Preprocessor to the Dataset and then transform the Dataset.

get_preprocessor_class_id

Get the preprocessor class identifier for this preprocessor class.

get_version

Get the version number for this preprocessor class.

preferred_batch_format

Batch format hint for upstream producers to try yielding best block format.

serialize

Serialize this preprocessor to a string or bytes.

set_preprocessor_class_id

Set the preprocessor class identifier for this preprocessor class.

set_version

Set the version number for this preprocessor class.

transform

Transform the given dataset.

transform_batch

Transform a single batch of data.

Attributes

MAGIC_CLOUDPICKLE

SERIALIZER_FORMAT_VERSION