ray.data.preprocessors.LabelEncoder#

class ray.data.preprocessors.LabelEncoder(label_column: str, *, output_column: str | None = None)[source]#

Bases: Preprocessor

Encode labels as integer targets.

LabelEncoder encodes labels as integer targets that range from \(0\) to \(n - 1\), where \(n\) is the number of unique labels.

If you transform a label that isn’t in the fitted datset, then the label is encoded as float("nan").

Examples

>>> import pandas as pd
>>> import ray
>>> df = pd.DataFrame({
...     "sepal_width": [5.1, 7, 4.9, 6.2],
...     "sepal_height": [3.5, 3.2, 3, 3.4],
...     "species": ["setosa", "versicolor", "setosa", "virginica"]
... })
>>> ds = ray.data.from_pandas(df)  
>>>
>>> from ray.data.preprocessors import LabelEncoder
>>> encoder = LabelEncoder(label_column="species")
>>> encoder.fit_transform(ds).to_pandas()  
   sepal_width  sepal_height  species
0          5.1           3.5        0
1          7.0           3.2        1
2          4.9           3.0        0
3          6.2           3.4        2

You can also provide the name of the output column that should hold the encoded labels if you want to use LabelEncoder in append mode.

>>> encoder = LabelEncoder(label_column="species", output_column="species_encoded")
>>> encoder.fit_transform(ds).to_pandas()  
   sepal_width  sepal_height     species  species_encoded
0          5.1           3.5      setosa                0
1          7.0           3.2  versicolor                1
2          4.9           3.0      setosa                0
3          6.2           3.4   virginica                2

If you transform a label not present in the original dataset, then the new label is encoded as float("nan").

>>> df = pd.DataFrame({
...     "sepal_width": [4.2],
...     "sepal_height": [2.7],
...     "species": ["bracteata"]
... })
>>> ds = ray.data.from_pandas(df)  
>>> encoder.transform(ds).to_pandas()  
   sepal_width  sepal_height  species
0          4.2           2.7      NaN
Parameters:
  • label_column – A column containing labels that you want to encode.

  • output_column – The name of the column that will contain the encoded labels. If None, the output column will have the same name as the input column.

See also

OrdinalEncoder

If you’re encoding ordered features, use OrdinalEncoder instead of LabelEncoder.

PublicAPI (alpha): This API is in alpha and may change before becoming stable.

Methods

deserialize

Load the original preprocessor serialized via self.serialize().

fit

Fit this Preprocessor to the Dataset.

fit_transform

Fit this Preprocessor to the Dataset and then transform the Dataset.

inverse_transform

Inverse transform the given dataset.

preferred_batch_format

Batch format hint for upstream producers to try yielding best block format.

serialize

Return this preprocessor serialized as a string.

transform

Transform the given dataset.

transform_batch

Transform a single batch of data.