ray.data.preprocessors.LabelEncoder#

class ray.data.preprocessors.LabelEncoder(label_column: str, *, output_column: str | None = None)[source]#

Bases: SerializablePreprocessorBase

Encode labels as integer targets.

LabelEncoder encodes labels as integer targets that range from \(0\) to \(n - 1\), where \(n\) is the number of unique labels.

If you transform a label that isn’t in the fitted datset, then the label is encoded as float("nan").

Examples

>>> import pandas as pd
>>> import ray
>>> df = pd.DataFrame({
...     "sepal_width": [5.1, 7, 4.9, 6.2],
...     "sepal_height": [3.5, 3.2, 3, 3.4],
...     "species": ["setosa", "versicolor", "setosa", "virginica"]
... })
>>> ds = ray.data.from_pandas(df)  
>>>
>>> from ray.data.preprocessors import LabelEncoder
>>> encoder = LabelEncoder(label_column="species")
>>> encoder.fit_transform(ds).to_pandas()  
   sepal_width  sepal_height  species
0          5.1           3.5        0
1          7.0           3.2        1
2          4.9           3.0        0
3          6.2           3.4        2

You can also provide the name of the output column that should hold the encoded labels if you want to use LabelEncoder in append mode.

>>> encoder = LabelEncoder(label_column="species", output_column="species_encoded")
>>> encoder.fit_transform(ds).to_pandas()  
   sepal_width  sepal_height     species  species_encoded
0          5.1           3.5      setosa                0
1          7.0           3.2  versicolor                1
2          4.9           3.0      setosa                0
3          6.2           3.4   virginica                2

If you transform a label not present in the original dataset, then the new label is encoded as float("nan").

>>> df = pd.DataFrame({
...     "sepal_width": [4.2],
...     "sepal_height": [2.7],
...     "species": ["bracteata"]
... })
>>> ds = ray.data.from_pandas(df)  
>>> encoder.transform(ds).to_pandas()  
   sepal_width  sepal_height  species
0          4.2           2.7      NaN

Parameters:

label_column – A column containing labels that you want to encode.
output_column – The name of the column that will contain the encoded labels. If None, the output column will have the same name as the input column.

See also

OrdinalEncoder: If you’re encoding ordered features, use OrdinalEncoder instead of LabelEncoder.

PublicAPI (alpha): This API is in alpha and may change before becoming stable.

Methods

`deserialize`	Deserialize a preprocessor from serialized data.
`fit`	Fit this Preprocessor to the Dataset.
`fit_transform`	Fit this Preprocessor to the Dataset and then transform the Dataset.
`get_preprocessor_class_id`	Get the preprocessor class identifier for this preprocessor class.
`get_version`	Get the version number for this preprocessor class.
`inverse_transform`	Inverse transform the given dataset.
`preferred_batch_format`	Batch format hint for upstream producers to try yielding best block format.
`serialize`	Serialize this preprocessor to a string or bytes.
`set_preprocessor_class_id`	Set the preprocessor class identifier for this preprocessor class.
`set_version`	Set the version number for this preprocessor class.
`transform`	Transform the given dataset.
`transform_batch`	Transform a single batch of data.

Attributes

`MAGIC_CLOUDPICKLE`
`SERIALIZER_FORMAT_VERSION`