Normalizer#

class ray.data.preprocessors.Normalizer(columns: List[str], norm: str = 'l2', *, output_columns: List[str] | None = None)[source]#

Bases: SerializablePreprocessorBase

Scales each sample to have unit norm.

This preprocessor works by dividing each sample (i.e., row) by the sample’s norm. The general formula is given by

\[s' = \frac{s}{\lVert s \rVert_p}\]

where \(s\) is the sample, \(s'\) is the transformed sample, :math:lVert s rVert`, and \(p\) is the norm type.

The following norms are supported:

"l1" (\(L^1\)): Sum of the absolute values.
"l2" (\(L^2\)): Square root of the sum of the squared values.
"max" (\(L^\infty\)): Maximum value.

Examples

>>> import pandas as pd
>>> import ray
>>> from ray.data.preprocessors import Normalizer
>>>
>>> df = pd.DataFrame({"X1": [1, 1], "X2": [1, 0], "X3": [0, 1]})
>>> ds = ray.data.from_pandas(df)
>>> ds.to_pandas()
   X1  X2  X3
0   1   1   0
1   1   0   1

The \(L^2\)-norm of the first sample is \(\sqrt{2}\), and the \(L^2\)-norm of the second sample is \(1\).

>>> preprocessor = Normalizer(columns=["X1", "X2"])
>>> preprocessor.fit_transform(ds).to_pandas()
         X1        X2  X3
0  0.707107  0.707107   0
1  1.000000  0.000000   1

The \(L^1\)-norm of the first sample is \(2\), and the \(L^1\)-norm of the second sample is \(1\).

>>> preprocessor = Normalizer(columns=["X1", "X2"], norm="l1")
>>> preprocessor.fit_transform(ds).to_pandas()
    X1   X2  X3
0  0.5  0.5   0
1  1.0  0.0   1

The \(L^\infty\)-norm of the both samples is \(1\).

>>> preprocessor = Normalizer(columns=["X1", "X2"], norm="max")
>>> preprocessor.fit_transform(ds).to_pandas()
    X1   X2  X3
0  1.0  1.0   0
1  1.0  0.0   1

Normalizer can also be used in append mode by providing the name of the output_columns that should hold the normalized values.

>>> preprocessor = Normalizer(columns=["X1", "X2"], output_columns=["X1_normalized", "X2_normalized"])
>>> preprocessor.fit_transform(ds).to_pandas()
   X1  X2  X3  X1_normalized  X2_normalized
0   1   1   0       0.707107       0.707107
1   1   0   1       1.000000       0.000000

Parameters:

columns – The columns to scale. For each row, these colmumns are scaled to unit-norm.
norm – The norm to use. The supported values are "l1", "l2", or "max". Defaults to "l2".
output_columns – The names of the transformed columns. If None, the transformed columns will be the same as the input columns. If not None, the length of output_columns must match the length of columns, othwerwise an error will be raised.

Raises:

ValueError – if norm is not "l1", "l2", or "max".

PublicAPI (alpha): This API is in alpha and may change before becoming stable.

Methods

`deserialize`	Deserialize a preprocessor from serialized data.
`fit`	Fit this Preprocessor to the Dataset.
`fit_transform`	Fit this Preprocessor to the Dataset and then transform the Dataset.
`get_preprocessor_class_id`	Get the preprocessor class identifier for this preprocessor class.
`get_version`	Get the version number for this preprocessor class.
`preferred_batch_format`	Batch format hint for upstream producers to try yielding best block format.
`serialize`	Serialize this preprocessor to a string or bytes.
`set_preprocessor_class_id`	Set the preprocessor class identifier for this preprocessor class.
`set_version`	Set the version number for this preprocessor class.
`transform`	Transform the given dataset.
`transform_batch`	Transform a single batch of data.

Attributes

`MAGIC_CLOUDPICKLE`
`SERIALIZER_FORMAT_VERSION`
`columns`
`norm`
`output_columns`
`stat_computation_plan`