ray.data.preprocessors.StandardScaler#

class ray.data.preprocessors.StandardScaler(columns: List[str], output_columns: List[str] | None = None)[source]#

Bases: Preprocessor

Translate and scale each column by its mean and standard deviation, respectively.

The general formula is given by

\[x' = \frac{x - \bar{x}}{s}\]

where \(x\) is the column, \(x'\) is the transformed column, \(\bar{x}\) is the column average, and \(s\) is the column’s sample standard deviation. If \(s = 0\) (i.e., the column is constant-valued), then the transformed column will contain zeros.

Warning

StandardScaler works best when your data is normal. If your data isn’t approximately normal, then the transformed features won’t be meaningful.

Examples

>>> import pandas as pd
>>> import ray
>>> from ray.data.preprocessors import StandardScaler
>>>
>>> df = pd.DataFrame({"X1": [-2, 0, 2], "X2": [-3, -3, 3], "X3": [1, 1, 1]})
>>> ds = ray.data.from_pandas(df)  
>>> ds.to_pandas()  
   X1  X2  X3
0  -2  -3   1
1   0  -3   1
2   2   3   1

Columns are scaled separately.

>>> preprocessor = StandardScaler(columns=["X1", "X2"])
>>> preprocessor.fit_transform(ds).to_pandas()  
         X1        X2  X3
0 -1.224745 -0.707107   1
1  0.000000 -0.707107   1
2  1.224745  1.414214   1

Constant-valued columns get filled with zeros.

>>> preprocessor = StandardScaler(columns=["X3"])
>>> preprocessor.fit_transform(ds).to_pandas()  
   X1  X2   X3
0  -2  -3  0.0
1   0  -3  0.0
2   2   3  0.0

>>> preprocessor = StandardScaler(
...     columns=["X1", "X2"],
...     output_columns=["X1_scaled", "X2_scaled"]
... )
>>> preprocessor.fit_transform(ds).to_pandas()  
   X1  X2  X3  X1_scaled  X2_scaled
0  -2  -3   1  -1.224745  -0.707107
1   0  -3   1   0.000000  -0.707107
2   2   3   1   1.224745   1.414214

Parameters:

columns – The columns to separately scale.
output_columns – The names of the transformed columns. If None, the transformed columns will be the same as the input columns. If not None, the length of output_columns must match the length of columns, othwerwise an error will be raised.

PublicAPI (alpha): This API is in alpha and may change before becoming stable.

Methods

`deserialize`	Load the original preprocessor serialized via `self.serialize()`.
`fit`	Fit this Preprocessor to the Dataset.
`fit_transform`	Fit this Preprocessor to the Dataset and then transform the Dataset.
`preferred_batch_format`	Batch format hint for upstream producers to try yielding best block format.
`serialize`	Return this preprocessor serialized as a string.
`transform`	Transform the given dataset.
`transform_batch`	Transform a single batch of data.