ray.data.preprocessors.StandardScaler#
- class ray.data.preprocessors.StandardScaler(columns: List[str])[source]#
Bases:
Preprocessor
Translate and scale each column by its mean and standard deviation, respectively.
The general formula is given by
\[x' = \frac{x - \bar{x}}{s}\]where \(x\) is the column, \(x'\) is the transformed column, \(\bar{x}\) is the column average, and \(s\) is the column’s sample standard deviation. If \(s = 0\) (i.e., the column is constant-valued), then the transformed column will contain zeros.
Warning
StandardScaler
works best when your data is normal. If your data isn’t approximately normal, then the transformed features won’t be meaningful.Examples
>>> import pandas as pd >>> import ray >>> from ray.data.preprocessors import StandardScaler >>> >>> df = pd.DataFrame({"X1": [-2, 0, 2], "X2": [-3, -3, 3], "X3": [1, 1, 1]}) >>> ds = ray.data.from_pandas(df) >>> ds.to_pandas() X1 X2 X3 0 -2 -3 1 1 0 -3 1 2 2 3 1
Columns are scaled separately.
>>> preprocessor = StandardScaler(columns=["X1", "X2"]) >>> preprocessor.fit_transform(ds).to_pandas() X1 X2 X3 0 -1.224745 -0.707107 1 1 0.000000 -0.707107 1 2 1.224745 1.414214 1
Constant-valued columns get filled with zeros.
>>> preprocessor = StandardScaler(columns=["X3"]) >>> preprocessor.fit_transform(ds).to_pandas() X1 X2 X3 0 -2 -3 0.0 1 0 -3 0.0 2 2 3 0.0
- Parameters:
columns – The columns to separately scale.
PublicAPI (alpha): This API is in alpha and may change before becoming stable.
Methods
Load the original preprocessor serialized via
self.serialize()
.Fit this Preprocessor to the Dataset.
Fit this Preprocessor to the Dataset and then transform the Dataset.
Batch format hint for upstream producers to try yielding best block format.
Return this preprocessor serialized as a string.
Transform the given dataset.
Transform a single batch of data.