ray.data.preprocessors.StandardScaler#
- class ray.data.preprocessors.StandardScaler(columns: List[str], output_columns: List[str] | None = None)[source]#
Bases:
Preprocessor
Translate and scale each column by its mean and standard deviation, respectively.
The general formula is given by
\[x' = \frac{x - \bar{x}}{s}\]where \(x\) is the column, \(x'\) is the transformed column, \(\bar{x}\) is the column average, and \(s\) is the column’s sample standard deviation. If \(s = 0\) (i.e., the column is constant-valued), then the transformed column will contain zeros.
Warning
StandardScaler
works best when your data is normal. If your data isn’t approximately normal, then the transformed features won’t be meaningful.Examples
>>> import pandas as pd >>> import ray >>> from ray.data.preprocessors import StandardScaler >>> >>> df = pd.DataFrame({"X1": [-2, 0, 2], "X2": [-3, -3, 3], "X3": [1, 1, 1]}) >>> ds = ray.data.from_pandas(df) >>> ds.to_pandas() X1 X2 X3 0 -2 -3 1 1 0 -3 1 2 2 3 1
Columns are scaled separately.
>>> preprocessor = StandardScaler(columns=["X1", "X2"]) >>> preprocessor.fit_transform(ds).to_pandas() X1 X2 X3 0 -1.224745 -0.707107 1 1 0.000000 -0.707107 1 2 1.224745 1.414214 1
Constant-valued columns get filled with zeros.
>>> preprocessor = StandardScaler(columns=["X3"]) >>> preprocessor.fit_transform(ds).to_pandas() X1 X2 X3 0 -2 -3 0.0 1 0 -3 0.0 2 2 3 0.0
>>> preprocessor = StandardScaler( ... columns=["X1", "X2"], ... output_columns=["X1_scaled", "X2_scaled"] ... ) >>> preprocessor.fit_transform(ds).to_pandas() X1 X2 X3 X1_scaled X2_scaled 0 -2 -3 1 -1.224745 -0.707107 1 0 -3 1 0.000000 -0.707107 2 2 3 1 1.224745 1.414214
- Parameters:
columns – The columns to separately scale.
output_columns – The names of the transformed columns. If None, the transformed columns will be the same as the input columns. If not None, the length of
output_columns
must match the length ofcolumns
, othwerwise an error will be raised.
PublicAPI (alpha): This API is in alpha and may change before becoming stable.
Methods
Load the original preprocessor serialized via
self.serialize()
.Fit this Preprocessor to the Dataset.
Fit this Preprocessor to the Dataset and then transform the Dataset.
Batch format hint for upstream producers to try yielding best block format.
Return this preprocessor serialized as a string.
Transform the given dataset.
Transform a single batch of data.