ray.data.preprocessors.RobustScaler#
- class ray.data.preprocessors.RobustScaler(columns: List[str], quantile_range: Tuple[float, float] = (0.25, 0.75), output_columns: List[str] | None = None, quantile_precision: int = 800)[source]#
Bases:
SerializablePreprocessorBaseScale and translate each column using approximate quantiles.
The general formula is given by
\[x' = \frac{x - \mu_{1/2}}{\mu_h - \mu_l}\]where \(x\) is the column, \(x'\) is the transformed column, \(\mu_{1/2}\) is the column median. \(\mu_{h}\) and \(\mu_{l}\) are the high and low quantiles, respectively. By default, \(\mu_{h}\) is the third quartile and \(\mu_{l}\) is the first quartile.
Internally, the
ApproximateQuantileaggregator is used to calculate the approximate quantiles.Tip
This scaler works well when your data contains many outliers.
Examples
>>> import pandas as pd >>> import ray >>> from ray.data.preprocessors import RobustScaler >>> >>> df = pd.DataFrame({ ... "X1": [1, 2, 3, 4, 5], ... "X2": [13, 5, 14, 2, 8], ... "X3": [1, 2, 2, 2, 3], ... }) >>> ds = ray.data.from_pandas(df) >>> ds.to_pandas() X1 X2 X3 0 1 13 1 1 2 5 2 2 3 14 2 3 4 2 2 4 5 8 3
RobustScalerseparately scales each column.>>> preprocessor = RobustScaler(columns=["X1", "X2"]) >>> preprocessor.fit_transform(ds).to_pandas() X1 X2 X3 0 -1.0 0.625 1 1 -0.5 -0.375 2 2 0.0 0.750 2 3 0.5 -0.750 2 4 1.0 0.000 3
>>> preprocessor = RobustScaler( ... columns=["X1", "X2"], ... output_columns=["X1_scaled", "X2_scaled"] ... ) >>> preprocessor.fit_transform(ds).to_pandas() X1 X2 X3 X1_scaled X2_scaled 0 1 13 1 -1.0 0.625 1 2 5 2 -0.5 -0.375 2 3 14 2 0.0 0.750 3 4 2 2 0.5 -0.750 4 5 8 3 1.0 0.000
- Parameters:
columns – The columns to separately scale.
quantile_range – A tuple that defines the lower and upper quantiles. Values must be between 0 and 1. Defaults to the 1st and 3rd quartiles:
(0.25, 0.75).output_columns – The names of the transformed columns. If None, the transformed columns will be the same as the input columns. If not None, the length of
output_columnsmust match the length ofcolumns, othwerwise an error will be raised.quantile_precision – Controls the accuracy and memory footprint of the sketch (K in KLL); higher values yield lower error but use more memory. Defaults to 800. See https://datasketches.apache.org/docs/KLL/KLLAccuracyAndSize.html for details on accuracy and size.
PublicAPI (alpha): This API is in alpha and may change before becoming stable.
Methods
Deserialize a preprocessor from serialized data.
Fit this Preprocessor to the Dataset.
Fit this Preprocessor to the Dataset and then transform the Dataset.
Get the preprocessor class identifier for this preprocessor class.
Get the version number for this preprocessor class.
Batch format hint for upstream producers to try yielding best block format.
Serialize this preprocessor to a string or bytes.
Set the preprocessor class identifier for this preprocessor class.
Set the version number for this preprocessor class.
Transform the given dataset.
Transform a single batch of data.
Attributes