ray.data.preprocessors.RobustScaler#

class ray.data.preprocessors.RobustScaler(columns: List[str], quantile_range: Tuple[float, float] = (0.25, 0.75), output_columns: List[str] | None = None)[source]#

Bases: Preprocessor

Scale and translate each column using quantiles.

The general formula is given by

\[x' = \frac{x - \mu_{1/2}}{\mu_h - \mu_l}\]

where \(x\) is the column, \(x'\) is the transformed column, \(\mu_{1/2}\) is the column median. \(\mu_{h}\) and \(\mu_{l}\) are the high and low quantiles, respectively. By default, \(\mu_{h}\) is the third quartile and \(\mu_{l}\) is the first quartile.

Tip

This scaler works well when your data contains many outliers.

Examples

>>> import pandas as pd
>>> import ray
>>> from ray.data.preprocessors import RobustScaler
>>>
>>> df = pd.DataFrame({
...     "X1": [1, 2, 3, 4, 5],
...     "X2": [13, 5, 14, 2, 8],
...     "X3": [1, 2, 2, 2, 3],
... })
>>> ds = ray.data.from_pandas(df)  
>>> ds.to_pandas()  
   X1  X2  X3
0   1  13   1
1   2   5   2
2   3  14   2
3   4   2   2
4   5   8   3

RobustScaler separately scales each column.

>>> preprocessor = RobustScaler(columns=["X1", "X2"])
>>> preprocessor.fit_transform(ds).to_pandas()  
    X1     X2  X3
0 -1.0  0.625   1
1 -0.5 -0.375   2
2  0.0  0.750   2
3  0.5 -0.750   2
4  1.0  0.000   3

>>> preprocessor = RobustScaler(
...    columns=["X1", "X2"],
...    output_columns=["X1_scaled", "X2_scaled"]
... )
>>> preprocessor.fit_transform(ds).to_pandas()  
   X1  X2  X3  X1_scaled  X2_scaled
0   1  13   1       -1.0      0.625
1   2   5   2       -0.5     -0.375
2   3  14   2        0.0      0.750
3   4   2   2        0.5     -0.750
4   5   8   3        1.0      0.000

Parameters:

columns – The columns to separately scale.
quantile_range – A tuple that defines the lower and upper quantiles. Values must be between 0 and 1. Defaults to the 1st and 3rd quartiles: (0.25, 0.75).
output_columns – The names of the transformed columns. If None, the transformed columns will be the same as the input columns. If not None, the length of output_columns must match the length of columns, othwerwise an error will be raised.

PublicAPI (alpha): This API is in alpha and may change before becoming stable.

Methods

`deserialize`	Load the original preprocessor serialized via `self.serialize()`.
`fit`	Fit this Preprocessor to the Dataset.
`fit_transform`	Fit this Preprocessor to the Dataset and then transform the Dataset.
`preferred_batch_format`	Batch format hint for upstream producers to try yielding best block format.
`serialize`	Return this preprocessor serialized as a string.
`transform`	Transform the given dataset.
`transform_batch`	Transform a single batch of data.