ray.data.preprocessors.SimpleImputer#

class ray.data.preprocessors.SimpleImputer(columns: List[str], strategy: str = 'mean', fill_value: str | Number | None = None, *, output_columns: List[str] | None = None)[source]#

Bases: Preprocessor

Replace missing values with imputed values. If the column is missing from a batch, it will be filled with the imputed value.

Examples

>>> import pandas as pd
>>> import ray
>>> from ray.data.preprocessors import SimpleImputer
>>> df = pd.DataFrame({"X": [0, None, 3, 3], "Y": [None, "b", "c", "c"]})
>>> ds = ray.data.from_pandas(df)  
>>> ds.to_pandas()  
     X     Y
0  0.0  None
1  NaN     b
2  3.0     c
3  3.0     c

The "mean" strategy imputes missing values with the mean of non-missing values. This strategy doesn’t work with categorical data.

>>> preprocessor = SimpleImputer(columns=["X"], strategy="mean")
>>> preprocessor.fit_transform(ds).to_pandas()  
     X     Y
0  0.0  None
1  2.0     b
2  3.0     c
3  3.0     c

The "most_frequent" strategy imputes missing values with the most frequent value in each column.

>>> preprocessor = SimpleImputer(columns=["X", "Y"], strategy="most_frequent")
>>> preprocessor.fit_transform(ds).to_pandas()  
     X  Y
0  0.0  c
1  3.0  b
2  3.0  c
3  3.0  c

The "constant" strategy imputes missing values with the value specified by fill_value.

>>> preprocessor = SimpleImputer(
...     columns=["Y"],
...     strategy="constant",
...     fill_value="?",
... )
>>> preprocessor.fit_transform(ds).to_pandas()  
     X  Y
0  0.0  ?
1  NaN  b
2  3.0  c
3  3.0  c

SimpleImputer can also be used in append mode by providing the name of the output_columns that should hold the imputed values.

>>> preprocessor = SimpleImputer(columns=["X"], output_columns=["X_imputed"], strategy="mean")
>>> preprocessor.fit_transform(ds).to_pandas()  
     X     Y  X_imputed
0  0.0  None        0.0
1  NaN     b        2.0
2  3.0     c        3.0
3  3.0     c        3.0

Parameters:

columns – The columns to apply imputation to.
strategy –
How imputed values are chosen.
- "mean": The mean of non-missing values. This strategy only works with numeric columns.
- "most_frequent": The most common value.
- "constant": The value passed to fill_value.
fill_value – The value to use when strategy is "constant".
output_columns – The names of the transformed columns. If None, the transformed columns will be the same as the input columns. If not None, the length of output_columns must match the length of columns, othwerwise an error will be raised.

Raises:

ValueError – if strategy is not "mean", "most_frequent", or "constant".

PublicAPI (alpha): This API is in alpha and may change before becoming stable.

Methods

`deserialize`	Load the original preprocessor serialized via `self.serialize()`.
`fit`	Fit this Preprocessor to the Dataset.
`fit_transform`	Fit this Preprocessor to the Dataset and then transform the Dataset.
`preferred_batch_format`	Batch format hint for upstream producers to try yielding best block format.
`serialize`	Return this preprocessor serialized as a string.
`transform`	Transform the given dataset.
`transform_batch`	Transform a single batch of data.