ray.data.preprocessors.SimpleImputer
ray.data.preprocessors.SimpleImputer#
- class ray.data.preprocessors.SimpleImputer(columns: List[str], strategy: str = 'mean', fill_value: Optional[Union[str, numbers.Number]] = None)[source]#
Bases:
ray.data.preprocessor.Preprocessor
Replace missing values with imputed values.
Examples
>>> import pandas as pd >>> import ray >>> from ray.data.preprocessors import SimpleImputer >>> df = pd.DataFrame({"X": [0, None, 3, 3], "Y": [None, "b", "c", "c"]}) >>> ds = ray.data.from_pandas(df) >>> ds.to_pandas() X Y 0 0.0 None 1 NaN b 2 3.0 c 3 3.0 c
The
"mean"
strategy imputes missing values with the mean of non-missing values. This strategy doesn’t work with categorical data.>>> preprocessor = SimpleImputer(columns=["X"], strategy="mean") >>> preprocessor.fit_transform(ds).to_pandas() X Y 0 0.0 None 1 2.0 b 2 3.0 c 3 3.0 c
The
"most_frequent"
strategy imputes missing values with the most frequent value in each column.>>> preprocessor = SimpleImputer(columns=["X", "Y"], strategy="most_frequent") >>> preprocessor.fit_transform(ds).to_pandas() X Y 0 0.0 c 1 3.0 b 2 3.0 c 3 3.0 c
The
"constant"
strategy imputes missing values with the value specified byfill_value
.>>> preprocessor = SimpleImputer( ... columns=["Y"], ... strategy="constant", ... fill_value="?", ... ) >>> preprocessor.fit_transform(ds).to_pandas() X Y 0 0.0 ? 1 NaN b 2 3.0 c 3 3.0 c
- Parameters
columns – The columns to apply imputation to.
strategy –
How imputed values are chosen.
"mean"
: The mean of non-missing values. This strategy only works with numeric columns."most_frequent"
: The most common value."constant"
: The value passed tofill_value
.
fill_value – The value to use when
strategy
is"constant"
.
- Raises
ValueError – if
strategy
is not"mean"
,"most_frequent"
, or"constant"
.
PublicAPI (alpha): This API is in alpha and may change before becoming stable.