ray.data.preprocessors.SimpleImputer#
- class ray.data.preprocessors.SimpleImputer(columns: List[str], strategy: str = 'mean', fill_value: str | Number | None = None)[source]#
Bases:
Preprocessor
Replace missing values with imputed values. If the column is missing from a batch, it will be filled with the imputed value.
Examples
>>> import pandas as pd >>> import ray >>> from ray.data.preprocessors import SimpleImputer >>> df = pd.DataFrame({"X": [0, None, 3, 3], "Y": [None, "b", "c", "c"]}) >>> ds = ray.data.from_pandas(df) >>> ds.to_pandas() X Y 0 0.0 None 1 NaN b 2 3.0 c 3 3.0 c
The
"mean"
strategy imputes missing values with the mean of non-missing values. This strategy doesn’t work with categorical data.>>> preprocessor = SimpleImputer(columns=["X"], strategy="mean") >>> preprocessor.fit_transform(ds).to_pandas() X Y 0 0.0 None 1 2.0 b 2 3.0 c 3 3.0 c
The
"most_frequent"
strategy imputes missing values with the most frequent value in each column.>>> preprocessor = SimpleImputer(columns=["X", "Y"], strategy="most_frequent") >>> preprocessor.fit_transform(ds).to_pandas() X Y 0 0.0 c 1 3.0 b 2 3.0 c 3 3.0 c
The
"constant"
strategy imputes missing values with the value specified byfill_value
.>>> preprocessor = SimpleImputer( ... columns=["Y"], ... strategy="constant", ... fill_value="?", ... ) >>> preprocessor.fit_transform(ds).to_pandas() X Y 0 0.0 ? 1 NaN b 2 3.0 c 3 3.0 c
- Parameters:
columns – The columns to apply imputation to.
strategy –
How imputed values are chosen.
"mean"
: The mean of non-missing values. This strategy only works with numeric columns."most_frequent"
: The most common value."constant"
: The value passed tofill_value
.
fill_value – The value to use when
strategy
is"constant"
.
- Raises:
ValueError – if
strategy
is not"mean"
,"most_frequent"
, or"constant"
.
PublicAPI (alpha): This API is in alpha and may change before becoming stable.
Methods
Load the original preprocessor serialized via
self.serialize()
.Fit this Preprocessor to the Dataset.
Fit this Preprocessor to the Dataset and then transform the Dataset.
Batch format hint for upstream producers to try yielding best block format.
Return this preprocessor serialized as a string.
Transform the given dataset.
Transform a single batch of data.