Preprocessor#

Preprocessor Interface#

Constructor#

Preprocessor()

Implements an ML preprocessing operation.

Fit/Transform APIs#

fit(dataset)

Fit this Preprocessor to the Dataset.

fit_transform(dataset)

Fit this Preprocessor to the Dataset and then transform the Dataset.

transform(dataset)

Transform the given dataset.

transform_batch(data)

Transform a single batch of data.

transform_stats()

Return Dataset stats for the most recent transform call, if any.

Generic Preprocessors#

BatchMapper(fn, batch_format[, batch_size])

Apply an arbitrary operation to a dataset.

Chain(*preprocessors)

Combine multiple preprocessors into a single Preprocessor.

Concatenator([output_column_name, include, ...])

Combine numeric columns into a column of type TensorDtype.

SimpleImputer(columns[, strategy, fill_value])

Replace missing values with imputed values.

Categorical Encoders#

Categorizer(columns[, dtypes])

Convert columns to pd.CategoricalDtype.

LabelEncoder(label_column)

Encode labels as integer targets.

MultiHotEncoder(columns, *[, max_categories])

Multi-hot encode categorical data.

OneHotEncoder(columns, *[, max_categories])

One-hot encode categorical data.

OrdinalEncoder(columns, *[, encode_lists])

Encode values within columns as ordered integer values.

Feature Scalers#

MaxAbsScaler(columns)

Scale each column by its absolute max value.

MinMaxScaler(columns)

Scale each column by its range.

Normalizer(columns[, norm])

Scales each sample to have unit norm.

PowerTransformer(columns, power[, method])

Apply a power transform to make your data more normally distributed.

RobustScaler(columns[, quantile_range])

Scale and translate each column using quantiles.

StandardScaler(columns)

Translate and scale each column by its mean and standard deviation, respectively.

K-Bins Discretizers#

CustomKBinsDiscretizer(columns, bins, *[, ...])

Bin values into discrete intervals using custom bin edges.

UniformKBinsDiscretizer(columns, bins, *[, ...])

Bin values into discrete intervals (bins) of uniform width.

Image Preprocessors#

TorchVisionPreprocessor(columns, transform)

Apply a TorchVision transform to image columns.

Text Encoders#

CountVectorizer(columns[, tokenization_fn, ...])

Count the frequency of tokens in a column of strings.

FeatureHasher(columns, num_features)

Apply the hashing trick to a table that describes token frequencies.

HashingVectorizer(columns, num_features[, ...])

Count the frequency of tokens using the hashing trick.

Tokenizer(columns[, tokenization_fn])

Replace each string with a list of tokens.