ray.data.preprocessor.Preprocessor#

class ray.data.preprocessor.Preprocessor[source]#

Bases: ABC

Implements an ML preprocessing operation.

Preprocessors are stateful objects that can be fitted against a Dataset and used to transform both local data batches and distributed data. For example, a Normalization preprocessor may calculate the mean and stdev of a field during fitting, and uses these attributes to implement its normalization transform.

Preprocessors can also be stateless and transform data without needed to be fitted. For example, a preprocessor may simply remove a column, which does not require any state to be fitted.

If you are implementing your own Preprocessor sub-class, you should override the following:

  • _fit if your preprocessor is stateful. Otherwise, set _is_fittable=False.

  • _transform_pandas and/or _transform_numpy for best performance, implement both. Otherwise, the data will be converted to the match the implemented method.

PublicAPI (beta): This API is in beta and may change before becoming stable.

Methods

__init__

deserialize

Load the original preprocessor serialized via self.serialize().

fit

Fit this Preprocessor to the Dataset.

fit_transform

Fit this Preprocessor to the Dataset and then transform the Dataset.

preferred_batch_format

Batch format hint for upstream producers to try yielding best block format.

serialize

Return this preprocessor serialized as a string.

transform

Transform the given dataset.

transform_batch

Transform a single batch of data.