Using Preprocessors#

Data preprocessing is a common technique for transforming raw data into features for a machine learning model. In general, you may want to apply the same preprocessing logic to your offline training data and online inference data.

This page covers preprocessors, which are a higher level API on top of existing Ray Data operations like map_batches, targeted towards tabular and structured data use cases.

If you are working with tabular data, you should use Ray Data preprocessors. However, the recommended way to perform preprocessing for unstructured data is to use existing Ray Data operations instead of preprocessors.



The Preprocessor class has four public methods:

  1. fit(): Compute state information about a Dataset (for example, the mean or standard deviation of a column) and save it to the Preprocessor. This information is used to perform transform(), and the method is typically called on a training dataset.

  2. transform(): Apply a transformation to a Dataset. If the Preprocessor is stateful, then fit() must be called first. This method is typically called on training, validation, and test datasets.

  3. transform_batch(): Apply a transformation to a single batch of data. This method is typically called on online or offline inference data.

  4. fit_transform(): Syntactic sugar for calling both fit() and transform() on a Dataset.

To show these methods in action, walk through a basic example. First, set up two simple Ray Datasets.

import pandas as pd
import ray
from import MinMaxScaler
from import StandardScaler

# Generate two simple datasets.
dataset =
dataset1, dataset2 = dataset.split(2)

# [{'id': 0}, {'id': 1}, {'id': 2}, {'id': 3}]

# [{'id': 4}, {'id': 5}, {'id': 6}, {'id': 7}]

Next, fit the Preprocessor on one Dataset, and then transform both Datasets with this fitted information.

# Fit the preprocessor on dataset1, and transform both dataset1 and dataset2.
preprocessor = MinMaxScaler(["id"])

dataset1_transformed = preprocessor.fit_transform(dataset1)
# [{'id': 0.0}, {'id': 0.3333333333333333}, {'id': 0.6666666666666666}, {'id': 1.0}]

dataset2_transformed = preprocessor.transform(dataset2)
# [{'id': 1.3333333333333333}, {'id': 1.6666666666666667}, {'id': 2.0}, {'id': 2.3333333333333335}]

Finally, call transform_batch on a single batch of data.

batch = pd.DataFrame({"id": list(range(8, 12))})
batch_transformed = preprocessor.transform_batch(batch)
#          id
# 0  2.666667
# 1  3.000000
# 2  3.333333
# 3  3.666667

The most common way of using a preprocessor is by using it on a Ray Data dataset, which is then passed to a Ray Train Trainer. See also:

  • Ray Train’s data preprocessing and ingest section for PyTorch

  • Ray Train’s data preprocessing and ingest section for LightGBM/XGBoost

Types of preprocessors#

Built-in preprocessors#

Ray Data provides a handful of preprocessors out of the box.

Generic preprocessors

Combine numeric columns into a column of type TensorDtype.

Implements an ML preprocessing operation.

Replace missing values with imputed values.

Categorical encoders

Convert columns to pd.CategoricalDtype.

Encode labels as integer targets.

Multi-hot encode categorical data.

One-hot encode categorical data.

Encode values within columns as ordered integer values.

Feature scalers

Scale each column by its absolute max value.

Scale each column by its range.

Scales each sample to have unit norm.

Apply a power transform to make your data more normally distributed.

Scale and translate each column using quantiles.

Translate and scale each column by its mean and standard deviation, respectively.


Materialize and split the dataset into train and test subsets.

Which preprocessor should you use?#

The type of preprocessor you use depends on what your data looks like. This section provides tips on handling common data formats.

Categorical data#

Most models expect numerical inputs. To represent your categorical data in a way your model can understand, encode categories using one of the preprocessors described below.

Categorical Data Type




"cat", "dog", "airplane"


Ordered categories

"bs", "md", "phd"


Unordered categories

"red", "green", "blue"


Lists of categories

("sci-fi", "action"), ("action", "comedy", "animated")



If you’re using LightGBM, you don’t need to encode your categorical data. Instead, use Categorizer to convert your data to pandas.CategoricalDtype.

Numerical data#

To ensure your models behaves properly, normalize your numerical data. Reference the table below to determine which preprocessor to use.

Data Property


Your data is approximately normal


Your data is sparse


Your data contains many outliers


Your data isn’t normal, but you need it to be


You need unit-norm rows


You aren’t sure what your data looks like



These preprocessors operate on numeric columns. If your dataset contains columns of type TensorDtype, you may need to implement a custom preprocessor.

Additionally, if your model expects a tensor or ndarray, create a tensor using Concatenator.


Built-in feature scalers like StandardScaler don’t work on TensorDtype columns, so apply Concatenator after feature scaling.

from import Concatenator, StandardScaler

# Generate a simple dataset.
dataset =[{"X": 1.0, "Y": 2.0}, {"X": 4.0, "Y": 0.0}])
# [{'X': 1.0, 'Y': 2.0}, {'X': 4.0, 'Y': 0.0}]

scaler = StandardScaler(columns=["X", "Y"])
concatenator = Concatenator()
dataset_transformed = scaler.fit_transform(dataset)
dataset_transformed = concatenator.fit_transform(dataset_transformed)
# [{'concat_out': array([-1.,  1.])}, {'concat_out': array([ 1., -1.])}]

Filling in missing values#

If your dataset contains missing values, replace them with SimpleImputer.

from import SimpleImputer

# Generate a simple dataset.
dataset =[{"id": 1.0}, {"id": None}, {"id": 3.0}])
# [{'id': 1.0}, {'id': None}, {'id': 3.0}]

imputer = SimpleImputer(columns=["id"], strategy="mean")
dataset_transformed = imputer.fit_transform(dataset)
# [{'id': 1.0}, {'id': 2.0}, {'id': 3.0}]

Chaining preprocessors#

If you need to apply more than one preprocessor, simply apply them in sequence on your dataset.

import ray
from import MinMaxScaler, SimpleImputer

# Generate one simple dataset.
dataset =
    [{"id": 0}, {"id": 1}, {"id": 2}, {"id": 3}, {"id": None}]
# [{'id': 0}, {'id': 1}, {'id': 2}, {'id': 3}, {'id': None}]

preprocessor_1 = SimpleImputer(["id"])
preprocessor_2 = MinMaxScaler(["id"])

# Apply both preprocessors in sequence on the dataset.
dataset_transformed = preprocessor_1.fit_transform(dataset)
dataset_transformed = preprocessor_2.fit_transform(dataset_transformed)

# [{'id': 0.0}, {'id': 0.3333333333333333}, {'id': 0.6666666666666666}, {'id': 1.0}, {'id': 0.5}]

Implementing custom preprocessors#

If you want to implement a custom preprocessor that needs to be fit, extend the Preprocessor base class.

from typing import Dict
import ray
from pandas import DataFrame
from import Preprocessor
from import Dataset
from import Max

class CustomPreprocessor(Preprocessor):
    def _fit(self, dataset: Dataset) -> Preprocessor:
        self.stats_ = dataset.aggregate(Max("id"))

    def _transform_pandas(self, df: DataFrame) -> DataFrame:
        return df * self.stats_["max(id)"]

# Generate a simple dataset.
dataset =
# [{'id': 0}, {'id': 1}, {'id': 2}, {'id': 3}]

# Create a stateful preprocessor that finds the max id and scales each id by it.
preprocessor = CustomPreprocessor()
dataset_transformed = preprocessor.fit_transform(dataset)
# [{'id': 0}, {'id': 3}, {'id': 6}, {'id': 9}]

If your preprocessor doesn’t need to be fit, use map_batches() to directly transform your dataset. For more details, see Transforming Data.