
class ray.data.preprocessors.Concatenator(output_column_name: str = 'concat_out', include: List[str] | None = None, exclude: str | List[str] | None = None, dtype: numpy.dtype | None = None, raise_if_missing: bool = False)[source]#

Bases: Preprocessor

Combine numeric columns into a column of type TensorDtype.

This preprocessor concatenates numeric columns and stores the result in a new column. The new column contains TensorArrayElement objects of shape \((m,)\), where \(m\) is the number of columns concatenated. The \(m\) concatenated columns are dropped after concatenation.


>>> import numpy as np
>>> import pandas as pd
>>> import ray
>>> from ray.data.preprocessors import Concatenator

Concatenator combines numeric columns into a column of TensorDtype.

>>> df = pd.DataFrame({"X0": [0, 3, 1], "X1": [0.5, 0.2, 0.9]})
>>> ds = ray.data.from_pandas(df)  
>>> concatenator = Concatenator()
>>> concatenator.fit_transform(ds).to_pandas()  
0  [0.0, 0.5]
1  [3.0, 0.2]
2  [1.0, 0.9]

By default, the created column is called "concat_out", but you can specify a different name.

>>> concatenator = Concatenator(output_column_name="tensor")
>>> concatenator.fit_transform(ds).to_pandas()  
0  [0.0, 0.5]
1  [3.0, 0.2]
2  [1.0, 0.9]

Sometimes, you might not want to concatenate all of of the columns in your dataset. In this case, you can exclude columns with the exclude parameter.

>>> df = pd.DataFrame({"X0": [0, 3, 1], "X1": [0.5, 0.2, 0.9], "Y": ["blue", "orange", "blue"]})
>>> ds = ray.data.from_pandas(df)  
>>> concatenator = Concatenator(exclude=["Y"])
>>> concatenator.fit_transform(ds).to_pandas()  
        Y  concat_out
0    blue  [0.0, 0.5]
1  orange  [3.0, 0.2]
2    blue  [1.0, 0.9]

Alternatively, you can specify which columns to concatenate with the include parameter.

>>> concatenator = Concatenator(include=["X0", "X1"])
>>> concatenator.fit_transform(ds).to_pandas()  
        Y  concat_out
0    blue  [0.0, 0.5]
1  orange  [3.0, 0.2]
2    blue  [1.0, 0.9]

Note that if a column is in both include and exclude, the column is excluded.

>>> concatenator = Concatenator(include=["X0", "X1", "Y"], exclude=["Y"])
>>> concatenator.fit_transform(ds).to_pandas()  
        Y  concat_out
0    blue  [0.0, 0.5]
1  orange  [3.0, 0.2]
2    blue  [1.0, 0.9]

By default, the concatenated tensor is a dtype common to the input columns. However, you can also explicitly set the dtype with the dtype parameter.

>>> concatenator = Concatenator(include=["X0", "X1"], dtype=np.float32)
>>> concatenator.fit_transform(ds)  
Dataset(num_rows=3, schema={Y: object, concat_out: TensorDtype(shape=(2,), dtype=float32)})
  • output_column_name – The desired name for the new column. Defaults to "concat_out".

  • include – A list of columns to concatenate. If None, all columns are concatenated.

  • exclude – A list of column to exclude from concatenation. If a column is in both include and exclude, the column is excluded from concatenation.

  • dtype – The dtype to convert the output tensors to. If unspecified, the dtype is determined by standard coercion rules.

  • raise_if_missing – If True, an error is raised if any of the columns in include or exclude don’t exist. Defaults to False.


ValueError – if raise_if_missing is True and a column in include or exclude doesn’t exist in the dataset.

PublicAPI (alpha): This API is in alpha and may change before becoming stable.



Load the original preprocessor serialized via self.serialize().


Fit this Preprocessor to the Dataset.


Fit this Preprocessor to the Dataset and then transform the Dataset.


Batch format hint for upstream producers to try yielding best block format.


Return this preprocessor serialized as a string.


Transform the given dataset.


Transform a single batch of data.