ray.data.preprocessors.Concatenator#

class ray.data.preprocessors.Concatenator(output_column_name: str = 'concat_out', include: List[str] | None = None, exclude: str | List[str] | None = None, dtype: numpy.dtype | None = None, raise_if_missing: bool = False)[source]#

Bases: Preprocessor

Combine numeric columns into a column of type TensorDtype.

This preprocessor concatenates numeric columns and stores the result in a new column. The new column contains TensorArrayElement objects of shape \((m,)\), where \(m\) is the number of columns concatenated. The \(m\) concatenated columns are dropped after concatenation.

Examples

>>> import numpy as np
>>> import pandas as pd
>>> import ray
>>> from ray.data.preprocessors import Concatenator

Concatenator combines numeric columns into a column of TensorDtype.

>>> df = pd.DataFrame({"X0": [0, 3, 1], "X1": [0.5, 0.2, 0.9]})
>>> ds = ray.data.from_pandas(df)  
>>> concatenator = Concatenator()
>>> concatenator.fit_transform(ds).to_pandas()  
   concat_out
0  [0.0, 0.5]
1  [3.0, 0.2]
2  [1.0, 0.9]

By default, the created column is called "concat_out", but you can specify a different name.

>>> concatenator = Concatenator(output_column_name="tensor")
>>> concatenator.fit_transform(ds).to_pandas()  
       tensor
0  [0.0, 0.5]
1  [3.0, 0.2]
2  [1.0, 0.9]

Sometimes, you might not want to concatenate all of of the columns in your dataset. In this case, you can exclude columns with the exclude parameter.

>>> df = pd.DataFrame({"X0": [0, 3, 1], "X1": [0.5, 0.2, 0.9], "Y": ["blue", "orange", "blue"]})
>>> ds = ray.data.from_pandas(df)  
>>> concatenator = Concatenator(exclude=["Y"])
>>> concatenator.fit_transform(ds).to_pandas()  
        Y  concat_out
0    blue  [0.0, 0.5]
1  orange  [3.0, 0.2]
2    blue  [1.0, 0.9]

Alternatively, you can specify which columns to concatenate with the include parameter.

>>> concatenator = Concatenator(include=["X0", "X1"])
>>> concatenator.fit_transform(ds).to_pandas()  
        Y  concat_out
0    blue  [0.0, 0.5]
1  orange  [3.0, 0.2]
2    blue  [1.0, 0.9]

Note that if a column is in both include and exclude, the column is excluded.

>>> concatenator = Concatenator(include=["X0", "X1", "Y"], exclude=["Y"])
>>> concatenator.fit_transform(ds).to_pandas()  
        Y  concat_out
0    blue  [0.0, 0.5]
1  orange  [3.0, 0.2]
2    blue  [1.0, 0.9]

By default, the concatenated tensor is a dtype common to the input columns. However, you can also explicitly set the dtype with the dtype parameter.

>>> concatenator = Concatenator(include=["X0", "X1"], dtype=np.float32)
>>> concatenator.fit_transform(ds)  
Dataset(num_rows=3, schema={Y: object, concat_out: TensorDtype(shape=(2,), dtype=float32)})
Parameters:
  • output_column_name – The desired name for the new column. Defaults to "concat_out".

  • include – A list of columns to concatenate. If None, all columns are concatenated.

  • exclude – A list of column to exclude from concatenation. If a column is in both include and exclude, the column is excluded from concatenation.

  • dtype – The dtype to convert the output tensors to. If unspecified, the dtype is determined by standard coercion rules.

  • raise_if_missing – If True, an error is raised if any of the columns in include or exclude don’t exist. Defaults to False.

Raises:

ValueError – if raise_if_missing is True and a column in include or exclude doesn’t exist in the dataset.

PublicAPI (alpha): This API is in alpha and may change before becoming stable.

Methods

deserialize

Load the original preprocessor serialized via self.serialize().

fit

Fit this Preprocessor to the Dataset.

fit_transform

Fit this Preprocessor to the Dataset and then transform the Dataset.

preferred_batch_format

Batch format hint for upstream producers to try yielding best block format.

serialize

Return this preprocessor serialized as a string.

transform

Transform the given dataset.

transform_batch

Transform a single batch of data.