ray.data.preprocessors.Concatenator#

class ray.data.preprocessors.Concatenator(columns: List[str], output_column_name: str = 'concat_out', dtype: numpy.dtype | None = None, raise_if_missing: bool = False)[source]#

Bases: Preprocessor

Combine numeric columns into a column of type TensorDtype. Only columns specified in columns will be concatenated.

This preprocessor concatenates numeric columns and stores the result in a new column. The new column contains TensorArrayElement objects of shape \((m,)\), where \(m\) is the number of columns concatenated. The \(m\) concatenated columns are dropped after concatenation. The preprocessor preserves the order of the columns provided in the colummns argument and will use that order when calling transform() and transform_batch().

Examples

>>> import numpy as np
>>> import pandas as pd
>>> import ray
>>> from ray.data.preprocessors import Concatenator

Concatenator combines numeric columns into a column of TensorDtype.

>>> df = pd.DataFrame({"X0": [0, 3, 1], "X1": [0.5, 0.2, 0.9]})
>>> ds = ray.data.from_pandas(df)  
>>> concatenator = Concatenator(columns=["X0", "X1"])
>>> concatenator.transform(ds).to_pandas()  
   concat_out
0  [0.0, 0.5]
1  [3.0, 0.2]
2  [1.0, 0.9]

By default, the created column is called "concat_out", but you can specify a different name.

>>> concatenator = Concatenator(columns=["X0", "X1"], output_column_name="tensor")
>>> concatenator.transform(ds).to_pandas()  
       tensor
0  [0.0, 0.5]
1  [3.0, 0.2]
2  [1.0, 0.9]
>>> concatenator = Concatenator(columns=["X0", "X1"], dtype=np.float32)
>>> concatenator.transform(ds)  
Dataset(num_rows=3, schema={Y: object, concat_out: TensorDtype(shape=(2,), dtype=float32)})
Parameters:
  • output_column_name – The desired name for the new column. Defaults to "concat_out".

  • columns – A list of columns to concatenate. The provided order of the columns will be retained during concatenation.

  • dtype – The dtype to convert the output tensors to. If unspecified, the dtype is determined by standard coercion rules.

  • raise_if_missing – If True, an error is raised if any of the columns in columns don’t exist. Defaults to False.

Raises:

ValueError – if raise_if_missing is True and a column in columns or doesn’t exist in the dataset.

PublicAPI (alpha): This API is in alpha and may change before becoming stable.

Methods

deserialize

Load the original preprocessor serialized via self.serialize().

fit

Fit this Preprocessor to the Dataset.

fit_transform

Fit this Preprocessor to the Dataset and then transform the Dataset.

preferred_batch_format

Batch format hint for upstream producers to try yielding best block format.

serialize

Return this preprocessor serialized as a string.

transform

Transform the given dataset.

transform_batch

Transform a single batch of data.