ray.data.preprocessors.Concatenator#

class ray.data.preprocessors.Concatenator(columns: List[str], output_column_name: str = 'concat_out', dtype: numpy.dtype | None = None, raise_if_missing: bool = False, flatten: bool = False)[source]#

Bases: SerializablePreprocessorBase

Combine numeric columns into a column of type TensorDtype. Only columns specified in columns will be concatenated.

This preprocessor concatenates numeric columns and stores the result in a new column. The new column contains TensorArrayElement objects of shape \((m,)\), where \(m\) is the number of columns concatenated. The \(m\) concatenated columns are dropped after concatenation. The preprocessor preserves the order of the columns provided in the colummns argument and will use that order when calling transform() and transform_batch().

Examples

>>> import numpy as np
>>> import pandas as pd
>>> import ray
>>> from ray.data.preprocessors import Concatenator

Concatenator combines numeric columns into a column of TensorDtype.

>>> df = pd.DataFrame({"X0": [0, 3, 1], "X1": [0.5, 0.2, 0.9]})
>>> ds = ray.data.from_pandas(df)  
>>> concatenator = Concatenator(columns=["X0", "X1"])
>>> concatenator.transform(ds).to_pandas()  
   concat_out
0  [0.0, 0.5]
1  [3.0, 0.2]
2  [1.0, 0.9]

By default, the created column is called "concat_out", but you can specify a different name.

>>> concatenator = Concatenator(columns=["X0", "X1"], output_column_name="tensor")
>>> concatenator.transform(ds).to_pandas()  
       tensor
0  [0.0, 0.5]
1  [3.0, 0.2]
2  [1.0, 0.9]
>>> concatenator = Concatenator(columns=["X0", "X1"], dtype=np.float32)
>>> concatenator.transform(ds)  
Dataset(num_rows=3, schema={Y: object, concat_out: TensorDtype(shape=(2,), dtype=float32)})

When flatten=True, nested vectors in the columns will be flattened during concatenation:

>>> df = pd.DataFrame({"X0": [[1, 2], [3, 4]], "X1": [0.5, 0.2]})
>>> ds = ray.data.from_pandas(df)  
>>> concatenator = Concatenator(columns=["X0", "X1"], flatten=True)
>>> concatenator.transform(ds).to_pandas()  
   concat_out
0  [1.0, 2.0, 0.5]
1  [3.0, 4.0, 0.2]
Parameters:
  • columns – A list of columns to concatenate. The provided order of the columns will be retained during concatenation.

  • output_column_name – The desired name for the new column. Defaults to "concat_out".

  • dtype – The dtype to convert the output tensors to. If unspecified, the dtype is determined by standard coercion rules.

  • raise_if_missing – If True, an error is raised if any of the columns in columns don’t exist. Defaults to False.

  • flatten – If True, nested vectors in the columns will be flattened during concatenation. Defaults to False.

Raises:

ValueError – if raise_if_missing is True and a column in columns or doesn’t exist in the dataset.

PublicAPI (alpha): This API is in alpha and may change before becoming stable.

Methods

deserialize

Deserialize a preprocessor from serialized data.

fit

Fit this Preprocessor to the Dataset.

fit_transform

Fit this Preprocessor to the Dataset and then transform the Dataset.

get_preprocessor_class_id

Get the preprocessor class identifier for this preprocessor class.

get_version

Get the version number for this preprocessor class.

preferred_batch_format

Batch format hint for upstream producers to try yielding best block format.

serialize

Serialize this preprocessor to a string or bytes.

set_preprocessor_class_id

Set the preprocessor class identifier for this preprocessor class.

set_version

Set the version number for this preprocessor class.

transform

Transform the given dataset.

transform_batch

Transform a single batch of data.

Attributes

MAGIC_CLOUDPICKLE

SERIALIZER_FORMAT_VERSION

columns

dtype

flatten

output_column_name

raise_if_missing

stat_computation_plan