ray.data.preprocessors.Concatenator#

class ray.data.preprocessors.Concatenator(columns: List[str], output_column_name: str = 'concat_out', dtype: numpy.dtype | None = None, raise_if_missing: bool = False, flatten: bool = False)[source]#

Bases: SerializablePreprocessorBase

Combine numeric columns into a column of type TensorDtype. Only columns specified in columns will be concatenated.

This preprocessor concatenates numeric columns and stores the result in a new column. The new column contains TensorArrayElement objects of shape \((m,)\), where \(m\) is the number of columns concatenated. The \(m\) concatenated columns are dropped after concatenation. The preprocessor preserves the order of the columns provided in the colummns argument and will use that order when calling transform() and transform_batch().

Examples

>>> import numpy as np
>>> import pandas as pd
>>> import ray
>>> from ray.data.preprocessors import Concatenator

Concatenator combines numeric columns into a column of TensorDtype.

>>> df = pd.DataFrame({"X0": [0, 3, 1], "X1": [0.5, 0.2, 0.9]})
>>> ds = ray.data.from_pandas(df)  
>>> concatenator = Concatenator(columns=["X0", "X1"])
>>> concatenator.transform(ds).to_pandas()  
   concat_out
0  [0.0, 0.5]
1  [3.0, 0.2]
2  [1.0, 0.9]

By default, the created column is called "concat_out", but you can specify a different name.

>>> concatenator = Concatenator(columns=["X0", "X1"], output_column_name="tensor")
>>> concatenator.transform(ds).to_pandas()  
       tensor
0  [0.0, 0.5]
1  [3.0, 0.2]
2  [1.0, 0.9]

>>> concatenator = Concatenator(columns=["X0", "X1"], dtype=np.float32)
>>> concatenator.transform(ds)  
Dataset(num_rows=3, schema={Y: object, concat_out: TensorDtype(shape=(2,), dtype=float32)})

When flatten=True, nested vectors in the columns will be flattened during concatenation:

>>> df = pd.DataFrame({"X0": [[1, 2], [3, 4]], "X1": [0.5, 0.2]})
>>> ds = ray.data.from_pandas(df)  
>>> concatenator = Concatenator(columns=["X0", "X1"], flatten=True)
>>> concatenator.transform(ds).to_pandas()  
   concat_out
0  [1.0, 2.0, 0.5]
1  [3.0, 4.0, 0.2]

Parameters:

columns – A list of columns to concatenate. The provided order of the columns will be retained during concatenation.
output_column_name – The desired name for the new column. Defaults to "concat_out".
dtype – The dtype to convert the output tensors to. If unspecified, the dtype is determined by standard coercion rules.
raise_if_missing – If True, an error is raised if any of the columns in columns don’t exist. Defaults to False.
flatten – If True, nested vectors in the columns will be flattened during concatenation. Defaults to False.

Raises:

ValueError – if raise_if_missing is True and a column in columns or doesn’t exist in the dataset.

PublicAPI (alpha): This API is in alpha and may change before becoming stable.

Methods

`deserialize`	Deserialize a preprocessor from serialized data.
`fit`	Fit this Preprocessor to the Dataset.
`fit_transform`	Fit this Preprocessor to the Dataset and then transform the Dataset.
`get_preprocessor_class_id`	Get the preprocessor class identifier for this preprocessor class.
`get_version`	Get the version number for this preprocessor class.
`preferred_batch_format`	Batch format hint for upstream producers to try yielding best block format.
`serialize`	Serialize this preprocessor to a string or bytes.
`set_preprocessor_class_id`	Set the preprocessor class identifier for this preprocessor class.
`set_version`	Set the version number for this preprocessor class.
`transform`	Transform the given dataset.
`transform_batch`	Transform a single batch of data.

Attributes

`MAGIC_CLOUDPICKLE`
`SERIALIZER_FORMAT_VERSION`
`columns`
`dtype`
`flatten`
`output_column_name`
`raise_if_missing`
`stat_computation_plan`