ray.data.preprocessors.Concatenator#
- class ray.data.preprocessors.Concatenator(columns: List[str], output_column_name: str = 'concat_out', dtype: numpy.dtype | None = None, raise_if_missing: bool = False)[source]#
Bases:
Preprocessor
Combine numeric columns into a column of type
TensorDtype
. Only columns specified incolumns
will be concatenated.This preprocessor concatenates numeric columns and stores the result in a new column. The new column contains
TensorArrayElement
objects of shape \((m,)\), where \(m\) is the number of columns concatenated. The \(m\) concatenated columns are dropped after concatenation. The preprocessor preserves the order of the columns provided in thecolummns
argument and will use that order when callingtransform()
andtransform_batch()
.Examples
>>> import numpy as np >>> import pandas as pd >>> import ray >>> from ray.data.preprocessors import Concatenator
Concatenator
combines numeric columns into a column ofTensorDtype
.>>> df = pd.DataFrame({"X0": [0, 3, 1], "X1": [0.5, 0.2, 0.9]}) >>> ds = ray.data.from_pandas(df) >>> concatenator = Concatenator(columns=["X0", "X1"]) >>> concatenator.transform(ds).to_pandas() concat_out 0 [0.0, 0.5] 1 [3.0, 0.2] 2 [1.0, 0.9]
By default, the created column is called
"concat_out"
, but you can specify a different name.>>> concatenator = Concatenator(columns=["X0", "X1"], output_column_name="tensor") >>> concatenator.transform(ds).to_pandas() tensor 0 [0.0, 0.5] 1 [3.0, 0.2] 2 [1.0, 0.9]
>>> concatenator = Concatenator(columns=["X0", "X1"], dtype=np.float32) >>> concatenator.transform(ds) Dataset(num_rows=3, schema={Y: object, concat_out: TensorDtype(shape=(2,), dtype=float32)})
- Parameters:
output_column_name – The desired name for the new column. Defaults to
"concat_out"
.columns – A list of columns to concatenate. The provided order of the columns will be retained during concatenation.
dtype – The
dtype
to convert the output tensors to. If unspecified, thedtype
is determined by standard coercion rules.raise_if_missing – If
True
, an error is raised if any of the columns incolumns
don’t exist. Defaults toFalse
.
- Raises:
ValueError – if
raise_if_missing
isTrue
and a column incolumns
or doesn’t exist in the dataset.
PublicAPI (alpha): This API is in alpha and may change before becoming stable.
Methods
Load the original preprocessor serialized via
self.serialize()
.Fit this Preprocessor to the Dataset.
Fit this Preprocessor to the Dataset and then transform the Dataset.
Batch format hint for upstream producers to try yielding best block format.
Return this preprocessor serialized as a string.
Transform the given dataset.
Transform a single batch of data.