ray.data.preprocessors.Concatenator#
- class ray.data.preprocessors.Concatenator(output_column_name: str = 'concat_out', include: List[str] | None = None, exclude: str | List[str] | None = None, dtype: numpy.dtype | None = None, raise_if_missing: bool = False)[source]#
Bases:
Preprocessor
Combine numeric columns into a column of type
TensorDtype
.This preprocessor concatenates numeric columns and stores the result in a new column. The new column contains
TensorArrayElement
objects of shape \((m,)\), where \(m\) is the number of columns concatenated. The \(m\) concatenated columns are dropped after concatenation.Examples
>>> import numpy as np >>> import pandas as pd >>> import ray >>> from ray.data.preprocessors import Concatenator
Concatenator
combines numeric columns into a column ofTensorDtype
.>>> df = pd.DataFrame({"X0": [0, 3, 1], "X1": [0.5, 0.2, 0.9]}) >>> ds = ray.data.from_pandas(df) >>> concatenator = Concatenator() >>> concatenator.fit_transform(ds).to_pandas() concat_out 0 [0.0, 0.5] 1 [3.0, 0.2] 2 [1.0, 0.9]
By default, the created column is called
"concat_out"
, but you can specify a different name.>>> concatenator = Concatenator(output_column_name="tensor") >>> concatenator.fit_transform(ds).to_pandas() tensor 0 [0.0, 0.5] 1 [3.0, 0.2] 2 [1.0, 0.9]
Sometimes, you might not want to concatenate all of of the columns in your dataset. In this case, you can exclude columns with the
exclude
parameter.>>> df = pd.DataFrame({"X0": [0, 3, 1], "X1": [0.5, 0.2, 0.9], "Y": ["blue", "orange", "blue"]}) >>> ds = ray.data.from_pandas(df) >>> concatenator = Concatenator(exclude=["Y"]) >>> concatenator.fit_transform(ds).to_pandas() Y concat_out 0 blue [0.0, 0.5] 1 orange [3.0, 0.2] 2 blue [1.0, 0.9]
Alternatively, you can specify which columns to concatenate with the
include
parameter.>>> concatenator = Concatenator(include=["X0", "X1"]) >>> concatenator.fit_transform(ds).to_pandas() Y concat_out 0 blue [0.0, 0.5] 1 orange [3.0, 0.2] 2 blue [1.0, 0.9]
Note that if a column is in both
include
andexclude
, the column is excluded.>>> concatenator = Concatenator(include=["X0", "X1", "Y"], exclude=["Y"]) >>> concatenator.fit_transform(ds).to_pandas() Y concat_out 0 blue [0.0, 0.5] 1 orange [3.0, 0.2] 2 blue [1.0, 0.9]
By default, the concatenated tensor is a
dtype
common to the input columns. However, you can also explicitly set thedtype
with thedtype
parameter.>>> concatenator = Concatenator(include=["X0", "X1"], dtype=np.float32) >>> concatenator.fit_transform(ds) Dataset(num_rows=3, schema={Y: object, concat_out: TensorDtype(shape=(2,), dtype=float32)})
- Parameters:
output_column_name – The desired name for the new column. Defaults to
"concat_out"
.include – A list of columns to concatenate. If
None
, all columns are concatenated.exclude – A list of column to exclude from concatenation. If a column is in both
include
andexclude
, the column is excluded from concatenation.dtype – The
dtype
to convert the output tensors to. If unspecified, thedtype
is determined by standard coercion rules.raise_if_missing – If
True
, an error is raised if any of the columns ininclude
orexclude
don’t exist. Defaults toFalse
.
- Raises:
ValueError – if
raise_if_missing
isTrue
and a column ininclude
orexclude
doesn’t exist in the dataset.
PublicAPI (alpha): This API is in alpha and may change before becoming stable.
Methods
Load the original preprocessor serialized via
self.serialize()
.Fit this Preprocessor to the Dataset.
Fit this Preprocessor to the Dataset and then transform the Dataset.
Batch format hint for upstream producers to try yielding best block format.
Return this preprocessor serialized as a string.
Transform the given dataset.
Transform a single batch of data.