ray.data.preprocessors.Concatenator
ray.data.preprocessors.Concatenator#
- class ray.data.preprocessors.Concatenator(output_column_name: str = 'concat_out', include: Optional[List[str]] = None, exclude: Optional[Union[str, List[str]]] = None, dtype: Optional[numpy.dtype] = None, raise_if_missing: bool = False)[source]#
Bases:
ray.data.preprocessor.Preprocessor
Combine numeric columns into a column of type
TensorDtype
.This preprocessor concatenates numeric columns and stores the result in a new column. The new column contains
TensorArrayElement
objects of shape \((m,)\), where \(m\) is the number of columns concatenated. The \(m\) concatenated columns are dropped after concatenation.Examples
>>> import numpy as np >>> import pandas as pd >>> import ray >>> from ray.data.preprocessors import Concatenator
Concatenator
combines numeric columns into a column ofTensorDtype
.>>> df = pd.DataFrame({"X0": [0, 3, 1], "X1": [0.5, 0.2, 0.9]}) >>> ds = ray.data.from_pandas(df) >>> concatenator = Concatenator() >>> concatenator.fit_transform(ds).to_pandas() concat_out 0 [0.0, 0.5] 1 [3.0, 0.2] 2 [1.0, 0.9]
By default, the created column is called
"concat_out"
, but you can specify a different name.>>> concatenator = Concatenator(output_column_name="tensor") >>> concatenator.fit_transform(ds).to_pandas() tensor 0 [0.0, 0.5] 1 [3.0, 0.2] 2 [1.0, 0.9]
Sometimes, you might not want to concatenate all of of the columns in your dataset. In this case, you can exclude columns with the
exclude
parameter.>>> df = pd.DataFrame({"X0": [0, 3, 1], "X1": [0.5, 0.2, 0.9], "Y": ["blue", "orange", "blue"]}) >>> ds = ray.data.from_pandas(df) >>> concatenator = Concatenator(exclude=["Y"]) >>> concatenator.fit_transform(ds).to_pandas() Y concat_out 0 blue [0.0, 0.5] 1 orange [3.0, 0.2] 2 blue [1.0, 0.9]
Alternatively, you can specify which columns to concatenate with the
include
parameter.>>> concatenator = Concatenator(include=["X0", "X1"]) >>> concatenator.fit_transform(ds).to_pandas() Y concat_out 0 blue [0.0, 0.5] 1 orange [3.0, 0.2] 2 blue [1.0, 0.9]
Note that if a column is in both
include
andexclude
, the column is excluded.>>> concatenator = Concatenator(include=["X0", "X1", "Y"], exclude=["Y"]) >>> concatenator.fit_transform(ds).to_pandas() Y concat_out 0 blue [0.0, 0.5] 1 orange [3.0, 0.2] 2 blue [1.0, 0.9]
By default, the concatenated tensor is a
dtype
common to the input columns. However, you can also explicitly set thedtype
with thedtype
parameter.>>> concatenator = Concatenator(include=["X0", "X1"], dtype=np.float32) >>> concatenator.fit_transform(ds) Dataset(num_blocks=1, num_rows=3, schema={Y: object, concat_out: TensorDtype(shape=(2,), dtype=float32)})
- Parameters
output_column_name – The desired name for the new column. Defaults to
"concat_out"
.include – A list of columns to concatenate. If
None
, all columns are concatenated.exclude – A list of column to exclude from concatenation. If a column is in both
include
andexclude
, the column is excluded from concatenation.dtype – The
dtype
to convert the output tensors to. If unspecified, thedtype
is determined by standard coercion rules.raise_if_missing – If
True
, an error is raised if any of the columns ininclude
orexclude
don’t exist. Defaults toFalse
.
- Raises
ValueError – if
raise_if_missing
isTrue
and a column ininclude
orexclude
doesn’t exist in the dataset.
PublicAPI (alpha): This API is in alpha and may change before becoming stable.