ray.data.preprocessors.CustomKBinsDiscretizer#

class ray.data.preprocessors.CustomKBinsDiscretizer(columns: List[str], bins: Iterable[float] | pandas.IntervalIndex | Dict[str, Iterable[float] | pandas.IntervalIndex], *, right: bool = True, include_lowest: bool = False, duplicates: str = 'raise', dtypes: Dict[str, pandas.CategoricalDtype | Type[numpy.integer]] | None = None, output_columns: List[str] | None = None)[source]#

Bases: _AbstractKBinsDiscretizer

Bin values into discrete intervals using custom bin edges.

Columns must contain numerical values.

Examples

Use CustomKBinsDiscretizer to bin continuous features.

>>> import pandas as pd
>>> import ray
>>> from ray.data.preprocessors import CustomKBinsDiscretizer
>>> df = pd.DataFrame({
...     "value_1": [0.2, 1.4, 2.5, 6.2, 9.7, 2.1],
...     "value_2": [10, 15, 13, 12, 23, 25],
... })
>>> ds = ray.data.from_pandas(df)
>>> discretizer = CustomKBinsDiscretizer(
...     columns=["value_1", "value_2"],
...     bins=[0, 1, 4, 10, 25]
... )
>>> discretizer.transform(ds).to_pandas()
   value_1  value_2
0        0        2
1        1        3
2        1        3
3        2        3
4        2        3
5        1        3

CustomKBinsDiscretizer can also be used in append mode by providing the name of the output_columns that should hold the encoded values.

>>> discretizer = CustomKBinsDiscretizer(
...     columns=["value_1", "value_2"],
...     bins=[0, 1, 4, 10, 25],
...     output_columns=["value_1_discretized", "value_2_discretized"]
... )
>>> discretizer.fit_transform(ds).to_pandas()  
   value_1  value_2  value_1_discretized  value_2_discretized
0      0.2       10                    0                    2
1      1.4       15                    1                    3
2      2.5       13                    1                    3
3      6.2       12                    2                    3
4      9.7       23                    2                    3
5      2.1       25                    1                    3

You can also specify different bin edges per column.

>>> discretizer = CustomKBinsDiscretizer(
...     columns=["value_1", "value_2"],
...     bins={"value_1": [0, 1, 4], "value_2": [0, 18, 35, 70]},
... )
>>> discretizer.transform(ds).to_pandas()
   value_1  value_2
0      0.0        0
1      1.0        0
2      1.0        0
3      NaN        0
4      NaN        1
5      1.0        1

Parameters:

columns – The columns to discretize.
bins – Defines custom bin edges. Can be an iterable of numbers, a pd.IntervalIndex, or a dict mapping columns to either of them. Note that pd.IntervalIndex for bins must be non-overlapping.
right – Indicates whether bins include the rightmost edge.
include_lowest – Indicates whether the first interval should be left-inclusive.
duplicates – Can be either ‘raise’ or ‘drop’. If bin edges are not unique, raise ValueError or drop non-uniques.
dtypes – An optional dictionary that maps columns to pd.CategoricalDtype objects or np.integer types. If you don’t include a column in dtypes or specify it as an integer dtype, the outputted column will consist of ordered integers corresponding to bins. If you use a pd.CategoricalDtype, the outputted column will be a pd.CategoricalDtype with the categories being mapped to bins. You can use pd.CategoricalDtype(categories, ordered=True) to preserve information about bin order.
output_columns – The names of the transformed columns. If None, the transformed columns will be the same as the input columns. If not None, the length of output_columns must match the length of columns, othwerwise an error will be raised.

`deserialize`	Load the original preprocessor serialized via `self.serialize()`.
`fit`	Fit this Preprocessor to the Dataset.
`fit_transform`	Fit this Preprocessor to the Dataset and then transform the Dataset.
`preferred_batch_format`	Batch format hint for upstream producers to try yielding best block format.
`serialize`	Return this preprocessor serialized as a string.
`transform`	Transform the given dataset.
`transform_batch`	Transform a single batch of data.