
class ray.data.preprocessors.CustomKBinsDiscretizer(columns: List[str], bins: Iterable[float] | pandas.IntervalIndex | Dict[str, Iterable[float] | pandas.IntervalIndex], *, right: bool = True, include_lowest: bool = False, duplicates: str = 'raise', dtypes: Dict[str, pandas.CategoricalDtype | Type[numpy.integer]] | None = None, output_columns: List[str] | None = None)[source]#

Bases: _AbstractKBinsDiscretizer

Bin values into discrete intervals using custom bin edges.

Columns must contain numerical values.


Use CustomKBinsDiscretizer to bin continuous features.

>>> import pandas as pd
>>> import ray
>>> from ray.data.preprocessors import CustomKBinsDiscretizer
>>> df = pd.DataFrame({
...     "value_1": [0.2, 1.4, 2.5, 6.2, 9.7, 2.1],
...     "value_2": [10, 15, 13, 12, 23, 25],
... })
>>> ds = ray.data.from_pandas(df)
>>> discretizer = CustomKBinsDiscretizer(
...     columns=["value_1", "value_2"],
...     bins=[0, 1, 4, 10, 25]
... )
>>> discretizer.transform(ds).to_pandas()
   value_1  value_2
0        0        2
1        1        3
2        1        3
3        2        3
4        2        3
5        1        3

CustomKBinsDiscretizer can also be used in append mode by providing the name of the output_columns that should hold the encoded values.

>>> discretizer = CustomKBinsDiscretizer(
...     columns=["value_1", "value_2"],
...     bins=[0, 1, 4, 10, 25],
...     output_columns=["value_1_discretized", "value_2_discretized"]
... )
>>> discretizer.fit_transform(ds).to_pandas()  
   value_1  value_2  value_1_discretized  value_2_discretized
0      0.2       10                    0                    2
1      1.4       15                    1                    3
2      2.5       13                    1                    3
3      6.2       12                    2                    3
4      9.7       23                    2                    3
5      2.1       25                    1                    3

You can also specify different bin edges per column.

>>> discretizer = CustomKBinsDiscretizer(
...     columns=["value_1", "value_2"],
...     bins={"value_1": [0, 1, 4], "value_2": [0, 18, 35, 70]},
... )
>>> discretizer.transform(ds).to_pandas()
   value_1  value_2
0      0.0        0
1      1.0        0
2      1.0        0
3      NaN        0
4      NaN        1
5      1.0        1
  • columns – The columns to discretize.

  • bins – Defines custom bin edges. Can be an iterable of numbers, a pd.IntervalIndex, or a dict mapping columns to either of them. Note that pd.IntervalIndex for bins must be non-overlapping.

  • right – Indicates whether bins include the rightmost edge.

  • include_lowest – Indicates whether the first interval should be left-inclusive.

  • duplicates – Can be either ‘raise’ or ‘drop’. If bin edges are not unique, raise ValueError or drop non-uniques.

  • dtypes – An optional dictionary that maps columns to pd.CategoricalDtype objects or np.integer types. If you don’t include a column in dtypes or specify it as an integer dtype, the outputted column will consist of ordered integers corresponding to bins. If you use a pd.CategoricalDtype, the outputted column will be a pd.CategoricalDtype with the categories being mapped to bins. You can use pd.CategoricalDtype(categories, ordered=True) to preserve information about bin order.

  • output_columns – The names of the transformed columns. If None, the transformed columns will be the same as the input columns. If not None, the length of output_columns must match the length of columns, othwerwise an error will be raised.

See also


If you want to bin data into uniform width bins.

PublicAPI (alpha): This API is in alpha and may change before becoming stable.



Load the original preprocessor serialized via self.serialize().


Fit this Preprocessor to the Dataset.


Fit this Preprocessor to the Dataset and then transform the Dataset.


Batch format hint for upstream producers to try yielding best block format.


Return this preprocessor serialized as a string.


Transform the given dataset.


Transform a single batch of data.