ray.data.preprocessors.UniformKBinsDiscretizer#

class ray.data.preprocessors.UniformKBinsDiscretizer(columns: List[str], bins: int | Dict[str, int], *, right: bool = True, include_lowest: bool = False, duplicates: str = 'raise', dtypes: Dict[str, pandas.CategoricalDtype | Type[numpy.integer]] | None = None)[source]#

Bases: _AbstractKBinsDiscretizer

Bin values into discrete intervals (bins) of uniform width.

Columns must contain numerical values.

Examples

Use UniformKBinsDiscretizer to bin continuous features.

>>> import pandas as pd
>>> import ray
>>> from ray.data.preprocessors import UniformKBinsDiscretizer
>>> df = pd.DataFrame({
...     "value_1": [0.2, 1.4, 2.5, 6.2, 9.7, 2.1],
...     "value_2": [10, 15, 13, 12, 23, 25],
... })
>>> ds = ray.data.from_pandas(df)
>>> discretizer = UniformKBinsDiscretizer(
...     columns=["value_1", "value_2"], bins=4
... )
>>> discretizer.fit_transform(ds).to_pandas()
   value_1  value_2
0        0        0
1        0        1
2        0        0
3        2        0
4        3        3
5        0        3

You can also specify different number of bins per column.

>>> discretizer = UniformKBinsDiscretizer(
...     columns=["value_1", "value_2"], bins={"value_1": 4, "value_2": 3}
... )
>>> discretizer.fit_transform(ds).to_pandas()
   value_1  value_2
0        0        0
1        0        0
2        0        0
3        2        0
4        3        2
5        0        2
Parameters:
  • columns – The columns to discretize.

  • bins – Defines the number of equal-width bins. Can be either an integer (which will be applied to all columns), or a dict that maps columns to integers. The range is extended by .1% on each side to include the minimum and maximum values.

  • right – Indicates whether bins includes the rightmost edge or not.

  • include_lowest – Whether the first interval should be left-inclusive or not.

  • duplicates – Can be either ‘raise’ or ‘drop’. If bin edges are not unique, raise ValueError or drop non-uniques.

  • dtypes – An optional dictionary that maps columns to pd.CategoricalDtype objects or np.integer types. If you don’t include a column in dtypes or specify it as an integer dtype, the outputted column will consist of ordered integers corresponding to bins. If you use a pd.CategoricalDtype, the outputted column will be a pd.CategoricalDtype with the categories being mapped to bins. You can use pd.CategoricalDtype(categories, ordered=True) to preserve information about bin order.

See also

CustomKBinsDiscretizer

If you want to specify your own bin edges.

PublicAPI (alpha): This API is in alpha and may change before becoming stable.

Methods

deserialize

Load the original preprocessor serialized via self.serialize().

fit

Fit this Preprocessor to the Dataset.

fit_transform

Fit this Preprocessor to the Dataset and then transform the Dataset.

preferred_batch_format

Batch format hint for upstream producers to try yielding best block format.

serialize

Return this preprocessor serialized as a string.

transform

Transform the given dataset.

transform_batch

Transform a single batch of data.