ray.data.preprocessors.UniformKBinsDiscretizer#

class ray.data.preprocessors.UniformKBinsDiscretizer(columns: List[str], bins: int | Dict[str, int], *, right: bool = True, include_lowest: bool = False, duplicates: str = 'raise', dtypes: Dict[str, pandas.CategoricalDtype | Type[numpy.integer]] | None = None, output_columns: List[str] | None = None)[source]#

Bases: _AbstractKBinsDiscretizer

Bin values into discrete intervals (bins) of uniform width.

Columns must contain numerical values.

Examples

Use UniformKBinsDiscretizer to bin continuous features.

>>> import pandas as pd
>>> import ray
>>> from ray.data.preprocessors import UniformKBinsDiscretizer
>>> df = pd.DataFrame({
...     "value_1": [0.2, 1.4, 2.5, 6.2, 9.7, 2.1],
...     "value_2": [10, 15, 13, 12, 23, 25],
... })
>>> ds = ray.data.from_pandas(df)
>>> discretizer = UniformKBinsDiscretizer(
...     columns=["value_1", "value_2"], bins=4
... )
>>> discretizer.fit_transform(ds).to_pandas()
   value_1  value_2
0        0        0
1        0        1
2        0        0
3        2        0
4        3        3
5        0        3

UniformKBinsDiscretizer can also be used in append mode by providing the name of the output_columns that should hold the encoded values.

>>> discretizer = UniformKBinsDiscretizer(
...     columns=["value_1", "value_2"],
...     bins=4,
...     output_columns=["value_1_discretized", "value_2_discretized"]
... )
>>> discretizer.fit_transform(ds).to_pandas()  
   value_1  value_2  value_1_discretized  value_2_discretized
0      0.2       10                    0                    0
1      1.4       15                    0                    1
2      2.5       13                    0                    0
3      6.2       12                    2                    0
4      9.7       23                    3                    3
5      2.1       25                    0                    3

You can also specify different number of bins per column.

>>> discretizer = UniformKBinsDiscretizer(
...     columns=["value_1", "value_2"], bins={"value_1": 4, "value_2": 3}
... )
>>> discretizer.fit_transform(ds).to_pandas()
   value_1  value_2
0        0        0
1        0        0
2        0        0
3        2        0
4        3        2
5        0        2

Parameters:

columns – The columns to discretize.
bins – Defines the number of equal-width bins. Can be either an integer (which will be applied to all columns), or a dict that maps columns to integers. The range is extended by .1% on each side to include the minimum and maximum values.
right – Indicates whether bins includes the rightmost edge or not.
include_lowest – Whether the first interval should be left-inclusive or not.
duplicates – Can be either ‘raise’ or ‘drop’. If bin edges are not unique, raise ValueError or drop non-uniques.
dtypes – An optional dictionary that maps columns to pd.CategoricalDtype objects or np.integer types. If you don’t include a column in dtypes or specify it as an integer dtype, the outputted column will consist of ordered integers corresponding to bins. If you use a pd.CategoricalDtype, the outputted column will be a pd.CategoricalDtype with the categories being mapped to bins. You can use pd.CategoricalDtype(categories, ordered=True) to preserve information about bin order.
output_columns – The names of the transformed columns. If None, the transformed columns will be the same as the input columns. If not None, the length of output_columns must match the length of columns, othwerwise an error will be raised.

`deserialize`	Load the original preprocessor serialized via `self.serialize()`.
`fit`	Fit this Preprocessor to the Dataset.
`fit_transform`	Fit this Preprocessor to the Dataset and then transform the Dataset.
`preferred_batch_format`	Batch format hint for upstream producers to try yielding best block format.
`serialize`	Return this preprocessor serialized as a string.
`transform`	Transform the given dataset.
`transform_batch`	Transform a single batch of data.