ray.data.preprocessors.UniformKBinsDiscretizer
ray.data.preprocessors.UniformKBinsDiscretizer#
- class ray.data.preprocessors.UniformKBinsDiscretizer(columns: List[str], bins: Union[int, Dict[str, int]], *, right: bool = True, include_lowest: bool = False, duplicates: str = 'raise', dtypes: Optional[Dict[str, Union[pandas.core.dtypes.dtypes.CategoricalDtype, Type[numpy.integer]]]] = None)[source]#
Bases:
ray.data.preprocessors.discretizer._AbstractKBinsDiscretizer
Bin values into discrete intervals (bins) of uniform width.
Columns must contain numerical values.
Examples
Use
UniformKBinsDiscretizer
to bin continuous features.>>> import pandas as pd >>> import ray >>> from ray.data.preprocessors import UniformKBinsDiscretizer >>> df = pd.DataFrame({ ... "value_1": [0.2, 1.4, 2.5, 6.2, 9.7, 2.1], ... "value_2": [10, 15, 13, 12, 23, 25], ... }) >>> ds = ray.data.from_pandas(df) >>> discretizer = UniformKBinsDiscretizer( ... columns=["value_1", "value_2"], bins=4 ... ) >>> discretizer.fit_transform(ds).to_pandas() value_1 value_2 0 0 0 1 0 1 2 0 0 3 2 0 4 3 3 5 0 3
You can also specify different number of bins per column.
>>> discretizer = UniformKBinsDiscretizer( ... columns=["value_1", "value_2"], bins={"value_1": 4, "value_2": 3} ... ) >>> discretizer.fit_transform(ds).to_pandas() value_1 value_2 0 0 0 1 0 0 2 0 0 3 2 0 4 3 2 5 0 2
- Parameters
columns – The columns to discretize.
bins – Defines the number of equal-width bins. Can be either an integer (which will be applied to all columns), or a dict that maps columns to integers. The range is extended by .1% on each side to include the minimum and maximum values.
right – Indicates whether bins includes the rightmost edge or not.
include_lowest – Whether the first interval should be left-inclusive or not.
duplicates – Can be either ‘raise’ or ‘drop’. If bin edges are not unique, raise
ValueError
or drop non-uniques.dtypes – An optional dictionary that maps columns to
pd.CategoricalDtype
objects ornp.integer
types. If you don’t include a column indtypes
or specify it as an integer dtype, the outputted column will consist of ordered integers corresponding to bins. If you use apd.CategoricalDtype
, the outputted column will be apd.CategoricalDtype
with the categories being mapped to bins. You can usepd.CategoricalDtype(categories, ordered=True)
to preserve information about bin order.
See also
CustomKBinsDiscretizer
If you want to specify your own bin edges.
PublicAPI (alpha): This API is in alpha and may change before becoming stable.