ray.data.preprocessors.CustomKBinsDiscretizer
ray.data.preprocessors.CustomKBinsDiscretizer#
- class ray.data.preprocessors.CustomKBinsDiscretizer(columns: List[str], bins: Union[Iterable[float], pandas.core.indexes.interval.IntervalIndex, Dict[str, Union[Iterable[float], pandas.core.indexes.interval.IntervalIndex]]], *, right: bool = True, include_lowest: bool = False, duplicates: str = 'raise', dtypes: Optional[Dict[str, Union[pandas.core.dtypes.dtypes.CategoricalDtype, Type[numpy.integer]]]] = None)[source]#
Bases:
ray.data.preprocessors.discretizer._AbstractKBinsDiscretizer
Bin values into discrete intervals using custom bin edges.
Columns must contain numerical values.
Examples
Use
CustomKBinsDiscretizer
to bin continuous features.>>> import pandas as pd >>> import ray >>> from ray.data.preprocessors import CustomKBinsDiscretizer >>> df = pd.DataFrame({ ... "value_1": [0.2, 1.4, 2.5, 6.2, 9.7, 2.1], ... "value_2": [10, 15, 13, 12, 23, 25], ... }) >>> ds = ray.data.from_pandas(df) >>> discretizer = CustomKBinsDiscretizer( ... columns=["value_1", "value_2"], ... bins=[0, 1, 4, 10, 25] ... ) >>> discretizer.transform(ds).to_pandas() value_1 value_2 0 0 2 1 1 3 2 1 3 3 2 3 4 2 3 5 1 3
You can also specify different bin edges per column.
>>> discretizer = CustomKBinsDiscretizer( ... columns=["value_1", "value_2"], ... bins={"value_1": [0, 1, 4], "value_2": [0, 18, 35, 70]}, ... ) >>> discretizer.transform(ds).to_pandas() value_1 value_2 0 0.0 0 1 1.0 0 2 1.0 0 3 NaN 0 4 NaN 1 5 1.0 1
- Parameters
columns – The columns to discretize.
bins – Defines custom bin edges. Can be an iterable of numbers, a
pd.IntervalIndex
, or a dict mapping columns to either of them. Note thatpd.IntervalIndex
for bins must be non-overlapping.right – Indicates whether bins include the rightmost edge.
include_lowest – Indicates whether the first interval should be left-inclusive.
duplicates – Can be either ‘raise’ or ‘drop’. If bin edges are not unique, raise
ValueError
or drop non-uniques.dtypes – An optional dictionary that maps columns to
pd.CategoricalDtype
objects ornp.integer
types. If you don’t include a column indtypes
or specify it as an integer dtype, the outputted column will consist of ordered integers corresponding to bins. If you use apd.CategoricalDtype
, the outputted column will be apd.CategoricalDtype
with the categories being mapped to bins. You can usepd.CategoricalDtype(categories, ordered=True)
to preserve information about bin order.
See also
UniformKBinsDiscretizer
If you want to bin data into uniform width bins.
PublicAPI (alpha): This API is in alpha and may change before becoming stable.