ray.data.preprocessors.UniformKBinsDiscretizer#
- class ray.data.preprocessors.UniformKBinsDiscretizer(columns: List[str], bins: int | Dict[str, int], *, right: bool = True, include_lowest: bool = False, duplicates: str = 'raise', dtypes: Dict[str, pandas.CategoricalDtype | Type[numpy.integer]] | None = None)[source]#
Bases:
_AbstractKBinsDiscretizer
Bin values into discrete intervals (bins) of uniform width.
Columns must contain numerical values.
Examples
Use
UniformKBinsDiscretizer
to bin continuous features.>>> import pandas as pd >>> import ray >>> from ray.data.preprocessors import UniformKBinsDiscretizer >>> df = pd.DataFrame({ ... "value_1": [0.2, 1.4, 2.5, 6.2, 9.7, 2.1], ... "value_2": [10, 15, 13, 12, 23, 25], ... }) >>> ds = ray.data.from_pandas(df) >>> discretizer = UniformKBinsDiscretizer( ... columns=["value_1", "value_2"], bins=4 ... ) >>> discretizer.fit_transform(ds).to_pandas() value_1 value_2 0 0 0 1 0 1 2 0 0 3 2 0 4 3 3 5 0 3
You can also specify different number of bins per column.
>>> discretizer = UniformKBinsDiscretizer( ... columns=["value_1", "value_2"], bins={"value_1": 4, "value_2": 3} ... ) >>> discretizer.fit_transform(ds).to_pandas() value_1 value_2 0 0 0 1 0 0 2 0 0 3 2 0 4 3 2 5 0 2
- Parameters:
columns – The columns to discretize.
bins – Defines the number of equal-width bins. Can be either an integer (which will be applied to all columns), or a dict that maps columns to integers. The range is extended by .1% on each side to include the minimum and maximum values.
right – Indicates whether bins includes the rightmost edge or not.
include_lowest – Whether the first interval should be left-inclusive or not.
duplicates – Can be either ‘raise’ or ‘drop’. If bin edges are not unique, raise
ValueError
or drop non-uniques.dtypes – An optional dictionary that maps columns to
pd.CategoricalDtype
objects ornp.integer
types. If you don’t include a column indtypes
or specify it as an integer dtype, the outputted column will consist of ordered integers corresponding to bins. If you use apd.CategoricalDtype
, the outputted column will be apd.CategoricalDtype
with the categories being mapped to bins. You can usepd.CategoricalDtype(categories, ordered=True)
to preserve information about bin order.
See also
CustomKBinsDiscretizer
If you want to specify your own bin edges.
PublicAPI (alpha): This API is in alpha and may change before becoming stable.
Methods
Load the original preprocessor serialized via
self.serialize()
.Fit this Preprocessor to the Dataset.
Fit this Preprocessor to the Dataset and then transform the Dataset.
Batch format hint for upstream producers to try yielding best block format.
Return this preprocessor serialized as a string.
Transform the given dataset.
Transform a single batch of data.