ray.data.preprocessors.UniformKBinsDiscretizer#
- class ray.data.preprocessors.UniformKBinsDiscretizer(columns: List[str], bins: int | Dict[str, int], *, right: bool = True, include_lowest: bool = False, duplicates: str = 'raise', dtypes: Dict[str, pandas.CategoricalDtype | Type[numpy.integer]] | None = None, output_columns: List[str] | None = None)[source]#
Bases:
_AbstractKBinsDiscretizerBin values into discrete intervals (bins) of uniform width.
Columns must contain numerical values.
Examples
Use
UniformKBinsDiscretizerto bin continuous features.>>> import pandas as pd >>> import ray >>> from ray.data.preprocessors import UniformKBinsDiscretizer >>> df = pd.DataFrame({ ... "value_1": [0.2, 1.4, 2.5, 6.2, 9.7, 2.1], ... "value_2": [10, 15, 13, 12, 23, 25], ... }) >>> ds = ray.data.from_pandas(df) >>> discretizer = UniformKBinsDiscretizer( ... columns=["value_1", "value_2"], bins=4 ... ) >>> discretizer.fit_transform(ds).to_pandas() value_1 value_2 0 0 0 1 0 1 2 0 0 3 2 0 4 3 3 5 0 3
UniformKBinsDiscretizercan also be used in append mode by providing the name of the output_columns that should hold the encoded values.>>> discretizer = UniformKBinsDiscretizer( ... columns=["value_1", "value_2"], ... bins=4, ... output_columns=["value_1_discretized", "value_2_discretized"] ... ) >>> discretizer.fit_transform(ds).to_pandas() value_1 value_2 value_1_discretized value_2_discretized 0 0.2 10 0 0 1 1.4 15 0 1 2 2.5 13 0 0 3 6.2 12 2 0 4 9.7 23 3 3 5 2.1 25 0 3
You can also specify different number of bins per column.
>>> discretizer = UniformKBinsDiscretizer( ... columns=["value_1", "value_2"], bins={"value_1": 4, "value_2": 3} ... ) >>> discretizer.fit_transform(ds).to_pandas() value_1 value_2 0 0 0 1 0 0 2 0 0 3 2 0 4 3 2 5 0 2
- Parameters:
columns – The columns to discretize.
bins – Defines the number of equal-width bins. Can be either an integer (which will be applied to all columns), or a dict that maps columns to integers. The range is extended by .1% on each side to include the minimum and maximum values.
right – Indicates whether bins includes the rightmost edge or not.
include_lowest – Whether the first interval should be left-inclusive or not.
duplicates – Can be either ‘raise’ or ‘drop’. If bin edges are not unique, raise
ValueErroror drop non-uniques.dtypes – An optional dictionary that maps columns to
pd.CategoricalDtypeobjects ornp.integertypes. If you don’t include a column indtypesor specify it as an integer dtype, the outputted column will consist of ordered integers corresponding to bins. If you use apd.CategoricalDtype, the outputted column will be apd.CategoricalDtypewith the categories being mapped to bins. You can usepd.CategoricalDtype(categories, ordered=True)to preserve information about bin order.output_columns – The names of the transformed columns. If None, the transformed columns will be the same as the input columns. If not None, the length of
output_columnsmust match the length ofcolumns, othwerwise an error will be raised.
See also
CustomKBinsDiscretizerIf you want to specify your own bin edges.
PublicAPI (alpha): This API is in alpha and may change before becoming stable.
Methods
Load the original preprocessor serialized via
self.serialize().Fit this Preprocessor to the Dataset.
Fit this Preprocessor to the Dataset and then transform the Dataset.
Batch format hint for upstream producers to try yielding best block format.
Return this preprocessor serialized as a string.
Transform the given dataset.
Transform a single batch of data.