ray.data.preprocessors.UniformKBinsDiscretizer#
- class ray.data.preprocessors.UniformKBinsDiscretizer(columns: List[str], bins: int | Dict[str, int], *, right: bool = True, include_lowest: bool = False, duplicates: str = 'raise', dtypes: Dict[str, pandas.CategoricalDtype | Type[numpy.integer]] | None = None, output_columns: List[str] | None = None)[source]#
Bases:
_AbstractKBinsDiscretizer
Bin values into discrete intervals (bins) of uniform width.
Columns must contain numerical values.
Examples
Use
UniformKBinsDiscretizer
to bin continuous features.>>> import pandas as pd >>> import ray >>> from ray.data.preprocessors import UniformKBinsDiscretizer >>> df = pd.DataFrame({ ... "value_1": [0.2, 1.4, 2.5, 6.2, 9.7, 2.1], ... "value_2": [10, 15, 13, 12, 23, 25], ... }) >>> ds = ray.data.from_pandas(df) >>> discretizer = UniformKBinsDiscretizer( ... columns=["value_1", "value_2"], bins=4 ... ) >>> discretizer.fit_transform(ds).to_pandas() value_1 value_2 0 0 0 1 0 1 2 0 0 3 2 0 4 3 3 5 0 3
UniformKBinsDiscretizer
can also be used in append mode by providing the name of the output_columns that should hold the encoded values.>>> discretizer = UniformKBinsDiscretizer( ... columns=["value_1", "value_2"], ... bins=4, ... output_columns=["value_1_discretized", "value_2_discretized"] ... ) >>> discretizer.fit_transform(ds).to_pandas() value_1 value_2 value_1_discretized value_2_discretized 0 0.2 10 0 0 1 1.4 15 0 1 2 2.5 13 0 0 3 6.2 12 2 0 4 9.7 23 3 3 5 2.1 25 0 3
You can also specify different number of bins per column.
>>> discretizer = UniformKBinsDiscretizer( ... columns=["value_1", "value_2"], bins={"value_1": 4, "value_2": 3} ... ) >>> discretizer.fit_transform(ds).to_pandas() value_1 value_2 0 0 0 1 0 0 2 0 0 3 2 0 4 3 2 5 0 2
- Parameters:
columns – The columns to discretize.
bins – Defines the number of equal-width bins. Can be either an integer (which will be applied to all columns), or a dict that maps columns to integers. The range is extended by .1% on each side to include the minimum and maximum values.
right – Indicates whether bins includes the rightmost edge or not.
include_lowest – Whether the first interval should be left-inclusive or not.
duplicates – Can be either ‘raise’ or ‘drop’. If bin edges are not unique, raise
ValueError
or drop non-uniques.dtypes – An optional dictionary that maps columns to
pd.CategoricalDtype
objects ornp.integer
types. If you don’t include a column indtypes
or specify it as an integer dtype, the outputted column will consist of ordered integers corresponding to bins. If you use apd.CategoricalDtype
, the outputted column will be apd.CategoricalDtype
with the categories being mapped to bins. You can usepd.CategoricalDtype(categories, ordered=True)
to preserve information about bin order.output_columns – The names of the transformed columns. If None, the transformed columns will be the same as the input columns. If not None, the length of
output_columns
must match the length ofcolumns
, othwerwise an error will be raised.
See also
CustomKBinsDiscretizer
If you want to specify your own bin edges.
PublicAPI (alpha): This API is in alpha and may change before becoming stable.
Methods
Load the original preprocessor serialized via
self.serialize()
.Fit this Preprocessor to the Dataset.
Fit this Preprocessor to the Dataset and then transform the Dataset.
Batch format hint for upstream producers to try yielding best block format.
Return this preprocessor serialized as a string.
Transform the given dataset.
Transform a single batch of data.