ray.data.preprocessors.CustomKBinsDiscretizer#
- class ray.data.preprocessors.CustomKBinsDiscretizer(columns: List[str], bins: Iterable[float] | pandas.IntervalIndex | Dict[str, Iterable[float] | pandas.IntervalIndex], *, right: bool = True, include_lowest: bool = False, duplicates: str = 'raise', dtypes: Dict[str, pandas.CategoricalDtype | Type[numpy.integer]] | None = None, output_columns: List[str] | None = None)[source]#
Bases:
_AbstractKBinsDiscretizer
Bin values into discrete intervals using custom bin edges.
Columns must contain numerical values.
Examples
Use
CustomKBinsDiscretizer
to bin continuous features.>>> import pandas as pd >>> import ray >>> from ray.data.preprocessors import CustomKBinsDiscretizer >>> df = pd.DataFrame({ ... "value_1": [0.2, 1.4, 2.5, 6.2, 9.7, 2.1], ... "value_2": [10, 15, 13, 12, 23, 25], ... }) >>> ds = ray.data.from_pandas(df) >>> discretizer = CustomKBinsDiscretizer( ... columns=["value_1", "value_2"], ... bins=[0, 1, 4, 10, 25] ... ) >>> discretizer.transform(ds).to_pandas() value_1 value_2 0 0 2 1 1 3 2 1 3 3 2 3 4 2 3 5 1 3
CustomKBinsDiscretizer
can also be used in append mode by providing the name of the output_columns that should hold the encoded values.>>> discretizer = CustomKBinsDiscretizer( ... columns=["value_1", "value_2"], ... bins=[0, 1, 4, 10, 25], ... output_columns=["value_1_discretized", "value_2_discretized"] ... ) >>> discretizer.fit_transform(ds).to_pandas() value_1 value_2 value_1_discretized value_2_discretized 0 0.2 10 0 2 1 1.4 15 1 3 2 2.5 13 1 3 3 6.2 12 2 3 4 9.7 23 2 3 5 2.1 25 1 3
You can also specify different bin edges per column.
>>> discretizer = CustomKBinsDiscretizer( ... columns=["value_1", "value_2"], ... bins={"value_1": [0, 1, 4], "value_2": [0, 18, 35, 70]}, ... ) >>> discretizer.transform(ds).to_pandas() value_1 value_2 0 0.0 0 1 1.0 0 2 1.0 0 3 NaN 0 4 NaN 1 5 1.0 1
- Parameters:
columns – The columns to discretize.
bins – Defines custom bin edges. Can be an iterable of numbers, a
pd.IntervalIndex
, or a dict mapping columns to either of them. Note thatpd.IntervalIndex
for bins must be non-overlapping.right – Indicates whether bins include the rightmost edge.
include_lowest – Indicates whether the first interval should be left-inclusive.
duplicates – Can be either ‘raise’ or ‘drop’. If bin edges are not unique, raise
ValueError
or drop non-uniques.dtypes – An optional dictionary that maps columns to
pd.CategoricalDtype
objects ornp.integer
types. If you don’t include a column indtypes
or specify it as an integer dtype, the outputted column will consist of ordered integers corresponding to bins. If you use apd.CategoricalDtype
, the outputted column will be apd.CategoricalDtype
with the categories being mapped to bins. You can usepd.CategoricalDtype(categories, ordered=True)
to preserve information about bin order.output_columns – The names of the transformed columns. If None, the transformed columns will be the same as the input columns. If not None, the length of
output_columns
must match the length ofcolumns
, othwerwise an error will be raised.
See also
UniformKBinsDiscretizer
If you want to bin data into uniform width bins.
PublicAPI (alpha): This API is in alpha and may change before becoming stable.
Methods
Load the original preprocessor serialized via
self.serialize()
.Fit this Preprocessor to the Dataset.
Fit this Preprocessor to the Dataset and then transform the Dataset.
Batch format hint for upstream producers to try yielding best block format.
Return this preprocessor serialized as a string.
Transform the given dataset.
Transform a single batch of data.