ray.data.aggregate.ApproximateQuantile.__init__#
- ApproximateQuantile.__init__(on: str, quantiles: List[float], quantile_precision: int = 800, alias_name: str | None = None)[source]#
Computes the approximate quantiles of a column by using a datasketches kll_floats_sketch. https://datasketches.apache.org/docs/KLL/KLLSketch.html
The accuracy of the KLL quantile sketch is a function of the configured quantile precision, which also affects the overall size of the sketch. The KLL Sketch has absolute error. For example, a specified rank accuracy of 1% at the median (rank = 0.50) means that the true quantile (if you could extract it from the set) should be between getQuantile(0.49) and getQuantile(0.51). This same 1% error applied at a rank of 0.95 means that the true quantile should be between getQuantile(0.94) and getQuantile(0.96). In other words, the error is a fixed +/- epsilon for the entire range of ranks.
- Typical single-sided rank error by quantile_precision (use for getQuantile/getRank):
quantile_precision=100 → ~2.61%
quantile_precision=200 → ~1.33%
quantile_precision=400 → ~0.68%
quantile_precision=800 → ~0.35%
See https://datasketches.apache.org/docs/KLL/KLLAccuracyAndSize.html for details on accuracy and size.
Null values in the target column are ignored when constructing the sketch.
Example
import ray from ray.data.aggregate import ApproximateQuantile # Create a dataset with some values ds = ray.data.from_items( [{"value": 20.0}, {"value": 40.0}, {"value": 60.0}, {"value": 80.0}, {"value": 100.0}] ) result = ds.aggregate(ApproximateQuantile(on="value", quantiles=[0.1, 0.5, 0.9])) # Result: {'approx_quantile(value)': [20.0, 60.0, 100.0]}
- Parameters:
on – The name of the column to calculate the quantile on. Must be a numeric column.
quantiles – The list of quantiles to compute. Must be between 0 and 1 inclusive. For example, quantiles=[0.5] computes the median. Null entries in the source column are skipped.
quantile_precision – Controls the accuracy and memory footprint of the sketch (K in KLL); higher values yield lower error but use more memory. Defaults to 800. See https://datasketches.apache.org/docs/KLL/KLLAccuracyAndSize.html for details on accuracy and size.
alias_name – Optional name for the resulting column. If not provided, defaults to “approx_quantile({column_name})”.