ray.data.aggregate.ApproximateQuantile.__init__#

ApproximateQuantile.__init__(on: str, quantiles: List[float], quantile_precision: int = 800, alias_name: str | None = None)[source]#

Computes the approximate quantiles of a column by using a datasketches kll_floats_sketch. https://datasketches.apache.org/docs/KLL/KLLSketch.html

The accuracy of the KLL quantile sketch is a function of the configured quantile precision, which also affects the overall size of the sketch. The KLL Sketch has absolute error. For example, a specified rank accuracy of 1% at the median (rank = 0.50) means that the true quantile (if you could extract it from the set) should be between getQuantile(0.49) and getQuantile(0.51). This same 1% error applied at a rank of 0.95 means that the true quantile should be between getQuantile(0.94) and getQuantile(0.96). In other words, the error is a fixed +/- epsilon for the entire range of ranks.

Typical single-sided rank error by quantile_precision (use for getQuantile/getRank):
  • quantile_precision=100 → ~2.61%

  • quantile_precision=200 → ~1.33%

  • quantile_precision=400 → ~0.68%

  • quantile_precision=800 → ~0.35%

See https://datasketches.apache.org/docs/KLL/KLLAccuracyAndSize.html for details on accuracy and size.

Null values in the target column are ignored when constructing the sketch.

Example

import ray
from ray.data.aggregate import ApproximateQuantile

# Create a dataset with some values
ds = ray.data.from_items(
    [{"value": 20.0}, {"value": 40.0}, {"value": 60.0},
    {"value": 80.0}, {"value": 100.0}]
)

result = ds.aggregate(ApproximateQuantile(on="value", quantiles=[0.1, 0.5, 0.9]))
# Result: {'approx_quantile(value)': [20.0, 60.0, 100.0]}
Parameters:
  • on – The name of the column to calculate the quantile on. Must be a numeric column.

  • quantiles – The list of quantiles to compute. Must be between 0 and 1 inclusive. For example, quantiles=[0.5] computes the median. Null entries in the source column are skipped.

  • quantile_precision – Controls the accuracy and memory footprint of the sketch (K in KLL); higher values yield lower error but use more memory. Defaults to 800. See https://datasketches.apache.org/docs/KLL/KLLAccuracyAndSize.html for details on accuracy and size.

  • alias_name – Optional name for the resulting column. If not provided, defaults to “approx_quantile({column_name})”.