ray.data.aggregate.ApproximateTopK.__init__#

ApproximateTopK.__init__(on: str, k: int, log_capacity: int = 15, alias_name: str | None = None)[source]#

Computes the approximate top k items in a column by using a datasketches frequent_strings_sketch. https://datasketches.apache.org/docs/Frequency/FrequentItemsOverview.html

Guarantees:
  • Any item with true frequency > N / (2^log_capacity) is guaranteed to appear in the results

  • Reported counts may have an error of at most ± N / (2^log_capacity).

If log_capacity is too small for your data:
  • Low-frequency items may be evicted from the sketch, potentially causing the top-k results to miss items that should appear in the output.

  • The error bounds increase, reducing the accuracy of the reported counts.

Example

import ray
from ray.data.aggregate import ApproximateTopK

ds = ray.data.from_items([
    {"word": "apple"}, {"word": "banana"}, {"word": "apple"},
    {"word": "cherry"}, {"word": "apple"}
])

result = ds.aggregate(ApproximateTopK(on="word", k=2))
# Result: {'approx_topk(word)': [{'word': 'apple', 'count': 3}, {'word': 'banana', 'count': 1}]}
Parameters:
  • on – The name of the column to aggregate.

  • k – The number of top items to return.

  • log_capacity – Base 2 logarithm of the maximum size of the internal hash map. Higher values increase accuracy but use more memory. Defaults to 15.

  • alias_name – The name of the aggregate. Defaults to None.