ray.data.aggregate.ApproximateTopK.init#

ApproximateTopK.__init__(on: str, k: int, log_capacity: int = 15, alias_name: str | None = None, encode_lists: bool = False)[source]#

Computes the approximate top k items in a column by using a datasketches frequent_strings_sketch. https://datasketches.apache.org/docs/Frequency/FrequentItemsOverview.html

Guarantees:

Any item with true frequency > N / (2^log_capacity) is guaranteed to appear in the results
Reported counts may have an error of at most ± N / (2^log_capacity).

If log_capacity is too small for your data:

Low-frequency items may be evicted from the sketch, potentially causing the top-k results to miss items that should appear in the output.
The error bounds increase, reducing the accuracy of the reported counts.

Example

import ray
from ray.data.aggregate import ApproximateTopK

ds = ray.data.from_items([
    {"word": "apple"}, {"word": "banana"}, {"word": "apple"},
    {"word": "cherry"}, {"word": "apple"}
])

result = ds.aggregate(ApproximateTopK(on="word", k=2))
# Result: {'approx_topk(word)': [{'word': 'apple', 'count': 3}, {'word': 'banana', 'count': 1}]}

Parameters:

on – The name of the column to aggregate.
k – The number of top items to return.
log_capacity – Base 2 logarithm of the maximum size of the internal hash map. Higher values increase accuracy but use more memory. Defaults to 15.
alias_name – The name of the aggregate. Defaults to None.
encode_lists – If True, encode list elements. If False, encode whole lists (i.e., the entire list is considered as a single object). False by default. Note that this is a top-level flatten (not a recursive flatten) operation.

ray.data.aggregate.ApproximateTopK.__init__#

ray.data.aggregate.ApproximateTopK.init#