ray.data.aggregate.ApproximateTopK.__init__#
- ApproximateTopK.__init__(on: str, k: int, log_capacity: int = 15, alias_name: str | None = None)[source]#
- Computes the approximate top k items in a column by using a datasketches frequent_strings_sketch. https://datasketches.apache.org/docs/Frequency/FrequentItemsOverview.html - Guarantees:
- Any item with true frequency > N / (2^log_capacity) is guaranteed to appear in the results 
- Reported counts may have an error of at most ± N / (2^log_capacity). 
 
- If log_capacity is too small for your data:
- Low-frequency items may be evicted from the sketch, potentially causing the top-k results to miss items that should appear in the output. 
- The error bounds increase, reducing the accuracy of the reported counts. 
 
 - Example - import ray from ray.data.aggregate import ApproximateTopK ds = ray.data.from_items([ {"word": "apple"}, {"word": "banana"}, {"word": "apple"}, {"word": "cherry"}, {"word": "apple"} ]) result = ds.aggregate(ApproximateTopK(on="word", k=2)) # Result: {'approx_topk(word)': [{'word': 'apple', 'count': 3}, {'word': 'banana', 'count': 1}]} - Parameters:
- on – The name of the column to aggregate. 
- k – The number of top items to return. 
- log_capacity – Base 2 logarithm of the maximum size of the internal hash map. Higher values increase accuracy but use more memory. Defaults to 15. 
- alias_name – The name of the aggregate. Defaults to None.