ray.data.aggregate.ApproximateTopK.__init__#
- ApproximateTopK.__init__(on: str, k: int, log_capacity: int = 15, alias_name: str | None = None, encode_lists: bool = False)[source]#
Computes the approximate top k items in a column by using a datasketches frequent_strings_sketch. https://datasketches.apache.org/docs/Frequency/FrequentItemsOverview.html
- Guarantees:
Any item with true frequency > N / (2^log_capacity) is guaranteed to appear in the results
Reported counts may have an error of at most ± N / (2^log_capacity).
- If log_capacity is too small for your data:
Low-frequency items may be evicted from the sketch, potentially causing the top-k results to miss items that should appear in the output.
The error bounds increase, reducing the accuracy of the reported counts.
Example
import ray from ray.data.aggregate import ApproximateTopK ds = ray.data.from_items([ {"word": "apple"}, {"word": "banana"}, {"word": "apple"}, {"word": "cherry"}, {"word": "apple"} ]) result = ds.aggregate(ApproximateTopK(on="word", k=2)) # Result: {'approx_topk(word)': [{'word': 'apple', 'count': 3}, {'word': 'banana', 'count': 1}]}
- Parameters:
on – The name of the column to aggregate.
k – The number of top items to return.
log_capacity – Base 2 logarithm of the maximum size of the internal hash map. Higher values increase accuracy but use more memory. Defaults to 15.
alias_name – The name of the aggregate. Defaults to None.
encode_lists – If
True, encode list elements. IfFalse, encode whole lists (i.e., the entire list is considered as a single object).Falseby default. Note that this is a top-level flatten (not a recursive flatten) operation.