ray.data.aggregate.CountDistinct#
- class ray.data.aggregate.CountDistinct(on: str, ignore_nulls: bool = True, alias_name: str | None = None)[source]#
Bases:
UniqueDefines distinct count aggregation.
This aggregation computes the count of distinct values in a column. It is similar to SQL’s COUNT(DISTINCT column_name) operation.
Example
import ray from ray.data.aggregate import CountDistinct # Create a dataset with repeated values ds = ray.data.from_items([ {"category": "A"}, {"category": "B"}, {"category": "A"}, {"category": "C"}, {"category": "A"}, {"category": "B"} ]) # Count distinct categories result = ds.aggregate(CountDistinct(on="category")) # result: {'count_distinct(category)': 3} # Using with groupby ds = ray.data.from_items([ {"group": "X", "category": "A"}, {"group": "X", "category": "B"}, {"group": "Y", "category": "A"}, {"group": "Y", "category": "A"} ]) result = ds.groupby("group").aggregate(CountDistinct(on="category")).take_all() # result: [{'group': 'X', 'count_distinct(category)': 2}, # {'group': 'Y', 'count_distinct(category)': 1}]
- Parameters:
on – The name of the column to count distinct values on.
ignore_nulls – Whether to ignore null values when counting distinct items. Default is True (nulls are excluded from the count).
alias_name – Optional name for the resulting column. If not provided, defaults to “count_distinct({on})”.
Methods
Return the count of distinct values.
Return the agg name (e.g., 'sum', 'mean', 'count').