ray.data.aggregate.CountDistinct#

class ray.data.aggregate.CountDistinct(on: str, ignore_nulls: bool = True, alias_name: str | None = None)[source]#

Bases: Unique

Defines distinct count aggregation.

This aggregation computes the count of distinct values in a column. It is similar to SQL’s COUNT(DISTINCT column_name) operation.

Example

import ray
from ray.data.aggregate import CountDistinct

# Create a dataset with repeated values
ds = ray.data.from_items([
    {"category": "A"}, {"category": "B"}, {"category": "A"},
    {"category": "C"}, {"category": "A"}, {"category": "B"}
])

# Count distinct categories
result = ds.aggregate(CountDistinct(on="category"))
# result: {'count_distinct(category)': 3}

# Using with groupby
ds = ray.data.from_items([
    {"group": "X", "category": "A"}, {"group": "X", "category": "B"},
    {"group": "Y", "category": "A"}, {"group": "Y", "category": "A"}
])
result = ds.groupby("group").aggregate(CountDistinct(on="category")).take_all()
# result: [{'group': 'X', 'count_distinct(category)': 2},
#          {'group': 'Y', 'count_distinct(category)': 1}]

Parameters:

on – The name of the column to count distinct values on.
ignore_nulls – Whether to ignore null values when counting distinct items. Default is True (nulls are excluded from the count).
alias_name – Optional name for the resulting column. If not provided, defaults to “count_distinct({on})”.

Methods

`finalize`	Return the count of distinct values.
`get_agg_name`	Return the agg name (e.g., 'sum', 'mean', 'count').