ray.data.aggregate.AggregateFnV2#

class ray.data.aggregate.AggregateFnV2(name: str, zero_factory: Callable[[], AggType], *, on: str | None, ignore_nulls: bool)[source]#

Bases: AggregateFn, ABC

Provides an interface to implement efficient aggregations to be applied to the dataset.

AggregateFnV2 instances are passed to a Dataset’s .aggregate(...) method to perform distributed aggregations. To create a custom aggregation, you should subclass AggregateFnV2 and implement the aggregate_block and combine methods. The _finalize method can also be overridden if the final accumulated state needs further transformation.

Aggregation follows these steps:

  1. Initialization: For each group (if grouping) or for the entire dataset, an initial accumulator is created using zero_factory.

  2. Block Aggregation: The aggregate_block method is applied to each block independently, producing a partial aggregation result for that block.

  3. Combination: The combine method is used to merge these partial results (or an existing accumulated result with a new partial result) into a single, combined accumulator.

  4. Finalization: Optionally, the _finalize method transforms the final combined accumulator into the desired output format.

Parameters:
  • name – The name of the aggregation. This will be used as the column name in the output, e.g., “sum(my_col)”.

  • zero_factory – A callable that returns the initial “zero” value for the accumulator. For example, for a sum, this would be lambda: 0; for finding a minimum, lambda: float("inf"), for finding a maximum, lambda: float("-inf").

  • on – The name of the column to perform the aggregation on. If None, the aggregation is performed over the entire row (e.g., for Count()).

  • ignore_nulls – Whether to ignore null values during aggregation. If True, nulls are skipped. If False, the presence of a null value might result in a null output, depending on the aggregation logic.

PublicAPI (alpha): This API is in alpha and may change before becoming stable.

Methods

aggregate_block

Aggregates data within a single block.

combine

Combines a new partial aggregation result with the current accumulator.

finalize

Transforms the final accumulated state into the desired output.