GroupedData API#

GroupedData objects are returned by groupby call: Dataset.groupby().

Computations or Descriptive Stats#

GroupedData.count

Compute count aggregation.

GroupedData.max

Compute grouped max aggregation.

GroupedData.mean

Compute grouped mean aggregation.

GroupedData.min

Compute grouped min aggregation.

GroupedData.std

Compute grouped standard deviation aggregation.

GroupedData.sum

Compute grouped sum aggregation.

Function Application#

GroupedData.aggregate

Implements an accumulator-based aggregation.

GroupedData.map_groups

Apply the given function to each group of records of this dataset.

AggregateFn#

class ray.data.aggregate.AggregateFn(init: Callable[[KeyType], AggType], merge: Callable[[AggType, AggType], AggType], name: str, accumulate_row: Callable[[AggType, T], AggType] = None, accumulate_block: Callable[[AggType, pyarrow.Table | pandas.DataFrame], AggType] = None, finalize: Callable[[AggType], U] | None = None)[source]#

Defines an aggregate function in the accumulator style.

Aggregates a collection of inputs of type T into a single output value of type U. See https://www.sigops.org/s/conferences/sosp/2009/papers/yu-sosp09.pdf for more details about accumulator-based aggregation.

Parameters:
  • init – This is called once for each group to return the empty accumulator. For example, an empty accumulator for a sum would be 0.

  • merge – This may be called multiple times, each time to merge two accumulators into one.

  • name – The name of the aggregation. This will be used as the column name in the output Dataset.

  • accumulate_row – This is called once per row of the same group. This combines the accumulator and the row, returns the updated accumulator. Exactly one of accumulate_row and accumulate_block must be provided.

  • accumulate_block – This is used to calculate the aggregation for a single block, and is vectorized alternative to accumulate_row. This will be given a base accumulator and the entire block, allowing for vectorized accumulation of the block. Exactly one of accumulate_row and accumulate_block must be provided.

  • finalize – This is called once to compute the final aggregation result from the fully merged accumulator.