GroupedDataset.sum(on: Union[None, str, Callable[[ray.data.block.T], Any], List[Union[None, str, Callable[[ray.data.block.T], Any]]]] = None, ignore_nulls: bool = True) ray.data.dataset.Dataset[ray.data.block.U][source]#

Compute grouped sum aggregation.


>>> import ray
>>> ray.data.range(100).groupby(lambda x: x % 3).sum() 
>>> ray.data.from_items([ 
...     (i % 3, i, i**2) 
...     for i in range(100)]) \ 
...     .groupby(lambda x: x[0] % 3) \ 
...     .sum(lambda x: x[2]) 
>>> ray.data.range_table(100).groupby("value").sum() 
>>> ray.data.from_items([ 
...     {"A": i % 3, "B": i, "C": i**2} 
...     for i in range(100)]) \ 
...     .groupby("A") \ 
...     .sum(["B", "C"]) 
  • on

    The data subset on which to compute the sum.

    • For a simple dataset: it can be a callable or a list thereof, and the default is to take a sum of all rows.

    • For an Arrow dataset: it can be a column name or a list thereof, and the default is to do a column-wise sum of all columns.

  • ignore_nulls – Whether to ignore null values. If True, null values will be ignored when computing the sum; if False, if a null value is encountered, the output will be null. We consider np.nan, None, and pd.NaT to be null values. Default is True.


The sum result.

For a simple dataset, the output is:

  • on=None: a simple dataset of (k, sum) tuples where k is the groupby key and sum is sum of all rows in that group.

  • on=[callable_1, ..., callable_n]: a simple dataset of (k, sum_1, ..., sum_n) tuples where k is the groupby key and sum_i is sum of the outputs of the ith callable called on each row in that group.

For an Arrow dataset, the output is:

  • on=None: an Arrow dataset containing a groupby key column, "k", and a column-wise sum column for each original column in the dataset.

  • on=["col_1", ..., "col_n"]: an Arrow dataset of n + 1 columns where the first column is the groupby key and the second through n + 1 columns are the results of the aggregations.

If groupby key is None then the key part of return is omitted.