ray.data.Dataset.groupby#

Dataset.groupby(key: Union[None, str, Callable[[ray.data.block.T], Any]]) GroupedDataset[T][source]#

Group the dataset by the key function or column name.

This is a lazy operation.

Examples

>>> import ray
>>> # Group by a key function and aggregate.
>>> ray.data.range(100).groupby(lambda x: x % 3).count()
Aggregate
+- Dataset(num_blocks=..., num_rows=100, schema=<class 'int'>)
>>> # Group by an Arrow table column and aggregate.
>>> ray.data.from_items([
...     {"A": x % 3, "B": x} for x in range(100)]).groupby(
...     "A").count()
Dataset(num_blocks=..., num_rows=3, schema={A: int64, count(): int64})

Time complexity: O(dataset size * log(dataset size / parallelism))

Parameters

key – A key function or Arrow column name. If this is None, the grouping is global.

Returns

A lazy GroupedDataset that can be aggregated later.