ray.data.Dataset.groupby#

Dataset.groupby(key: Optional[str]) GroupedData[source]#

Group the dataset by the key function or column name.

Examples

>>> import ray
>>> # Group by a table column and aggregate.
>>> ray.data.from_items([
...     {"A": x % 3, "B": x} for x in range(100)]).groupby(
...     "A").count()
Aggregate
+- Dataset(num_blocks=100, num_rows=100, schema={A: int64, B: int64})

Time complexity: O(dataset size * log(dataset size / parallelism))

Parameters

key – A column name. If this is None, the grouping is global.

Returns

A lazy GroupedData that can be aggregated later.