ray.data.Dataset.groupby#

Dataset.groupby(key: str | List[str] | None, num_partitions: int | None = None) → GroupedData[source]#

Group rows of a Dataset according to a column.

Use this method to transform data based on a categorical variable.

Note

This operation requires all inputs to be materialized in object store for it to execute.

Examples

import pandas as pd
import ray

def normalize_variety(group: pd.DataFrame) -> pd.DataFrame:
    for feature in group.drop("variety").columns:
        group[feature] = group[feature] / group[feature].abs().max()
    return group

ds = (
    ray.data.read_parquet("s3://anonymous@ray-example-data/iris.parquet")
    .groupby("variety")
    .map_groups(normalize_variety, batch_format="pandas")
)

Time complexity: O(dataset size * log(dataset size / parallelism))

Parameters:

key – A column name or list of column names. If this is None, place all rows in a single group.
num_partitions – Number of partitions data will be partitioned into (only relevant if hash-shuffling strategy is used). When not set defaults to DataContext.min_parallelism.

Returns:

A lazy GroupedData.