ray.data.grouped_data.GroupedData.map_groups#

Apply the given function to each group of records of this dataset.

While map_groups() is very flexible, note that it comes with downsides:

It may be slower than using more specific methods such as min(), max().
It requires that each group fits in memory on a single node.

In general, prefer to use aggregate() instead of map_groups().

Warning

Specifying both num_cpus and num_gpus for map tasks is experimental, and may result in scheduling or stability issues. Please report any issues to the Ray team.

Examples

>>> # Return a single record per group (list of multiple records in,
>>> # list of a single record out).
>>> import ray
>>> import pandas as pd
>>> import numpy as np
>>> # Get first value per group.
>>> ds = ray.data.from_items([ 
...     {"group": 1, "value": 1},
...     {"group": 1, "value": 2},
...     {"group": 2, "value": 3},
...     {"group": 2, "value": 4}])
>>> ds.groupby("group").map_groups( 
...     lambda g: {"result": np.array([g["value"][0]])})

>>> # Return multiple records per group (dataframe in, dataframe out).
>>> df = pd.DataFrame(
...     {"A": ["a", "a", "b"], "B": [1, 1, 3], "C": [4, 6, 5]}
... )
>>> ds = ray.data.from_pandas(df) 
>>> grouped = ds.groupby("A") 
>>> grouped.map_groups( 
...     lambda g: g.apply(
...         lambda c: c / g[c.name].sum() if c.name in ["B", "C"] else c
...     )
... ) 

Parameters:

fn – The function to apply to each group of records, or a class type that can be instantiated to create such a callable. It takes as input a batch of all records from a single group, and returns a batch of zero or more records, similar to map_batches().
zero_copy_batch – If True, each group of rows (batch) will be provided w/o making an additional copy.
compute – This argument is deprecated. Use concurrency argument.
batch_format – Specify "default" to use the default block format (NumPy), "pandas" to select pandas.DataFrame, “pyarrow” to select pyarrow.Table, or "numpy" to select Dict[str, numpy.ndarray], or None to return the underlying block exactly as is with no additional formatting.
fn_args – Arguments to fn.
fn_kwargs – Keyword arguments to fn.
fn_constructor_args – Positional arguments to pass to fn’s constructor. You can only provide this if fn is a callable class. These arguments are top-level arguments in the underlying Ray actor construction task.
fn_constructor_kwargs – Keyword arguments to pass to fn’s constructor. This can only be provided if fn is a callable class. These arguments are top-level arguments in the underlying Ray actor construction task.
num_cpus – The number of CPUs to reserve for each parallel map worker.
num_gpus – The number of GPUs to reserve for each parallel map worker. For example, specify num_gpus=1 to request 1 GPU for each parallel map worker.
memory – The heap memory in bytes to reserve for each parallel map worker.
ray_remote_args_fn – A function that returns a dictionary of remote args passed to each map worker. The purpose of this argument is to generate dynamic arguments for each actor or task, and will be called each time prior to initializing the worker. Args returned from this dict will always override the args in ray_remote_args. Note: this is an advanced, experimental feature.
concurrency –
The semantics of this argument depend on the type of fn:
- If fn is a function and concurrency isn’t set (default), the actual concurrency is implicitly determined by the available resources and number of input blocks.
- If fn is a function and concurrency is an int n, Ray Data launches at most n concurrent tasks.
- If fn is a class and concurrency is an int n, Ray Data uses an actor pool with exactly n workers.
- If fn is a class and concurrency is a tuple (m, n), Ray Data uses an autoscaling actor pool from m to n workers.
- If fn is a class and concurrency isn’t set (default), this method raises an error.
ray_remote_args – Additional resource requirements to request from Ray (e.g., num_gpus=1 to request GPUs for the map tasks). See ray.remote() for details.

Returns:

The return type is determined by the return type of fn, and the return value is combined from results of all groups.