Dataset.map(fn: Union[Callable[[Dict[str, Any]], Dict[str, Any]], Callable[[Dict[str, Any]], Iterator[Dict[str, Any]]], _CallableClassProtocol], *, compute: Optional[ray.data._internal.compute.ComputeStrategy] = None, fn_constructor_args: Optional[Iterable[Any]] = None, num_cpus: Optional[float] = None, num_gpus: Optional[float] = None, **ray_remote_args) Dataset[source]#

Apply the given function to each row of this dataset.

Use this method to transform your data. To learn more, see Transforming rows.


If your transformation is vectorized like most NumPy or pandas operations, map_batches() might be faster.


import os
from typing import Any, Dict
import ray

def parse_filename(row: Dict[str, Any]) -> Dict[str, Any]:
    row["filename"] = os.path.basename(row["path"])
    return row

ds = (
    ray.data.read_images("s3://anonymous@ray-example-data/image-datasets/simple", include_paths=True)
Column    Type
------    ----
image     numpy.ndarray(shape=(32, 32, 3), dtype=uint8)
path      string
filename  string

Time complexity: O(dataset size / parallelism)

  • fn – The function to apply to each row, or a class type that can be instantiated to create such a callable. Callable classes are only supported for the actor compute strategy.

  • compute – The compute strategy, either None (default) to use Ray tasks, ray.data.ActorPoolStrategy(size=n) to use a fixed-size actor pool, or ray.data.ActorPoolStrategy(min_size=m, max_size=n) for an autoscaling actor pool.

  • fn_constructor_args – Positional arguments to pass to fn’s constructor. You can only provide this if fn is a callable class. These arguments are top-level arguments in the underlying Ray actor construction task.

  • num_cpus – The number of CPUs to reserve for each parallel map worker.

  • num_gpus – The number of GPUs to reserve for each parallel map worker. For example, specify num_gpus=1 to request 1 GPU for each parallel map worker.

  • ray_remote_args – Additional resource requirements to request from Ray for each map worker.

See also


Call this method to create new rows from existing ones. Unlike map(), a function passed to flat_map() can return multiple rows.


Call this method to transform batches of data.