ray.data.Dataset.map#

Dataset.map(fn: Union[Callable[[ray.data.block.T], ray.data.block.U], _CallableClassProtocol[T, U]], *, compute: Union[str, ray.data._internal.compute.ComputeStrategy] = None, **ray_remote_args) Dataset[U][source]#

Apply the given function to each record of this dataset.

This is a blocking operation. Note that mapping individual records can be quite slow. Consider using map_batches() for performance.

Examples

>>> import ray
>>> # Transform python objects.
>>> ds = ray.data.range(1000)
>>> ds.map(lambda x: x * 2)
Map
+- Dataset(num_blocks=..., num_rows=1000, schema=<class 'int'>)
>>> # Transform Arrow records.
>>> ds = ray.data.from_items(
...     [{"value": i} for i in range(1000)])
>>> ds.map(lambda record: {"v2": record["value"] * 2})
Dataset(num_blocks=..., num_rows=1000, schema={v2: int64})
>>> # Define a callable class that persists state across
>>> # function invocations for efficiency.
>>> init_model = ... 
>>> class CachedModel:
...    def __init__(self):
...        self.model = init_model()
...    def __call__(self, batch):
...        return self.model(batch)
>>> # Apply the transform in parallel on GPUs. Since
>>> # compute=ActorPoolStrategy(2, 8) the transform will be applied on an
>>> # autoscaling pool of 2-8 Ray actors, each allocated 1 GPU by Ray.
>>> from ray.data._internal.compute import ActorPoolStrategy
>>> ds.map(CachedModel, 
...        compute=ActorPoolStrategy(2, 8),
...        num_gpus=1)

Time complexity: O(dataset size / parallelism)

Parameters
  • fn – The function to apply to each record, or a class type that can be instantiated to create such a callable. Callable classes are only supported for the actor compute strategy.

  • compute – The compute strategy, either “tasks” (default) to use Ray tasks, or “actors” to use an autoscaling actor pool. If wanting to configure the min or max size of the autoscaling actor pool, you can provide an ActorPoolStrategy(min, max) instance. If using callable classes for fn, the actor compute strategy must be used.

  • ray_remote_args – Additional resource requirements to request from ray (e.g., num_gpus=1 to request GPUs for the map tasks).

See also

flat_map():

Call this method to create new records from existing ones. Unlike map(), a function passed to flat_map() can return multiple records.

flat_map() isn’t recommended because it’s slow; call map_batches() instead.

map_batches()

Call this method to transform batches of data. It’s faster and more flexible than map() and flat_map().