ray.data.Dataset.map
ray.data.Dataset.map#
- Dataset.map(fn: Union[Callable[[ray.data.block.T], ray.data.block.U], _CallableClassProtocol[T, U]], *, compute: Union[str, ray.data._internal.compute.ComputeStrategy] = None, **ray_remote_args) Dataset[U] [source]#
Apply the given function to each record of this dataset.
Note that mapping individual records can be quite slow. Consider using
map_batches()
for performance.Examples
>>> import ray >>> # Transform python objects. >>> ds = ray.data.range(1000) >>> ds.map(lambda x: x * 2) Map +- Dataset(num_blocks=..., num_rows=1000, schema=<class 'int'>) >>> # Transform Arrow records. >>> ds = ray.data.from_items( ... [{"value": i} for i in range(1000)]) >>> ds.map(lambda record: {"v2": record["value"] * 2}) Map +- Dataset(num_blocks=..., num_rows=1000, schema={value: int64}) >>> # Define a callable class that persists state across >>> # function invocations for efficiency. >>> init_model = ... >>> class CachedModel: ... def __init__(self): ... self.model = init_model() ... def __call__(self, batch): ... return self.model(batch) >>> # Apply the transform in parallel on GPUs. Since >>> # compute=ActorPoolStrategy(2, 8) the transform will be applied on an >>> # autoscaling pool of 2-8 Ray actors, each allocated 1 GPU by Ray. >>> from ray.data._internal.compute import ActorPoolStrategy >>> ds.map(CachedModel, ... compute=ActorPoolStrategy(2, 8), ... num_gpus=1)
Time complexity: O(dataset size / parallelism)
- Parameters
fn – The function to apply to each record, or a class type that can be instantiated to create such a callable. Callable classes are only supported for the actor compute strategy.
compute – The compute strategy, either “tasks” (default) to use Ray tasks, or “actors” to use an autoscaling actor pool. If wanting to configure the min or max size of the autoscaling actor pool, you can provide an
ActorPoolStrategy(min, max)
instance. If using callable classes for fn, the actor compute strategy must be used.ray_remote_args – Additional resource requirements to request from ray (e.g., num_gpus=1 to request GPUs for the map tasks).
See also
flat_map()
:Call this method to create new records from existing ones. Unlike
map()
, a function passed toflat_map()
can return multiple records.flat_map()
isn’t recommended because it’s slow; callmap_batches()
instead.map_batches()
Call this method to transform batches of data. It’s faster and more flexible than
map()
andflat_map()
.