ray.data.Dataset.flat_map
ray.data.Dataset.flat_map#
- Dataset.flat_map(fn: Union[Callable[[Dict[str, Any]], List[Dict[str, Any]]], Callable[[Dict[str, Any]], Iterator[List[Dict[str, Any]]]], _CallableClassProtocol], *, compute: Optional[ray.data._internal.compute.ComputeStrategy] = None, num_cpus: Optional[float] = None, num_gpus: Optional[float] = None, **ray_remote_args) Dataset [source]#
Apply the given function to each record and then flatten results.
Consider using
.map_batches()
for better performance (the batch size can be altered in map_batches).Examples
>>> import ray >>> ds = ray.data.range(1000) >>> ds.flat_map(lambda x: [{"id": 1}, {"id": 2}, {"id": 4}]) FlatMap +- Dataset(num_blocks=..., num_rows=1000, schema={id: int64})
Time complexity: O(dataset size / parallelism)
- Parameters
fn – The function or generator to apply to each record, or a class type that can be instantiated to create such a callable. Callable classes are only supported for the actor compute strategy.
compute – The compute strategy, either “tasks” (default) to use Ray tasks,
ray.data.ActorPoolStrategy(size=n)
to use a fixed-size actor pool, orray.data.ActorPoolStrategy(min_size=m, max_size=n)
for an autoscaling actor pool.num_cpus – The number of CPUs to reserve for each parallel map worker.
num_gpus – The number of GPUs to reserve for each parallel map worker. For example, specify
num_gpus=1
to request 1 GPU for each parallel map worker.ray_remote_args – Additional resource requirements to request from ray for each map worker.
See also
map_batches()
Call this method to transform batches of data. It’s faster and more flexible than
map()
andflat_map()
.map()
Call this method to transform one record at time.
This method isn’t recommended because it’s slow; call
map_batches()
instead.