ray.data.Dataset.filter
ray.data.Dataset.filter#
- Dataset.filter(fn: Union[Callable[[Dict[str, Any]], bool], Callable[[Dict[str, Any]], Iterator[bool]], _CallableClassProtocol], *, compute: Union[str, ray.data._internal.compute.ComputeStrategy] = None, **ray_remote_args) Dataset [source]#
Filter out rows that don’t satisfy the given predicate.
Tip
If you can represent your predicate with NumPy or pandas operations,
Dataset.map_batches()
might be faster. You can implement filter by dropping rows.Examples
>>> import ray >>> ds = ray.data.range(100) >>> ds.filter(lambda row: row["id"] % 2 == 0).take_all() [{'id': 0}, {'id': 2}, {'id': 4}, ...]
Time complexity: O(dataset size / parallelism)
- Parameters
fn – The predicate to apply to each row, or a class type that can be instantiated to create such a callable. Callable classes are only supported for the actor compute strategy.
compute – The compute strategy, either “tasks” (default) to use Ray tasks,
ray.data.ActorPoolStrategy(size=n)
to use a fixed-size actor pool, orray.data.ActorPoolStrategy(min_size=m, max_size=n)
for an autoscaling actor pool.ray_remote_args – Additional resource requirements to request from ray (e.g., num_gpus=1 to request GPUs for the map tasks).