Dataset.filter(fn: Union[Callable[[Dict[str, Any]], bool], Callable[[Dict[str, Any]], Iterator[bool]], _CallableClassProtocol], *, compute: Union[str, ray.data._internal.compute.ComputeStrategy] = None, **ray_remote_args) Dataset[source]#

Filter out records that do not satisfy the given predicate.

Consider using .map_batches() for better performance (you can implement filter by dropping records).


>>> import ray
>>> ds = ray.data.range(100)
>>> ds.filter(lambda x: x["id"] % 2 == 0)
+- Dataset(num_blocks=..., num_rows=100, schema={id: int64})

Time complexity: O(dataset size / parallelism)

  • fn – The predicate to apply to each record, or a class type that can be instantiated to create such a callable. Callable classes are only supported for the actor compute strategy.

  • compute – The compute strategy, either “tasks” (default) to use Ray tasks, ray.data.ActorPoolStrategy(size=n) to use a fixed-size actor pool, or ray.data.ActorPoolStrategy(min_size=m, max_size=n) for an autoscaling actor pool.

  • ray_remote_args – Additional resource requirements to request from ray (e.g., num_gpus=1 to request GPUs for the map tasks).