Dataset.filter(fn: Union[Callable[[ray.data.block.T], ray.data.block.U], _CallableClassProtocol[T, U]], *, compute: Union[str, ray.data._internal.compute.ComputeStrategy] = None, **ray_remote_args) Dataset[T][source]#

Filter out records that do not satisfy the given predicate.

Consider using .map_batches() for better performance (you can implement filter by dropping records).


>>> import ray
>>> ds = ray.data.range(100)
>>> ds.filter(lambda x: x % 2 == 0)
+- Dataset(num_blocks=..., num_rows=100, schema=<class 'int'>)

Time complexity: O(dataset size / parallelism)

  • fn – The predicate to apply to each record, or a class type that can be instantiated to create such a callable. Callable classes are only supported for the actor compute strategy.

  • compute – The compute strategy, either “tasks” (default) to use Ray tasks, or “actors” to use an autoscaling actor pool. If wanting to configure the min or max size of the autoscaling actor pool, you can provide an ActorPoolStrategy(min, max) instance. If using callable classes for fn, the actor compute strategy must be used.

  • ray_remote_args – Additional resource requirements to request from ray (e.g., num_gpus=1 to request GPUs for the map tasks).