ray.data.Dataset.filter#

Dataset.filter(fn: Callable[[Dict[str, Any]], bool] | Callable[[Dict[str, Any]], Iterator[bool]] | _CallableClassProtocol | None = None, expr: str | Expr | None = None, *, compute: str | ComputeStrategy = None, fn_args: Iterable[Any] | None = None, fn_kwargs: Dict[str, Any] | None = None, fn_constructor_args: Iterable[Any] | None = None, fn_constructor_kwargs: Dict[str, Any] | None = None, num_cpus: float | None = None, num_gpus: float | None = None, memory: float | None = None, concurrency: int | Tuple[int, int] | Tuple[int, int, int] | None = None, ray_remote_args_fn: Callable[[], Dict[str, Any]] | None = None, **ray_remote_args) Dataset[source]#

Filter out rows that don’t satisfy the given predicate.

You can use either a function or a callable class or an expression to perform the transformation. For functions, Ray Data uses stateless Ray tasks. For classes, Ray Data uses stateful Ray actors. For more information, see Stateful Transforms.

Tip

If you use the expr parameter with a predicate expression, Ray Data optimizes your filter with native Arrow interfaces.

Deprecated since version String: expressions are deprecated and will be removed in a future version. Use predicate expressions from ray.data.expressions instead.

Examples

>>> import ray
>>> from ray.data.expressions import col
>>> ds = ray.data.range(100)
>>> # String expressions (deprecated - will warn)
>>> ds.filter(expr="id <= 4").take_all()
[{'id': 0}, {'id': 1}, {'id': 2}, {'id': 3}, {'id': 4}]
>>> # Using predicate expressions (preferred)
>>> ds.filter(expr=(col("id") > 10) & (col("id") < 20)).take_all()
[{'id': 11}, {'id': 12}, {'id': 13}, {'id': 14}, {'id': 15}, {'id': 16}, {'id': 17}, {'id': 18}, {'id': 19}]

Time complexity: O(dataset size / parallelism)

Parameters:
  • fn – The predicate to apply to each row, or a class type that can be instantiated to create such a callable.

  • expr – An expression that represents a predicate (boolean condition) for filtering. Can be either a string expression (deprecated) or a predicate expression from ray.data.expressions.

  • fn_args – Positional arguments to pass to fn after the first argument. These arguments are top-level arguments to the underlying Ray task.

  • fn_kwargs – Keyword arguments to pass to fn. These arguments are top-level arguments to the underlying Ray task.

  • fn_constructor_args – Positional arguments to pass to fn’s constructor. You can only provide this if fn is a callable class. These arguments are top-level arguments in the underlying Ray actor construction task.

  • fn_constructor_kwargs – Keyword arguments to pass to fn’s constructor. This can only be provided if fn is a callable class. These arguments are top-level arguments in the underlying Ray actor construction task.

  • compute

    The compute strategy to use for the map operation.

    • If compute is not specified for a function, will use ray.data.TaskPoolStrategy() to launch concurrent tasks based on the available resources and number of input blocks.

    • Use ray.data.TaskPoolStrategy(size=n) to launch at most n concurrent Ray tasks.

    • If compute is not specified for a callable class, will use ray.data.ActorPoolStrategy(min_size=1, max_size=None) to launch an autoscaling actor pool from 1 to unlimited workers.

    • Use ray.data.ActorPoolStrategy(size=n) to use a fixed size actor pool of n workers.

    • Use ray.data.ActorPoolStrategy(min_size=m, max_size=n) to use an autoscaling actor pool from m to n workers.

    • Use ray.data.ActorPoolStrategy(min_size=m, max_size=n, initial_size=initial) to use an autoscaling actor pool from m to n workers, with an initial size of initial.

  • num_cpus – The number of CPUs to reserve for each parallel map worker.

  • num_gpus – The number of GPUs to reserve for each parallel map worker. For example, specify num_gpus=1 to request 1 GPU for each parallel map worker.

  • memory – The heap memory in bytes to reserve for each parallel map worker.

  • concurrency – This argument is deprecated. Use compute argument.

  • ray_remote_args_fn – A function that returns a dictionary of remote args passed to each map worker. The purpose of this argument is to generate dynamic arguments for each actor/task, and will be called each time prior to initializing the worker. Args returned from this dict will always override the args in ray_remote_args. Note: this is an advanced, experimental feature.

  • ray_remote_args – Additional resource requirements to request from Ray (e.g., num_gpus=1 to request GPUs for the map tasks). See ray.remote() for details.