ray.data.datasource.PathPartitionFilter#

class ray.data.datasource.PathPartitionFilter(path_partition_parser: ray.data.datasource.partitioning.PathPartitionParser, filter_fn: Callable[[Dict[str, str]], bool])[source]#

Bases: object

Partition filter for path-based partition formats.

Used to explicitly keep or reject files based on a custom filter function that takes partition keys and values parsed from the file’s path as input.

PublicAPI (beta): This API is in beta and may change before becoming stable.

static of(filter_fn: Callable[[Dict[str, str]], bool], style: ray.data.datasource.partitioning.PartitionStyle = PartitionStyle.HIVE, base_dir: Optional[str] = None, field_names: Optional[List[str]] = None, filesystem: Optional[pyarrow.fs.FileSystem] = None) PathPartitionFilter[source]#

Creates a path-based partition filter using a flattened argument list.

Parameters
  • filter_fn –

    Callback used to filter partitions. Takes a dictionary mapping partition keys to values as input. Unpartitioned files are denoted with an empty input dictionary. Returns True to read a file for that partition or False to skip it. Partition keys and values are always strings read from the filesystem path. For example, this removes all unpartitioned files:

    lambda d: True if d else False
    

    This raises an assertion error for any unpartitioned file found:

    def do_assert(val, msg):
        assert val, msg
    
    lambda d: do_assert(d, "Expected all files to be partitioned!")
    

    And this only reads files from January, 2022 partitions:

    lambda d: d["month"] == "January" and d["year"] == "2022"
    

  • style – The partition style - may be either HIVE or DIRECTORY.

  • base_dir – β€œ/”-delimited base directory to start searching for partitions (exclusive). File paths outside of this directory will be considered unpartitioned. Specify None or an empty string to search for partitions in all file path directories.

  • field_names – The partition key names. Required for DIRECTORY partitioning. Optional for HIVE partitioning. When non-empty, the order and length of partition key field names must match the order and length of partition directories discovered. Partition key field names are not required to exist in the dataset schema.

  • filesystem – Filesystem that will be used for partition path file I/O.

Returns

The new path-based partition filter.

property parser: ray.data.datasource.partitioning.PathPartitionParser#

Returns the path partition parser for this filter.