ray.data.datasource.PathPartitionParser#

class ray.data.datasource.PathPartitionParser(partitioning: ray.data.datasource.partitioning.Partitioning)[source]#

Bases: object

Partition parser for path-based partition formats.

Path-based partition formats embed all partition keys and values directly in their dataset file paths.

Two path partition formats are currently supported - HIVE and DIRECTORY.

For HIVE Partitioning, all partition directories under the base directory will be discovered based on “{key1}={value1}/{key2}={value2}” naming conventions. Key/value pairs do not need to be presented in the same order across all paths. Directory names nested under the base directory that don’t follow this naming condition will be considered unpartitioned. If a partition filter is defined, then it will be called with an empty input dictionary for each unpartitioned file.

For DIRECTORY Partitioning, all directories under the base directory will be interpreted as partition values of the form “{value1}/{value2}”. An accompanying ordered list of partition field names must also be provided, where the order and length of all partition values must match the order and length of field names. Files stored directly in the base directory will be considered unpartitioned. If a partition filter is defined, then it will be called with an empty input dictionary for each unpartitioned file. For example, if the base directory is “foo” then “foo.csv” and “foo/bar.csv” would be considered unpartitioned files but “foo/bar/baz.csv” would be associated with partition “bar”. If the base directory is undefined, then “foo.csv” would be unpartitioned, “foo/bar.csv” would be associated with partition “foo”, and “foo/bar/baz.csv” would be associated with partition (“foo”, “bar”).

DeveloperAPI: This API may change across minor Ray releases.

static of(style: ray.data.datasource.partitioning.PartitionStyle = PartitionStyle.HIVE, base_dir: Optional[str] = None, field_names: Optional[List[str]] = None, filesystem: Optional[pyarrow.fs.FileSystem] = None) PathPartitionParser[source]#

Creates a path-based partition parser using a flattened argument list.

Parameters
  • style – The partition style - may be either HIVE or DIRECTORY.

  • base_dir – “/”-delimited base directory to start searching for partitions (exclusive). File paths outside of this directory will be considered unpartitioned. Specify None or an empty string to search for partitions in all file path directories.

  • field_names – The partition key names. Required for DIRECTORY partitioning. Optional for HIVE partitioning. When non-empty, the order and length of partition key field names must match the order and length of partition directories discovered. Partition key field names are not required to exist in the dataset schema.

  • filesystem – Filesystem that will be used for partition path file I/O.

Returns

The new path-based partition parser.

property scheme: ray.data.datasource.partitioning.Partitioning#

Returns the partitioning for this parser.