ray.data.datasource.Partitioning#

class ray.data.datasource.Partitioning(style: PartitionStyle, base_dir: str | None = None, field_names: List[str] | None = None, field_types: Dict[str, Type[int | float | str | bool]] | None = None, filesystem: pyarrow.fs.FileSystem | None = None)[source]#

Partition scheme used to describe path-based partitions.

Path-based partition formats embed all partition keys and values directly in their dataset file paths.

For example, to read a dataset with Hive-style partitions:

>>> import ray
>>> from ray.data.datasource.partitioning import Partitioning
>>> ds = ray.data.read_csv(
...     "s3://anonymous@ray-example-data/iris.csv",
...     partitioning=Partitioning("hive"),
... )

Instead, if your files are arranged in a directory structure such as:

root/dog/dog_0.jpeg
root/dog/dog_1.jpeg
...

root/cat/cat_0.jpeg
root/cat/cat_1.jpeg
...

Then you can use directory-based partitioning:

>>> import ray
>>> from ray.data.datasource.partitioning import Partitioning
>>> root = "s3://anonymous@air-example-data/cifar-10/images"
>>> partitioning = Partitioning("dir", field_names=["class"], base_dir=root)
>>> ds = ray.data.read_images(root, partitioning=partitioning)

DeveloperAPI: This API may change across minor Ray releases.

Methods

Attributes

base_dir

"/"-delimited base directory that all partitioned paths should exist under (exclusive).

field_names

The partition key field names (i.e. column names for tabular datasets).

field_types

A dictionary that maps partition key names to their desired data type.

filesystem

Filesystem that will be used for partition path file I/O.

normalized_base_dir

Returns the base directory normalized for compatibility with a filesystem.

resolved_filesystem

Returns the filesystem resolved for compatibility with a base directory.

style

The partition style - may be either HIVE or DIRECTORY.