ray.data.datasource.Partitioning#
- class ray.data.datasource.Partitioning(style: PartitionStyle, base_dir: str | None = None, field_names: List[str] | None = None, field_types: Dict[str, Type[int | float | str | bool]] | None = None, filesystem: pyarrow.fs.FileSystem | None = None)[source]#
Partition scheme used to describe path-based partitions.
Path-based partition formats embed all partition keys and values directly in their dataset file paths.
For example, to read a dataset with Hive-style partitions:
>>> import ray >>> from ray.data.datasource.partitioning import Partitioning >>> ds = ray.data.read_csv( ... "s3://anonymous@ray-example-data/iris.csv", ... partitioning=Partitioning("hive"), ... )
Instead, if your files are arranged in a directory structure such as:
root/dog/dog_0.jpeg root/dog/dog_1.jpeg ... root/cat/cat_0.jpeg root/cat/cat_1.jpeg ...
Then you can use directory-based partitioning:
>>> import ray >>> from ray.data.datasource.partitioning import Partitioning >>> root = "s3://anonymous@air-example-data/cifar-10/images" >>> partitioning = Partitioning("dir", field_names=["class"], base_dir=root) >>> ds = ray.data.read_images(root, partitioning=partitioning)
DeveloperAPI: This API may change across minor Ray releases.
Methods
Attributes
"/"-delimited base directory that all partitioned paths should exist under (exclusive).
The partition key field names (i.e. column names for tabular datasets).
A dictionary that maps partition key names to their desired data type.
Filesystem that will be used for partition path file I/O.
Returns the base directory normalized for compatibility with a filesystem.
Returns the filesystem resolved for compatibility with a base directory.
The partition style - may be either HIVE or DIRECTORY.