ray.data.datasource.ParquetMetadataProvider#

class ray.data.datasource.ParquetMetadataProvider[source]#

Bases: ray.data.datasource.file_meta_provider.FileMetadataProvider

Abstract callable that provides block metadata for Arrow Parquet file fragments.

All file fragments should belong to a single dataset block.

Supports optional pre-fetching of ordered metadata for all file fragments in a single batch to help optimize metadata resolution.

Current subclasses:

DefaultParquetMetadataProvider

DeveloperAPI: This API may change across minor Ray releases.

prefetch_file_metadata(pieces: List[pyarrow.dataset.ParquetFileFragment], **ray_remote_args) Optional[List[Any]][source]#

Pre-fetches file metadata for all Parquet file fragments in a single batch.

Subsets of the metadata returned will be provided as input to subsequent calls to _get_block_metadata() together with their corresponding Parquet file fragments.

Implementations that don’t support pre-fetching file metadata shouldn’t override this method.

Parameters

pieces – The Parquet file fragments to fetch metadata for.

Returns

Metadata resolved for each input file fragment, or None. Metadata must be returned in the same order as all input file fragments, such that metadata[i] always contains the metadata for pieces[i].