ray.data.datasource.ParquetMetadataProvider
ray.data.datasource.ParquetMetadataProvider#
- class ray.data.datasource.ParquetMetadataProvider[source]#
Bases:
ray.data.datasource.file_meta_provider.FileMetadataProvider
Abstract callable that provides block metadata for Arrow Parquet file fragments.
All file fragments should belong to a single dataset block.
Supports optional pre-fetching of ordered metadata for all file fragments in a single batch to help optimize metadata resolution.
- Current subclasses:
DefaultParquetMetadataProvider
DeveloperAPI: This API may change across minor Ray releases.
- prefetch_file_metadata(pieces: List[pyarrow.dataset.ParquetFileFragment], **ray_remote_args) Optional[List[Any]] [source]#
Pre-fetches file metadata for all Parquet file fragments in a single batch.
Subsets of the metadata returned will be provided as input to subsequent calls to _get_block_metadata() together with their corresponding Parquet file fragments.
Implementations that don’t support pre-fetching file metadata shouldn’t override this method.
- Parameters
pieces – The Parquet file fragments to fetch metadata for.
- Returns
Metadata resolved for each input file fragment, or
None
. Metadata must be returned in the same order as all input file fragments, such thatmetadata[i]
always contains the metadata forpieces[i]
.