ray.data.datasource.DefaultParquetMetadataProvider#

class ray.data.datasource.DefaultParquetMetadataProvider[source]#

Bases: ray.data.datasource.file_meta_provider.ParquetMetadataProvider

The default file metadata provider for ParquetDatasource.

Aggregates total block bytes and number of rows using the Parquet file metadata associated with a list of Arrow Parquet dataset file fragments.

DeveloperAPI: This API may change across minor Ray releases.

prefetch_file_metadata(pieces: List[pyarrow.dataset.ParquetFileFragment], **ray_remote_args) Optional[List[pyarrow.parquet.FileMetaData]][source]#

Pre-fetches file metadata for all Parquet file fragments in a single batch.

Subsets of the metadata returned will be provided as input to subsequent calls to _get_block_metadata() together with their corresponding Parquet file fragments.

Implementations that don’t support pre-fetching file metadata shouldn’t override this method.

Parameters

pieces – The Parquet file fragments to fetch metadata for.

Returns

Metadata resolved for each input file fragment, or None. Metadata must be returned in the same order as all input file fragments, such that metadata[i] always contains the metadata for pieces[i].