ray.data.datasource.DefaultParquetMetadataProvider
ray.data.datasource.DefaultParquetMetadataProvider#
- class ray.data.datasource.DefaultParquetMetadataProvider[source]#
Bases:
ray.data.datasource.file_meta_provider.ParquetMetadataProvider
The default file metadata provider for ParquetDatasource.
Aggregates total block bytes and number of rows using the Parquet file metadata associated with a list of Arrow Parquet dataset file fragments.
DeveloperAPI: This API may change across minor Ray releases.
- prefetch_file_metadata(pieces: List[pyarrow.dataset.ParquetFileFragment], **ray_remote_args) Optional[List[pyarrow.parquet.FileMetaData]] [source]#
Pre-fetches file metadata for all Parquet file fragments in a single batch.
Subsets of the metadata returned will be provided as input to subsequent calls to _get_block_metadata() together with their corresponding Parquet file fragments.
Implementations that don’t support pre-fetching file metadata shouldn’t override this method.
- Parameters
pieces – The Parquet file fragments to fetch metadata for.
- Returns
Metadata resolved for each input file fragment, or
None
. Metadata must be returned in the same order as all input file fragments, such thatmetadata[i]
always contains the metadata forpieces[i]
.