class ray.data.datasource.ParquetDatasource(*args, **kwds)[source]#

Bases: ray.data.datasource.parquet_base_datasource.ParquetBaseDatasource

Parquet datasource, for reading and writing Parquet files.

The primary difference from ParquetBaseDatasource is that this uses PyArrow’s ParquetDataset abstraction for dataset reads, and thus offers automatic Arrow dataset schema inference and row count collection at the cost of some potential performance and/or compatibility penalties.


>>> import ray
>>> from ray.data.datasource import ParquetDatasource
>>> source = ParquetDatasource() 
>>> ray.data.read_datasource( 
...     source, paths="/path/to/dir").take()
[{"a": 1, "b": "foo"}, ...]

PublicAPI: This API is stable across Ray releases.


Return a Reader for the given read arguments.

The reader object will be responsible for querying the read metadata, and generating the actual read tasks to retrieve the data blocks upon request.


read_args – Additional kwargs to pass to the datasource impl.