ray.data.datasource.ParquetDatasource#

class ray.data.datasource.ParquetDatasource(*args, **kwds)[source]#

Parquet datasource, for reading and writing Parquet files.

The primary difference from ParquetBaseDatasource is that this uses PyArrow’s ParquetDataset abstraction for dataset reads, and thus offers automatic Arrow dataset schema inference and row count collection at the cost of some potential performance and/or compatibility penalties.

Examples

>>> import ray
>>> from ray.data.datasource import ParquetDatasource
>>> source = ParquetDatasource() 
>>> ray.data.read_datasource( 
...     source, paths="/path/to/dir").take()
[{"a": 1, "b": "foo"}, ...]

PublicAPI: This API is stable across Ray releases.

__init__()#

Methods

__init__()

create_reader(**kwargs)

Return a Reader for the given read arguments.

do_write(blocks, metadata, path, dataset_uuid)

Creates and returns write tasks for a file-based datasource.

file_extension_filter()

on_write_complete(write_results, **kwargs)

Callback for when a write job completes.

on_write_failed(write_results, error, **kwargs)

Callback for when a write job fails.

prepare_read(parallelism, **read_args)

Deprecated: Please implement create_reader() instead.