ray.data.datasource.ParquetDatasource
ray.data.datasource.ParquetDatasource#
- class ray.data.datasource.ParquetDatasource[source]#
Bases:
ray.data.datasource.parquet_base_datasource.ParquetBaseDatasource
Parquet datasource, for reading and writing Parquet files.
The primary difference from ParquetBaseDatasource is that this uses PyArrow’s
ParquetDataset
abstraction for dataset reads, and thus offers automatic Arrow dataset schema inference and row count collection at the cost of some potential performance and/or compatibility penalties.Examples
>>> import ray >>> from ray.data.datasource import ParquetDatasource >>> source = ParquetDatasource() >>> ray.data.read_datasource( ... source, paths="/path/to/dir").take() [{"a": 1, "b": "foo"}, ...]
PublicAPI: This API is stable across Ray releases.
Methods
__init__
()do_write
(blocks, metadata, ray_remote_args, ...)Launch Ray tasks for writing blocks out to the datasource.
get_name
()Return a human-readable name for this datasource.
on_write_complete
(write_results, **kwargs)Callback for when a write job completes.
on_write_failed
(write_results, error, **kwargs)Callback for when a write job fails.
prepare_read
(parallelism, **read_args)Deprecated: Please implement create_reader() instead.
write
(blocks, ctx, path, dataset_uuid[, ...])Write blocks for a file-based datasource.