ray.data.read_parquet#

ray.data.read_parquet(paths: Union[str, List[str]], *, filesystem: Optional[pyarrow.fs.FileSystem] = None, columns: Optional[List[str]] = None, parallelism: int = -1, ray_remote_args: Dict[str, Any] = None, tensor_column_schema: Optional[Dict[str, Tuple[numpy.dtype, Tuple[int, ...]]]] = None, meta_provider: ray.data.datasource.file_meta_provider.ParquetMetadataProvider = <ray.data.datasource.file_meta_provider.DefaultParquetMetadataProvider object>, **arrow_parquet_args) ray.data.dataset.Dataset[source]#

Create an Arrow dataset from parquet files.

Examples

>>> import ray
>>> # Read a directory of files in remote storage.
>>> ray.data.read_parquet("s3://bucket/path") 
>>> # Read multiple local files.
>>> ray.data.read_parquet(["/path/to/file1", "/path/to/file2"]) 
>>> # Specify a schema for the parquet file.
>>> import pyarrow as pa
>>> fields = [("sepal.length", pa.float64()),
...           ("sepal.width", pa.float64()),
...           ("petal.length", pa.float64()),
...           ("petal.width", pa.float64()),
...           ("variety", pa.string())]
>>> ray.data.read_parquet("example://iris.parquet",
...     schema=pa.schema(fields))
Dataset(
   num_blocks=1,
   num_rows=150,
   schema={
      sepal.length: double,
      sepal.width: double,
      petal.length: double,
      petal.width: double,
      variety: string
   }
)

For further arguments you can pass to pyarrow as a keyword argument, see https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Scanner.html#pyarrow.dataset.Scanner.from_fragment

Parameters
  • paths – A single file path or directory, or a list of file paths. Multiple directories are not supported.

  • filesystem – The filesystem implementation to read from. These are specified in https://arrow.apache.org/docs/python/api/filesystems.html#filesystem-implementations.

  • columns – A list of column names to read.

  • parallelism – The requested parallelism of the read. Parallelism may be limited by the number of files of the dataset.

  • ray_remote_args – kwargs passed to ray.remote in the read tasks.

  • tensor_column_schema – A dict of column name –> tensor dtype and shape mappings for converting a Parquet column containing serialized tensors (ndarrays) as their elements to our tensor column extension type. This assumes that the tensors were serialized in the raw NumPy array format in C-contiguous order (e.g. via arr.tobytes()).

  • meta_provider – File metadata provider. Custom metadata providers may be able to resolve file metadata more quickly and/or accurately.

  • arrow_parquet_args – Other parquet read options to pass to pyarrow, see https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Scanner.html#pyarrow.dataset.Scanner.from_fragment

Returns

Dataset producing Arrow records read from the specified paths.

PublicAPI: This API is stable across Ray releases.