ray.data.read_parquet
ray.data.read_parquet#
- ray.data.read_parquet(paths: Union[str, List[str]], *, filesystem: Optional[pyarrow.fs.FileSystem] = None, columns: Optional[List[str]] = None, parallelism: int = -1, ray_remote_args: Dict[str, Any] = None, tensor_column_schema: Optional[Dict[str, Tuple[numpy.dtype, Tuple[int, ...]]]] = None, meta_provider: ray.data.datasource.file_meta_provider.ParquetMetadataProvider = <ray.data.datasource.file_meta_provider.DefaultParquetMetadataProvider object>, **arrow_parquet_args) ray.data.dataset.Dataset [source]#
Create an Arrow dataset from parquet files.
Examples
>>> import ray >>> # Read a directory of files in remote storage. >>> ray.data.read_parquet("s3://bucket/path")
>>> # Read multiple local files. >>> ray.data.read_parquet(["/path/to/file1", "/path/to/file2"])
>>> # Specify a schema for the parquet file. >>> import pyarrow as pa >>> fields = [("sepal.length", pa.float64()), ... ("sepal.width", pa.float64()), ... ("petal.length", pa.float64()), ... ("petal.width", pa.float64()), ... ("variety", pa.string())] >>> ray.data.read_parquet("example://iris.parquet", ... schema=pa.schema(fields)) Dataset( num_blocks=1, num_rows=150, schema={ sepal.length: double, sepal.width: double, petal.length: double, petal.width: double, variety: string } )
For further arguments you can pass to pyarrow as a keyword argument, see https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Scanner.html#pyarrow.dataset.Scanner.from_fragment
- Parameters
paths – A single file path or directory, or a list of file paths. Multiple directories are not supported.
filesystem – The filesystem implementation to read from. These are specified in https://arrow.apache.org/docs/python/api/filesystems.html#filesystem-implementations.
columns – A list of column names to read.
parallelism – The requested parallelism of the read. Parallelism may be limited by the number of files of the dataset.
ray_remote_args – kwargs passed to ray.remote in the read tasks.
tensor_column_schema – A dict of column name –> tensor dtype and shape mappings for converting a Parquet column containing serialized tensors (ndarrays) as their elements to our tensor column extension type. This assumes that the tensors were serialized in the raw NumPy array format in C-contiguous order (e.g. via
arr.tobytes()
).meta_provider – File metadata provider. Custom metadata providers may be able to resolve file metadata more quickly and/or accurately.
arrow_parquet_args – Other parquet read options to pass to pyarrow, see https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Scanner.html#pyarrow.dataset.Scanner.from_fragment
- Returns
Dataset producing Arrow records read from the specified paths.
PublicAPI: This API is stable across Ray releases.