- ray.data.read_parquet_bulk(paths: Union[str, List[str]], *, filesystem: Optional[pyarrow.fs.FileSystem] = None, columns: Optional[List[str]] = None, parallelism: int = - 1, ray_remote_args: Dict[str, Any] = None, arrow_open_file_args: Optional[Dict[str, Any]] = None, tensor_column_schema: Optional[Dict[str, Tuple[numpy.dtype, Tuple[int, ...]]]] = None, meta_provider: Optional[ray.data.datasource.file_meta_provider.BaseFileMetadataProvider] = None, partition_filter: Optional[ray.data.datasource.partitioning.PathPartitionFilter] = FileExtensionFilter(extensions=['.parquet'], allow_if_no_extensions=False), shuffle: Optional[Literal['files']] = None, **arrow_parquet_args) ray.data.dataset.Dataset #
Datasetfrom parquet files without reading metadata.
read_parquet()for most cases.
Performance slowdowns are possible when using this method with parquet files that are very large.
Only provide file paths as input (i.e., no directory paths). An OSError is raised if one or more paths point to directories. If your use-case requires directory paths, use
Read multiple local files. You should always provide only input file paths (i.e. no directory paths) when known to minimize read latency.
>>> ray.data.read_parquet_bulk( ... ["/path/to/file1", "/path/to/file2"])
paths – A single file path or a list of file paths.
filesystem – The PyArrow filesystem implementation to read from. These filesystems are specified in the PyArrow docs. Specify this parameter if you need to provide specific configurations to the filesystem. By default, the filesystem is automatically selected based on the scheme of the paths. For example, if the path begins with
columns – A list of column names to read. Only the specified columns are read during the file scan.
parallelism – The amount of parallelism to use for the dataset. Defaults to -1, which automatically determines the optimal parallelism for your configuration. You should not need to manually set this value in most cases. For details on how the parallelism is automatically determined and guidance on how to tune it, see Tuning read parallelism. Parallelism is upper bounded by the total number of records in all the parquet files.
ray_remote_args – kwargs passed to
remote()in the read tasks.
arrow_open_file_args – kwargs passed to pyarrow.fs.FileSystem.open_input_file. when opening input files to read.
tensor_column_schema – A dict of column name to PyArrow dtype and shape mappings for converting a Parquet column containing serialized tensors (ndarrays) as their elements to PyArrow tensors. This function assumes that the tensors are serialized in the raw NumPy array format in C-contiguous order (e.g. via
meta_provider – A file metadata provider. Custom metadata providers may be able to resolve file metadata more quickly and/or accurately. In most cases, you do not need to set this. If
None, this function uses a system-chosen implementation.
partition_filter – A
PathPartitionFilter. Use with a custom callback to read only selected partitions of a dataset. By default, this filters out any file paths whose file extension does not match “.parquet”.
shuffle – If setting to “files”, randomly shuffle input files order before read. Defaults to not shuffle with
arrow_parquet_args – Other parquet read options to pass to PyArrow. For the full set of arguments, see the PyArrow API
Datasetproducing records read from the specified paths.