ray.data.read_parquet_bulk#

ray.data.read_parquet_bulk(paths: str | List[str], *, filesystem: pyarrow.fs.FileSystem | None = None, columns: List[str] | None = None, parallelism: int = -1, num_cpus: float | None = None, num_gpus: float | None = None, memory: float | None = None, ray_remote_args: Dict[str, Any] = None, arrow_open_file_args: Dict[str, Any] | None = None, tensor_column_schema: Dict[str, Tuple[numpy.dtype, Tuple[int, ...]]] | None = None, meta_provider: BaseFileMetadataProvider | None = None, partition_filter: PathPartitionFilter | None = None, shuffle: Literal['files'] | FileShuffleConfig | None = None, include_paths: bool = False, file_extensions: List[str] | None = ['parquet'], concurrency: int | None = None, override_num_blocks: int | None = None, **arrow_parquet_args) → Dataset[source]#

Create Dataset from parquet files without reading metadata.

Use read_parquet() for most cases.

Use read_parquet_bulk() if all the provided paths point to files and metadata fetching using read_parquet() takes too long or the parquet files do not all have a unified schema.

Performance slowdowns are possible when using this method with parquet files that are very large.

Warning

Only provide file paths as input (i.e., no directory paths). An OSError is raised if one or more paths point to directories. If your use-case requires directory paths, use read_parquet() instead.

Examples

Read multiple local files. You should always provide only input file paths (i.e. no directory paths) when known to minimize read latency.

>>> ray.data.read_parquet_bulk( 
...     ["/path/to/file1", "/path/to/file2"])

Parameters:

paths – A single file path or a list of file paths.
filesystem – The PyArrow filesystem implementation to read from. These filesystems are specified in the PyArrow docs. Specify this parameter if you need to provide specific configurations to the filesystem. By default, the filesystem is automatically selected based on the scheme of the paths. For example, if the path begins with s3://, the S3FileSystem is used.
columns – A list of column names to read. Only the specified columns are read during the file scan.
parallelism – This argument is deprecated. Use override_num_blocks argument.
num_cpus – The number of CPUs to reserve for each parallel read worker.
num_gpus – The number of GPUs to reserve for each parallel read worker. For example, specify num_gpus=1 to request 1 GPU for each parallel read worker.
memory – The heap memory in bytes to reserve for each parallel read worker.
ray_remote_args – kwargs passed to ray.remote() in the read tasks.
arrow_open_file_args – kwargs passed to pyarrow.fs.FileSystem.open_input_file. when opening input files to read.
tensor_column_schema – A dict of column name to PyArrow dtype and shape mappings for converting a Parquet column containing serialized tensors (ndarrays) as their elements to PyArrow tensors. This function assumes that the tensors are serialized in the raw NumPy array format in C-contiguous order (e.g. via arr.tobytes()).
meta_provider – [Deprecated] A file metadata provider. Custom metadata providers may be able to resolve file metadata more quickly and/or accurately. In most cases, you do not need to set this. If None, this function uses a system-chosen implementation.
partition_filter – A PathPartitionFilter. Use with a custom callback to read only selected partitions of a dataset. By default, this filters out any file paths whose file extension does not match “.parquet”.
shuffle – If setting to “files”, randomly shuffle input files order before read. If setting to FileShuffleConfig, you can pass a seed to shuffle the input files. Defaults to not shuffle with None.
arrow_parquet_args – Other parquet read options to pass to PyArrow. For the full set of arguments, see the PyArrow API
include_paths – If True, include the path to each file. File paths are stored in the 'path' column.
file_extensions – A list of file extensions to filter files by.
concurrency – The maximum number of Ray tasks to run concurrently. Set this to control number of tasks to run concurrently. This doesn’t change the total number of tasks run or the total number of output blocks. By default, concurrency is dynamically decided based on the available resources.
override_num_blocks – Override the number of output blocks from all read tasks. By default, the number of output blocks is dynamically decided based on input data size and available resources. You shouldn’t manually set this value in most cases.

Returns:

Dataset producing records read from the specified paths.

Warning

DEPRECATED: This API is deprecated and may be removed in future Ray releases.