ray.data.datasource.DefaultFileMetadataProvider#

class ray.data.datasource.DefaultFileMetadataProvider[source]#

Bases: ray.data.datasource.file_meta_provider.BaseFileMetadataProvider

Default metadata provider for FileBasedDatasource implementations that reuse the base prepare_read method.

Calculates block size in bytes as the sum of its constituent file sizes, and assumes a fixed number of rows per file.

DeveloperAPI: This API may change across minor Ray releases.

expand_paths(paths: List[str], filesystem: pyarrow.fs.FileSystem) Tuple[List[str], List[Optional[int]]][source]#

Expands all paths into concrete file paths by walking directories.

Also returns a sidecar of file sizes.

The input paths must be normalized for compatibility with the input filesystem prior to invocation.

Args:
paths: A list of file and/or directory paths compatible with the

given filesystem.

filesystem: The filesystem implementation that should be used for

expanding all paths and reading their files.

Returns:

A tuple whose first item contains the list of file paths discovered, and whose second item contains the size of each file. None may be returned if a file size is either unknown or will be fetched later by _get_block_metadata(), but the length of both lists must be equal.