ray.data.datasource.FastFileMetadataProvider#

class ray.data.datasource.FastFileMetadataProvider[source]#

Bases: ray.data.datasource.file_meta_provider.DefaultFileMetadataProvider

Fast Metadata provider for FileBasedDatasource implementations.

Offers improved performance vs. DefaultFileMetadataProvider by skipping directory path expansion and file size collection. While this performance improvement may be negligible for local filesystems, it can be substantial for cloud storage service providers.

This should only be used when all input paths are known to be files.

DeveloperAPI: This API may change across minor Ray releases.

expand_paths(paths: List[str], filesystem: pyarrow.fs.FileSystem) Tuple[List[str], List[Optional[int]]][source]#

Expands all paths into concrete file paths by walking directories.

Also returns a sidecar of file sizes.

The input paths must be normalized for compatibility with the input filesystem prior to invocation.

Args:
paths: A list of file and/or directory paths compatible with the

given filesystem.

filesystem: The filesystem implementation that should be used for

expanding all paths and reading their files.

Returns:

A tuple whose first item contains the list of file paths discovered, and whose second item contains the size of each file. None may be returned if a file size is either unknown or will be fetched later by _get_block_metadata(), but the length of both lists must be equal.