ray.data.datasource.FileBasedDatasource#

class ray.data.datasource.FileBasedDatasource(paths: str | ~typing.List[str], *, filesystem: pyarrow.fs.FileSystem | None = None, schema: type | pyarrow.lib.Schema | None = None, open_stream_args: ~typing.Dict[str, ~typing.Any] | None = None, meta_provider: ~ray.data.datasource.file_meta_provider.BaseFileMetadataProvider = <ray.data.datasource.file_meta_provider.DefaultFileMetadataProvider object>, partition_filter: ~ray.data.datasource.partitioning.PathPartitionFilter = None, partitioning: ~ray.data.datasource.partitioning.Partitioning = None, ignore_missing_paths: bool = False, shuffle: ~typing.Literal['files'] | ~ray.data.datasource.file_based_datasource.FileShuffleConfig | None = None, include_paths: bool = False, file_extensions: ~typing.List[str] | None = None)[source]#

Bases: Datasource

File-based datasource for reading files.

Don’t use this class directly. Instead, subclass it and implement _read_stream().

DeveloperAPI: This API may change across minor Ray releases.

Methods

`create_reader`	Deprecated: Implement `get_read_tasks()` and `estimate_inmemory_data_size()` instead.
`get_name`	Return a human-readable name for this datasource.
`prepare_read`	Deprecated: Implement `get_read_tasks()` and `estimate_inmemory_data_size()` instead.

Attributes

`should_create_reader`
`supports_distributed_reads`	If `False`, only launch read tasks on the driver's node.