ray.data.datasource.FileBasedDatasource
ray.data.datasource.FileBasedDatasource#
- class ray.data.datasource.FileBasedDatasource(*args, **kwds)[source]#
Bases:
ray.data.datasource.datasource.Datasource
[Union
[ray.data._internal.arrow_block.ArrowRow
,Any
]]File-based datasource, for reading and writing files.
This class should not be used directly, and should instead be subclassed and tailored to particular file formats. Classes deriving from this class must implement _read_file().
If the _FILE_EXTENSION is defined, per default only files with this extension will be read. If None, no default filter is used.
- Current subclasses:
JSONDatasource, CSVDatasource, NumpyDatasource, BinaryDatasource
DeveloperAPI: This API may change across minor Ray releases.
- create_reader(**kwargs)[source]#
Return a Reader for the given read arguments.
The reader object will be responsible for querying the read metadata, and generating the actual read tasks to retrieve the data blocks upon request.
- Parameters
read_args – Additional kwargs to pass to the datasource impl.
- do_write(blocks: List[ray.types.ObjectRef[Union[List[ray.data.block.T], pyarrow.Table, pandas.DataFrame, bytes]]], metadata: List[ray.data.block.BlockMetadata], path: str, dataset_uuid: str, filesystem: Optional[pyarrow.fs.FileSystem] = None, try_create_dir: bool = True, open_stream_args: Optional[Dict[str, Any]] = None, block_path_provider: ray.data.datasource.file_based_datasource.BlockWritePathProvider = <ray.data.datasource.file_based_datasource.DefaultBlockWritePathProvider object>, write_args_fn: Callable[[], Dict[str, Any]] = <function FileBasedDatasource.<lambda>>, _block_udf: Optional[Callable[[Union[List[ray.data.block.T], pyarrow.Table, pandas.DataFrame, bytes]], Union[List[ray.data.block.T], pyarrow.Table, pandas.DataFrame, bytes]]] = None, ray_remote_args: Dict[str, Any] = None, **write_args) List[ray.types.ObjectRef[Any]] [source]#
Creates and returns write tasks for a file-based datasource.