ray.data.read_numpy#
- ray.data.read_numpy(paths: str | List[str], *, filesystem: pyarrow.fs.FileSystem | None = None, parallelism: int = -1, arrow_open_stream_args: Dict[str, Any] | None = None, meta_provider: BaseFileMetadataProvider | None = None, partition_filter: PathPartitionFilter | None = None, partitioning: Partitioning = None, include_paths: bool = False, ignore_missing_paths: bool = False, shuffle: Literal['files'] | None = None, file_extensions: List[str] | None = ['npy'], **numpy_load_args) Dataset [source]#
Create an Arrow dataset from numpy files.
Examples
Read a directory of files in remote storage.
>>> import ray >>> ray.data.read_numpy("s3://bucket/path")
Read multiple local files.
>>> ray.data.read_numpy(["/path/to/file1", "/path/to/file2"])
Read multiple directories.
>>> ray.data.read_numpy( ... ["s3://bucket/path1", "s3://bucket/path2"])
- Parameters:
paths – A single file/directory path or a list of file/directory paths. A list of paths can contain both files and directories.
filesystem – The filesystem implementation to read from.
parallelism – The requested parallelism of the read. Parallelism may be limited by the number of files of the dataset.
arrow_open_stream_args – kwargs passed to pyarrow.fs.FileSystem.open_input_stream.
numpy_load_args – Other options to pass to np.load.
meta_provider – File metadata provider. Custom metadata providers may be able to resolve file metadata more quickly and/or accurately. If
None
, this function uses a system-chosen implementation.partition_filter – Path-based partition filter, if any. Can be used with a custom callback to read only selected partitions of a dataset. By default, this filters out any file paths whose file extension does not match “.npy”.
partitioning – A
Partitioning
object that describes how paths are organized. Defaults toNone
.include_paths – If
True
, include the path to each file. File paths are stored in the'path'
column.ignore_missing_paths – If True, ignores any file paths in
paths
that are not found. Defaults to False.shuffle – If setting to “files”, randomly shuffle input files order before read. Defaults to not shuffle with
None
.file_extensions – A list of file extensions to filter files by.
- Returns:
Dataset holding Tensor records read from the specified paths.