ray.data.read_numpy#

ray.data.read_numpy(paths: Union[str, List[str]], *, filesystem: Optional[pyarrow.fs.FileSystem] = None, parallelism: int = - 1, arrow_open_stream_args: Optional[Dict[str, Any]] = None, meta_provider: Optional[ray.data.datasource.file_meta_provider.BaseFileMetadataProvider] = None, partition_filter: Optional[ray.data.datasource.partitioning.PathPartitionFilter] = FileExtensionFilter(extensions=['.npy'], allow_if_no_extensions=False), partitioning: ray.data.datasource.partitioning.Partitioning = None, ignore_missing_paths: bool = False, shuffle: Optional[Literal['files']] = None, **numpy_load_args) ray.data.dataset.Dataset[source]#

Create an Arrow dataset from numpy files.

Examples

Read a directory of files in remote storage.

>>> import ray
>>> ray.data.read_numpy("s3://bucket/path") 

Read multiple local files.

>>> ray.data.read_numpy(["/path/to/file1", "/path/to/file2"]) 

Read multiple directories.

>>> ray.data.read_numpy( 
...     ["s3://bucket/path1", "s3://bucket/path2"])
Parameters
  • paths – A single file/directory path or a list of file/directory paths. A list of paths can contain both files and directories.

  • filesystem – The filesystem implementation to read from.

  • parallelism – The requested parallelism of the read. Parallelism may be limited by the number of files of the dataset.

  • arrow_open_stream_args – kwargs passed to pyarrow.fs.FileSystem.open_input_stream.

  • numpy_load_args – Other options to pass to np.load.

  • meta_provider – File metadata provider. Custom metadata providers may be able to resolve file metadata more quickly and/or accurately. If None, this function uses a system-chosen implementation.

  • partition_filter – Path-based partition filter, if any. Can be used with a custom callback to read only selected partitions of a dataset. By default, this filters out any file paths whose file extension does not match “.npy”.

  • partitioning – A Partitioning object that describes how paths are organized. Defaults to None.

  • ignore_missing_paths – If True, ignores any file paths in paths that are not found. Defaults to False.

  • shuffle – If setting to “files”, randomly shuffle input files order before read. Defaults to not shuffle with None.

Returns

Dataset holding Tensor records read from the specified paths.