ray.data.read_images#

ray.data.read_images(paths: Union[str, List[str]], *, filesystem: Optional[pyarrow.fs.FileSystem] = None, parallelism: int = -1, meta_provider: ray.data.datasource.file_meta_provider.BaseFileMetadataProvider = <ray.data.datasource.image_datasource._ImageFileMetadataProvider object>, ray_remote_args: Dict[str, Any] = None, arrow_open_file_args: Optional[Dict[str, Any]] = None, partition_filter: Optional[ray.data.datasource.partitioning.PathPartitionFilter] = FileExtensionFilter(extensions=['.png', '.jpg', '.jpeg', '.tiff', '.bmp', '.gif'], allow_if_no_extensions=False), partitioning: ray.data.datasource.partitioning.Partitioning = None, size: Optional[Tuple[int, int]] = None, mode: Optional[str] = None, include_paths: bool = False, ignore_missing_paths: bool = False) ray.data.dataset.Dataset[source]#

Read images from the specified paths.

Examples

>>> import ray
>>> path = "s3://anonymous@air-example-data-2/movie-image-small-filesize-1GB"
>>> ds = ray.data.read_images(path)  
>>> ds  
Dataset(num_blocks=200, num_rows=41979, schema={image: numpy.ndarray(ndim=3, dtype=uint8)})

If you need image file paths, set include_paths=True.

>>> ds = ray.data.read_images(path, include_paths=True)  
>>> ds  
Dataset(num_blocks=200, num_rows=41979, schema={image: numpy.ndarray(ndim=3, dtype=uint8), path: string})
>>> ds.take(1)[0]["path"]  
'air-example-data-2/movie-image-small-filesize-1GB/0.jpg'

If your images are arranged like:

root/dog/xxx.png
root/dog/xxy.png

root/cat/123.png
root/cat/nsdf3.png

Then you can include the labels by specifying a Partitioning.

>>> import ray
>>> from ray.data.datasource.partitioning import Partitioning
>>> root = "example://tiny-imagenet-200/train"
>>> partitioning = Partitioning("dir", field_names=["class"], base_dir=root)
>>> ds = ray.data.read_images(root, size=(224, 224), partitioning=partitioning)  
>>> ds  
Dataset(num_blocks=176, num_rows=94946, schema={image: TensorDtype(shape=(224, 224, 3), dtype=uint8), class: object})
Parameters
  • paths – A single file/directory path or a list of file/directory paths. A list of paths can contain both files and directories.

  • filesystem – The filesystem implementation to read from.

  • parallelism – The requested parallelism of the read. Parallelism may be limited by the number of files of the dataset.

  • meta_provider – File metadata provider. Custom metadata providers may be able to resolve file metadata more quickly and/or accurately.

  • ray_remote_args – kwargs passed to ray.remote in the read tasks.

  • arrow_open_file_args – kwargs passed to pyarrow.fs.FileSystem.open_input_file.

  • partition_filter – Path-based partition filter, if any. Can be used with a custom callback to read only selected partitions of a dataset. By default, this filters out any file paths whose file extension does not match *.png, *.jpg, *.jpeg, *.tiff, *.bmp, or *.gif.

  • partitioning – A Partitioning object that describes how paths are organized. Defaults to None.

  • size – The desired height and width of loaded images. If unspecified, images retain their original shape.

  • mode – A Pillow mode describing the desired type and depth of pixels. If unspecified, image modes are inferred by Pillow.

  • include_paths – If True, include the path to each image. File paths are stored in the 'path' column.

  • ignore_missing_paths – If True, ignores any file/directory paths in paths that are not found. Defaults to False.

Returns

A Dataset producing tensors that represent the images at the specified paths. For information on working with tensors, read the tensor data guide.

Raises
  • ValueError – if size contains non-positive numbers.

  • ValueError – if mode is unsupported.

PublicAPI (beta): This API is in beta and may change before becoming stable.