ray.data.read_images
ray.data.read_images#
- ray.data.read_images(paths: Union[str, List[str]], *, filesystem: Optional[pyarrow.fs.FileSystem] = None, parallelism: int = -1, meta_provider: ray.data.datasource.file_meta_provider.BaseFileMetadataProvider = <ray.data.datasource.image_datasource._ImageFileMetadataProvider object>, ray_remote_args: Dict[str, Any] = None, arrow_open_file_args: Optional[Dict[str, Any]] = None, partition_filter: Optional[ray.data.datasource.partitioning.PathPartitionFilter] = FileExtensionFilter(extensions=['.png', '.jpg', '.jpeg', '.tiff', '.bmp', '.gif'], allow_if_no_extensions=False), partitioning: ray.data.datasource.partitioning.Partitioning = None, size: Optional[Tuple[int, int]] = None, mode: Optional[str] = None, include_paths: bool = False, ignore_missing_paths: bool = False) ray.data.dataset.Dataset [source]#
Read images from the specified paths.
Examples
>>> import ray >>> path = "s3://anonymous@air-example-data-2/movie-image-small-filesize-1GB" >>> ds = ray.data.read_images(path) >>> ds Dataset(num_blocks=200, num_rows=41979, schema={image: numpy.ndarray(ndim=3, dtype=uint8)})
If you need image file paths, set
include_paths=True
.>>> ds = ray.data.read_images(path, include_paths=True) >>> ds Dataset(num_blocks=200, num_rows=41979, schema={image: numpy.ndarray(ndim=3, dtype=uint8), path: string}) >>> ds.take(1)[0]["path"] 'air-example-data-2/movie-image-small-filesize-1GB/0.jpg'
If your images are arranged like:
root/dog/xxx.png root/dog/xxy.png root/cat/123.png root/cat/nsdf3.png
Then you can include the labels by specifying a
Partitioning
.>>> import ray >>> from ray.data.datasource.partitioning import Partitioning >>> root = "example://tiny-imagenet-200/train" >>> partitioning = Partitioning("dir", field_names=["class"], base_dir=root) >>> ds = ray.data.read_images(root, size=(224, 224), partitioning=partitioning) >>> ds Dataset(num_blocks=176, num_rows=94946, schema={image: TensorDtype(shape=(224, 224, 3), dtype=uint8), class: object})
- Parameters
paths – A single file/directory path or a list of file/directory paths. A list of paths can contain both files and directories.
filesystem – The filesystem implementation to read from.
parallelism – The requested parallelism of the read. Parallelism may be limited by the number of files of the dataset.
meta_provider – File metadata provider. Custom metadata providers may be able to resolve file metadata more quickly and/or accurately.
ray_remote_args – kwargs passed to ray.remote in the read tasks.
arrow_open_file_args – kwargs passed to
pyarrow.fs.FileSystem.open_input_file
.partition_filter – Path-based partition filter, if any. Can be used with a custom callback to read only selected partitions of a dataset. By default, this filters out any file paths whose file extension does not match
*.png
,*.jpg
,*.jpeg
,*.tiff
,*.bmp
, or*.gif
.partitioning – A
Partitioning
object that describes how paths are organized. Defaults toNone
.size – The desired height and width of loaded images. If unspecified, images retain their original shape.
mode – A Pillow mode describing the desired type and depth of pixels. If unspecified, image modes are inferred by Pillow.
include_paths – If
True
, include the path to each image. File paths are stored in the'path'
column.ignore_missing_paths – If True, ignores any file/directory paths in
paths
that are not found. Defaults to False.
- Returns
A
Dataset
producing tensors that represent the images at the specified paths. For information on working with tensors, read the tensor data guide.- Raises
ValueError – if
size
contains non-positive numbers.ValueError – if
mode
is unsupported.
PublicAPI (beta): This API is in beta and may change before becoming stable.