ray.data.read_webdataset#

ray.data.read_webdataset(paths: Union[str, List[str]], *, filesystem: Optional[pyarrow.fs.FileSystem] = None, parallelism: int = - 1, arrow_open_stream_args: Optional[Dict[str, Any]] = None, meta_provider: Optional[ray.data.datasource.file_meta_provider.BaseFileMetadataProvider] = None, partition_filter: Optional[ray.data.datasource.partitioning.PathPartitionFilter] = None, decoder: Optional[Union[bool, str, callable, list]] = True, fileselect: Optional[Union[list, callable]] = None, filerename: Optional[Union[list, callable]] = None, suffixes: Optional[Union[list, callable]] = None, verbose_open: bool = False, shuffle: Optional[Literal['files']] = None) ray.data.dataset.Dataset[source]#

Create a Dataset from WebDataset files.

Parameters
  • paths – A single file/directory path or a list of file/directory paths. A list of paths can contain both files and directories.

  • filesystem – The filesystem implementation to read from.

  • parallelism – The requested parallelism of the read. Parallelism may be limited by the number of files in the dataset.

  • arrow_open_stream_args – Key-word arguments passed to pyarrow.fs.FileSystem.open_input_stream. To read a compressed TFRecord file, pass the corresponding compression type (e.g. for GZIP or ZLIB, use arrow_open_stream_args={'compression_type': 'gzip'}).

  • meta_provider – File metadata provider. Custom metadata providers may be able to resolve file metadata more quickly and/or accurately. If None, this function uses a system-chosen implementation.

  • partition_filter – Path-based partition filter, if any. Can be used with a custom callback to read only selected partitions of a dataset.

  • decoder – A function or list of functions to decode the data.

  • fileselect – A callable or list of glob patterns to select files.

  • filerename – A function or list of tuples to rename files prior to grouping.

  • suffixes – A function or list of suffixes to select for creating samples.

  • verbose_open – Whether to print the file names as they are opened.

  • shuffle – If setting to “files”, randomly shuffle input files order before read. Defaults to not shuffle with None.

Returns

A Dataset that contains the example features.

Raises

ValueError – If a file contains a message that isn’t a tf.train.Example.

PublicAPI (alpha): This API is in alpha and may change before becoming stable.