ray.data.Dataset.write_webdataset#
- Dataset.write_webdataset(path: str, *, filesystem: pyarrow.fs.FileSystem | None = None, try_create_dir: bool = True, arrow_open_stream_args: Dict[str, Any] | None = None, filename_provider: FilenameProvider | None = None, num_rows_per_file: int | None = None, ray_remote_args: Dict[str, Any] = None, encoder: bool | str | callable | list | None = True, concurrency: int | None = None) None [source]#
Writes the dataset to WebDataset files.
The TFRecord files will contain tf.train.Example # noqa: E501 records, with one Example record for each row in the dataset.
Warning
tf.train.Feature only natively stores ints, floats, and bytes, so this function only supports datasets with these data types, and will error if the dataset contains unsupported types.
This is only supported for datasets convertible to Arrow records. To control the number of files, use
Dataset.repartition()
.Unless a custom filename provider is given, the format of the output files is
{uuid}_{block_idx}.tfrecords
, whereuuid
is a unique id for the dataset.Note
This operation will trigger execution of the lazy transformations performed on this dataset.
Examples
import ray ds = ray.data.range(100) ds.write_webdataset("s3://bucket/folder/")
Time complexity: O(dataset size / parallelism)
- Parameters:
path – The path to the destination root directory, where tfrecords files are written to.
filesystem – The filesystem implementation to write to.
try_create_dir – If
True
, attempts to create all directories in the destination path. Does nothing if all directories already exist. Defaults toTrue
.arrow_open_stream_args – kwargs passed to
pyarrow.fs.FileSystem.open_output_stream
filename_provider – A
FilenameProvider
implementation. Use this parameter to customize what your filenames look like.num_rows_per_file – The target number of rows to write to each file. If
None
, Ray Data writes a system-chosen number of rows to each file. The specified value is a hint, not a strict limit. Ray Data might write more or fewer rows to each file.ray_remote_args – Kwargs passed to
ray.remote
in the write tasks.concurrency – The maximum number of Ray tasks to run concurrently. Set this to control number of tasks to run concurrently. This doesn’t change the total number of tasks run. By default, concurrency is dynamically decided based on the available resources.
PublicAPI (alpha): This API is in alpha and may change before becoming stable.