ray.data.Dataset.write_tfrecords
ray.data.Dataset.write_tfrecords#
- Dataset.write_tfrecords(path: str, *, tf_schema: Optional[schema_pb2.Schema] = None, filesystem: Optional[pyarrow.fs.FileSystem] = None, try_create_dir: bool = True, arrow_open_stream_args: Optional[Dict[str, Any]] = None, block_path_provider: ray.data.datasource.file_based_datasource.BlockWritePathProvider = <ray.data.datasource.file_based_datasource.DefaultBlockWritePathProvider object>, ray_remote_args: Dict[str, Any] = None) None [source]#
Write the dataset to TFRecord files.
The TFRecord files will contain tf.train.Example # noqa: E501 records, with one Example record for each row in the dataset.
Warning
tf.train.Feature only natively stores ints, floats, and bytes, so this function only supports datasets with these data types, and will error if the dataset contains unsupported types.
This is only supported for datasets convertible to Arrow records. To control the number of files, use
.repartition()
.Unless a custom block path provider is given, the format of the output files will be {uuid}_{block_idx}.tfrecords, where
uuid
is an unique id for the dataset.Note
This operation will trigger execution of the lazy transformations performed on this dataset.
Examples
>>> import ray >>> ds = ray.data.from_items([ ... { "name": "foo", "score": 42 }, ... { "name": "bar", "score": 43 }, ... ]) >>> ds.write_tfrecords("s3://bucket/path")
Time complexity: O(dataset size / parallelism)
- Parameters
path – The path to the destination root directory, where tfrecords files will be written to.
filesystem – The filesystem implementation to write to.
try_create_dir – Try to create all directories in destination path if True. Does nothing if all directories already exist.
arrow_open_stream_args – kwargs passed to pyarrow.fs.FileSystem.open_output_stream
block_path_provider – BlockWritePathProvider implementation to write each dataset block to a custom output path.
ray_remote_args – Kwargs passed to ray.remote in the write tasks.