ray.data.Dataset.write_tfrecords#

Dataset.write_tfrecords(path: str, *, filesystem: Optional[pyarrow.fs.FileSystem] = None, try_create_dir: bool = True, arrow_open_stream_args: Optional[Dict[str, Any]] = None, block_path_provider: ray.data.datasource.file_based_datasource.BlockWritePathProvider = <ray.data.datasource.file_based_datasource.DefaultBlockWritePathProvider object>, ray_remote_args: Dict[str, Any] = None) None[source]#

Write the dataset to TFRecord files.

The TFRecord files will contain tf.train.Example # noqa: E501 records, with one Example record for each row in the dataset.

Warning

tf.train.Feature only natively stores ints, floats, and bytes, so this function only supports datasets with these data types, and will error if the dataset contains unsupported types.

This is only supported for datasets convertible to Arrow records. To control the number of files, use .repartition().

Unless a custom block path provider is given, the format of the output files will be {uuid}_{block_idx}.tfrecords, where uuid is an unique id for the dataset.

Examples

>>> import ray
>>> ds = ray.data.from_items([
...     { "name": "foo", "score": 42 },
...     { "name": "bar", "score": 43 },
... ])
>>> ds.write_tfrecords("s3://bucket/path") 

Time complexity: O(dataset size / parallelism)

Parameters
  • path – The path to the destination root directory, where tfrecords files will be written to.

  • filesystem – The filesystem implementation to write to.

  • try_create_dir – Try to create all directories in destination path if True. Does nothing if all directories already exist.

  • arrow_open_stream_args – kwargs passed to pyarrow.fs.FileSystem.open_output_stream

  • block_path_provider – BlockWritePathProvider implementation to write each dataset block to a custom output path.

  • ray_remote_args – Kwargs passed to ray.remote in the write tasks.