Dataset.write_parquet(path: str, *, filesystem: Optional[pyarrow.fs.FileSystem] = None, try_create_dir: bool = True, arrow_open_stream_args: Optional[Dict[str, Any]] = None, block_path_provider: ray.data.datasource.block_path_provider.BlockWritePathProvider = <ray.data.datasource.block_path_provider.DefaultBlockWritePathProvider object>, arrow_parquet_args_fn: Callable[[], Dict[str, Any]] = <function Dataset.<lambda>>, ray_remote_args: Dict[str, Any] = None, **arrow_parquet_args) None[source]#

Writes the Dataset to parquet files under the provided path.

The number of files is determined by the number of blocks in the dataset. To control the number of number of blocks, call repartition().

If pyarrow can’t represent your data, this method errors.

By default, the format of the output files is {uuid}_{block_idx}.parquet, where uuid is a unique id for the dataset. To modify this behavior, implement a custom BlockWritePathProvider and pass it in as the block_path_provider argument.


This operation will trigger execution of the lazy transformations performed on this dataset.


>>> import ray
>>> ds = ray.data.range(100)
>>> ds.write_parquet("local:///tmp/data/")

Time complexity: O(dataset size / parallelism)

  • path – The path to the destination root directory, where parquet files are written to.

  • filesystem – The pyarrow filesystem implementation to write to. These filesystems are specified in the pyarrow docs. Specify this if you need to provide specific configurations to the filesystem. By default, the filesystem is automatically selected based on the scheme of the paths. For example, if the path begins with s3://, the S3FileSystem is used.

  • try_create_dir – If True, attempts to create all directories in the destination path. Does nothing if all directories already exist. Defaults to True.

  • arrow_open_stream_args – kwargs passed to pyarrow.fs.FileSystem.open_output_stream, which is used when opening the file to write to.

  • block_path_provider – A BlockWritePathProvider implementation specifying the filename structure for each output parquet file. By default, the format of the output files is {uuid}_{block_idx}.parquet, where uuid is a unique id for the dataset.

  • arrow_parquet_args_fn – Callable that returns a dictionary of write arguments that are provided to pyarrow.parquet.write_table() when writing each block to a file. Overrides any duplicate keys from arrow_parquet_args. Use this argument instead of arrow_parquet_args if any of your write arguments can’t pickled, or if you’d like to lazily resolve the write arguments for each dataset block.

  • ray_remote_args – Kwargs passed to remote() in the write tasks.

  • arrow_parquet_args

    Options to pass to pyarrow.parquet.write_table(), which is used to write out each block to a file.