ray.data.Dataset.write_json#

Dataset.write_json(path: str, *, filesystem: pyarrow.fs.FileSystem | None = None, try_create_dir: bool = True, arrow_open_stream_args: Dict[str, Any] | None = None, filename_provider: FilenameProvider | None = None, pandas_json_args_fn: Callable[[], Dict[str, Any]] | None = None, num_rows_per_file: int | None = None, ray_remote_args: Dict[str, Any] = None, concurrency: int | None = None, **pandas_json_args) None[source]#

Writes the Dataset to JSON and JSONL files.

The number of files is determined by the number of blocks in the dataset. To control the number of number of blocks, call repartition().

This method is only supported for datasets with records that are convertible to pandas dataframes.

By default, the format of the output files is {uuid}_{block_idx}.json, where uuid is a unique id for the dataset. To modify this behavior, implement a custom FilenameProvider and pass it in as the filename_provider argument.

Note

This operation will trigger execution of the lazy transformations performed on this dataset.

Examples

Write the dataset as JSON file to a local directory.

>>> import ray
>>> import pandas as pd
>>> ds = ray.data.from_pandas([pd.DataFrame({"one": [1], "two": ["a"]})])
>>> ds.write_json("local:///tmp/data")

Write the dataset as JSONL files to a local directory.

>>> ds = ray.data.read_json("s3://anonymous@ray-example-data/train.jsonl")
>>> ds.write_json("local:///tmp/data")

Time complexity: O(dataset size / parallelism)

Parameters:
  • path – The path to the destination root directory, where the JSON files are written to.

  • filesystem – The pyarrow filesystem implementation to write to. These filesystems are specified in the pyarrow docs. Specify this if you need to provide specific configurations to the filesystem. By default, the filesystem is automatically selected based on the scheme of the paths. For example, if the path begins with s3://, the S3FileSystem is used.

  • try_create_dir – If True, attempts to create all directories in the destination path. Does nothing if all directories already exist. Defaults to True.

  • arrow_open_stream_args – kwargs passed to pyarrow.fs.FileSystem.open_output_stream, which is used when opening the file to write to.

  • filename_provider – A FilenameProvider implementation. Use this parameter to customize what your filenames look like.

  • pandas_json_args_fn – Callable that returns a dictionary of write arguments that are provided to pandas.DataFrame.to_json() when writing each block to a file. Overrides any duplicate keys from pandas_json_args. Use this parameter instead of pandas_json_args if any of your write arguments can’t be pickled, or if you’d like to lazily resolve the write arguments for each dataset block.

  • num_rows_per_file – [Experimental] The target number of rows to write to each file. If None, Ray Data writes a system-chosen number of rows to each file. The specified value is a hint, not a strict limit. Ray Data might write more or fewer rows to each file. In specific, if the number of rows per block is larger than the specified value, Ray Data writes the number of rows per block to each file.

  • ray_remote_args – kwargs passed to ray.remote() in the write tasks.

  • concurrency – The maximum number of Ray tasks to run concurrently. Set this to control number of tasks to run concurrently. This doesn’t change the total number of tasks run. By default, concurrency is dynamically decided based on the available resources.

  • pandas_json_args

    These args are passed to pandas.DataFrame.to_json(), which is used under the hood to write out each Dataset block. These are dict(orient=”records”, lines=True) by default.