ray.data.read_json
ray.data.read_json#
- ray.data.read_json(paths: Union[str, List[str]], *, filesystem: Optional[pyarrow.fs.FileSystem] = None, parallelism: int = -1, ray_remote_args: Dict[str, Any] = None, arrow_open_stream_args: Optional[Dict[str, Any]] = None, meta_provider: ray.data.datasource.file_meta_provider.BaseFileMetadataProvider = <ray.data.datasource.file_meta_provider.DefaultFileMetadataProvider object>, partition_filter: Optional[ray.data.datasource.partitioning.PathPartitionFilter] = FileExtensionFilter(extensions=['.json'], allow_if_no_extensions=False), partitioning: ray.data.datasource.partitioning.Partitioning = Partitioning(style='hive', base_dir='', field_names=None, filesystem=None), ignore_missing_paths: bool = False, **arrow_json_args) ray.data.dataset.Dataset [source]#
Create an Arrow dataset from json files.
Examples
>>> import ray >>> # Read a directory of files in remote storage. >>> ray.data.read_json("s3://bucket/path")
>>> # Read multiple local files. >>> ray.data.read_json(["/path/to/file1", "/path/to/file2"])
>>> # Read multiple directories. >>> ray.data.read_json( ... ["s3://bucket/path1", "s3://bucket/path2"])
By default,
read_json
parses Hive-style partitions from file paths. If your data adheres to a different partitioning scheme, set thepartitioning
parameter.>>> ds = ray.data.read_json("example://year=2022/month=09/sales.json") >>> ds.take(1) [{'order_number': 10107, 'quantity': 30, 'year': '2022', 'month': '09'}
- Parameters
paths – A single file/directory path or a list of file/directory paths. A list of paths can contain both files and directories.
filesystem – The filesystem implementation to read from.
parallelism – The requested parallelism of the read. Parallelism may be limited by the number of files of the dataset.
ray_remote_args – kwargs passed to ray.remote in the read tasks.
arrow_open_stream_args – kwargs passed to pyarrow.fs.FileSystem.open_input_stream
meta_provider – File metadata provider. Custom metadata providers may be able to resolve file metadata more quickly and/or accurately.
partition_filter – Path-based partition filter, if any. Can be used with a custom callback to read only selected partitions of a dataset. By default, this filters out any file paths whose file extension does not match “.json”.
arrow_json_args – Other json read options to pass to pyarrow.
partitioning –
A
Partitioning
object that describes how paths are organized. By default, this function parses Hive-style partitions.ignore_missing_paths – If True, ignores any file paths in
paths
that are not found. Defaults to False.
- Returns
Dataset producing Arrow records read from the specified paths.
PublicAPI: This API is stable across Ray releases.