ray.data.read_csv#

ray.data.read_csv(paths: Union[str, List[str]], *, filesystem: Optional[pyarrow.fs.FileSystem] = None, parallelism: int = -1, ray_remote_args: Dict[str, Any] = None, arrow_open_stream_args: Optional[Dict[str, Any]] = None, meta_provider: ray.data.datasource.file_meta_provider.BaseFileMetadataProvider = <ray.data.datasource.file_meta_provider.DefaultFileMetadataProvider object>, partition_filter: Optional[ray.data.datasource.partitioning.PathPartitionFilter] = None, partitioning: ray.data.datasource.partitioning.Partitioning = Partitioning(style='hive', base_dir='', field_names=None, filesystem=None), ignore_missing_paths: bool = False, **arrow_csv_args) ray.data.dataset.Dataset[source]#

Create an Arrow dataset from csv files.

Examples

>>> import ray
>>> # Read a directory of files in remote storage.
>>> ray.data.read_csv("s3://bucket/path") 
>>> # Read multiple local files.
>>> ray.data.read_csv(["/path/to/file1", "/path/to/file2"]) 
>>> # Read multiple directories.
>>> ray.data.read_csv( 
...     ["s3://bucket/path1", "s3://bucket/path2"])
>>> # Read files that use a different delimiter. For more uses of ParseOptions see
>>> # https://arrow.apache.org/docs/python/generated/pyarrow.csv.ParseOptions.html  # noqa: #501
>>> from pyarrow import csv
>>> parse_options = csv.ParseOptions(delimiter="\t")
>>> ray.data.read_csv( 
...     "example://iris.tsv",
...     parse_options=parse_options)
>>> # Convert a date column with a custom format from a CSV file.
>>> # For more uses of ConvertOptions see
>>> # https://arrow.apache.org/docs/python/generated/pyarrow.csv.ConvertOptions.html  # noqa: #501
>>> from pyarrow import csv
>>> convert_options = csv.ConvertOptions(
...     timestamp_parsers=["%m/%d/%Y"])
>>> ray.data.read_csv( 
...     "example://dow_jones_index.csv",
...     convert_options=convert_options)

By default, read_csv parses Hive-style partitions from file paths. If your data adheres to a different partitioning scheme, set the partitioning parameter.

>>> ds = ray.data.read_csv("example://year=2022/month=09/sales.csv")  
>>> ds.take(1)  
[{'order_number': 10107, 'quantity': 30, 'year': '2022', 'month': '09'}]

By default, read_csv reads all files from file paths. If you want to filter files by file extensions, set the partition_filter parameter.

>>> # Read only *.csv files from multiple directories.
>>> from ray.data.datasource import FileExtensionFilter
>>> ray.data.read_csv( 
...     ["s3://bucket/path1", "s3://bucket/path2"],
...     partition_filter=FileExtensionFilter("csv"))
Parameters
  • paths – A single file/directory path or a list of file/directory paths. A list of paths can contain both files and directories.

  • filesystem – The filesystem implementation to read from.

  • parallelism – The requested parallelism of the read. Parallelism may be limited by the number of files of the dataset.

  • ray_remote_args – kwargs passed to ray.remote in the read tasks.

  • arrow_open_stream_args – kwargs passed to pyarrow.fs.FileSystem.open_input_stream

  • meta_provider – File metadata provider. Custom metadata providers may be able to resolve file metadata more quickly and/or accurately.

  • partition_filter – Path-based partition filter, if any. Can be used with a custom callback to read only selected partitions of a dataset. By default, this does not filter out any files. If wishing to filter out all file paths except those whose file extension matches e.g. “.csv”, a FileExtensionFilter("csv") can be provided.

  • partitioning

    A Partitioning object that describes how paths are organized. By default, this function parses Hive-style partitions.

  • arrow_csv_args – Other csv read options to pass to pyarrow.

  • ignore_missing_paths – If True, ignores any file paths in paths that are not found. Defaults to False.

Returns

Dataset producing Arrow records read from the specified paths.

PublicAPI: This API is stable across Ray releases.