Input/Output#

Synthetic Data#

range(n, *[, parallelism])

Create a dataset from a range of integers [0..n).

range_table(n, *[, parallelism])

Create a tabular dataset from a range of integers [0..n).

range_tensor(n, *[, shape, parallelism])

Create a Tensor dataset from a range of integers [0..n).

Python Objects#

from_items(items, *[, parallelism])

Create a dataset from a list of local Python objects.

Parquet#

read_parquet(paths, *[, filesystem, ...])

Create an Arrow dataset from parquet files.

read_parquet_bulk(paths, *[, filesystem, ...])

Create an Arrow dataset from a large number (such as >1K) of parquet files quickly.

Dataset.write_parquet(path, *[, filesystem, ...])

Write the dataset to parquet.

CSV#

read_csv(paths, *[, filesystem, ...])

Create an Arrow dataset from csv files.

Dataset.write_csv(path, *[, filesystem, ...])

Write the dataset to csv.

JSON#

read_json(paths, *[, filesystem, ...])

Create an Arrow dataset from json files.

Dataset.write_json(path, *[, filesystem, ...])

Write the dataset to json.

Text#

read_text(paths, *[, encoding, errors, ...])

Create a dataset from lines stored in text files.

Images#

read_images(paths, *[, filesystem, ...])

Read images from the specified paths.

Binary#

read_binary_files(paths, *[, include_paths, ...])

Create a dataset from binary files of arbitrary contents.

TFRecords#

read_tfrecords(paths, *[, filesystem, ...])

Create a dataset from TFRecord files that contain tf.train.Example messages.

Dataset.write_tfrecords(path, *[, ...])

Write the dataset to TFRecord files.

Pandas#

from_pandas(dfs)

Create a dataset from a list of Pandas dataframes.

from_pandas_refs(dfs)

Create a dataset from a list of Ray object references to Pandas dataframes.

Dataset.to_pandas([limit])

Convert this dataset into a single Pandas DataFrame.

Dataset.to_pandas_refs()

Convert this dataset into a distributed set of Pandas dataframes.

NumPy#

read_numpy(paths, *[, filesystem, ...])

Create an Arrow dataset from numpy files.

from_numpy(ndarrays)

Create a dataset from a list of NumPy ndarrays.

from_numpy_refs(ndarrays)

Create a dataset from a list of NumPy ndarray futures.

Dataset.write_numpy(path, *[, column, ...])

Write a tensor column of the dataset to npy files.

Dataset.to_numpy_refs(*[, column])

Convert this dataset into a distributed set of NumPy ndarrays.

Arrow#

from_arrow(tables)

Create a dataset from a list of Arrow tables.

from_arrow_refs(tables)

Create a dataset from a set of Arrow tables.

Dataset.to_arrow_refs()

Convert this dataset into a distributed set of Arrow tables.

MongoDB#

read_mongo(uri, database, collection, *[, ...])

Create an Arrow dataset from MongoDB.

Dataset.write_mongo(uri, database, collection)

Write the dataset to a MongoDB datasource.

SQL Databases#

read_sql(sql, connection_factory, *[, ...])

Read from a database that provides a Python DB API2-compliant connector.

Dask#

from_dask(df)

Create a dataset from a Dask DataFrame.

Dataset.to_dask([meta])

Convert this dataset into a Dask DataFrame.

Spark#

from_spark(df, *[, parallelism])

Create a dataset from a Spark dataframe.

Dataset.to_spark(spark)

Convert this dataset into a Spark dataframe.

Modin#

from_modin(df)

Create a dataset from a Modin dataframe.

Dataset.to_modin()

Convert this dataset into a Modin dataframe.

Mars#

from_mars(df)

Create a dataset from a MARS dataframe.

Dataset.to_mars()

Convert this dataset into a MARS dataframe.

Torch#

from_torch(dataset)

Create a dataset from a Torch dataset.

Hugging Face#

from_huggingface(dataset)

Create a dataset from a Hugging Face Datasets Dataset.

TensorFlow#

from_tf(dataset)

Create a dataset from a TensorFlow dataset.

WebDataset#

read_webdataset(paths, *[, filesystem, ...])

Create a dataset from WebDataset files.

Datasource API#

read_datasource(datasource, *[, ...])

Read a dataset from a custom data source.

Dataset.write_datasource(datasource, *[, ...])

Write the dataset to a custom datasource.

Datasource(*args, **kwds)

Interface for defining a custom ray.data.Dataset datasource.

ReadTask(read_fn, metadata)

A function used to read blocks from the dataset.

datasource.Reader(*args, **kwds)

A bound read operation for a datasource.

Built-in Datasources#

datasource.BinaryDatasource(*args, **kwds)

Binary datasource, for reading and writing binary files.

datasource.CSVDatasource(*args, **kwds)

CSV datasource, for reading and writing CSV files.

datasource.FileBasedDatasource(*args, **kwds)

File-based datasource, for reading and writing files.

datasource.ImageDatasource(*args, **kwds)

A datasource that lets you read images.

datasource.JSONDatasource(*args, **kwds)

JSON datasource, for reading and writing JSON files.

datasource.NumpyDatasource(*args, **kwds)

Numpy datasource, for reading and writing Numpy files.

datasource.ParquetDatasource(*args, **kwds)

Parquet datasource, for reading and writing Parquet files.

datasource.RangeDatasource(*args, **kwds)

An example datasource that generates ranges of numbers from [0..n).

datasource.TFRecordDatasource(*args, **kwds)

TFRecord datasource, for reading and writing TFRecord files.

datasource.MongoDatasource(*args, **kwds)

Datasource for reading from and writing to MongoDB.

datasource.WebDatasetDatasource(*args, **kwds)

A Datasource for WebDataset datasets (tar format with naming conventions).

Partitioning API#

datasource.Partitioning(style[, base_dir, ...])

Partition scheme used to describe path-based partitions.

datasource.PartitionStyle(value)

Supported dataset partition styles.

datasource.PathPartitionEncoder(partitioning)

Callable that generates directory path strings for path-based partition formats.

datasource.PathPartitionParser(partitioning)

Partition parser for path-based partition formats.

datasource.PathPartitionFilter(...)

Partition filter for path-based partition formats.

MetadataProvider API#

datasource.FileMetadataProvider()

Abstract callable that provides metadata for the files of a single dataset block.

datasource.BaseFileMetadataProvider()

Abstract callable that provides metadata for FileBasedDatasource

datasource.ParquetMetadataProvider()

Abstract callable that provides block metadata for Arrow Parquet file fragments.

datasource.DefaultFileMetadataProvider()

Default metadata provider for FileBasedDatasource implementations that reuse the base prepare_read method.

datasource.DefaultParquetMetadataProvider()

The default file metadata provider for ParquetDatasource.

datasource.FastFileMetadataProvider()

Fast Metadata provider for FileBasedDatasource implementations.