Loading Data API#

Synthetic Data#

range

Creates a Dataset from a range of integers [0..n).

range_tensor

Creates a Dataset tensors of the provided shape from range [0...n].

Python Objects#

from_items

Create a Dataset from a list of local Python objects.

Parquet#

read_parquet

Creates a Dataset from parquet files.

CSV#

read_csv

Creates a Dataset from CSV files.

JSON#

read_json

Creates a Dataset from JSON and JSONL files.

Text#

read_text

Create a Dataset from lines stored in text files.

Audio#

read_audio

Creates a Dataset from audio files.

Avro#

read_avro

Create a Dataset from records stored in Avro files.

Images#

read_images

Creates a Dataset from image files.

Binary#

read_binary_files

Create a Dataset from binary files of arbitrary contents.

TFRecords#

read_tfrecords

Create a Dataset from TFRecord files that contain tf.train.Example messages.

TFXReadOptions

Specifies read options when reading TFRecord files with TFX.

Pandas#

from_pandas

Create a Dataset from a list of pandas dataframes.

from_pandas_refs

Create a Dataset from a list of Ray object references to pandas dataframes.

NumPy#

read_numpy

Create an Arrow dataset from numpy files.

from_numpy

Creates a Dataset from a list of NumPy ndarrays.

from_numpy_refs

Creates a Dataset from a list of Ray object references to NumPy ndarrays.

Arrow#

from_arrow

Create a Dataset from a list of PyArrow tables.

from_arrow_refs

Create a Dataset from a list of Ray object references to PyArrow tables.

MongoDB#

read_mongo

Create a Dataset from a MongoDB database.

BigQuery#

read_bigquery

Create a dataset from BigQuery.

SQL Databases#

read_sql

Read from a database that provides a Python DB API2-compliant connector.

Databricks#

read_databricks_tables

Read a Databricks unity catalog table or Databricks SQL execution result.

Snowflake#

read_snowflake

Read data from a Snowflake data set.

Unity Catalog#

read_unity_catalog

Loads a Unity Catalog table or files into a Ray Dataset using Databricks Unity Catalog credential vending, with automatic short-lived cloud credential handoff for secure, parallel, distributed access from external engines.

Delta Sharing#

read_delta_sharing_tables

Read data from a Delta Sharing table.

Hudi#

read_hudi

Create a Dataset from an Apache Hudi table.

Iceberg#

read_iceberg

Create a Dataset from an Iceberg table.

Delta Lake#

read_delta

Creates a Dataset from Delta Lake files.

Lance#

read_lance

Create a Dataset from a Lance Dataset.

MCAP (Message Capture)#

read_mcap

Create a Dataset from MCAP (Message Capture) files.

ClickHouse#

read_clickhouse

Create a Dataset from a ClickHouse table or view.

Daft#

from_daft

Create a Dataset from a Daft DataFrame.

Dask#

from_dask

Create a Dataset from a Dask DataFrame.

Spark#

from_spark

Create a Dataset from a Spark DataFrame.

Modin#

from_modin

Create a Dataset from a Modin DataFrame.

Mars#

from_mars

Create a Dataset from a Mars DataFrame.

Torch#

from_torch

Create a Dataset from a Torch Dataset.

Hugging Face#

from_huggingface

Read a Hugging Face Dataset into a Ray Dataset.

TensorFlow#

from_tf

Create a Dataset from a TensorFlow Dataset.

Video#

read_videos

Creates a Dataset from video files.

WebDataset#

read_webdataset

Create a Dataset from WebDataset files.

Kafka#

read_kafka

Read data from Kafka topics.

Datasource API#

read_datasource

Read a stream from a custom Datasource.

Datasource

Interface for defining a custom Dataset datasource.

ReadTask

A function used to read blocks from the Dataset.

datasource.FilenameProvider

Generates filenames when you write a Dataset.

Partitioning API#

datasource.Partitioning

Partition scheme used to describe path-based partitions.

datasource.PartitionStyle

Supported dataset partition styles.

datasource.PathPartitionParser

Partition parser for path-based partition formats.

datasource.PathPartitionFilter

Partition filter for path-based partition formats.

MetadataProvider API#

datasource.FileMetadataProvider

Abstract callable that provides metadata for the files of a single dataset block.

datasource.BaseFileMetadataProvider

Abstract callable that provides metadata for FileBasedDatasource implementations that reuse the base prepare_read() method.

datasource.DefaultFileMetadataProvider

Default metadata provider for FileBasedDatasource implementations that reuse the base prepare_read method.

Shuffling API#

FileShuffleConfig

Configuration for file shuffling.