Dataset API#


Dataset(plan, epoch[, lazy, logical_plan])

A Dataset is a distributed data collection for data loading and processing.

Basic Transformations#, *[, compute, num_cpus, num_gpus])

Apply the given function to each record of this dataset.

Dataset.map_batches(fn, *[, batch_size, ...])

Apply the given function to batches of data.

Dataset.flat_map(fn, *[, compute, num_cpus, ...])

Apply the given function to each record and then flatten results.

Dataset.filter(fn, *[, compute])

Filter out records that do not satisfy the given predicate.

Dataset.add_column(col, fn, *[, compute])

Add the given column to the dataset.

Dataset.drop_columns(cols, *[, compute])

Drop one or more columns from the dataset.

Dataset.select_columns(cols, *[, compute])

Select one or more columns from the dataset.

Dataset.random_sample(fraction, *[, seed])

Randomly samples a fraction of the elements of this dataset.


Materialize and truncate the dataset to the first limit records.

Sorting, Shuffling, Repartitioning#

Dataset.sort([key, descending])

Sort the dataset by the specified key column or key function.

Dataset.random_shuffle(*[, seed, num_blocks])

Randomly shuffle the elements of this dataset.

Dataset.randomize_block_order(*[, seed])

Randomly shuffle the blocks of this dataset.

Dataset.repartition(num_blocks, *[, shuffle])

Repartition the dataset into exactly this number of blocks.

Splitting and Merging Datasets#

Dataset.split(n, *[, equal, locality_hints])

Materialize and split the dataset into n disjoint pieces.


Materialize and split the dataset at the given indices (like np.split).


Materialize and split the dataset using proportions.

Dataset.streaming_split(n, *[, equal, ...])

Returns n DataIterators that can be used to read disjoint subsets of the dataset in parallel.

Dataset.train_test_split(test_size, *[, ...])

Materialize and split the dataset into train and test subsets.


Materialize and combine this dataset with others of the same type.

Materialize and zip this dataset with the elements of another.

Grouped and Global Aggregations#


Group the dataset by the key function or column name.


Aggregate the entire dataset as one group.

Dataset.sum([on, ignore_nulls])

Compute sum over entire dataset.

Dataset.min([on, ignore_nulls])

Compute minimum over entire dataset.

Dataset.max([on, ignore_nulls])

Compute maximum over entire dataset.

Dataset.mean([on, ignore_nulls])

Compute mean over entire dataset.

Dataset.std([on, ddof, ignore_nulls])

Compute standard deviation over entire dataset.

Consuming Data#[limit])

Print up to the given number of records from the dataset.


Return up to limit records from the dataset.

Dataset.take_batch([batch_size, batch_format])

Return up to batch_size records from the dataset in a batch.


Return all of the records in the dataset.


Return a DataIterator that can be used to repeatedly iterate over the dataset.

Dataset.iter_rows(*[, prefetch_blocks])

Return a local row iterator over the dataset.

Dataset.iter_batches(*[, prefetch_batches, ...])

Return a local batched iterator over the dataset.

Dataset.iter_torch_batches(*[, ...])

Return a local batched iterator of Torch Tensors over the dataset.

Dataset.iter_tf_batches(*[, ...])

Return a local batched iterator of TensorFlow Tensors over the dataset.

I/O and Conversion#

Dataset.write_parquet(path, *[, filesystem, ...])

Write the dataset to parquet.

Dataset.write_json(path, *[, filesystem, ...])

Write the dataset to json.

Dataset.write_csv(path, *[, filesystem, ...])

Write the dataset to csv.

Dataset.write_numpy(path, *[, column, ...])

Write a tensor column of the dataset to npy files.

Dataset.write_tfrecords(path, *[, ...])

Write the dataset to TFRecord files.

Dataset.write_webdataset(path, *[, ...])

Write the dataset to WebDataset files.

Dataset.write_mongo(uri, database, collection)

Write the dataset to a MongoDB datasource.

Dataset.write_datasource(datasource, *[, ...])

Write the dataset to a custom datasource.

Dataset.to_torch(*[, label_column, ...])

Return a Torch IterableDataset over this dataset.

Dataset.to_tf(feature_columns, label_columns, *)

Return a TF Dataset over this dataset.


Convert this dataset into a Dask DataFrame.


Convert this dataset into a MARS dataframe.


Convert this dataset into a Modin dataframe.


Convert this dataset into a Spark dataframe.


Convert this dataset into a single Pandas DataFrame.


Convert this dataset into a distributed set of Pandas dataframes.

Dataset.to_numpy_refs(*[, column])

Convert this dataset into a distributed set of NumPy ndarrays.


Convert this dataset into a distributed set of Arrow tables.

Dataset.to_random_access_dataset(key[, ...])

Convert this dataset into a distributed RandomAccessDataset (EXPERIMENTAL).

Inspecting Metadata#


Count the number of records in the dataset.


Returns the columns of this Dataset.


Return the schema of the dataset.


Return the number of blocks of this dataset.


Return the in-memory size of the dataset.


Return the list of input files for the dataset.


Returns a string containing execution timing information.


Get a list of references to the underlying blocks of this dataset.



Execute and materialize this dataset into object store memory.

ActorPoolStrategy([legacy_min_size, ...])

Specify the compute strategy for a Dataset transform.



Whether this dataset's lineage is able to be serialized for storage and later deserialized, possibly on a different cluster.


Serialize this dataset's lineage, not the actual data or the existing data futures, to bytes that can be stored and later deserialized, possibly on a different cluster.


Deserialize the provided lineage-serialized Dataset.