Dataset API#

Constructor#

Dataset(plan, epoch[, lazy, logical_plan])

A Dataset is a distributed data collection for data loading and processing.

Basic Transformations#

Dataset.map(fn, *[, compute])

Apply the given function to each record of this dataset.

Dataset.map_batches(fn, *[, batch_size, ...])

Apply the given function to batches of data.

Dataset.flat_map(fn, *[, compute])

Apply the given function to each record and then flatten results.

Dataset.filter(fn, *[, compute])

Filter out records that do not satisfy the given predicate.

Dataset.add_column(col, fn, *[, compute])

Add the given column to the dataset.

Dataset.drop_columns(cols, *[, compute])

Drop one or more columns from the dataset.

Dataset.select_columns(cols, *[, compute])

Select one or more columns from the dataset.

Dataset.random_sample(fraction, *[, seed])

Randomly samples a fraction of the elements of this dataset.

Dataset.limit(limit)

Truncate the dataset to the first limit records.

Sorting, Shuffling, Repartitioning#

Dataset.sort([key, descending])

Sort the dataset by the specified key column or key function.

Dataset.random_shuffle(*[, seed, num_blocks])

Randomly shuffle the elements of this dataset.

Dataset.randomize_block_order(*[, seed])

Randomly shuffle the blocks of this dataset.

Dataset.repartition(num_blocks, *[, shuffle])

Repartition the dataset into exactly this number of blocks.

Splitting and Merging Datasets#

Dataset.split(n, *[, equal, locality_hints])

Split the dataset into n disjoint pieces.

Dataset.split_at_indices(indices)

Split the dataset at the given indices (like np.split).

Dataset.split_proportionately(proportions)

Split the dataset using proportions.

Dataset.train_test_split(test_size, *[, ...])

Split the dataset into train and test subsets.

Dataset.union(*other)

Combine this dataset with others of the same type.

Dataset.zip(other)

Zip this dataset with the elements of another.

Grouped and Global Aggregations#

Dataset.groupby(key)

Group the dataset by the key function or column name.

Dataset.aggregate(*aggs)

Aggregate the entire dataset as one group.

Dataset.sum([on, ignore_nulls])

Compute sum over entire dataset.

Dataset.min([on, ignore_nulls])

Compute minimum over entire dataset.

Dataset.max([on, ignore_nulls])

Compute maximum over entire dataset.

Dataset.mean([on, ignore_nulls])

Compute mean over entire dataset.

Dataset.std([on, ddof, ignore_nulls])

Compute standard deviation over entire dataset.

Converting to Pipeline#

Dataset.repeat([times])

Convert this into a DatasetPipeline by looping over this dataset.

Dataset.window(*[, blocks_per_window, ...])

Convert this into a DatasetPipeline by windowing over data blocks.

Consuming Datasets#

Dataset.show([limit])

Print up to the given number of records from the dataset.

Dataset.take([limit])

Return up to limit records from the dataset.

Dataset.take_all([limit])

Return all of the records in the dataset.

Dataset.iterator()

Return a DatasetIterator that can be used to repeatedly iterate over the dataset.

Dataset.iter_rows(*[, prefetch_blocks])

Return a local row iterator over the dataset.

Dataset.iter_batches(*[, prefetch_blocks, ...])

Return a local batched iterator over the dataset.

Dataset.iter_torch_batches(*[, ...])

Return a local batched iterator of Torch Tensors over the dataset.

Dataset.iter_tf_batches(*[, ...])

Return a local batched iterator of TensorFlow Tensors over the dataset.

I/O and Conversion#

Dataset.write_parquet(path, *[, filesystem, ...])

Write the dataset to parquet.

Dataset.write_json(path, *[, filesystem, ...])

Write the dataset to json.

Dataset.write_csv(path, *[, filesystem, ...])

Write the dataset to csv.

Dataset.write_numpy(path, *[, column, ...])

Write a tensor column of the dataset to npy files.

Dataset.write_tfrecords(path, *[, ...])

Write the dataset to TFRecord files.

Dataset.write_mongo(uri, database, collection)

Write the dataset to a MongoDB datasource.

Dataset.write_datasource(datasource, *[, ...])

Write the dataset to a custom datasource.

Dataset.to_torch(*[, label_column, ...])

Return a Torch IterableDataset over this dataset.

Dataset.to_tf(feature_columns, label_columns, *)

Return a TF Dataset over this dataset.

Dataset.to_dask([meta])

Convert this dataset into a Dask DataFrame.

Dataset.to_mars()

Convert this dataset into a MARS dataframe.

Dataset.to_modin()

Convert this dataset into a Modin dataframe.

Dataset.to_spark(spark)

Convert this dataset into a Spark dataframe.

Dataset.to_pandas([limit])

Convert this dataset into a single Pandas DataFrame.

Dataset.to_pandas_refs()

Convert this dataset into a distributed set of Pandas dataframes.

Dataset.to_numpy_refs(*[, column])

Convert this dataset into a distributed set of NumPy ndarrays.

Dataset.to_arrow_refs()

Convert this dataset into a distributed set of Arrow tables.

Dataset.to_random_access_dataset(key[, ...])

Convert this Dataset into a distributed RandomAccessDataset (EXPERIMENTAL).

Inspecting Metadata#

Dataset.count()

Count the number of records in the dataset.

Dataset.schema([fetch_if_missing])

Return the schema of the dataset.

Dataset.default_batch_format()

Return this dataset's default batch format.

Dataset.num_blocks()

Return the number of blocks of this dataset.

Dataset.size_bytes()

Return the in-memory size of the dataset.

Dataset.input_files()

Return the list of input files for the dataset.

Dataset.stats()

Returns a string containing execution timing information.

Dataset.get_internal_block_refs()

Get a list of references to the underlying blocks of this dataset.

Execution#

Dataset.fully_executed()

Force full evaluation of the blocks of this dataset.

Dataset.is_fully_executed()

Returns whether this Dataset has been fully executed.

Dataset.lazy()

Enable lazy evaluation.

Serialization#

Dataset.has_serializable_lineage()

Whether this dataset's lineage is able to be serialized for storage and later deserialized, possibly on a different cluster.

Dataset.serialize_lineage()

Serialize this dataset's lineage, not the actual data or the existing data futures, to bytes that can be stored and later deserialized, possibly on a different cluster.

Dataset.deserialize_lineage(serialized_ds)

Deserialize the provided lineage-serialized Dataset.