DatasetPipeline API#

Constructor#

DatasetPipeline(base_iterable[, stages, ...])

Implements a pipeline of Datasets.

Basic Transformations#

DatasetPipeline.map(fn, *[, compute])

Apply Dataset.map to each dataset/window in this pipeline.

DatasetPipeline.map_batches(fn, *[, ...])

Apply Dataset.map_batches to each dataset/window in this pipeline.

DatasetPipeline.flat_map(fn, *[, compute])

Apply Dataset.flat_map to each dataset/window in this pipeline.

DatasetPipeline.foreach_window(fn)

Apply a transform to each dataset/window in this pipeline.

DatasetPipeline.filter(fn, *[, compute])

Apply Dataset.filter to each dataset/window in this pipeline.

DatasetPipeline.add_column(col, fn, *[, compute])

Apply Dataset.add_column to each dataset/window in this pipeline.

DatasetPipeline.drop_columns(cols, *[, compute])

Apply Dataset.drop_columns to each dataset/window in this pipeline.

DatasetPipeline.select_columns(cols, *[, ...])

Apply Dataset.select_columns to each dataset/window in this pipeline.

Sorting, Shuffling, Repartitioning#

DatasetPipeline.sort_each_window([key, ...])

Apply Dataset.sort to each dataset/window in this pipeline.

DatasetPipeline.random_shuffle_each_window(*)

Apply Dataset.random_shuffle to each dataset/window in this pipeline.

DatasetPipeline.randomize_block_order_each_window(*)

Apply Dataset.randomize_block_order to each dataset/window in this pipeline.

DatasetPipeline.repartition_each_window(...)

Apply Dataset.repartition to each dataset/window in this pipeline.

Splitting DatasetPipelines#

DatasetPipeline.split(n, *[, equal, ...])

Split the pipeline into n disjoint pipeline shards.

DatasetPipeline.split_at_indices(indices)

Split the datasets within the pipeline at the given indices (like np.split).

Creating DatasetPipelines#

DatasetPipeline.repeat([times])

Repeat this pipeline a given number or times, or indefinitely.

DatasetPipeline.rewindow(*, blocks_per_window)

Change the windowing (blocks per dataset) of this pipeline.

DatasetPipeline.from_iterable(iterable)

Create a pipeline from an sequence of Dataset producing functions.

Consuming DatasetPipelines#

DatasetPipeline.show([limit])

Call Dataset.show over the stream of output batches from the pipeline

DatasetPipeline.show_windows([limit_per_dataset])

Print up to the given number of records from each window/dataset.

DatasetPipeline.take([limit])

Call Dataset.take over the stream of output batches from the pipeline

DatasetPipeline.take_all([limit])

Call Dataset.take_all over the stream of output batches from the pipeline

DatasetPipeline.iterator()

Return a DatasetIterator that can be used to repeatedly iterate over the dataset.

DatasetPipeline.iter_rows(*[, prefetch_blocks])

Return a local row iterator over the data in the pipeline.

DatasetPipeline.iter_batches(*[, ...])

Return a local batched iterator over the data in the pipeline.

DatasetPipeline.iter_torch_batches(*[, ...])

Call Dataset.iter_torch_batches over the stream of output batches from the pipeline.

DatasetPipeline.iter_tf_batches(*[, ...])

Call Dataset.iter_tf_batches over the stream of output batches from the pipeline.

I/O and Conversion#

DatasetPipeline.write_json(path, *[, ...])

Call Dataset.write_json on each output dataset of this pipeline.

DatasetPipeline.write_csv(path, *[, ...])

Call Dataset.write_csv on each output dataset of this pipeline.

DatasetPipeline.write_parquet(path, *[, ...])

Call Dataset.write_parquet on each output dataset of this pipeline.

DatasetPipeline.write_datasource(datasource, *)

Call Dataset.write_datasource on each output dataset of this pipeline.

DatasetPipeline.to_tf(feature_columns, ...)

Call Dataset.to_tf over the stream of output batches from the pipeline

DatasetPipeline.to_torch(*[, label_column, ...])

Call Dataset.to_torch over the stream of output batches from the pipeline

Inspecting Metadata#

DatasetPipeline.schema([fetch_if_missing])

Return the schema of the dataset pipeline.

DatasetPipeline.count()

Count the number of records in the dataset pipeline.

DatasetPipeline.stats([exclude_first_window])

Returns a string containing execution timing information.

DatasetPipeline.sum()

Sum the records in the dataset pipeline.