Dataset API
Contents
Dataset API#
Constructor#
|
A Dataset is a distributed data collection for data loading and processing. |
Basic Transformations#
|
Apply the given function to each record of this dataset. |
|
Apply the given function to batches of data. |
|
Apply the given function to each record and then flatten results. |
|
Filter out records that do not satisfy the given predicate. |
|
Add the given column to the dataset. |
|
Drop one or more columns from the dataset. |
|
Select one or more columns from the dataset. |
|
Randomly samples a fraction of the elements of this dataset. |
|
Materialize and truncate the dataset to the first |
Sorting, Shuffling, Repartitioning#
|
Sort the dataset by the specified key column or key function. |
|
Randomly shuffle the elements of this dataset. |
|
Randomly shuffle the blocks of this dataset. |
|
Repartition the dataset into exactly this number of blocks. |
Splitting and Merging Datasets#
|
Materialize and split the dataset into |
|
Materialize and split the dataset at the given indices (like np.split). |
|
Materialize and split the dataset using proportions. |
|
Returns |
|
Materialize and split the dataset into train and test subsets. |
|
Materialize and combine this dataset with others of the same type. |
|
Materialize and zip this dataset with the elements of another. |
Grouped and Global Aggregations#
|
Group the dataset by the key function or column name. |
|
Aggregate the entire dataset as one group. |
|
Compute sum over entire dataset. |
|
Compute minimum over entire dataset. |
|
Compute maximum over entire dataset. |
|
Compute mean over entire dataset. |
|
Compute standard deviation over entire dataset. |
Consuming Data#
|
Print up to the given number of records from the dataset. |
|
Return up to |
|
Return up to |
|
Return all of the records in the dataset. |
Return a |
|
|
Return a local row iterator over the dataset. |
|
Return a local batched iterator over the dataset. |
|
Return a local batched iterator of Torch Tensors over the dataset. |
|
Return a local batched iterator of TensorFlow Tensors over the dataset. |
I/O and Conversion#
|
Write the dataset to parquet. |
|
Write the dataset to json. |
|
Write the dataset to csv. |
|
Write a tensor column of the dataset to npy files. |
|
Write the dataset to TFRecord files. |
|
Write the dataset to WebDataset files. |
|
Write the dataset to a MongoDB datasource. |
|
Write the dataset to a custom datasource. |
|
Return a Torch IterableDataset over this dataset. |
|
Return a TF Dataset over this dataset. |
|
Convert this dataset into a Dask DataFrame. |
Convert this dataset into a MARS dataframe. |
|
Convert this dataset into a Modin dataframe. |
|
|
Convert this dataset into a Spark dataframe. |
|
Convert this dataset into a single Pandas DataFrame. |
Convert this dataset into a distributed set of Pandas dataframes. |
|
|
Convert this dataset into a distributed set of NumPy ndarrays. |
Convert this dataset into a distributed set of Arrow tables. |
|
|
Convert this dataset into a distributed RandomAccessDataset (EXPERIMENTAL). |
Inspecting Metadata#
Count the number of records in the dataset. |
|
|
Returns the columns of this Dataset. |
|
Return the schema of the dataset. |
Return the number of blocks of this dataset. |
|
Return the in-memory size of the dataset. |
|
Return the list of input files for the dataset. |
|
Returns a string containing execution timing information. |
|
Get a list of references to the underlying blocks of this dataset. |
Execution#
Execute and materialize this dataset into object store memory. |
|
|
Specify the compute strategy for a Dataset transform. |
Serialization#
Whether this dataset's lineage is able to be serialized for storage and later deserialized, possibly on a different cluster. |
|
Serialize this dataset's lineage, not the actual data or the existing data futures, to bytes that can be stored and later deserialized, possibly on a different cluster. |
|
|
Deserialize the provided lineage-serialized Dataset. |