Dataset API#
Constructor#
A Dataset is a distributed data collection for data loading and processing. |
Basic Transformations#
Apply the given function to each row of this dataset. |
|
Apply the given function to batches of data. |
|
Apply the given function to each row and then flatten results. |
|
Filter out rows that don't satisfy the given predicate. |
|
Add the given column to the dataset. |
|
Drop one or more columns from the dataset. |
|
Select one or more columns from the dataset. |
|
Returns a new |
|
Truncate the dataset to the first |
Sorting, Shuffling, Repartitioning#
Sort the dataset by the specified key column or key function. |
|
Randomly shuffle the rows of this |
|
Splitting and Merging Datasets#
Materialize and split the dataset into |
|
Materialize and split the dataset at the given indices (like |
|
Materialize and split the dataset using proportions. |
|
Returns |
|
Materialize and split the dataset into train and test subsets. |
|
Concatenate |
|
Materialize and zip the columns of this dataset with the columns of another. |
Grouped and Global Aggregations#
Group rows of a |
|
List the unique elements in a given column. |
|
Aggregate values using one or more functions. |
|
Compute the sum of one or more columns. |
|
Return the minimum of one or more columns. |
|
Return the maximum of one or more columns. |
|
Compute the mean of one or more columns. |
|
Compute the standard deviation of one or more columns. |
Consuming Data#
Print up to the given number of rows from the |
|
Return up to |
|
Return up to |
|
Return all of the rows in this |
|
Return a |
|
Return an iterable over the rows in this dataset. |
|
Return an iterable over batches of data. |
|
Return an iterable over batches of data represented as Torch tensors. |
|
Return an iterable over batches of data represented as TensorFlow tensors. |
I/O and Conversion#
Writes the |
|
Writes the |
|
Writes the |
|
Writes a column of the |
|
Write the |
|
Writes the dataset to WebDataset files. |
|
Writes the |
|
Writes the dataset to a custom |
|
Return a Torch IterableDataset over this |
|
Return a TensorFlow Dataset over this |
|
Convert this |
|
Convert this |
|
Convert this |
|
Convert this |
|
Convert this |
|
Converts this |
|
Converts this |
|
Convert this |
|
Convert this dataset into a distributed RandomAccessDataset (EXPERIMENTAL). |
Inspecting Metadata#
Count the number of records in the dataset. |
|
Returns the columns of this Dataset. |
|
Return the schema of the dataset. |
|
Return the number of blocks of this dataset. |
|
Return the in-memory size of the dataset. |
|
Return the list of input files for the dataset. |
|
Returns a string containing execution timing information. |
|
Get a list of references to the underlying blocks of this dataset. |
Execution#
Execute and materialize this dataset into object store memory. |
|
Specify the compute strategy for a Dataset transform. |
Serialization#
Whether this dataset's lineage is able to be serialized for storage and later deserialized, possibly on a different cluster. |
|
Serialize this dataset's lineage, not the actual data or the existing data futures, to bytes that can be stored and later deserialized, possibly on a different cluster. |
|
Deserialize the provided lineage-serialized Dataset. |
Internals#
alias of |
|
Execution stats for this block. |
|
Metadata about the block. |
|
Provides accessor methods for a specific block. |