ray.data.Dataset.unique#

Dataset.unique(column: str) → List[Any][source]#

List the unique elements in a given column.

Note

This operation will trigger execution of the lazy transformations performed on this dataset.

Note

This operation requires all inputs to be materialized in object store for it to execute.

Examples

>>> import ray
>>> ds = ray.data.from_items([1, 2, 3, 2, 3])
>>> ds.unique("item")
[1, 2, 3]

This function is very useful for computing labels in a machine learning dataset:

>>> import ray
>>> ds = ray.data.read_csv("s3://anonymous@ray-example-data/iris.csv")
>>> ds.unique("target")
[0, 1, 2]

One common use case is to convert the class labels into integers for training and inference:

>>> classes = {0: 'Setosa', 1: 'Versicolor', 2: 'Virginica'}
>>> def preprocessor(df, classes):
...     df["variety"] = df["target"].map(classes)
...     return df
>>> train_ds = ds.map_batches(
...     preprocessor, fn_kwargs={"classes": classes}, batch_format="pandas")
>>> train_ds.sort("sepal length (cm)").take(1)  # Sort to make it deterministic
[{'sepal length (cm)': 4.3, ..., 'variety': 'Setosa'}]

Time complexity: O(dataset size / parallelism)

Parameters:: column – The column to collect unique elements over.
Returns:: A list with unique elements in the given column.