ray.data.Dataset.unique#

Dataset.unique(column: str) List[Any][source]#

List the unique elements in a given column.

Note

This operation requires all inputs to be materialized in object store for it to execute.

Examples

>>> import ray
>>> ds = ray.data.from_items([1, 2, 3, 2, 3])
>>> ds.unique("item")
[1, 2, 3]

This function is very useful for computing labels in a machine learning dataset:

>>> import ray
>>> ds = ray.data.read_csv("s3://anonymous@ray-example-data/iris.csv")
>>> ds.unique("target")
[0, 1, 2]

One common use case is to convert the class labels into integers for training and inference:

>>> classes = {0: 'Setosa', 1: 'Versicolor', 2: 'Virginica'}
>>> def preprocessor(df, classes):
...     df["variety"] = df["target"].map(classes)
...     return df
>>> train_ds = ds.map_batches(
...     preprocessor, fn_kwargs={"classes": classes}, batch_format="pandas")
>>> train_ds.sort("sepal length (cm)").take(1)  # Sort to make it deterministic
[{'sepal length (cm)': 4.3, ..., 'variety': 'Setosa'}]

Time complexity: O(dataset size * log(dataset size / parallelism))

Parameters:

column – The column to collect unique elements over.

Returns:

A list with unique elements in the given column.