ray.data.Dataset.unique#
- Dataset.unique(column: str) List[Any] [source]#
List the unique elements in a given column.
Note
This operation requires all inputs to be materialized in object store for it to execute.
Examples
>>> import ray >>> ds = ray.data.from_items([1, 2, 3, 2, 3]) >>> ds.unique("item") [1, 2, 3]
This function is very useful for computing labels in a machine learning dataset:
>>> import ray >>> ds = ray.data.read_csv("s3://anonymous@ray-example-data/iris.csv") >>> ds.unique("target") [0, 1, 2]
One common use case is to convert the class labels into integers for training and inference:
>>> classes = {0: 'Setosa', 1: 'Versicolor', 2: 'Virginica'} >>> def preprocessor(df, classes): ... df["variety"] = df["target"].map(classes) ... return df >>> train_ds = ds.map_batches( ... preprocessor, fn_kwargs={"classes": classes}, batch_format="pandas") >>> train_ds.sort("sepal length (cm)").take(1) # Sort to make it deterministic [{'sepal length (cm)': 4.3, ..., 'variety': 'Setosa'}]
Time complexity: O(dataset size * log(dataset size / parallelism))
- Parameters:
column – The column to collect unique elements over.
- Returns:
A list with unique elements in the given column.