ray.data.Dataset.sort#

Dataset.sort(key: str | List[str] | None = None, descending: bool | List[bool] = False, boundaries: List[int | float] = None) Dataset[source]#

Sort the dataset by the specified key column or key function.

Note

The descending parameter must be a boolean, or a list of booleans. If it is a list, all items in the list must share the same direction. Multi-directional sort is not supported yet.

Note

This operation requires all inputs to be materialized in object store for it to execute.

Examples

>>> import ray
>>> ds = ray.data.range(15)
>>> ds = ds.sort("id", descending=False, boundaries=[5, 10])
>>> for df in ray.get(ds.to_pandas_refs()):
...     print(df)
   id
0   0
1   1
2   2
3   3
4   4
   id
0   5
1   6
2   7
3   8
4   9
   id
0  10
1  11
2  12
3  13
4  14

Time complexity: O(dataset size * log(dataset size / parallelism))

Parameters:
  • key – The column or a list of columns to sort by.

  • descending – Whether to sort in descending order. Must be a boolean or a list of booleans matching the number of the columns.

  • boundaries – The list of values based on which to repartition the dataset. For example, if the input boundary is [10,20], rows with values less than 10 will be divided into the first block, rows with values greater than or equal to 10 and less than 20 will be divided into the second block, and rows with values greater than or equal to 20 will be divided into the third block. If not provided, the boundaries will be sampled from the input blocks. This feature only supports numeric columns right now.

Returns:

A new, sorted Dataset.