ray.data.Dataset.sort#

Dataset.sort(key: str | List[str], descending: bool | List[bool] = False, boundaries: List[int | float] = None) Dataset[source]#

Sort the dataset by the specified key column or key function. The key parameter must be specified (i.e., it cannot be None).

Note

If provided, the boundaries parameter can only be used to partition the first sort key.

Note

This operation requires all inputs to be materialized in object store for it to execute.

Examples

>>> import ray
>>> ds = ray.data.range(15)
>>> ds = ds.sort("id", descending=False, boundaries=[5, 10])
>>> for df in ray.get(ds.to_pandas_refs()):
...     print(df)
   id
0   0
1   1
2   2
3   3
4   4
   id
0   5
1   6
2   7
3   8
4   9
   id
0  10
1  11
2  12
3  13
4  14

Time complexity: O(dataset size * log(dataset size / parallelism))

Parameters:
  • key – The column or a list of columns to sort by.

  • descending – Whether to sort in descending order. Must be a boolean or a list of booleans matching the number of the columns.

  • boundaries – The list of values based on which to repartition the dataset. For example, if the input boundary is [10,20], rows with values less than 10 will be divided into the first block, rows with values greater than or equal to 10 and less than 20 will be divided into the second block, and rows with values greater than or equal to 20 will be divided into the third block. If not provided, the boundaries will be sampled from the input blocks. This feature only supports numeric columns right now.

Returns:

A new, sorted Dataset.

Raises:

ValueError – if the sort key is None.