Dataset.std(on: Optional[Union[str, List[str]]] = None, ddof: int = 1, ignore_nulls: bool = True) Union[Any, Dict[str, Any]][source]#

Compute standard deviation over entire dataset.


This operation will trigger execution of the lazy transformations performed on this dataset.


>>> import ray
>>> round(ray.data.range(100).std("id", ddof=0), 5)
>>> ray.data.from_items([
...     {"A": i, "B": i**2}
...     for i in range(100)]).std(["A", "B"])
{'std(A)': 29.011491975882016, 'std(B)': 2968.1748039269296}


This uses Welford’s online method for an accumulator-style computation of the standard deviation. This method was chosen due to it’s numerical stability, and it being computable in a single pass. This may give different (but more accurate) results than NumPy, Pandas, and sklearn, which use a less numerically stable two-pass algorithm. See https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford’s_online_algorithm

  • on – a column name or a list of column names to aggregate.

  • ddof – Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.

  • ignore_nulls – Whether to ignore null values. If True, null values will be ignored when computing the std; if False, if a null value is encountered, the output will be None. We consider np.nan, None, and pd.NaT to be null values. Default is True.


The standard deviation result.

For different values of on, the return varies:

  • on=None: an dict containing the column-wise std of all columns,

  • on="col": a scalar representing the std of all items in column "col",

  • on=["col_1", ..., "col_n"]: an n-column dict containing the column-wise std of the provided columns.

If the dataset is empty, all values are null, or any value is null AND ignore_nulls is False, then the output will be None.