ray.data.Dataset.summary#

Dataset.summary(columns: List[str] | None = None, override_dtype_agg_mapping: Dict[DataType, Callable[[str], List[AggregateFnV2]]] | None = None) → DatasetSummary[source]#

Generate a statistical summary of the dataset, organized by data type.

This method computes various statistics for different column dtypes:

For numerical dtypes (int*, float*, decimal, bool): count, mean, min, max, std, approx_quantile (median), missing%, zero%
For string and binary dtypes: count, missing%, approx_top_k (top 10 values)
For temporal dtypes (timestamp, date, time, duration): count, min, max, missing%
For other dtypes: count, missing%, approx_top_k

You can customize the aggregations performed for specific data types using the override_dtype_agg_mapping parameter.

The summary separates statistics into two tables: - Schema-matching stats: Statistics that preserve the original column type (e.g., min/max for integers) - Schema-changing stats: Statistics that change the type (e.g., mean converts int to float)

Note

This operation will trigger execution of the lazy transformations performed on this dataset.

Note

This operation requires all inputs to be materialized in object store for it to execute.

Examples

>>> import ray
>>> ds = ray.data.from_items([
...     {"age": 25, "salary": 50000, "name": "Alice", "city": "NYC"},
...     {"age": 30, "salary": 60000, "name": None, "city": "LA"},
...     {"age": 0, "salary": None, "name": "Bob", "city": None},
... ])
>>> summary = ds.summary()
>>> # Get combined pandas DataFrame with all statistics
>>> summary.to_pandas()  
          statistic        age                         city                           name        salary
0  approx_quantile[0]  25.000000                         None                           None  60000.000000
1      approx_topk[0]        NaN   {'city': 'LA', 'count': 1}    {'count': 1, 'name': 'Bob'}           NaN
2      approx_topk[1]        NaN  {'city': 'NYC', 'count': 1}  {'count': 1, 'name': 'Alice'}           NaN
3               count   3.000000                            3                              3      3.000000
4                 max  30.000000                          NaN                            NaN  60000.000000
5                mean  18.333333                         None                           None  55000.000000
6                 min   0.000000                          NaN                            NaN  50000.000000
7         missing_pct   0.000000                    33.333333                      33.333333     33.333333
8                 std  13.123346                         None                           None   5000.000000
9            zero_pct  33.333333                         None                           None      0.000000

>>> # Access individual column statistics
>>> summary.get_column_stats("age")  
statistic               value
 approx_quantile[0]  25.000000
     approx_topk[0]        NaN
     approx_topk[1]        NaN
              count   3.000000
                max  30.000000
               mean  18.333333
                min   0.000000
        missing_pct   0.000000
                std  13.123346
          zero_pct  33.333333

Custom aggregations for specific types:

>>> from ray.data.datatype import DataType
>>> from ray.data.aggregate import Sum, Count
>>> # Override aggregations for int64 columns
>>> custom_mapping = {
...     DataType.int64(): lambda col: [Count(on=col), Sum(on=col)]
... }
>>> summary = ds.summary(override_dtype_agg_mapping=custom_mapping)

Parameters:

columns – Optional list of column names to include in the summary. If None, all columns will be included.
override_dtype_agg_mapping – Optional mapping from DataType to factory functions. Each factory function takes a column name and returns a list of aggregators for that column. This will be merged with the default mapping, with user-provided mappings taking precedence.

Returns:

A DatasetSummary object with methods to access statistics and the original dataset schema. Use to_pandas() to get all statistics as a DataFrame, or get_column_stats(col) for a specific column

PublicAPI (alpha): This API is in alpha and may change before becoming stable.