Aggregation#

Aggregations refer to any data transformation that produces scalar values from arrays. In the previous examples, several of them were used, including count and sum. You may now be wondering what happens when you apply sum() to a GroupBy object. Optimised implementations exist for many common aggregations, such as the one in the following table. However, they are not limited to this set of methods.

Function name Description

Function name

Description

any, all

Returns True if one (or more) or all of the non-NA values are truthy

count

Number of non-NA values

cummin, cummax

Cumulative minimum and maximum of the non-NA values

cumsum

Cumulative sum of the non-NA values

cumprod

Cumulative product of non-NA values

first, last

First and last non-NA values

mean

Mean of the non-NA values

median

Arithmetic median of the non-NA values

min, max

Minimum and maximum of the non-NA values

nth

Retrieval of the nth largest value

ohlc

calculates the four open-high-low-close statistics for time series-like data

prod

Product of the non-NA values

quantile

calculates the sample quantile

rank

Ordinal ranks of non-NA values, as when calling Series.rank

sum

Sum of non-NA values

std, var

Standard deviation and variance of the sample

You can use your own aggregations and also call any method that is also defined for the grouped object. For example, the Series method nsmallest selects the smallest requested number of values from the data.

Although nsmallest is not explicitly implemented for GroupBy, we can still use it with a non-optimised implementation. Internally, GroupBy decomposes the Series, calls df.nsmallest(n) for each part and then merges these results in the result object:

[1]:
import numpy as np
import pandas as pd
[2]:
df = pd.DataFrame(
    {
        "Title": [
            "Jupyter Tutorial",
            "Jupyter Tutorial",
            "PyViz Tutorial",
            None,
            "Python Basics",
            "Python Basics",
        ],
        "2021-12": [30134, 6073, 4873, None, 427, 95],
        "2022-01": [33295, 7716, 3930, None, 276, 226],
        "2022-02": [19651, 6547, 2573, None, 525, 157],
    }
)

df
[2]:
Title 2021-12 2022-01 2022-02
0 Jupyter Tutorial 30134.0 33295.0 19651.0
1 Jupyter Tutorial 6073.0 7716.0 6547.0
2 PyViz Tutorial 4873.0 3930.0 2573.0
3 None NaN NaN NaN
4 Python Basics 427.0 276.0 525.0
5 Python Basics 95.0 226.0 157.0
[3]:
grouped = df.groupby("Title")
[4]:
grouped["2022-01"].nsmallest(1)
[4]:
Title
Jupyter Tutorial  1    7716.0
PyViz Tutorial    2    3930.0
Python Basics     5     226.0
Name: 2022-01, dtype: float64

To use a custom aggregation function, pass any function that aggregates an array to the aggregate or agg method:

[5]:
def range(arr):
    return arr.max() - arr.min()

grouped.agg(range)
[5]:
2021-12 2022-01 2022-02
Title
Jupyter Tutorial 24061.0 25579.0 13104.0
PyViz Tutorial 0.0 0.0 0.0
Python Basics 332.0 50.0 368.0

You will find that some methods like describe also work, even though they are not strictly speaking aggregations:

[6]:
grouped.describe()
[6]:
2021-12 2022-01 2022-02
count mean std min 25% 50% 75% max count mean ... 75% max count mean std min 25% 50% 75% max
Title
Jupyter Tutorial 2.0 18103.5 17013.696262 6073.0 12088.25 18103.5 24118.75 30134.0 2.0 20505.5 ... 26900.25 33295.0 2.0 13099.0 9265.927261 6547.0 9823.0 13099.0 16375.0 19651.0
PyViz Tutorial 1.0 4873.0 NaN 4873.0 4873.00 4873.0 4873.00 4873.0 1.0 3930.0 ... 3930.00 3930.0 1.0 2573.0 NaN 2573.0 2573.0 2573.0 2573.0 2573.0
Python Basics 2.0 261.0 234.759451 95.0 178.00 261.0 344.00 427.0 2.0 251.0 ... 263.50 276.0 2.0 341.0 260.215295 157.0 249.0 341.0 433.0 525.0

3 rows × 24 columns

Note:

Custom aggregation functions are generally much slower than the optimised functions in the table above. This is because there is some extra work involved in creating the intermediate data sets for the group (function calls, reordering of data).

Additional functions column by column#

As we have already seen, aggregating a Series or all columns of a DataFrame is a matter of using aggregate (or agg) with the desired function or calling a method such as mean or std. However, it is more common to aggregate simultaneously with another function depending on the column or with multiple functions.

[7]:
grouped.agg("mean")
[7]:
2021-12 2022-01 2022-02
Title
Jupyter Tutorial 18103.5 20505.5 13099.0
PyViz Tutorial 4873.0 3930.0 2573.0
Python Basics 261.0 251.0 341.0

If you pass a list of functions or function names instead, you will get back a DataFrame with column names from the functions:

[8]:
grouped.agg(["mean", "std", range])
[8]:
2021-12 2022-01 2022-02
mean std range mean std range mean std range
Title
Jupyter Tutorial 18103.5 17013.696262 24061.0 20505.5 18087.084356 25579.0 13099.0 9265.927261 13104.0
PyViz Tutorial 4873.0 NaN 0.0 3930.0 NaN 0.0 2573.0 NaN 0.0
Python Basics 261.0 234.759451 332.0 251.0 35.355339 50.0 341.0 260.215295 368.0

Here we have passed agg a list of aggregation functions to be evaluated independently for the data groups.

You don’t need to accept the names that GroupBy gives to the columns; in particular, lambda functions have the name <lambda>, which makes them difficult to identify. When you pass a list of tuples, the first element of each tuple is used as the column name in the DataFrame:

[9]:
grouped.agg(
    [("Mean", "mean"), ("Standard deviation", "std"), ("Range", range)]
)
[9]:
2021-12 2022-01 2022-02
Mean Standard deviation Range Mean Standard deviation Range Mean Standard deviation Range
Title
Jupyter Tutorial 18103.5 17013.696262 24061.0 20505.5 18087.084356 25579.0 13099.0 9265.927261 13104.0
PyViz Tutorial 4873.0 NaN 0.0 3930.0 NaN 0.0 2573.0 NaN 0.0
Python Basics 261.0 234.759451 332.0 251.0 35.355339 50.0 341.0 260.215295 368.0

With a DataFrame, you have the option of specifying a list of functions to be applied to all columns or to different functions per column. Let’s say we want to calculate the same three statistics for the columns:

[10]:
stats = ["count", "mean", "max"]

evaluations = grouped.agg(stats)

evaluations
[10]:
2021-12 2022-01 2022-02
count mean max count mean max count mean max
Title
Jupyter Tutorial 2 18103.5 30134.0 2 20505.5 33295.0 2 13099.0 19651.0
PyViz Tutorial 1 4873.0 4873.0 1 3930.0 3930.0 1 2573.0 2573.0
Python Basics 2 261.0 427.0 2 251.0 276.0 2 341.0 525.0

As you can see, the resulting DataFrame has hierarchical columns, just as you would get if you aggregated each column separately and used pandas.concat to join the results together, using the column names as key arguments:

[11]:
evaluations["2021-12"]
[11]:
count mean max
Title
Jupyter Tutorial 2 18103.5 30134.0
PyViz Tutorial 1 4873.0 4873.0
Python Basics 2 261.0 427.0

As before, a list of tuples with user-defined names can be passed:

[12]:
tuples = [("Mean", "mean"), ("Variance", np.var)]

grouped[["2021-12", "2022-01"]].agg(tuples)
[12]:
2021-12 2022-01
Mean Variance Mean Variance
Title
Jupyter Tutorial 18103.5 289465860.5 20505.5 327142620.5
PyViz Tutorial 4873.0 NaN 3930.0 NaN
Python Basics 261.0 55112.0 251.0 1250.0

If we now assume that potentially different functions are to be applied to one or more of the columns, we pass a dict to agg that contains an assignment of column names to one of the function specifications:

[13]:
grouped.agg({"2021-12": "mean", "2022-01": np.var})
[13]:
2021-12 2022-01
Title
Jupyter Tutorial 18103.5 327142620.5
PyViz Tutorial 4873.0 NaN
Python Basics 261.0 1250.0
[14]:
grouped.agg({"2021-12": ["min", "max", "mean", "std"], "2022-01": "sum"})
[14]:
2021-12 2022-01
min max mean std sum
Title
Jupyter Tutorial 6073.0 30134.0 18103.5 17013.696262 41011.0
PyViz Tutorial 4873.0 4873.0 4873.0 NaN 3930.0
Python Basics 95.0 427.0 261.0 234.759451 502.0

Return aggregated data without row indices#

In all the examples so far, the aggregated data is returned with an index. Since this is not always desired, you can disable this behaviour in most cases by passing as_index=False to groupby:

[15]:
grouped.agg([range], as_index=False).mean()
[15]:
2021-12  8131.000000
2022-01  8543.000000
2022-02  4490.666667
dtype: float64

By using the method as_index=False, some unnecessary calculations are avoided. Of course, it is always possible to get the result back with index by calling reset_index for the result.