Aggregation¶

Aggregations refer to any data transformation that produces scalar values from arrays. In the previous examples, several of them were used, including count and sum. You may now be wondering what happens when you apply sum() to a GroupBy object. Optimised implementations exist for many common aggregations, such as the one in the following table. However, they are not limited to this set of methods.

Function name Description

Function name	Description
`any`, `all`	Returns `True` if one (or more) or all of the non-NA values are truthy
`count`	Number of non-NA values
`cummin`, `cummax`	Cumulative minimum and maximum of the non-NA values
`cumsum`	Cumulative sum of the non-NA values
`cumprod`	Cumulative product of non-NA values
`first`, `last`	First and last non-NA values
`mean`	Mean of the non-NA values
`median`	Arithmetic median of the non-NA values
`min`, `max`	Minimum and maximum of the non-NA values
`nth`	Retrieval of the nth largest value
`ohlc`	calculates the four open-high-low-close statistics for time series-like data
`prod`	Product of the non-NA values
`quantile`	calculates the sample quantile
`rank`	Ordinal ranks of non-NA values, as when calling `Series.rank`
`sum`	Sum of non-NA values
`std`, `var`	Standard deviation and variance of the sample

You can use your own aggregations and also call any method that is also defined for the grouped object. For example, the Series method nsmallest selects the smallest requested number of values from the data.

Although nsmallest is not explicitly implemented for GroupBy, we can still use it with a non-optimised implementation. Internally, GroupBy decomposes the Series, calls df.nsmallest(n) for each part and then merges these results in the result object:

[1]:

import pandas as pd

[2]:

df = pd.DataFrame(
    {
        "Title": [
            "Jupyter Tutorial",
            "Jupyter Tutorial",
            "PyViz Tutorial",
            None,
            "Python Basics",
            "Python Basics",
        ],
        "2021-12": [30134, 6073, 4873, None, 427, 95],
        "2022-01": [33295, 7716, 3930, None, 276, 226],
        "2022-02": [19651, 6547, 2573, None, 525, 157],
    },
)

df

[2]:

	Title	2021-12	2022-01	2022-02
0	Jupyter Tutorial	30134.0	33295.0	19651.0
1	Jupyter Tutorial	6073.0	7716.0	6547.0
2	PyViz Tutorial	4873.0	3930.0	2573.0
3	None	NaN	NaN	NaN
4	Python Basics	427.0	276.0	525.0
5	Python Basics	95.0	226.0	157.0

[3]:

grouped = df.groupby("Title")

[4]:

grouped["2022-01"].nsmallest(1)

[4]:

Title
Jupyter Tutorial  1    7716.0
PyViz Tutorial    2    3930.0
Python Basics     5     226.0
Name: 2022-01, dtype: float64

To use a custom aggregation function, pass any function that aggregates an array to the aggregate or agg method:

[5]:

def minmaxrange(arr):
    return arr.max() - arr.min()


grouped.agg(minmaxrange)

[5]:

	2021-12	2022-01	2022-02
Title
Jupyter Tutorial	24061.0	25579.0	13104.0
PyViz Tutorial	0.0	0.0	0.0
Python Basics	332.0	50.0	368.0

You will find that some methods like describe also work, even though they are not strictly speaking aggregations:

[6]:

grouped.describe()

[6]:

	2021-12								2022-01					2022-02
	count	mean	std	min	25%	50%	75%	max	count	mean	...	75%	max	count	mean	std	min	25%	50%	75%	max
Title
Jupyter Tutorial	2.0	18103.5	17013.696262	6073.0	12088.25	18103.5	24118.75	30134.0	2.0	20505.5	...	26900.25	33295.0	2.0	13099.0	9265.927261	6547.0	9823.0	13099.0	16375.0	19651.0
PyViz Tutorial	1.0	4873.0	NaN	4873.0	4873.00	4873.0	4873.00	4873.0	1.0	3930.0	...	3930.00	3930.0	1.0	2573.0	NaN	2573.0	2573.0	2573.0	2573.0	2573.0
Python Basics	2.0	261.0	234.759451	95.0	178.00	261.0	344.00	427.0	2.0	251.0	...	263.50	276.0	2.0	341.0	260.215295	157.0	249.0	341.0	433.0	525.0

3 rows × 24 columns

Note:

Custom aggregation functions are generally much slower than the optimised functions in the table above. This is because there is some extra work involved in creating the intermediate data sets for the group (function calls, reordering of data).

Additional functions column by column¶

As we have already seen, aggregating a Series or all columns of a DataFrame is a matter of using aggregate (or agg) with the desired function or calling a method such as mean or std. However, it is more common to aggregate simultaneously with another function depending on the column or with multiple functions.

[7]:

grouped.agg("mean")

[7]:

	2021-12	2022-01	2022-02
Title
Jupyter Tutorial	18103.5	20505.5	13099.0
PyViz Tutorial	4873.0	3930.0	2573.0
Python Basics	261.0	251.0	341.0

If you pass a list of functions or function names instead, you will get back a DataFrame with column names from the functions:

[8]:

grouped.agg(["mean", "std", minmaxrange])

[8]:

	2021-12			2022-01			2022-02
	mean	std	minmaxrange	mean	std	minmaxrange	mean	std	minmaxrange
Title
Jupyter Tutorial	18103.5	17013.696262	24061.0	20505.5	18087.084356	25579.0	13099.0	9265.927261	13104.0
PyViz Tutorial	4873.0	NaN	0.0	3930.0	NaN	0.0	2573.0	NaN	0.0
Python Basics	261.0	234.759451	332.0	251.0	35.355339	50.0	341.0	260.215295	368.0

Here we have passed agg a list of aggregation functions to be evaluated independently for the data groups.

You don’t need to accept the names that GroupBy gives to the columns; in particular, lambda functions have the name <lambda>, which makes them difficult to identify. When you pass a list of tuples, the first element of each tuple is used as the column name in the DataFrame:

[9]:

grouped.agg(
    [("Mean", "mean"), ("Standard deviation", "std"), ("Range", minmaxrange)],
)

[9]:

	2021-12			2022-01			2022-02
	Mean	Standard deviation	Range	Mean	Standard deviation	Range	Mean	Standard deviation	Range
Title
Jupyter Tutorial	18103.5	17013.696262	24061.0	20505.5	18087.084356	25579.0	13099.0	9265.927261	13104.0
PyViz Tutorial	4873.0	NaN	0.0	3930.0	NaN	0.0	2573.0	NaN	0.0
Python Basics	261.0	234.759451	332.0	251.0	35.355339	50.0	341.0	260.215295	368.0

With a DataFrame, you have the option of specifying a list of functions to be applied to all columns or to different functions per column. Let’s say we want to calculate the same three statistics for the columns:

[10]:

stats = ["count", "mean", "max"]

evaluations = grouped.agg(stats)

evaluations

[10]:

	2021-12			2022-01			2022-02
	count	mean	max	count	mean	max	count	mean	max
Title
Jupyter Tutorial	2	18103.5	30134.0	2	20505.5	33295.0	2	13099.0	19651.0
PyViz Tutorial	1	4873.0	4873.0	1	3930.0	3930.0	1	2573.0	2573.0
Python Basics	2	261.0	427.0	2	251.0	276.0	2	341.0	525.0

As you can see, the resulting DataFrame has hierarchical columns, just as you would get if you aggregated each column separately and used pandas.concat to join the results together, using the column names as key arguments:

[11]:

evaluations["2021-12"]

[11]:

	count	mean	max
Title
Jupyter Tutorial	2	18103.5	30134.0
PyViz Tutorial	1	4873.0	4873.0
Python Basics	2	261.0	427.0

As before, a list of tuples with user-defined names can be passed:

[12]:

tuples = [("Mean", "mean"), ("Variance", "var")]

grouped[["2021-12", "2022-01"]].agg(tuples)

[12]:

	2021-12		2022-01
	Mean	Variance	Mean	Variance
Title
Jupyter Tutorial	18103.5	289465860.5	20505.5	327142620.5
PyViz Tutorial	4873.0	NaN	3930.0	NaN
Python Basics	261.0	55112.0	251.0	1250.0

If we now assume that potentially different functions are to be applied to one or more of the columns, we pass a dict to agg that contains an assignment of column names to one of the function specifications:

[13]:

grouped.agg({"2021-12": "mean", "2022-01": "var"})

[13]:

	2021-12	2022-01
Title
Jupyter Tutorial	18103.5	327142620.5
PyViz Tutorial	4873.0	NaN
Python Basics	261.0	1250.0

[14]:

grouped.agg({"2021-12": ["min", "max", "mean", "std"], "2022-01": "sum"})

[14]:

	2021-12				2022-01
	min	max	mean	std	sum
Title
Jupyter Tutorial	6073.0	30134.0	18103.5	17013.696262	41011.0
PyViz Tutorial	4873.0	4873.0	4873.0	NaN	3930.0
Python Basics	95.0	427.0	261.0	234.759451	502.0

Return aggregated data without row indices¶

In all the examples so far, the aggregated data is returned with an index:

Since this is not always desired, you can disable this behaviour in most cases by passing as_index=False to groupby:

[15]:

grouped.agg([minmaxrange]).mean()

[15]:

2021-12  minmaxrange    8131.000000
2022-01  minmaxrange    8543.000000
2022-02  minmaxrange    4490.666667
dtype: float64

Since this is not always desired, you can disable this behaviour by using reset_index(drop=True):

[16]:

grouped.agg([minmaxrange]).mean().reset_index(drop=True)

[16]:

0    8131.000000
1    8543.000000
2    4490.666667
dtype: float64