Aggregation#
Aggregations refer to any data transformation that produces scalar values from arrays. In the previous examples, several of them were used, including count
and sum
. You may now be wondering what happens when you apply sum()
to a GroupBy
object. Optimised implementations exist for many common aggregations, such as the one in the following table. However, they are not limited to this set of methods.
Function name Description
Function name |
Description |
---|---|
|
Returns |
|
Number of non-NA values |
|
Cumulative minimum and maximum of the non-NA values |
|
Cumulative sum of the non-NA values |
|
Cumulative product of non-NA values |
|
First and last non-NA values |
|
Mean of the non-NA values |
|
Arithmetic median of the non-NA values |
|
Minimum and maximum of the non-NA values |
|
Retrieval of the nth largest value |
|
calculates the four open-high-low-close statistics for time series-like data |
|
Product of the non-NA values |
|
calculates the sample quantile |
|
Ordinal ranks of non-NA values, as when calling |
|
Sum of non-NA values |
|
Standard deviation and variance of the sample |
You can use your own aggregations and also call any method that is also defined for the grouped object. For example, the Series
method nsmallest
selects the smallest requested number of values from the data.
Although nsmallest
is not explicitly implemented for GroupBy
, we can still use it with a non-optimised implementation. Internally, GroupBy
decomposes the Series
, calls df.nsmallest(n)
for each part and then merges these results in the result object:
[1]:
import numpy as np
import pandas as pd
[2]:
df = pd.DataFrame(
{
"Title": [
"Jupyter Tutorial",
"Jupyter Tutorial",
"PyViz Tutorial",
None,
"Python Basics",
"Python Basics",
],
"2021-12": [30134, 6073, 4873, None, 427, 95],
"2022-01": [33295, 7716, 3930, None, 276, 226],
"2022-02": [19651, 6547, 2573, None, 525, 157],
}
)
df
[2]:
Title | 2021-12 | 2022-01 | 2022-02 | |
---|---|---|---|---|
0 | Jupyter Tutorial | 30134.0 | 33295.0 | 19651.0 |
1 | Jupyter Tutorial | 6073.0 | 7716.0 | 6547.0 |
2 | PyViz Tutorial | 4873.0 | 3930.0 | 2573.0 |
3 | None | NaN | NaN | NaN |
4 | Python Basics | 427.0 | 276.0 | 525.0 |
5 | Python Basics | 95.0 | 226.0 | 157.0 |
[3]:
grouped = df.groupby("Title")
[4]:
grouped["2022-01"].nsmallest(1)
[4]:
Title
Jupyter Tutorial 1 7716.0
PyViz Tutorial 2 3930.0
Python Basics 5 226.0
Name: 2022-01, dtype: float64
To use a custom aggregation function, pass any function that aggregates an array to the aggregate
or agg
method:
[5]:
def range(arr):
return arr.max() - arr.min()
grouped.agg(range)
[5]:
2021-12 | 2022-01 | 2022-02 | |
---|---|---|---|
Title | |||
Jupyter Tutorial | 24061.0 | 25579.0 | 13104.0 |
PyViz Tutorial | 0.0 | 0.0 | 0.0 |
Python Basics | 332.0 | 50.0 | 368.0 |
You will find that some methods like describe
also work, even though they are not strictly speaking aggregations:
[6]:
grouped.describe()
[6]:
2021-12 | 2022-01 | 2022-02 | |||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | mean | std | min | 25% | 50% | 75% | max | count | mean | ... | 75% | max | count | mean | std | min | 25% | 50% | 75% | max | |
Title | |||||||||||||||||||||
Jupyter Tutorial | 2.0 | 18103.5 | 17013.696262 | 6073.0 | 12088.25 | 18103.5 | 24118.75 | 30134.0 | 2.0 | 20505.5 | ... | 26900.25 | 33295.0 | 2.0 | 13099.0 | 9265.927261 | 6547.0 | 9823.0 | 13099.0 | 16375.0 | 19651.0 |
PyViz Tutorial | 1.0 | 4873.0 | NaN | 4873.0 | 4873.00 | 4873.0 | 4873.00 | 4873.0 | 1.0 | 3930.0 | ... | 3930.00 | 3930.0 | 1.0 | 2573.0 | NaN | 2573.0 | 2573.0 | 2573.0 | 2573.0 | 2573.0 |
Python Basics | 2.0 | 261.0 | 234.759451 | 95.0 | 178.00 | 261.0 | 344.00 | 427.0 | 2.0 | 251.0 | ... | 263.50 | 276.0 | 2.0 | 341.0 | 260.215295 | 157.0 | 249.0 | 341.0 | 433.0 | 525.0 |
3 rows × 24 columns
Note:
Custom aggregation functions are generally much slower than the optimised functions in the table above. This is because there is some extra work involved in creating the intermediate data sets for the group (function calls, reordering of data).
Additional functions column by column#
As we have already seen, aggregating a Series
or all columns of a DataFrame
is a matter of using aggregate
(or agg
) with the desired function or calling a method such as mean
or std
. However, it is more common to aggregate simultaneously with another function depending on the column or with multiple functions.
[7]:
grouped.agg("mean")
[7]:
2021-12 | 2022-01 | 2022-02 | |
---|---|---|---|
Title | |||
Jupyter Tutorial | 18103.5 | 20505.5 | 13099.0 |
PyViz Tutorial | 4873.0 | 3930.0 | 2573.0 |
Python Basics | 261.0 | 251.0 | 341.0 |
If you pass a list of functions or function names instead, you will get back a DataFrame
with column names from the functions:
[8]:
grouped.agg(["mean", "std", range])
[8]:
2021-12 | 2022-01 | 2022-02 | |||||||
---|---|---|---|---|---|---|---|---|---|
mean | std | range | mean | std | range | mean | std | range | |
Title | |||||||||
Jupyter Tutorial | 18103.5 | 17013.696262 | 24061.0 | 20505.5 | 18087.084356 | 25579.0 | 13099.0 | 9265.927261 | 13104.0 |
PyViz Tutorial | 4873.0 | NaN | 0.0 | 3930.0 | NaN | 0.0 | 2573.0 | NaN | 0.0 |
Python Basics | 261.0 | 234.759451 | 332.0 | 251.0 | 35.355339 | 50.0 | 341.0 | 260.215295 | 368.0 |
Here we have passed agg
a list of aggregation functions to be evaluated independently for the data groups.
You don’t need to accept the names that GroupBy
gives to the columns; in particular, lambda functions have the name <lambda>
, which makes them difficult to identify. When you pass a list of tuples, the first element of each tuple is used as the column name in the DataFrame:
[9]:
grouped.agg(
[("Mean", "mean"), ("Standard deviation", "std"), ("Range", range)]
)
[9]:
2021-12 | 2022-01 | 2022-02 | |||||||
---|---|---|---|---|---|---|---|---|---|
Mean | Standard deviation | Range | Mean | Standard deviation | Range | Mean | Standard deviation | Range | |
Title | |||||||||
Jupyter Tutorial | 18103.5 | 17013.696262 | 24061.0 | 20505.5 | 18087.084356 | 25579.0 | 13099.0 | 9265.927261 | 13104.0 |
PyViz Tutorial | 4873.0 | NaN | 0.0 | 3930.0 | NaN | 0.0 | 2573.0 | NaN | 0.0 |
Python Basics | 261.0 | 234.759451 | 332.0 | 251.0 | 35.355339 | 50.0 | 341.0 | 260.215295 | 368.0 |
With a DataFrame
, you have the option of specifying a list of functions to be applied to all columns or to different functions per column. Let’s say we want to calculate the same three statistics for the columns:
[10]:
stats = ["count", "mean", "max"]
evaluations = grouped.agg(stats)
evaluations
[10]:
2021-12 | 2022-01 | 2022-02 | |||||||
---|---|---|---|---|---|---|---|---|---|
count | mean | max | count | mean | max | count | mean | max | |
Title | |||||||||
Jupyter Tutorial | 2 | 18103.5 | 30134.0 | 2 | 20505.5 | 33295.0 | 2 | 13099.0 | 19651.0 |
PyViz Tutorial | 1 | 4873.0 | 4873.0 | 1 | 3930.0 | 3930.0 | 1 | 2573.0 | 2573.0 |
Python Basics | 2 | 261.0 | 427.0 | 2 | 251.0 | 276.0 | 2 | 341.0 | 525.0 |
As you can see, the resulting DataFrame
has hierarchical columns, just as you would get if you aggregated each column separately and used pandas.concat to join the results together, using the column names as key arguments:
[11]:
evaluations["2021-12"]
[11]:
count | mean | max | |
---|---|---|---|
Title | |||
Jupyter Tutorial | 2 | 18103.5 | 30134.0 |
PyViz Tutorial | 1 | 4873.0 | 4873.0 |
Python Basics | 2 | 261.0 | 427.0 |
As before, a list of tuples with user-defined names can be passed:
[12]:
tuples = [("Mean", "mean"), ("Variance", np.var)]
grouped[["2021-12", "2022-01"]].agg(tuples)
[12]:
2021-12 | 2022-01 | |||
---|---|---|---|---|
Mean | Variance | Mean | Variance | |
Title | ||||
Jupyter Tutorial | 18103.5 | 289465860.5 | 20505.5 | 327142620.5 |
PyViz Tutorial | 4873.0 | NaN | 3930.0 | NaN |
Python Basics | 261.0 | 55112.0 | 251.0 | 1250.0 |
If we now assume that potentially different functions are to be applied to one or more of the columns, we pass a dict
to agg
that contains an assignment of column names to one of the function specifications:
[13]:
grouped.agg({"2021-12": "mean", "2022-01": np.var})
[13]:
2021-12 | 2022-01 | |
---|---|---|
Title | ||
Jupyter Tutorial | 18103.5 | 327142620.5 |
PyViz Tutorial | 4873.0 | NaN |
Python Basics | 261.0 | 1250.0 |
[14]:
grouped.agg({"2021-12": ["min", "max", "mean", "std"], "2022-01": "sum"})
[14]:
2021-12 | 2022-01 | ||||
---|---|---|---|---|---|
min | max | mean | std | sum | |
Title | |||||
Jupyter Tutorial | 6073.0 | 30134.0 | 18103.5 | 17013.696262 | 41011.0 |
PyViz Tutorial | 4873.0 | 4873.0 | 4873.0 | NaN | 3930.0 |
Python Basics | 95.0 | 427.0 | 261.0 | 234.759451 | 502.0 |
Return aggregated data without row indices#
In all the examples so far, the aggregated data is returned with an index. Since this is not always desired, you can disable this behaviour in most cases by passing as_index=False
to groupby
:
[15]:
grouped.agg([range], as_index=False).mean()
[15]:
2021-12 8131.000000
2022-01 8543.000000
2022-02 4490.666667
dtype: float64
By using the method as_index=False
, some unnecessary calculations are avoided. Of course, it is always possible to get the result back with index by calling reset_index
for the result.