Group operations¶
By groupby
is meant a process that involves one or more of the following steps:
Split divides the data into groups according to certain criteria
Apply applies a function independently to each group
Combine combines the results in a data structure
In the first phase of the process, the data contained in a pandas object, be it a Series
, a DataFrame
or something else, is split into groups based on one or more keys. The division is done on a particular axis of an object. For example, a DataFrame can be grouped by its rows (axis=0
) or its columns (axis=1
). Then, a function is applied to each group to create a new value. Finally, the results of all these function applications are combined in a result object. The shape of the
result object usually depends on what is done with the data.
Each grouping key can take many forms, and the keys do not all have to be of the same type:
a list or array of values that have the same length as the axis being grouped
a value that specifies a column name in a DataFrame
a dict or series that is a correspondence between the values on the axis being grouped and the group names
a function that is called on the axis index or the individual labels in the index
Note:
The latter three methods are shortcuts to create an array of values that will be used to divide the object.
Don’t worry if this all seems abstract. Throughout this chapter I will give many examples of all these methods. For starters, here is a small table dataset as a DataFrame:
[1]:
import numpy as np
import pandas as pd
[2]:
df = pd.DataFrame(
{
"Title": [
"Jupyter Tutorial",
"Jupyter Tutorial",
"PyViz Tutorial",
None,
"Python Basics",
"Python Basics",
],
"Language": ["de", "en", "de", None, "de", "en"],
"2021-12": [19651, 4722, 2573, None, 525, 157],
"2022-01": [30134, 3497, 4873, None, 427, 85],
"2022-02": [33295, 4009, 3930, None, 276, 226],
}
)
df
[2]:
Title | Language | 2021-12 | 2022-01 | 2022-02 | |
---|---|---|---|---|---|
0 | Jupyter Tutorial | de | 19651.0 | 30134.0 | 33295.0 |
1 | Jupyter Tutorial | en | 4722.0 | 3497.0 | 4009.0 |
2 | PyViz Tutorial | de | 2573.0 | 4873.0 | 3930.0 |
3 | None | None | NaN | NaN | NaN |
4 | Python Basics | de | 525.0 | 427.0 | 276.0 |
5 | Python Basics | en | 157.0 | 85.0 | 226.0 |
Suppose you want to calculate the sum of column 02/2022
using the labels of Title. There are several ways to do this. One is to access 02/2022
and call groupby
with the column (a Series
) in Title
:
[3]:
grouped = df["2022-02"].groupby(df["Title"])
grouped
[3]:
<pandas.core.groupby.generic.SeriesGroupBy object at 0x11f32c8d0>
This grouped
variable is now a special SeriesGroupBy
object. It has not yet calculated anything except some intermediate data about the group key df['Title']
. The idea is that this object has all the information needed to apply an operation to each of the groups. For example, to calculate the group averages, we can call the sum
method of the GroupBy
object:
[4]:
grouped.sum()
[4]:
Title
Jupyter Tutorial 37304.0
PyViz Tutorial 3930.0
Python Basics 502.0
Name: 2022-02, dtype: float64
Later I will explain more about what happens when you call .sum()
. The important thing to note here is that the data (a row) has been aggregated by splitting the data across the group key, creating a new row that is now indexed by the unique values in the Title
column. The resulting index is Title
because groupby(df['Title']
did this.
If we had passed multiple arrays as a list instead, we would get something different:
[5]:
sums = df["2021-12"].groupby([df["Language"], df["Title"]]).sum()
sums
[5]:
Language Title
de Jupyter Tutorial 19651.0
PyViz Tutorial 2573.0
Python Basics 525.0
en Jupyter Tutorial 4722.0
Python Basics 157.0
Name: 2021-12, dtype: float64
Here we have grouped the data based on two keys, and the resulting series now has a hierarchical index consisting of the observed unique key pairs:
[6]:
sums.unstack()
[6]:
Title | Jupyter Tutorial | PyViz Tutorial | Python Basics |
---|---|---|---|
Language | |||
de | 19651.0 | 2573.0 | 525.0 |
en | 4722.0 | NaN | 157.0 |
Often the grouping information is in the same DataFrame as the data you want to edit. In this case, you can pass column names (whether they are strings, numbers or other Python objects) as group keys:
[7]:
df.groupby("Title").sum()
[7]:
Language | 2021-12 | 2022-01 | 2022-02 | |
---|---|---|---|---|
Title | ||||
Jupyter Tutorial | deen | 24373.0 | 33631.0 | 37304.0 |
PyViz Tutorial | de | 2573.0 | 4873.0 | 3930.0 |
Python Basics | deen | 682.0 | 512.0 | 502.0 |
Here it is noticeable that the result does not contain a Language
column. Since df['Language']
is not numeric data, it interferes with the table layout and is therefore automatically excluded from the result. By default, all numeric columns are aggregated.
[8]:
df.groupby(["Title","Language"]).sum()
[8]:
2021-12 | 2022-01 | 2022-02 | ||
---|---|---|---|---|
Title | Language | |||
Jupyter Tutorial | de | 19651.0 | 30134.0 | 33295.0 |
en | 4722.0 | 3497.0 | 4009.0 | |
PyViz Tutorial | de | 2573.0 | 4873.0 | 3930.0 |
Python Basics | de | 525.0 | 427.0 | 276.0 |
en | 157.0 | 85.0 | 226.0 |
Regardless of the goal of using groupby
, a generally useful groupby
method is size
, which returns a series with the group sizes:
[9]:
df.groupby(["Language"]).size()
[9]:
Language
de 3
en 2
dtype: int64
Note:
All missing values in a group key are excluded from the result by default. This behaviour can be disabled by passing
dropna=False
togroupby
:
[10]:
df.groupby("Language", dropna=False).size()
[10]:
Language
de 3
en 2
NaN 1
dtype: int64
[11]:
df.groupby(["Title", "Language"], dropna=False).size()
[11]:
Title Language
Jupyter Tutorial de 1
en 1
PyViz Tutorial de 1
Python Basics de 1
en 1
NaN NaN 1
dtype: int64
Iteration over groups¶
The object returned by groupby
supports iteration and produces a sequence of 2-tuples containing the group name along with the data packet. Consider the following:
[12]:
for name, group in df.groupby("Title"):
print(name)
print(group)
Jupyter Tutorial
Title Language 2021-12 2022-01 2022-02
0 Jupyter Tutorial de 19651.0 30134.0 33295.0
1 Jupyter Tutorial en 4722.0 3497.0 4009.0
PyViz Tutorial
Title Language 2021-12 2022-01 2022-02
2 PyViz Tutorial de 2573.0 4873.0 3930.0
Python Basics
Title Language 2021-12 2022-01 2022-02
4 Python Basics de 525.0 427.0 276.0
5 Python Basics en 157.0 85.0 226.0
With multiple keys, the first element of the tuple is a tuple of key values:
[13]:
for (i1, i2), group in df.groupby(["Title", "Language"]):
print((i1, i2))
print(group)
('Jupyter Tutorial', 'de')
Title Language 2021-12 2022-01 2022-02
0 Jupyter Tutorial de 19651.0 30134.0 33295.0
('Jupyter Tutorial', 'en')
Title Language 2021-12 2022-01 2022-02
1 Jupyter Tutorial en 4722.0 3497.0 4009.0
('PyViz Tutorial', 'de')
Title Language 2021-12 2022-01 2022-02
2 PyViz Tutorial de 2573.0 4873.0 3930.0
('Python Basics', 'de')
Title Language 2021-12 2022-01 2022-02
4 Python Basics de 525.0 427.0 276.0
('Python Basics', 'en')
Title Language 2021-12 2022-01 2022-02
5 Python Basics en 157.0 85.0 226.0
Next, we want to output a dict
of the data as a one-liner:
[14]:
books = dict(list(df.groupby("Title")))
books
[14]:
{'Jupyter Tutorial': Title Language 2021-12 2022-01 2022-02
0 Jupyter Tutorial de 19651.0 30134.0 33295.0
1 Jupyter Tutorial en 4722.0 3497.0 4009.0,
'PyViz Tutorial': Title Language 2021-12 2022-01 2022-02
2 PyViz Tutorial de 2573.0 4873.0 3930.0,
'Python Basics': Title Language 2021-12 2022-01 2022-02
4 Python Basics de 525.0 427.0 276.0
5 Python Basics en 157.0 85.0 226.0}
By default, groupby
groups on axis=0
, but you can also group on any of the other axes. For example, we could group the columns of our example df
here by dtype
as follows:
[15]:
df.dtypes
[15]:
Title object
Language object
2021-12 float64
2022-01 float64
2022-02 float64
dtype: object
[16]:
grouped = df.groupby(df.dtypes, axis=1)
[17]:
for dtype, group in grouped:
print(dtype)
print(group)
float64
2021-12 2022-01 2022-02
0 19651.0 30134.0 33295.0
1 4722.0 3497.0 4009.0
2 2573.0 4873.0 3930.0
3 NaN NaN NaN
4 525.0 427.0 276.0
5 157.0 85.0 226.0
object
Title Language
0 Jupyter Tutorial de
1 Jupyter Tutorial en
2 PyViz Tutorial de
3 None None
4 Python Basics de
5 Python Basics en
Selecting a column or subset of columns¶
Indexing a GroupBy
object created from a DataFrame with a column name or an array of column names has the effect of subdividing columns for aggregation. This means that:
[18]:
df.groupby("Title")["2021-12"]
df.groupby("Title")[["2022-01"]]
[18]:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x11f2f08d0>
are simplified spellings for:
[19]:
df["2021-12"].groupby(df["Title"])
df[["2022-01"]].groupby(df["Title"])
[19]:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x11f362a50>
Especially for large datasets, it may be desirable to aggregate only some columns. For example, to calculate the sum for only column 01/2022
in the previous dataset and get the result as a DataFrame, we could write:
[20]:
df.groupby(["Title", "Language"])[["2022-01"]].sum()
[20]:
2022-01 | ||
---|---|---|
Title | Language | |
Jupyter Tutorial | de | 30134.0 |
en | 3497.0 | |
PyViz Tutorial | de | 4873.0 |
Python Basics | de | 427.0 |
en | 85.0 |
The object returned by this indexing operation is a grouped DataFrame if a list or array is passed, or a grouped series if only a single column name is passed as a scalar:
[21]:
series_grouped = df.groupby(["Title", "Language"])["2022-01"]
series_grouped
[21]:
<pandas.core.groupby.generic.SeriesGroupBy object at 0x11f36d150>
[22]:
series_grouped.sum()
[22]:
Title Language
Jupyter Tutorial de 30134.0
en 3497.0
PyViz Tutorial de 4873.0
Python Basics de 427.0
en 85.0
Name: 2022-01, dtype: float64
Grouping with dicts
and series
¶
Grouping information can also be in a form other than an array:
[23]:
df.iloc[2:3, [2, 3]] = np.nan
Suppose I have a group correspondence for the columns and want to group the columns together by group:
[24]:
mapping = {"2021-12": "Dec 2021",
"2022-01": "Jan 2022",
"2022-02": "Feb 2022"}
Now an array could be constructed from this dict
to pass to groupby
, but instead we can just pass the dict
:
[25]:
by_column = df.groupby(mapping, axis=1)
by_column.sum()
[25]:
Dec 2021 | Feb 2022 | Jan 2022 | |
---|---|---|---|
0 | 19651.0 | 33295.0 | 30134.0 |
1 | 4722.0 | 4009.0 | 3497.0 |
2 | 0.0 | 3930.0 | 0.0 |
3 | 0.0 | 0.0 | 0.0 |
4 | 525.0 | 276.0 | 427.0 |
5 | 157.0 | 226.0 | 85.0 |
The same functionality applies to Series
:
[26]:
map_series = pd.Series(mapping)
map_series
[26]:
2021-12 Dec 2021
2022-01 Jan 2022
2022-02 Feb 2022
dtype: object
[27]:
df.groupby(map_series, axis=1).sum()
[27]:
Dec 2021 | Feb 2022 | Jan 2022 | |
---|---|---|---|
0 | 19651.0 | 33295.0 | 30134.0 |
1 | 4722.0 | 4009.0 | 3497.0 |
2 | 0.0 | 3930.0 | 0.0 |
3 | 0.0 | 0.0 | 0.0 |
4 | 525.0 | 276.0 | 427.0 |
5 | 157.0 | 226.0 | 85.0 |
Grouping with Functions¶
Using Python functions is a more general method of defining a group assignment compared to a Dict
or Series
. Each function passed as a group key is called once per index value, with the return values used as group names. Specifically, consider the example DataFrame from the previous section, which contains the titles as index values. Suppose If you want to group by the length of the names, you can calculate an array with the lengths of the strings, but it is easier to pass the len
function:
[28]:
df = pd.DataFrame(
[
[19651, 30134, 33295],
[4722, 3497, 4009],
[2573, 4873, 3930],
[525, 427, 276],
[157, 85, 226],
],
index=[
"Jupyter Tutorial",
"Jupyter Tutorial",
"PyViz Tutorial",
"Python Basics",
"Python Basics",
],
columns=["2021-12", "2022-01", "2022-02"],
)
[29]:
df.groupby(len).count()
[29]:
2021-12 | 2022-01 | 2022-02 | |
---|---|---|---|
13 | 2 | 2 | 2 |
14 | 1 | 1 | 1 |
16 | 2 | 2 | 2 |
Mixing functions with arrays, dicts or series is no problem, as everything is converted internally into arrays:
[30]:
languages = ["de", "en", "de", "de", "en"]
[31]:
df.groupby([len, languages]).count()
[31]:
2021-12 | 2022-01 | 2022-02 | ||
---|---|---|---|---|
13 | de | 1 | 1 | 1 |
en | 1 | 1 | 1 | |
14 | de | 1 | 1 | 1 |
16 | de | 1 | 1 | 1 |
en | 1 | 1 | 1 |
Grouping by index levels¶
A final practical feature for hierarchically indexed datasets is the ability to aggregate by one of the index levels of an axis. Let’s look at an example:
[32]:
version_hits = [
[19651, 0, 30134, 0, 33295, 0],
[4722, 1825, 3497, 2576, 4009, 3707],
[2573, 0, 4873, 0, 3930, 0],
[None, None, None, None, None, None],
[525, 0, 427, 0, 276, 0],
[157, 0, 85, 0, 226, 0],
]
df = pd.DataFrame(
version_hits,
index=[
[
"Jupyter Tutorial",
"Jupyter Tutorial",
"PyViz Tutorial",
None,
"Python Basics",
"Python Basics",
],
["de", "en", "de", None, "de", "en"],
],
columns=[
["2021-12", "2021-12", "2022-01", "2022-01", "2022-02", "2022-02"],
["latest", "stable", "latest", "stable", "latest", "stable"],
],
)
df.columns.names = ["Month", "Version"]
df
[32]:
Month | 2021-12 | 2022-01 | 2022-02 | ||||
---|---|---|---|---|---|---|---|
Version | latest | stable | latest | stable | latest | stable | |
Jupyter Tutorial | de | 19651.0 | 0.0 | 30134.0 | 0.0 | 33295.0 | 0.0 |
en | 4722.0 | 1825.0 | 3497.0 | 2576.0 | 4009.0 | 3707.0 | |
PyViz Tutorial | de | 2573.0 | 0.0 | 4873.0 | 0.0 | 3930.0 | 0.0 |
NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Python Basics | de | 525.0 | 0.0 | 427.0 | 0.0 | 276.0 | 0.0 |
en | 157.0 | 0.0 | 85.0 | 0.0 | 226.0 | 0.0 |
[33]:
df.groupby(level="Month", axis=1).sum()
[33]:
Month | 2021-12 | 2022-01 | 2022-02 | |
---|---|---|---|---|
Jupyter Tutorial | de | 19651.0 | 30134.0 | 33295.0 |
en | 6547.0 | 6073.0 | 7716.0 | |
PyViz Tutorial | de | 2573.0 | 4873.0 | 3930.0 |
NaN | NaN | 0.0 | 0.0 | 0.0 |
Python Basics | de | 525.0 | 427.0 | 276.0 |
en | 157.0 | 85.0 | 226.0 |