Group operations

By groupby is meant a process that involves one or more of the following steps:

  • Split divides the data into groups according to certain criteria

  • Apply applies a function independently to each group

  • Combine combines the results in a data structure

In the first phase of the process, the data contained in a pandas object, be it a Series, a DataFrame or something else, is split into groups based on one or more keys. The division is done on a particular axis of an object. For example, a DataFrame can be grouped by its rows (axis=0) or its columns (axis=1). Then, a function is applied to each group to create a new value. Finally, the results of all these function applications are combined in a result object. The shape of the result object usually depends on what is done with the data.

Each grouping key can take many forms, and the keys do not all have to be of the same type:

  • a list or array of values that have the same length as the axis being grouped

  • a value that specifies a column name in a DataFrame

  • a dict or series that is a correspondence between the values on the axis being grouped and the group names

  • a function that is called on the axis index or the individual labels in the index

Note:

The latter three methods are shortcuts to create an array of values that will be used to divide the object.

Don’t worry if this all seems abstract. Throughout this chapter I will give many examples of all these methods. For starters, here is a small table dataset as a DataFrame:

[1]:
import numpy as np
import pandas as pd
[2]:
df = pd.DataFrame(
    {
        "Title": [
            "Jupyter Tutorial",
            "Jupyter Tutorial",
            "PyViz Tutorial",
            None,
            "Python Basics",
            "Python Basics",
        ],
        "Language": ["de", "en", "de", None, "de", "en"],
        "2021-12": [19651, 4722, 2573, None, 525, 157],
        "2022-01": [30134, 3497, 4873, None, 427, 85],
        "2022-02": [33295, 4009, 3930, None, 276, 226],
    }
)

df
[2]:
Title Language 2021-12 2022-01 2022-02
0 Jupyter Tutorial de 19651.0 30134.0 33295.0
1 Jupyter Tutorial en 4722.0 3497.0 4009.0
2 PyViz Tutorial de 2573.0 4873.0 3930.0
3 None None NaN NaN NaN
4 Python Basics de 525.0 427.0 276.0
5 Python Basics en 157.0 85.0 226.0

Suppose you want to calculate the sum of column 02/2022 using the labels of Title. There are several ways to do this. One is to access 02/2022 and call groupby with the column (a Series) in Title:

[3]:
grouped = df["2022-02"].groupby(df["Title"])

grouped
[3]:
<pandas.core.groupby.generic.SeriesGroupBy object at 0x11f32c8d0>

This grouped variable is now a special SeriesGroupBy object. It has not yet calculated anything except some intermediate data about the group key df['Title']. The idea is that this object has all the information needed to apply an operation to each of the groups. For example, to calculate the group averages, we can call the sum method of the GroupBy object:

[4]:
grouped.sum()
[4]:
Title
Jupyter Tutorial    37304.0
PyViz Tutorial       3930.0
Python Basics         502.0
Name: 2022-02, dtype: float64

Later I will explain more about what happens when you call .sum(). The important thing to note here is that the data (a row) has been aggregated by splitting the data across the group key, creating a new row that is now indexed by the unique values in the Title column. The resulting index is Title because groupby(df['Title'] did this.

If we had passed multiple arrays as a list instead, we would get something different:

[5]:
sums = df["2021-12"].groupby([df["Language"], df["Title"]]).sum()

sums
[5]:
Language  Title
de        Jupyter Tutorial    19651.0
          PyViz Tutorial       2573.0
          Python Basics         525.0
en        Jupyter Tutorial     4722.0
          Python Basics         157.0
Name: 2021-12, dtype: float64

Here we have grouped the data based on two keys, and the resulting series now has a hierarchical index consisting of the observed unique key pairs:

[6]:
sums.unstack()
[6]:
Title Jupyter Tutorial PyViz Tutorial Python Basics
Language
de 19651.0 2573.0 525.0
en 4722.0 NaN 157.0

Often the grouping information is in the same DataFrame as the data you want to edit. In this case, you can pass column names (whether they are strings, numbers or other Python objects) as group keys:

[7]:
df.groupby("Title").sum()
[7]:
Language 2021-12 2022-01 2022-02
Title
Jupyter Tutorial deen 24373.0 33631.0 37304.0
PyViz Tutorial de 2573.0 4873.0 3930.0
Python Basics deen 682.0 512.0 502.0

Here it is noticeable that the result does not contain a Language column. Since df['Language'] is not numeric data, it interferes with the table layout and is therefore automatically excluded from the result. By default, all numeric columns are aggregated.

[8]:
df.groupby(["Title","Language"]).sum()
[8]:
2021-12 2022-01 2022-02
Title Language
Jupyter Tutorial de 19651.0 30134.0 33295.0
en 4722.0 3497.0 4009.0
PyViz Tutorial de 2573.0 4873.0 3930.0
Python Basics de 525.0 427.0 276.0
en 157.0 85.0 226.0

Regardless of the goal of using groupby, a generally useful groupby method is size, which returns a series with the group sizes:

[9]:
df.groupby(["Language"]).size()
[9]:
Language
de    3
en    2
dtype: int64

Note:

All missing values in a group key are excluded from the result by default. This behaviour can be disabled by passing dropna=False to groupby:

[10]:
df.groupby("Language", dropna=False).size()
[10]:
Language
de     3
en     2
NaN    1
dtype: int64
[11]:
df.groupby(["Title", "Language"], dropna=False).size()
[11]:
Title             Language
Jupyter Tutorial  de          1
                  en          1
PyViz Tutorial    de          1
Python Basics     de          1
                  en          1
NaN               NaN         1
dtype: int64

Iteration over groups

The object returned by groupby supports iteration and produces a sequence of 2-tuples containing the group name along with the data packet. Consider the following:

[12]:
for name, group in df.groupby("Title"):
    print(name)
    print(group)
Jupyter Tutorial
              Title Language  2021-12  2022-01  2022-02
0  Jupyter Tutorial       de  19651.0  30134.0  33295.0
1  Jupyter Tutorial       en   4722.0   3497.0   4009.0
PyViz Tutorial
            Title Language  2021-12  2022-01  2022-02
2  PyViz Tutorial       de   2573.0   4873.0   3930.0
Python Basics
           Title Language  2021-12  2022-01  2022-02
4  Python Basics       de    525.0    427.0    276.0
5  Python Basics       en    157.0     85.0    226.0

With multiple keys, the first element of the tuple is a tuple of key values:

[13]:
for (i1, i2), group in df.groupby(["Title", "Language"]):
    print((i1, i2))
    print(group)
('Jupyter Tutorial', 'de')
              Title Language  2021-12  2022-01  2022-02
0  Jupyter Tutorial       de  19651.0  30134.0  33295.0
('Jupyter Tutorial', 'en')
              Title Language  2021-12  2022-01  2022-02
1  Jupyter Tutorial       en   4722.0   3497.0   4009.0
('PyViz Tutorial', 'de')
            Title Language  2021-12  2022-01  2022-02
2  PyViz Tutorial       de   2573.0   4873.0   3930.0
('Python Basics', 'de')
           Title Language  2021-12  2022-01  2022-02
4  Python Basics       de    525.0    427.0    276.0
('Python Basics', 'en')
           Title Language  2021-12  2022-01  2022-02
5  Python Basics       en    157.0     85.0    226.0

Next, we want to output a dict of the data as a one-liner:

[14]:
books = dict(list(df.groupby("Title")))

books
[14]:
{'Jupyter Tutorial':               Title Language  2021-12  2022-01  2022-02
 0  Jupyter Tutorial       de  19651.0  30134.0  33295.0
 1  Jupyter Tutorial       en   4722.0   3497.0   4009.0,
 'PyViz Tutorial':             Title Language  2021-12  2022-01  2022-02
 2  PyViz Tutorial       de   2573.0   4873.0   3930.0,
 'Python Basics':            Title Language  2021-12  2022-01  2022-02
 4  Python Basics       de    525.0    427.0    276.0
 5  Python Basics       en    157.0     85.0    226.0}

By default, groupby groups on axis=0, but you can also group on any of the other axes. For example, we could group the columns of our example df here by dtype as follows:

[15]:
df.dtypes
[15]:
Title        object
Language     object
2021-12     float64
2022-01     float64
2022-02     float64
dtype: object
[16]:
grouped = df.groupby(df.dtypes, axis=1)
[17]:
for dtype, group in grouped:
    print(dtype)
    print(group)
float64
   2021-12  2022-01  2022-02
0  19651.0  30134.0  33295.0
1   4722.0   3497.0   4009.0
2   2573.0   4873.0   3930.0
3      NaN      NaN      NaN
4    525.0    427.0    276.0
5    157.0     85.0    226.0
object
              Title Language
0  Jupyter Tutorial       de
1  Jupyter Tutorial       en
2    PyViz Tutorial       de
3              None     None
4     Python Basics       de
5     Python Basics       en

Selecting a column or subset of columns

Indexing a GroupBy object created from a DataFrame with a column name or an array of column names has the effect of subdividing columns for aggregation. This means that:

[18]:
df.groupby("Title")["2021-12"]
df.groupby("Title")[["2022-01"]]
[18]:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x11f2f08d0>

are simplified spellings for:

[19]:
df["2021-12"].groupby(df["Title"])
df[["2022-01"]].groupby(df["Title"])
[19]:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x11f362a50>

Especially for large datasets, it may be desirable to aggregate only some columns. For example, to calculate the sum for only column 01/2022 in the previous dataset and get the result as a DataFrame, we could write:

[20]:
df.groupby(["Title", "Language"])[["2022-01"]].sum()
[20]:
2022-01
Title Language
Jupyter Tutorial de 30134.0
en 3497.0
PyViz Tutorial de 4873.0
Python Basics de 427.0
en 85.0

The object returned by this indexing operation is a grouped DataFrame if a list or array is passed, or a grouped series if only a single column name is passed as a scalar:

[21]:
series_grouped = df.groupby(["Title", "Language"])["2022-01"]

series_grouped
[21]:
<pandas.core.groupby.generic.SeriesGroupBy object at 0x11f36d150>
[22]:
series_grouped.sum()
[22]:
Title             Language
Jupyter Tutorial  de          30134.0
                  en           3497.0
PyViz Tutorial    de           4873.0
Python Basics     de            427.0
                  en             85.0
Name: 2022-01, dtype: float64

Grouping with dicts and series

Grouping information can also be in a form other than an array:

[23]:
df.iloc[2:3, [2, 3]] = np.nan

Suppose I have a group correspondence for the columns and want to group the columns together by group:

[24]:
mapping = {"2021-12": "Dec 2021",
           "2022-01": "Jan 2022",
           "2022-02": "Feb 2022"}

Now an array could be constructed from this dict to pass to groupby, but instead we can just pass the dict:

[25]:
by_column = df.groupby(mapping, axis=1)

by_column.sum()
[25]:
Dec 2021 Feb 2022 Jan 2022
0 19651.0 33295.0 30134.0
1 4722.0 4009.0 3497.0
2 0.0 3930.0 0.0
3 0.0 0.0 0.0
4 525.0 276.0 427.0
5 157.0 226.0 85.0

The same functionality applies to Series:

[26]:
map_series = pd.Series(mapping)

map_series
[26]:
2021-12    Dec 2021
2022-01    Jan 2022
2022-02    Feb 2022
dtype: object
[27]:
df.groupby(map_series, axis=1).sum()
[27]:
Dec 2021 Feb 2022 Jan 2022
0 19651.0 33295.0 30134.0
1 4722.0 4009.0 3497.0
2 0.0 3930.0 0.0
3 0.0 0.0 0.0
4 525.0 276.0 427.0
5 157.0 226.0 85.0

Grouping with Functions

Using Python functions is a more general method of defining a group assignment compared to a Dict or Series. Each function passed as a group key is called once per index value, with the return values used as group names. Specifically, consider the example DataFrame from the previous section, which contains the titles as index values. Suppose If you want to group by the length of the names, you can calculate an array with the lengths of the strings, but it is easier to pass the len function:

[28]:
df = pd.DataFrame(
    [
        [19651, 30134, 33295],
        [4722, 3497, 4009],
        [2573, 4873, 3930],
        [525, 427, 276],
        [157, 85, 226],
    ],
    index=[
        "Jupyter Tutorial",
        "Jupyter Tutorial",
        "PyViz Tutorial",
        "Python Basics",
        "Python Basics",
    ],
    columns=["2021-12", "2022-01", "2022-02"],
)
[29]:
df.groupby(len).count()
[29]:
2021-12 2022-01 2022-02
13 2 2 2
14 1 1 1
16 2 2 2

Mixing functions with arrays, dicts or series is no problem, as everything is converted internally into arrays:

[30]:
languages = ["de", "en", "de", "de", "en"]
[31]:
df.groupby([len, languages]).count()
[31]:
2021-12 2022-01 2022-02
13 de 1 1 1
en 1 1 1
14 de 1 1 1
16 de 1 1 1
en 1 1 1

Grouping by index levels

A final practical feature for hierarchically indexed datasets is the ability to aggregate by one of the index levels of an axis. Let’s look at an example:

[32]:
version_hits = [
    [19651, 0, 30134, 0, 33295, 0],
    [4722, 1825, 3497, 2576, 4009, 3707],
    [2573, 0, 4873, 0, 3930, 0],
    [None, None, None, None, None, None],
    [525, 0, 427, 0, 276, 0],
    [157, 0, 85, 0, 226, 0],
]

df = pd.DataFrame(
    version_hits,
    index=[
        [
            "Jupyter Tutorial",
            "Jupyter Tutorial",
            "PyViz Tutorial",
            None,
            "Python Basics",
            "Python Basics",
        ],
        ["de", "en", "de", None, "de", "en"],
    ],
    columns=[
        ["2021-12", "2021-12", "2022-01", "2022-01", "2022-02", "2022-02"],
        ["latest", "stable", "latest", "stable", "latest", "stable"],
    ],
)

df.columns.names = ["Month", "Version"]

df
[32]:
Month 2021-12 2022-01 2022-02
Version latest stable latest stable latest stable
Jupyter Tutorial de 19651.0 0.0 30134.0 0.0 33295.0 0.0
en 4722.0 1825.0 3497.0 2576.0 4009.0 3707.0
PyViz Tutorial de 2573.0 0.0 4873.0 0.0 3930.0 0.0
NaN NaN NaN NaN NaN NaN NaN NaN
Python Basics de 525.0 0.0 427.0 0.0 276.0 0.0
en 157.0 0.0 85.0 0.0 226.0 0.0
[33]:
df.groupby(level="Month", axis=1).sum()
[33]:
Month 2021-12 2022-01 2022-02
Jupyter Tutorial de 19651.0 30134.0 33295.0
en 6547.0 6073.0 7716.0
PyViz Tutorial de 2573.0 4873.0 3930.0
NaN NaN 0.0 0.0 0.0
Python Basics de 525.0 427.0 276.0
en 157.0 85.0 226.0