Group operations¶

By groupby is meant a process that involves one or more of the following steps:

Split divides the data into groups according to certain criteria
Apply applies a function independently to each group
Combine combines the results in a data structure

In the first phase of the process, the data contained in a pandas object, be it a Series, a DataFrame or something else, is split into groups based on one or more keys. The division is done on a particular axis of an object. For example, a DataFrame can be grouped by its rows or its columns. Then, a function is applied to each group to create a new value. Finally, the results of all these function applications are combined in a result object. The shape of the result object usually depends on what is done with the data.

Each grouping key can take many forms, and the keys do not all have to be of the same type:

a list or array of values that have the same length as the axis being grouped
a value that specifies a column name in a DataFrame
a dict or series that is a correspondence between the values on the axis being grouped and the group names
a function that is called on the axis index or the individual labels in the index

Note:

The latter three methods are shortcuts to create an array of values that will be used to divide the object.

Don’t worry if this all seems abstract. Throughout this chapter I will give many examples of all these methods. For starters, here is a small table dataset as a DataFrame:

[1]:

import numpy as np
import pandas as pd

[2]:

df = pd.DataFrame(
    {
        "Title": [
            "Jupyter Tutorial",
            "Jupyter Tutorial",
            "PyViz Tutorial",
            None,
            "Python Basics",
            "Python Basics",
        ],
        "Language": ["de", "en", "de", None, "de", "en"],
        "2021-12": [19651, 4722, 2573, None, 525, 157],
        "2022-01": [30134, 3497, 4873, None, 427, 85],
        "2022-02": [33295, 4009, 3930, None, 276, 226],
    },
)

df

[2]:

	Title	Language	2021-12	2022-01	2022-02
0	Jupyter Tutorial	de	19651.0	30134.0	33295.0
1	Jupyter Tutorial	en	4722.0	3497.0	4009.0
2	PyViz Tutorial	de	2573.0	4873.0	3930.0
3	None	None	NaN	NaN	NaN
4	Python Basics	de	525.0	427.0	276.0
5	Python Basics	en	157.0	85.0	226.0

Suppose you want to calculate the sum of column 02/2022 using the labels of Title. There are several ways to do this. One is to access 02/2022 and call groupby with the column (a Series) in Title:

[3]:

grouped = df["2022-02"].groupby(df["Title"])

grouped

[3]:

<pandas.core.groupby.generic.SeriesGroupBy object at 0x1094e02f0>

This grouped variable is now a special SeriesGroupBy object. It has not yet calculated anything except some intermediate data about the group key df['Title']. The idea is that this object has all the information needed to apply an operation to each of the groups. For example, to calculate the group averages, we can call the sum method of the GroupBy object:

[4]:

grouped.sum()

[4]:

Title
Jupyter Tutorial    37304.0
PyViz Tutorial       3930.0
Python Basics         502.0
Name: 2022-02, dtype: float64

Later I will explain more about what happens when you call .sum(). The important thing to note here is that the data (a row) has been aggregated by splitting the data across the group key, creating a new row that is now indexed by the unique values in the Title column. The resulting index is Title because groupby(df['Title'] did this.

If we had passed multiple arrays as a list instead, we would get something different:

[5]:

sums = df["2021-12"].groupby([df["Language"], df["Title"]]).sum()

sums

[5]:

Language  Title
de        Jupyter Tutorial    19651.0
          PyViz Tutorial       2573.0
          Python Basics         525.0
en        Jupyter Tutorial     4722.0
          Python Basics         157.0
Name: 2021-12, dtype: float64

Here we have grouped the data based on two keys, and the resulting series now has a hierarchical index consisting of the observed unique key pairs:

[6]:

sums.unstack()

[6]:

Title	Jupyter Tutorial	PyViz Tutorial	Python Basics
Language
de	19651.0	2573.0	525.0
en	4722.0	NaN	157.0

Often the grouping information is in the same DataFrame as the data you want to edit. In this case, you can pass column names (whether they are strings, numbers or other Python objects) as group keys:

[7]:

df.groupby("Title").sum()

[7]:

	Language	2021-12	2022-01	2022-02
Title
Jupyter Tutorial	deen	24373.0	33631.0	37304.0
PyViz Tutorial	de	2573.0	4873.0	3930.0
Python Basics	deen	682.0	512.0	502.0

Here it is noticeable that the result does not contain a Language column. Since df['Language'] is not numeric data, it interferes with the table layout and is therefore automatically excluded from the result. By default, all numeric columns are aggregated.

[8]:

df.groupby(["Title", "Language"]).sum()

[8]:

		2021-12	2022-01	2022-02
Title	Language
Jupyter Tutorial	de	19651.0	30134.0	33295.0
Jupyter Tutorial	en	4722.0	3497.0	4009.0
PyViz Tutorial	de	2573.0	4873.0	3930.0
Python Basics	de	525.0	427.0	276.0
Python Basics	en	157.0	85.0	226.0

Regardless of the goal of using groupby, a generally useful groupby method is size, which returns a series with the group sizes:

[9]:

df.groupby(["Language"]).size()

[9]:

Language
de    3
en    2
dtype: int64

Note:

All missing values in a group key are excluded from the result by default. This behaviour can be disabled by passing dropna=False to groupby:

[10]:

df.groupby("Language", dropna=False).size()

[10]:

Language
de     3
en     2
NaN    1
dtype: int64

[11]:

df.groupby(["Title", "Language"], dropna=False).size()

[11]:

Title             Language
Jupyter Tutorial  de          1
                  en          1
PyViz Tutorial    de          1
Python Basics     de          1
                  en          1
NaN               NaN         1
dtype: int64

Iteration over groups¶

The object returned by groupby supports iteration and produces a sequence of 2-tuples containing the group name along with the data packet. Consider the following:

[12]:

for name, group in df.groupby("Title"):
    print(name)
    print(group)

Jupyter Tutorial
              Title Language  2021-12  2022-01  2022-02
0  Jupyter Tutorial       de  19651.0  30134.0  33295.0
1  Jupyter Tutorial       en   4722.0   3497.0   4009.0
PyViz Tutorial
            Title Language  2021-12  2022-01  2022-02
2  PyViz Tutorial       de   2573.0   4873.0   3930.0
Python Basics
           Title Language  2021-12  2022-01  2022-02
4  Python Basics       de    525.0    427.0    276.0
5  Python Basics       en    157.0     85.0    226.0

With multiple keys, the first element of the tuple is a tuple of key values:

[13]:

for (i1, i2), group in df.groupby(["Title", "Language"]):
    print((i1, i2))
    print(group)

('Jupyter Tutorial', 'de')
              Title Language  2021-12  2022-01  2022-02
0  Jupyter Tutorial       de  19651.0  30134.0  33295.0
('Jupyter Tutorial', 'en')
              Title Language  2021-12  2022-01  2022-02
1  Jupyter Tutorial       en   4722.0   3497.0   4009.0
('PyViz Tutorial', 'de')
            Title Language  2021-12  2022-01  2022-02
2  PyViz Tutorial       de   2573.0   4873.0   3930.0
('Python Basics', 'de')
           Title Language  2021-12  2022-01  2022-02
4  Python Basics       de    525.0    427.0    276.0
('Python Basics', 'en')
           Title Language  2021-12  2022-01  2022-02
5  Python Basics       en    157.0     85.0    226.0

Next, we want to output a dict of the data as a one-liner:

[14]:

books = dict(list(df.groupby("Title")))

books

[14]:

{'Jupyter Tutorial':               Title Language  2021-12  2022-01  2022-02
 0  Jupyter Tutorial       de  19651.0  30134.0  33295.0
 1  Jupyter Tutorial       en   4722.0   3497.0   4009.0,
 'PyViz Tutorial':             Title Language  2021-12  2022-01  2022-02
 2  PyViz Tutorial       de   2573.0   4873.0   3930.0,
 'Python Basics':            Title Language  2021-12  2022-01  2022-02
 4  Python Basics       de    525.0    427.0    276.0
 5  Python Basics       en    157.0     85.0    226.0}

By default, groupby groups on axis=0, but you can also group on any of the other axes. For example, we could group the columns of our example df here by dtype as follows:

[15]:

df.dtypes

[15]:

Title        object
Language     object
2021-12     float64
2022-01     float64
2022-02     float64
dtype: object

[16]:

grouped = df.T.groupby(df.dtypes)

[17]:

for dtype, group in grouped:
    print(dtype)
    print(group)

float64
               0       1       2    3      4      5
2021-12  19651.0  4722.0  2573.0  NaN  525.0  157.0
2022-01  30134.0  3497.0  4873.0  NaN  427.0   85.0
2022-02  33295.0  4009.0  3930.0  NaN  276.0  226.0
object
                         0                 1               2     3  \
Title     Jupyter Tutorial  Jupyter Tutorial  PyViz Tutorial  None
Language                de                en              de  None

                      4              5
Title     Python Basics  Python Basics
Language             de             en

Selecting a column or subset of columns¶

Indexing a GroupBy object created from a DataFrame with a column name or an array of column names has the effect of subdividing columns for aggregation. This means that:

[18]:

df.groupby("Title")["2021-12"]
df.groupby("Title")[["2022-01"]]

[18]:

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x10efe72f0>

are simplified spellings for:

[19]:

df["2021-12"].groupby(df["Title"])
df[["2022-01"]].groupby(df["Title"])

[19]:

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x10f02eb30>

Especially for large datasets, it may be desirable to aggregate only some columns. For example, to calculate the sum for only column 01/2022 in the previous dataset and get the result as a DataFrame, we could write:

[20]:

df.groupby(["Title", "Language"])[["2022-01"]].sum()

[20]:

		2022-01
Title	Language
Jupyter Tutorial	de	30134.0
Jupyter Tutorial	en	3497.0
PyViz Tutorial	de	4873.0
Python Basics	de	427.0
Python Basics	en	85.0

The object returned by this indexing operation is a grouped DataFrame if a list or array is passed, or a grouped series if only a single column name is passed as a scalar:

[21]:

series_grouped = df.groupby(["Title", "Language"])["2022-01"]

series_grouped

[21]:

<pandas.core.groupby.generic.SeriesGroupBy object at 0x10efae8b0>

[22]:

series_grouped.sum()

[22]:

Title             Language
Jupyter Tutorial  de          30134.0
                  en           3497.0
PyViz Tutorial    de           4873.0
Python Basics     de            427.0
                  en             85.0
Name: 2022-01, dtype: float64

Grouping with `dicts` and `series`¶

Grouping information can also be in a form other than an array:

[23]:

df.iloc[2:3, [2, 3]] = np.nan

Suppose I have a group correspondence for the columns and want to group the columns together by group:

[24]:

mapping = {"2021-12": "Dec 2021", "2022-01": "Jan 2022", "2022-02": "Feb 2022"}

Now an array could be constructed from this dict to pass to groupby, but instead we can just pass the dict:

[25]:

by_column = df.T.groupby(mapping)

by_column.sum()

[25]:

	0	1	2	4	5
Dec 2021	19651.0	4722.0	0	525.0	157.0
Feb 2022	33295.0	4009.0	3930.0	276.0	226.0
Jan 2022	30134.0	3497.0	0	427.0	85.0

The same functionality applies to Series:

[26]:

map_series = pd.Series(mapping)

map_series

[26]:

2021-12    Dec 2021
2022-01    Jan 2022
2022-02    Feb 2022
dtype: object

[27]:

df.T.groupby(map_series).sum()

[27]:

	0	1	2	4	5
Dec 2021	19651.0	4722.0	0	525.0	157.0
Feb 2022	33295.0	4009.0	3930.0	276.0	226.0
Jan 2022	30134.0	3497.0	0	427.0	85.0

Grouping with Functions¶

Using Python functions is a more general method of defining a group assignment compared to a Dict or Series. Each function passed as a group key is called once per index value, with the return values used as group names. Specifically, consider the example DataFrame from the previous section, which contains the titles as index values. Suppose If you want to group by the length of the names, you can calculate an array with the lengths of the strings, but it is easier to pass the len function:

[28]:

df = pd.DataFrame(
    [
        [19651, 30134, 33295],
        [4722, 3497, 4009],
        [2573, 4873, 3930],
        [525, 427, 276],
        [157, 85, 226],
    ],
    index=[
        "Jupyter Tutorial",
        "Jupyter Tutorial",
        "PyViz Tutorial",
        "Python Basics",
        "Python Basics",
    ],
    columns=["2021-12", "2022-01", "2022-02"],
)

[29]:

df.groupby(len).count()

[29]:

	2021-12	2022-01	2022-02
13	2	2	2
14	1	1	1
16	2	2	2

Mixing functions with arrays, dicts or series is no problem, as everything is converted internally into arrays:

[30]:

languages = ["de", "en", "de", "de", "en"]

[31]:

df.groupby([len, languages]).count()

[31]:

		2021-12	2022-01	2022-02
13	de	1	1	1
13	en	1	1	1
14	de	1	1	1
16	de	1	1	1
16	en	1	1	1

Grouping by index levels¶

A final practical feature for hierarchically indexed datasets is the ability to aggregate by one of the index levels of an axis. Let’s look at an example:

[32]:

version_hits = [
    [19651, 0, 30134, 0, 33295, 0],
    [4722, 1825, 3497, 2576, 4009, 3707],
    [2573, 0, 4873, 0, 3930, 0],
    [None, None, None, None, None, None],
    [525, 0, 427, 0, 276, 0],
    [157, 0, 85, 0, 226, 0],
]

df = pd.DataFrame(
    version_hits,
    index=[
        [
            "Jupyter Tutorial",
            "Jupyter Tutorial",
            "PyViz Tutorial",
            None,
            "Python Basics",
            "Python Basics",
        ],
        ["de", "en", "de", None, "de", "en"],
    ],
    columns=[
        ["2021-12", "2021-12", "2022-01", "2022-01", "2022-02", "2022-02"],
        ["latest", "stable", "latest", "stable", "latest", "stable"],
    ],
)

df.columns.names = ["Month", "Version"]

df

[32]:

	Month	2021-12		2022-01		2022-02
	Version	latest	stable	latest	stable	latest	stable
Jupyter Tutorial	de	19651.0	0.0	30134.0	0.0	33295.0	0.0
Jupyter Tutorial	en	4722.0	1825.0	3497.0	2576.0	4009.0	3707.0
PyViz Tutorial	de	2573.0	0.0	4873.0	0.0	3930.0	0.0
NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Python Basics	de	525.0	0.0	427.0	0.0	276.0	0.0
Python Basics	en	157.0	0.0	85.0	0.0	226.0	0.0

[33]:

df.T.groupby(level="Month").sum()

[33]:

	Jupyter Tutorial		PyViz Tutorial	NaN	Python Basics
	de	en	de	NaN	de	en
Month
2021-12	19651.0	6547.0	2573.0	0.0	525.0	157.0
2022-01	30134.0	6073.0	4873.0	0.0	427.0	85.0
2022-02	33295.0	7716.0	3930.0	0.0	276.0	226.0

Group operations¶

Iteration over groups¶

Selecting a column or subset of columns¶

Grouping with dicts and series¶

Grouping with Functions¶

Grouping by index levels¶

Grouping with `dicts` and `series`¶