Apply¶

The most general GroupBy method is apply. It splits the object to be processed, calls the passed function on each part and then tries to chain the parts together.

Suppose we want to select the five largest hit values by group. To do this, we first write a function that selects the rows with the largest values in a particular column:

[1]:

import numpy as np
import pandas as pd

[2]:

df = pd.DataFrame(
    {
        "2021-12": [30134, 6073, 4873, None, 427, 95],
        "2022-01": [33295, 7716, 3930, None, 276, 226],
        "2022-02": [19651, 6547, 2573, None, 525, 157],
    },
    index=[
        [
            "Jupyter Tutorial",
            "Jupyter Tutorial",
            "PyViz Tutorial",
            "PyViz Tutorial",
            "Python Basics",
            "Python Basics",
        ],
        ["de", "en", "de", "en", "de", "en"],
    ],
)
df.index.names = ["Title", "Language"]

df

[2]:

		2021-12	2022-01	2022-02
Title	Language
Jupyter Tutorial	de	30134.0	33295.0	19651.0
Jupyter Tutorial	en	6073.0	7716.0	6547.0
PyViz Tutorial	de	4873.0	3930.0	2573.0
PyViz Tutorial	en	NaN	NaN	NaN
Python Basics	de	427.0	276.0	525.0
Python Basics	en	95.0	226.0	157.0

[3]:

def top(df, n=5, column="2021-12"):
    return df.sort_values(by=column, ascending=False)[:n]


top(df, n=3)

[3]:

		2021-12	2022-01	2022-02
Title	Language
Jupyter Tutorial	de	30134.0	33295.0	19651.0
Jupyter Tutorial	en	6073.0	7716.0	6547.0
PyViz Tutorial	de	4873.0	3930.0	2573.0

If we now group by titles, for example, and call apply with this function, we get the following:

[4]:

grouped_titles = df.groupby("Title", as_index=False)

grouped_titles.apply(top)

[4]:

			2021-12	2022-01	2022-02
	Title	Language
0	Jupyter Tutorial	de	30134.0	33295.0	19651.0
0	Jupyter Tutorial	en	6073.0	7716.0	6547.0
1	PyViz Tutorial	de	4873.0	3930.0	2573.0
1	PyViz Tutorial	en	NaN	NaN	NaN
2	Python Basics	de	427.0	276.0	525.0
2	Python Basics	en	95.0	226.0	157.0

What happened here? The upper function is called for each row group of the DataFrame, and then the results are concatenated with pandas.concat, labelling the parts with the group names. The result therefore has a hierarchical index whose inner level contains index values from the original DataFrame.

If you pass a function to apply that takes other arguments or keywords, you can pass them after the function:

[5]:

grouped_titles = df.groupby("Title", as_index=False)

grouped_titles.apply(top, n=1)

[5]:

			2021-12	2022-01	2022-02
	Title	Language
0	Jupyter Tutorial	de	30134.0	33295.0	19651.0
1	PyViz Tutorial	de	4873.0	3930.0	2573.0
2	Python Basics	de	427.0	276.0	525.0

We have now seen the basic usage of apply. What happens inside the passed function is very versatile and up to you; it only has to return a pandas object or a single value. In the following, we will therefore mainly show examples that can give you ideas on how to solve various problems with groupby.

First, let’s look again at describe, called over the GroupBy object:

[6]:

grouped_titles = df.groupby("Title")

result = grouped_titles.describe()

result

[6]:

	2021-12								2022-01					2022-02
	count	mean	std	min	25%	50%	75%	max	count	mean	...	75%	max	count	mean	std	min	25%	50%	75%	max
Title
Jupyter Tutorial	2.0	18103.5	17013.696262	6073.0	12088.25	18103.5	24118.75	30134.0	2.0	20505.5	...	26900.25	33295.0	2.0	13099.0	9265.927261	6547.0	9823.0	13099.0	16375.0	19651.0
PyViz Tutorial	1.0	4873.0	NaN	4873.0	4873.00	4873.0	4873.00	4873.0	1.0	3930.0	...	3930.00	3930.0	1.0	2573.0	NaN	2573.0	2573.0	2573.0	2573.0	2573.0
Python Basics	2.0	261.0	234.759451	95.0	178.00	261.0	344.00	427.0	2.0	251.0	...	263.50	276.0	2.0	341.0	260.215295	157.0	249.0	341.0	433.0	525.0

3 rows × 24 columns

When you call a method like describe within GroupBy, it is actually just an abbreviation for:

[7]:

def desc(x):
    return x.describe()


grouped_titles.apply(desc)

[7]:

		2021-12	2022-01	2022-02
Title
Jupyter Tutorial	count	2.000000	2.000000	2.000000
	mean	18103.500000	20505.500000	13099.000000
	std	17013.696262	18087.084356	9265.927261
	min	6073.000000	7716.000000	6547.000000
	25%	12088.250000	14110.750000	9823.000000
	50%	18103.500000	20505.500000	13099.000000
	75%	24118.750000	26900.250000	16375.000000
	max	30134.000000	33295.000000	19651.000000
PyViz Tutorial	count	1.000000	1.000000	1.000000
	mean	4873.000000	3930.000000	2573.000000
	std	NaN	NaN	NaN
	min	4873.000000	3930.000000	2573.000000
	25%	4873.000000	3930.000000	2573.000000
	50%	4873.000000	3930.000000	2573.000000
	75%	4873.000000	3930.000000	2573.000000
	max	4873.000000	3930.000000	2573.000000
Python Basics	count	2.000000	2.000000	2.000000
	mean	261.000000	251.000000	341.000000
	std	234.759451	35.355339	260.215295
	min	95.000000	226.000000	157.000000
	25%	178.000000	238.500000	249.000000
	50%	261.000000	251.000000	341.000000
	75%	344.000000	263.500000	433.000000
	max	427.000000	276.000000	525.000000

Suppression of the group keys¶

In the previous examples, you saw that the resulting object has a hierarchical index formed by the group keys together with the indices of the individual parts of the original object. You can disable this by passing group_keys=False to groupby:

[8]:

grouped_lang = df.groupby("Language", group_keys=False)

grouped_lang.apply(top)

[8]:

		2021-12	2022-01	2022-02
Title	Language
Jupyter Tutorial	de	30134.0	33295.0	19651.0
PyViz Tutorial	de	4873.0	3930.0	2573.0
Python Basics	de	427.0	276.0	525.0
Jupyter Tutorial	en	6073.0	7716.0	6547.0
Python Basics	en	95.0	226.0	157.0
PyViz Tutorial	en	NaN	NaN	NaN

Quantile and bucket analysis¶

As described in discretisation and grouping, pandas has some tools, especially cut and qcut, to split data into buckets with bins of your choice or by sample quantiles. Combine these functions with groupby and you can conveniently perform bucket or quantile analysis on a dataset. Consider a simple random data set and a bucket categorisation of equal length with cut:

[9]:

rng = np.random.default_rng()
df2 = pd.DataFrame(
    {
        "data1": rng.normal(size=1000),
        "data2": rng.normal(size=1000),
    },
)

quartiles = pd.cut(df2.data1, 4)

quartiles[:10]

[9]:

0    (-1.401, 0.235]
1    (-1.401, 0.235]
2    (-1.401, 0.235]
3    (-1.401, 0.235]
4    (-1.401, 0.235]
5     (0.235, 1.872]
6     (0.235, 1.872]
7     (0.235, 1.872]
8     (0.235, 1.872]
9    (-1.401, 0.235]
Name: data1, dtype: category
Categories (4, interval[float64, right]): [(-3.045, -1.401] < (-1.401, 0.235] < (0.235, 1.872] < (1.872, 3.509]]

The category object returned by cut can be passed directly to groupby. So we could calculate a set of group statistics for the quartiles as follows:

[10]:

def stats(group):
    return pd.DataFrame(
        {
            "min": group.min(),
            "max": group.max(),
            "count": group.count(),
            "mean": group.mean(),
        },
    )


grouped_quart = df2.groupby(quartiles, observed=False)

grouped_quart.apply(stats)

[10]:

		min	max	count	mean
data1
(-3.045, -1.401]	data1	-3.038325	-1.403818	71	-1.777125
(-3.045, -1.401]	data2	-2.552644	1.823623	71	0.046782
(-1.401, 0.235]	data1	-1.390785	0.234324	518	-0.472618
(-1.401, 0.235]	data2	-3.085377	3.088782	518	-0.003109
(0.235, 1.872]	data1	0.236014	1.870917	386	0.831342
(0.235, 1.872]	data2	-2.649750	3.874882	386	0.030917
(1.872, 3.509]	data1	1.881248	3.509077	25	2.247147
(1.872, 3.509]	data2	-1.899936	1.636602	25	0.107493

These were buckets of equal length; to calculate buckets of equal size based on sample quantiles, we can use qcut. I pass labels=False to get only quantile numbers:

[11]:

quartiles_samp = pd.qcut(df2.data1, 4, labels=False)

grouped_quart_samp = df2.groupby(quartiles_samp)

grouped_quart_samp.apply(stats)

[11]:

		min	max	count	mean
data1
0	data1	-3.038325	-0.687402	250	-1.205700
0	data2	-2.552644	2.762498	250	0.059576
1	data1	-0.684801	0.025629	250	-0.321806
1	data2	-3.085377	3.088782	250	-0.087235
2	data1	0.025695	0.655370	250	0.327563
2	data2	-2.649750	3.874882	250	0.127567
3	data1	0.659186	3.509077	250	1.224283
3	data2	-2.627217	2.615816	250	-0.034580

Populating data with group-specific values¶

When cleaning missing data, in some cases you will replace data observations with dropna, but in other cases you may want to fill the null values (NA) with a fixed value or a value derived from the data. fillna is the right tool for this; here, for example, I fill the null values with the mean:

[12]:

s = pd.Series(rng.normal(size=8))
s[::3] = np.nan

s

[12]:

0         NaN
1   -0.725159
2    0.297304
3         NaN
4    1.152454
5   -0.410714
6         NaN
7   -0.477529
dtype: float64

[13]:

s.fillna(s.mean())

[13]:

0   -0.032729
1   -0.725159
2    0.297304
3   -0.032729
4    1.152454
5   -0.410714
6   -0.032729
7   -0.477529
dtype: float64

Here are some sample data for my tutorials, divided into German and English editions:

Suppose you want the fill value to vary by group. These values can be predefined, and since the groups have an internal name attribute, you can use this with apply:

[14]:

fill_values = {"de": 10632, "en": 3469}

def fill(g):
    return g.fillna(fill_values[g.name])

df.groupby("Language").apply(fill)

[14]:

			2021-12	2022-01	2022-02
Language	Title	Language
de	Jupyter Tutorial	de	30134.0	33295.0	19651.0
	PyViz Tutorial	de	4873.0	3930.0	2573.0
	Python Basics	de	427.0	276.0	525.0
en	Jupyter Tutorial	en	6073.0	7716.0	6547.0
	PyViz Tutorial	en	3469.0	3469.0	3469.0
	Python Basics	en	95.0	226.0	157.0

You can also group the data and use apply with a function that calls fillna for each data packet:

[15]:

def fill_mean(g):
    return g.fillna(g.mean())


df.groupby("Language").apply(fill_mean)

[15]:

			2021-12	2022-01	2022-02
Language	Title	Language
de	Jupyter Tutorial	de	30134.0	33295.0	19651.0
	PyViz Tutorial	de	4873.0	3930.0	2573.0
	Python Basics	de	427.0	276.0	525.0
en	Jupyter Tutorial	en	6073.0	7716.0	6547.0
	PyViz Tutorial	en	3084.0	3971.0	3352.0
	Python Basics	en	95.0	226.0	157.0

Group weighted average¶

Since operations between columns in a DataFrame or two Series are possible, we can calculate the group-weighted average, for example:

[16]:

rng = np.random.default_rng()
df3 = pd.DataFrame(
    {
        "category": ["de", "de", "de", "de", "en", "en", "en", "en"],
        "data": rng.integers(100000, size=8),
        "weights": rng.random(8),
    },
)

df3

[16]:

	category	data	weights
0	de	88218	0.414424
1	de	77067	0.035276
2	de	90737	0.769323
3	de	93228	0.384954
4	en	84643	0.241868
5	en	68233	0.063349
6	en	34558	0.354452
7	en	16885	0.986126

The group average weighted by category would then be:

[17]:

grouped_cat = df3.groupby("category")

def get_wavg(g):
    return np.average(g["data"], weights=g["weights"])

grouped_cat.apply(get_wavg, include_groups=False)

[17]:

category
de    90383.354698
en    32625.459028
dtype: float64

Correlation¶

An interesting task could be to calculate a DataFrame consisting of the percentage changes.

For this purpose, we first create a function that calculates the pairwise correlation of the 2021-12 column with the subsequent columns:

[18]:

def corr(x):
    return x.corrwith(x["2021-12"])

Next, we calculate the percentage change:

[19]:

pcts = df.pct_change().dropna()

pcts

/var/folders/hk/s8m0bblj0g10hw885gld52mc0000gn/T/ipykernel_71113/3358811060.py:1: FutureWarning: The default fill_method='pad' in DataFrame.pct_change is deprecated and will be removed in a future version. Either fill in any non-leading NA values prior to calling pct_change or specify 'fill_method=None' to not fill NA values.
  pcts = df.pct_change().dropna()

[19]:

		2021-12	2022-01	2022-02
Title	Language
Jupyter Tutorial	en	-0.798467	-0.768253	-0.666836
PyViz Tutorial	de	-0.197596	-0.490669	-0.606996
PyViz Tutorial	en	0.000000	0.000000	0.000000
Python Basics	de	-0.912374	-0.929771	-0.795958
Python Basics	en	-0.777518	-0.181159	-0.700952

Finally, we group these percentage changes by year, which can be extracted from each row label with a one-line function that returns the year attribute of each date label:

[20]:

grouped_lang = pcts.groupby("Language")

grouped_lang.apply(corr)

[20]:

	2021-12	2022-01	2022-02
Language
de	1.0	1.000000	1.00000
en	1.0	0.699088	0.99781

[21]:

grouped_lang.apply(lambda g: g["2021-12"].corr(g["2022-01"]))

[21]:

Language
de    1.000000
en    0.699088
dtype: float64

Performance problems with `apply`¶

Since the apply method typically acts on each individual value in a Series, the function is called once for each value. If you have thousands of values, the function will be called thousands of times. This ignores the fast vectorisations of pandas unless you are using NumPy functions and slow Python is used. For example, we previously grouped the data by title and then called our top method with apply. Let’s measure the time for this:

[22]:

%%timeit
grouped_titles.apply(top)

410 μs ± 6.29 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

We can get the same result without applying by passing the DataFrame to our top method:

[23]:

%%timeit
top(df)

38.7 μs ± 963 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

This calculation is 18 times faster.

Optimising `apply` with Cython¶

It is not always easy to find an alternative for apply. However, numerical operations like our top method can be made faster with Cython. To use Cython in Jupyyter, we use the following IPython magic:

[24]:

%load_ext Cython

Then we can define our top function with Cython:

[25]:

%%cython
def top_cy(df, n=5, column="2021-12"):
    return df.sort_values(by=column, ascending=False)[:n]

[26]:

%%timeit
grouped_titles.apply(top_cy)

422 μs ± 3.2 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

We haven’t really gained much with this yet. Further optimisation possibilities would be to define the type in the Cython code with cpdef. For this, however, we would have to modify our method, because then no DataFrame can be passed.