Apply#

The most general GroupBy method is apply. It splits the object to be processed, calls the passed function on each part and then tries to chain the parts together.

Suppose we want to select the five largest hit values by group. To do this, we first write a function that selects the rows with the largest values in a particular column:

[1]:
import numpy as np
import pandas as pd
[2]:
df = pd.DataFrame(
    {
        "2021-12": [30134, 6073, 4873, None, 427, 95],
        "2022-01": [33295, 7716, 3930, None, 276, 226],
        "2022-02": [19651, 6547, 2573, None, 525, 157],
    },
    index=[
        [
            "Jupyter Tutorial",
            "Jupyter Tutorial",
            "PyViz Tutorial",
            "PyViz Tutorial",
            "Python Basics",
            "Python Basics",
        ],
        ["de", "en", "de", "en", "de", "en"],
    ],
)
df.index.names = ["Title", "Language"]

df
[2]:
2021-12 2022-01 2022-02
Title Language
Jupyter Tutorial de 30134.0 33295.0 19651.0
en 6073.0 7716.0 6547.0
PyViz Tutorial de 4873.0 3930.0 2573.0
en NaN NaN NaN
Python Basics de 427.0 276.0 525.0
en 95.0 226.0 157.0
[3]:
def top(df, n=5, column="2021-12"):
    return df.sort_values(by=column, ascending=False)[:n]


top(df, n=3)
[3]:
2021-12 2022-01 2022-02
Title Language
Jupyter Tutorial de 30134.0 33295.0 19651.0
en 6073.0 7716.0 6547.0
PyViz Tutorial de 4873.0 3930.0 2573.0

If we now group by titles, for example, and call apply with this function, we get the following:

[4]:
grouped_titles = df.groupby("Title", as_index=False)

grouped_titles.apply(top)
[4]:
2021-12 2022-01 2022-02
Title Language
0 Jupyter Tutorial de 30134.0 33295.0 19651.0
en 6073.0 7716.0 6547.0
1 PyViz Tutorial de 4873.0 3930.0 2573.0
en NaN NaN NaN
2 Python Basics de 427.0 276.0 525.0
en 95.0 226.0 157.0

What happened here? The upper function is called for each row group of the DataFrame, and then the results are concatenated with pandas.concat, labelling the parts with the group names. The result therefore has a hierarchical index whose inner level contains index values from the original DataFrame.

If you pass a function to apply that takes other arguments or keywords, you can pass them after the function:

[5]:
grouped_titles = df.groupby("Title", as_index=False)

grouped_titles.apply(top, n=1)
[5]:
2021-12 2022-01 2022-02
Title Language
0 Jupyter Tutorial de 30134.0 33295.0 19651.0
1 PyViz Tutorial de 4873.0 3930.0 2573.0
2 Python Basics de 427.0 276.0 525.0

We have now seen the basic usage of apply. What happens inside the passed function is very versatile and up to you; it only has to return a pandas object or a single value. In the following, we will therefore mainly show examples that can give you ideas on how to solve various problems with groupby.

First, let’s look again at describe, called over the GroupBy object:

[6]:
result = grouped_titles.describe()

result
[6]:
2021-12 2022-01 2022-02
count mean std min 25% 50% 75% max count mean ... 75% max count mean std min 25% 50% 75% max
0 2.0 18103.5 17013.696262 6073.0 12088.25 18103.5 24118.75 30134.0 2.0 20505.5 ... 26900.25 33295.0 2.0 13099.0 9265.927261 6547.0 9823.0 13099.0 16375.0 19651.0
1 1.0 4873.0 NaN 4873.0 4873.00 4873.0 4873.00 4873.0 1.0 3930.0 ... 3930.00 3930.0 1.0 2573.0 NaN 2573.0 2573.0 2573.0 2573.0 2573.0
2 2.0 261.0 234.759451 95.0 178.00 261.0 344.00 427.0 2.0 251.0 ... 263.50 276.0 2.0 341.0 260.215295 157.0 249.0 341.0 433.0 525.0

3 rows × 24 columns

When you call a method like describe within GroupBy, it is actually just an abbreviation for:

[7]:
f = lambda x: x.describe()
grouped_titles.apply(f)
[7]:
2021-12 2022-01 2022-02
0 count 2.000000 2.000000 2.000000
mean 18103.500000 20505.500000 13099.000000
std 17013.696262 18087.084356 9265.927261
min 6073.000000 7716.000000 6547.000000
25% 12088.250000 14110.750000 9823.000000
50% 18103.500000 20505.500000 13099.000000
75% 24118.750000 26900.250000 16375.000000
max 30134.000000 33295.000000 19651.000000
1 count 1.000000 1.000000 1.000000
mean 4873.000000 3930.000000 2573.000000
std NaN NaN NaN
min 4873.000000 3930.000000 2573.000000
25% 4873.000000 3930.000000 2573.000000
50% 4873.000000 3930.000000 2573.000000
75% 4873.000000 3930.000000 2573.000000
max 4873.000000 3930.000000 2573.000000
2 count 2.000000 2.000000 2.000000
mean 261.000000 251.000000 341.000000
std 234.759451 35.355339 260.215295
min 95.000000 226.000000 157.000000
25% 178.000000 238.500000 249.000000
50% 261.000000 251.000000 341.000000
75% 344.000000 263.500000 433.000000
max 427.000000 276.000000 525.000000

Suppression of the group keys#

In the previous examples, you saw that the resulting object has a hierarchical index formed by the group keys together with the indices of the individual parts of the original object. You can disable this by passing group_keys=False to groupby:

[8]:
grouped_lang = df.groupby("Language", group_keys=False)

grouped_lang.apply(top)
[8]:
2021-12 2022-01 2022-02
Title Language
Jupyter Tutorial de 30134.0 33295.0 19651.0
PyViz Tutorial de 4873.0 3930.0 2573.0
Python Basics de 427.0 276.0 525.0
Jupyter Tutorial en 6073.0 7716.0 6547.0
Python Basics en 95.0 226.0 157.0
PyViz Tutorial en NaN NaN NaN

Quantile and bucket analysis#

As described in discretisation and grouping, pandas has some tools, especially cut and qcut, to split data into buckets with bins of your choice or by sample quantiles. Combine these functions with groupby and you can conveniently perform bucket or quantile analysis on a dataset. Consider a simple random data set and a bucket categorisation of equal length with cut:

[9]:
rng = np.random.default_rng()
df2 = pd.DataFrame(
    {
        "data1": rng.normal(size=1000),
        "data2": rng.normal(size=1000)
    }
)

quartiles = pd.cut(df2.data1, 4)

quartiles[:10]
[9]:
0    (-1.38, 0.0424]
1    (-1.38, 0.0424]
2    (0.0424, 1.464]
3    (0.0424, 1.464]
4    (-1.38, 0.0424]
5    (0.0424, 1.464]
6    (-1.38, 0.0424]
7    (-1.38, 0.0424]
8    (-1.38, 0.0424]
9    (-1.38, 0.0424]
Name: data1, dtype: category
Categories (4, interval[float64, right]): [(-2.807, -1.38] < (-1.38, 0.0424] < (0.0424, 1.464] < (1.464, 2.886]]

The category object returned by cut can be passed directly to groupby. So we could calculate a set of group statistics for the quartiles as follows:

[10]:
def stats(group):
    return pd.DataFrame(
        {
            "min": group.min(),
            "max": group.max(),
            "count": group.count(),
            "mean": group.mean(),
        }
    )


grouped_quart = df2.groupby(quartiles)

grouped_quart.apply(stats)
[10]:
min max count mean
data1
(-2.807, -1.38] data1 -2.801488 -1.389215 74 -1.791025
data2 -2.435057 2.248734 74 -0.052099
(-1.38, 0.0424] data1 -1.369729 0.042258 452 -0.557848
data2 -2.815141 2.555586 452 0.007063
(0.0424, 1.464] data1 0.043187 1.453391 401 0.661546
data2 -2.680655 3.002200 401 0.055687
(1.464, 2.886] data1 1.471820 2.886283 73 1.832881
data2 -2.108169 3.566785 73 0.145811

These were buckets of equal length; to calculate buckets of equal size based on sample quantiles, we can use qcut. I pass labels=False to get only quantile numbers:

[11]:
quartiles_samp = pd.qcut(df2.data1, 4, labels=False)

grouped_quart_samp = df2.groupby(quartiles_samp)

grouped_quart_samp.apply(stats)
[11]:
min max count mean
data1
0 data1 -2.801488 -0.651380 250 -1.207350
data2 -2.815141 2.555586 250 -0.048382
1 data1 -0.648977 -0.027545 250 -0.332227
data2 -2.529867 2.478078 250 0.069973
2 data1 -0.026572 0.691897 250 0.325120
data2 -2.638173 3.002200 250 0.063670
3 data1 0.692362 2.886283 250 1.272046
data2 -2.680655 3.566785 250 0.043986

Populating data with group-specific values#

When cleaning missing data, in some cases you will replace data observations with dropna, but in other cases you may want to fill the null values (NA) with a fixed value or a value derived from the data. fillna is the right tool for this; here, for example, I fill the null values with the mean:

[12]:
s = pd.Series(rng.normal(size=8))
s[::3] = np.nan

s
[12]:
0         NaN
1    0.835698
2   -0.262870
3         NaN
4   -1.345111
5   -0.266797
6         NaN
7    0.550379
dtype: float64
[13]:
s.fillna(s.mean())
[13]:
0   -0.097740
1    0.835698
2   -0.262870
3   -0.097740
4   -1.345111
5   -0.266797
6   -0.097740
7    0.550379
dtype: float64

Here are some sample data for my tutorials, divided into German and English editions:

Suppose you want the fill value to vary by group. These values can be predefined, and since the groups have an internal name attribute, you can use this with apply:

[14]:
fill_values = {"de": 10632, "en": 3469}

fill_func = lambda g: g.fillna(fill_values[g.name])

df.groupby("Language").apply(fill_func)
[14]:
2021-12 2022-01 2022-02
Language Title Language
de Jupyter Tutorial de 30134.0 33295.0 19651.0
PyViz Tutorial de 4873.0 3930.0 2573.0
Python Basics de 427.0 276.0 525.0
en Jupyter Tutorial en 6073.0 7716.0 6547.0
PyViz Tutorial en 3469.0 3469.0 3469.0
Python Basics en 95.0 226.0 157.0

You can also group the data and use apply with a function that calls fillna for each data packet:

[15]:
fill_mean = lambda g: g.fillna(g.mean())

df.groupby("Language").apply(fill_mean)
[15]:
2021-12 2022-01 2022-02
Language Title Language
de Jupyter Tutorial de 30134.0 33295.0 19651.0
PyViz Tutorial de 4873.0 3930.0 2573.0
Python Basics de 427.0 276.0 525.0
en Jupyter Tutorial en 6073.0 7716.0 6547.0
PyViz Tutorial en 3084.0 3971.0 3352.0
Python Basics en 95.0 226.0 157.0

Group weighted average#

Since operations between columns in a DataFrame or two Series are possible, we can calculate the group-weighted average, for example:

[16]:
df3 = pd.DataFrame(
    {
        "category": ["de", "de", "de", "de", "en", "en", "en", "en"],
        "data": np.random.randint(100000, size=8),
        "weights": np.random.rand(8),
    }
)

df3
[16]:
category data weights
0 de 41970 0.967458
1 de 53639 0.605162
2 de 16329 0.007546
3 de 14668 0.033338
4 en 99258 0.826135
5 en 7727 0.861027
6 en 13388 0.005460
7 en 27957 0.276577

The group average weighted by category would then be:

[17]:
grouped_cat = df3.groupby("category")
get_wavg = lambda g: np.average(g["data"], weights=g["weights"])

grouped_cat.apply(get_wavg)
[17]:
category
de    45662.558991
en    48983.872414
dtype: float64

Correlation#

An interesting task could be to calculate a DataFrame consisting of the percentage changes.

For this purpose, we first create a function that calculates the pairwise correlation of the 2021-12 column with the subsequent columns:

[18]:
corr = lambda x: x.corrwith(x["2021-12"])

Next, we calculate the percentage change:

[19]:
pcts = df.pct_change().dropna()

Finally, we group these percentage changes by year, which can be extracted from each row label with a one-line function that returns the year attribute of each date label:

[20]:
grouped_lang = pcts.groupby("Language")

grouped_lang.apply(corr)
[20]:
2021-12 2022-01 2022-02
Language
de 1.0 1.000000 1.00000
en 1.0 0.699088 0.99781
[21]:
grouped_lang.apply(lambda g: g["2021-12"].corr(g["2022-01"]))
[21]:
Language
de    1.000000
en    0.699088
dtype: float64

Performance problems with apply#

Since the apply method typically acts on each individual value in a Series, the function is called once for each value. If you have thousands of values, the function will be called thousands of times. This ignores the fast vectorisations of pandas unless you are using NumPy functions and slow Python is used. For example, we previously grouped the data by title and then called our top method with apply. Let’s measure the time for this:

[22]:
%%timeit
grouped_titles.apply(top)
566 µs ± 8.04 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

We can get the same result without applying by passing the DataFrame to our top method:

[23]:
%%timeit
top(df)
43.8 µs ± 693 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

This calculation is 18 times faster.

Optimising apply with Cython#

It is not always easy to find an alternative for apply. However, numerical operations like our top method can be made faster with Cython. To use Cython in Jupyyter, we use the following IPython magic:

[24]:
%load_ext Cython

Then we can define our top function with Cython:

[25]:
%%cython
def top_cy(df, n=5, column="2021-12"):
    return df.sort_values(by=column, ascending=False)[:n]
[26]:
%%timeit
grouped_titles.apply(top_cy)
565 µs ± 7.08 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

We haven’t really gained much with this yet. Further optimisation possibilities would be to define the type in the Cython code with cpdef. For this, however, we would have to modify our method, because then no DataFrame can be passed.