{ "cells": [ { "cell_type": "markdown", "id": "7261fb06", "metadata": {}, "source": [ "# Apply\n", "\n", "The most general `GroupBy` method is `apply`. It splits the object to be processed, calls the passed function on each part and then tries to chain the parts together." ] }, { "cell_type": "markdown", "id": "744d704b", "metadata": {}, "source": [ "Suppose we want to select the five largest `hit` values by group. To do this, we first write a function that selects the rows with the largest values in a particular column:" ] }, { "cell_type": "code", "execution_count": 1, "id": "d764b7b9", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 2, "id": "6344dbd3", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
2021-122022-012022-02
TitleLanguage
Jupyter Tutorialde30134.033295.019651.0
en6073.07716.06547.0
PyViz Tutorialde4873.03930.02573.0
enNaNNaNNaN
Python Basicsde427.0276.0525.0
en95.0226.0157.0
\n", "
" ], "text/plain": [ " 2021-12 2022-01 2022-02\n", "Title Language \n", "Jupyter Tutorial de 30134.0 33295.0 19651.0\n", " en 6073.0 7716.0 6547.0\n", "PyViz Tutorial de 4873.0 3930.0 2573.0\n", " en NaN NaN NaN\n", "Python Basics de 427.0 276.0 525.0\n", " en 95.0 226.0 157.0" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame(\n", " {\n", " \"2021-12\": [30134, 6073, 4873, None, 427, 95],\n", " \"2022-01\": [33295, 7716, 3930, None, 276, 226],\n", " \"2022-02\": [19651, 6547, 2573, None, 525, 157],\n", " },\n", " index=[\n", " [\n", " \"Jupyter Tutorial\",\n", " \"Jupyter Tutorial\",\n", " \"PyViz Tutorial\",\n", " \"PyViz Tutorial\",\n", " \"Python Basics\",\n", " \"Python Basics\",\n", " ],\n", " [\"de\", \"en\", \"de\", \"en\", \"de\", \"en\"],\n", " ],\n", ")\n", "df.index.names = [\"Title\", \"Language\"]\n", "\n", "df" ] }, { "cell_type": "code", "execution_count": 3, "id": "32423696", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
2021-122022-012022-02
TitleLanguage
Jupyter Tutorialde30134.033295.019651.0
en6073.07716.06547.0
PyViz Tutorialde4873.03930.02573.0
\n", "
" ], "text/plain": [ " 2021-12 2022-01 2022-02\n", "Title Language \n", "Jupyter Tutorial de 30134.0 33295.0 19651.0\n", " en 6073.0 7716.0 6547.0\n", "PyViz Tutorial de 4873.0 3930.0 2573.0" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def top(df, n=5, column=\"2021-12\"):\n", " return df.sort_values(by=column, ascending=False)[:n]\n", "\n", "\n", "top(df, n=3)" ] }, { "cell_type": "markdown", "id": "f4bfc5a5", "metadata": {}, "source": [ "If we now group by titles, for example, and call `apply` with this function, we get the following:" ] }, { "cell_type": "code", "execution_count": 4, "id": "58ac5a5b", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
2021-122022-012022-02
TitleLanguage
0Jupyter Tutorialde30134.033295.019651.0
en6073.07716.06547.0
1PyViz Tutorialde4873.03930.02573.0
enNaNNaNNaN
2Python Basicsde427.0276.0525.0
en95.0226.0157.0
\n", "
" ], "text/plain": [ " 2021-12 2022-01 2022-02\n", " Title Language \n", "0 Jupyter Tutorial de 30134.0 33295.0 19651.0\n", " en 6073.0 7716.0 6547.0\n", "1 PyViz Tutorial de 4873.0 3930.0 2573.0\n", " en NaN NaN NaN\n", "2 Python Basics de 427.0 276.0 525.0\n", " en 95.0 226.0 157.0" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grouped_titles = df.groupby(\"Title\", as_index=False)\n", "\n", "grouped_titles.apply(top)" ] }, { "cell_type": "markdown", "id": "8e959413", "metadata": {}, "source": [ "What happened here? The upper function is called for each row group of the `DataFrame`, and then the results are concatenated with [pandas.concat](https://pandas.pydata.org/docs/reference/api/pandas.concat.html), labelling the parts with the group names. The result therefore has a hierarchical index whose inner level contains index values from the original `DataFrame`." ] }, { "cell_type": "markdown", "id": "275064af", "metadata": {}, "source": [ "If you pass a function to `apply` that takes other arguments or keywords, you can pass them after the function:" ] }, { "cell_type": "code", "execution_count": 5, "id": "a9bba923", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
2021-122022-012022-02
TitleLanguage
0Jupyter Tutorialde30134.033295.019651.0
1PyViz Tutorialde4873.03930.02573.0
2Python Basicsde427.0276.0525.0
\n", "
" ], "text/plain": [ " 2021-12 2022-01 2022-02\n", " Title Language \n", "0 Jupyter Tutorial de 30134.0 33295.0 19651.0\n", "1 PyViz Tutorial de 4873.0 3930.0 2573.0\n", "2 Python Basics de 427.0 276.0 525.0" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grouped_titles = df.groupby(\"Title\", as_index=False)\n", "\n", "grouped_titles.apply(top, n=1)" ] }, { "cell_type": "markdown", "id": "ed60be92", "metadata": {}, "source": [ "We have now seen the basic usage of `apply`. What happens inside the passed function is very versatile and up to you; it only has to return a pandas object or a single value. In the following, we will therefore mainly show examples that can give you ideas on how to solve various problems with `groupby`." ] }, { "cell_type": "markdown", "id": "4cbddfa4", "metadata": {}, "source": [ "First, let’s look again at `describe`, called over the `GroupBy` object:" ] }, { "cell_type": "code", "execution_count": 6, "id": "3a534db6", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
2021-122022-012022-02
countmeanstdmin25%50%75%maxcountmean...75%maxcountmeanstdmin25%50%75%max
Title
Jupyter Tutorial2.018103.517013.6962626073.012088.2518103.524118.7530134.02.020505.5...26900.2533295.02.013099.09265.9272616547.09823.013099.016375.019651.0
PyViz Tutorial1.04873.0NaN4873.04873.004873.04873.004873.01.03930.0...3930.003930.01.02573.0NaN2573.02573.02573.02573.02573.0
Python Basics2.0261.0234.75945195.0178.00261.0344.00427.02.0251.0...263.50276.02.0341.0260.215295157.0249.0341.0433.0525.0
\n", "

3 rows × 24 columns

\n", "
" ], "text/plain": [ " 2021-12 \\\n", " count mean std min 25% 50% \n", "Title \n", "Jupyter Tutorial 2.0 18103.5 17013.696262 6073.0 12088.25 18103.5 \n", "PyViz Tutorial 1.0 4873.0 NaN 4873.0 4873.00 4873.0 \n", "Python Basics 2.0 261.0 234.759451 95.0 178.00 261.0 \n", "\n", " 2022-01 ... \\\n", " 75% max count mean ... 75% max \n", "Title ... \n", "Jupyter Tutorial 24118.75 30134.0 2.0 20505.5 ... 26900.25 33295.0 \n", "PyViz Tutorial 4873.00 4873.0 1.0 3930.0 ... 3930.00 3930.0 \n", "Python Basics 344.00 427.0 2.0 251.0 ... 263.50 276.0 \n", "\n", " 2022-02 \\\n", " count mean std min 25% 50% \n", "Title \n", "Jupyter Tutorial 2.0 13099.0 9265.927261 6547.0 9823.0 13099.0 \n", "PyViz Tutorial 1.0 2573.0 NaN 2573.0 2573.0 2573.0 \n", "Python Basics 2.0 341.0 260.215295 157.0 249.0 341.0 \n", "\n", " \n", " 75% max \n", "Title \n", "Jupyter Tutorial 16375.0 19651.0 \n", "PyViz Tutorial 2573.0 2573.0 \n", "Python Basics 433.0 525.0 \n", "\n", "[3 rows x 24 columns]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grouped_titles = df.groupby(\"Title\")\n", "\n", "result = grouped_titles.describe()\n", "\n", "result" ] }, { "cell_type": "markdown", "id": "4a471ff4", "metadata": {}, "source": [ "When you call a method like `describe` within `GroupBy`, it is actually just an abbreviation for:" ] }, { "cell_type": "code", "execution_count": 7, "id": "b3f40be1", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
2021-122022-012022-02
Title
Jupyter Tutorialcount2.0000002.0000002.000000
mean18103.50000020505.50000013099.000000
std17013.69626218087.0843569265.927261
min6073.0000007716.0000006547.000000
25%12088.25000014110.7500009823.000000
50%18103.50000020505.50000013099.000000
75%24118.75000026900.25000016375.000000
max30134.00000033295.00000019651.000000
PyViz Tutorialcount1.0000001.0000001.000000
mean4873.0000003930.0000002573.000000
stdNaNNaNNaN
min4873.0000003930.0000002573.000000
25%4873.0000003930.0000002573.000000
50%4873.0000003930.0000002573.000000
75%4873.0000003930.0000002573.000000
max4873.0000003930.0000002573.000000
Python Basicscount2.0000002.0000002.000000
mean261.000000251.000000341.000000
std234.75945135.355339260.215295
min95.000000226.000000157.000000
25%178.000000238.500000249.000000
50%261.000000251.000000341.000000
75%344.000000263.500000433.000000
max427.000000276.000000525.000000
\n", "
" ], "text/plain": [ " 2021-12 2022-01 2022-02\n", "Title \n", "Jupyter Tutorial count 2.000000 2.000000 2.000000\n", " mean 18103.500000 20505.500000 13099.000000\n", " std 17013.696262 18087.084356 9265.927261\n", " min 6073.000000 7716.000000 6547.000000\n", " 25% 12088.250000 14110.750000 9823.000000\n", " 50% 18103.500000 20505.500000 13099.000000\n", " 75% 24118.750000 26900.250000 16375.000000\n", " max 30134.000000 33295.000000 19651.000000\n", "PyViz Tutorial count 1.000000 1.000000 1.000000\n", " mean 4873.000000 3930.000000 2573.000000\n", " std NaN NaN NaN\n", " min 4873.000000 3930.000000 2573.000000\n", " 25% 4873.000000 3930.000000 2573.000000\n", " 50% 4873.000000 3930.000000 2573.000000\n", " 75% 4873.000000 3930.000000 2573.000000\n", " max 4873.000000 3930.000000 2573.000000\n", "Python Basics count 2.000000 2.000000 2.000000\n", " mean 261.000000 251.000000 341.000000\n", " std 234.759451 35.355339 260.215295\n", " min 95.000000 226.000000 157.000000\n", " 25% 178.000000 238.500000 249.000000\n", " 50% 261.000000 251.000000 341.000000\n", " 75% 344.000000 263.500000 433.000000\n", " max 427.000000 276.000000 525.000000" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "f = lambda x: x.describe()\n", "grouped_titles.apply(f)" ] }, { "cell_type": "markdown", "id": "90a07bfc", "metadata": {}, "source": [ "## Suppression of the group keys\n", "\n", "In the previous examples, you saw that the resulting object has a hierarchical index formed by the group keys together with the indices of the individual parts of the original object. You can disable this by passing `group_keys=False` to `groupby`:" ] }, { "cell_type": "code", "execution_count": 8, "id": "0f51bcb0", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
2021-122022-012022-02
TitleLanguage
Jupyter Tutorialde30134.033295.019651.0
PyViz Tutorialde4873.03930.02573.0
Python Basicsde427.0276.0525.0
Jupyter Tutorialen6073.07716.06547.0
Python Basicsen95.0226.0157.0
PyViz TutorialenNaNNaNNaN
\n", "
" ], "text/plain": [ " 2021-12 2022-01 2022-02\n", "Title Language \n", "Jupyter Tutorial de 30134.0 33295.0 19651.0\n", "PyViz Tutorial de 4873.0 3930.0 2573.0\n", "Python Basics de 427.0 276.0 525.0\n", "Jupyter Tutorial en 6073.0 7716.0 6547.0\n", "Python Basics en 95.0 226.0 157.0\n", "PyViz Tutorial en NaN NaN NaN" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grouped_lang = df.groupby(\"Language\", group_keys=False)\n", "\n", "grouped_lang.apply(top)" ] }, { "cell_type": "markdown", "id": "2131c5cf", "metadata": {}, "source": [ "## Quantile and bucket analysis\n", "\n", "As described in [discretisation and grouping](discretisation.ipynb), pandas has some tools, especially `cut` and `qcut`, to split data into buckets with bins of your choice or by sample quantiles. Combine these functions with `groupby` and you can conveniently perform bucket or quantile analysis on a dataset. Consider a simple random data set and a bucket categorisation of equal length with `cut`:" ] }, { "cell_type": "code", "execution_count": 9, "id": "8b54db53", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 (-1.985, -0.309]\n", "1 (-3.667, -1.985]\n", "2 (-1.985, -0.309]\n", "3 (-1.985, -0.309]\n", "4 (-1.985, -0.309]\n", "5 (-0.309, 1.367]\n", "6 (-1.985, -0.309]\n", "7 (1.367, 3.043]\n", "8 (-1.985, -0.309]\n", "9 (-0.309, 1.367]\n", "Name: data1, dtype: category\n", "Categories (4, interval[float64, right]): [(-3.667, -1.985] < (-1.985, -0.309] < (-0.309, 1.367] < (1.367, 3.043]]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rng = np.random.default_rng()\n", "df2 = pd.DataFrame(\n", " {\n", " \"data1\": rng.normal(size=1000),\n", " \"data2\": rng.normal(size=1000)\n", " }\n", ")\n", "\n", "quartiles = pd.cut(df2.data1, 4)\n", "\n", "quartiles[:10]" ] }, { "cell_type": "markdown", "id": "ab66f0c8", "metadata": {}, "source": [ "The `category` object returned by `cut` can be passed directly to `groupby`. So we could calculate a set of group statistics for the quartiles as follows:" ] }, { "cell_type": "code", "execution_count": 10, "id": "6157be57", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
minmaxcountmean
data1
(-3.667, -1.985]data1-3.660514-1.99739631-2.422460
data2-1.7290692.380827310.193483
(-1.985, -0.309]data1-1.974745-0.310654356-0.941064
data2-2.7721653.0499643560.006150
(-0.309, 1.367]data1-0.3066601.3664185250.398290
data2-3.6782673.307415525-0.013217
(1.367, 3.043]data11.3692703.042972881.813557
data2-2.8854122.047945880.045750
\n", "
" ], "text/plain": [ " min max count mean\n", "data1 \n", "(-3.667, -1.985] data1 -3.660514 -1.997396 31 -2.422460\n", " data2 -1.729069 2.380827 31 0.193483\n", "(-1.985, -0.309] data1 -1.974745 -0.310654 356 -0.941064\n", " data2 -2.772165 3.049964 356 0.006150\n", "(-0.309, 1.367] data1 -0.306660 1.366418 525 0.398290\n", " data2 -3.678267 3.307415 525 -0.013217\n", "(1.367, 3.043] data1 1.369270 3.042972 88 1.813557\n", " data2 -2.885412 2.047945 88 0.045750" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def stats(group):\n", " return pd.DataFrame(\n", " {\n", " \"min\": group.min(),\n", " \"max\": group.max(),\n", " \"count\": group.count(),\n", " \"mean\": group.mean(),\n", " }\n", " )\n", "\n", "\n", "grouped_quart = df2.groupby(quartiles, observed=False)\n", "\n", "grouped_quart.apply(stats)" ] }, { "cell_type": "markdown", "id": "a000b84b", "metadata": {}, "source": [ "These were buckets of equal length; to calculate buckets of equal size based on sample quantiles, we can use `qcut`. I pass `labels=False` to get only quantile numbers:" ] }, { "cell_type": "code", "execution_count": 11, "id": "70da8f32", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
minmaxcountmean
data1
0data1-3.660514-0.743539250-1.365801
data2-2.7721652.830985250-0.040007
1data1-0.742822-0.019690250-0.345565
data2-2.2099693.0499642500.146497
2data1-0.0179010.6276292500.301905
data2-2.7646953.307415250-0.077263
3data10.6287143.0429722501.243782
data2-3.6782672.260646250-0.008129
\n", "
" ], "text/plain": [ " min max count mean\n", "data1 \n", "0 data1 -3.660514 -0.743539 250 -1.365801\n", " data2 -2.772165 2.830985 250 -0.040007\n", "1 data1 -0.742822 -0.019690 250 -0.345565\n", " data2 -2.209969 3.049964 250 0.146497\n", "2 data1 -0.017901 0.627629 250 0.301905\n", " data2 -2.764695 3.307415 250 -0.077263\n", "3 data1 0.628714 3.042972 250 1.243782\n", " data2 -3.678267 2.260646 250 -0.008129" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "quartiles_samp = pd.qcut(df2.data1, 4, labels=False)\n", "\n", "grouped_quart_samp = df2.groupby(quartiles_samp)\n", "\n", "grouped_quart_samp.apply(stats)" ] }, { "cell_type": "markdown", "id": "d4e7e305", "metadata": {}, "source": [ "## Populating data with group-specific values\n", "\n", "When cleaning missing data, in some cases you will replace data observations with `dropna`, but in other cases you may want to fill the null values (`NA`) with a fixed value or a value derived from the data. `fillna` is the right tool for this; here, for example, I fill the null values with the mean:" ] }, { "cell_type": "code", "execution_count": 12, "id": "812aebbf", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 NaN\n", "1 0.682415\n", "2 -0.463491\n", "3 NaN\n", "4 0.397419\n", "5 0.607853\n", "6 NaN\n", "7 0.060891\n", "dtype: float64" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s = pd.Series(rng.normal(size=8))\n", "s[::3] = np.nan\n", "\n", "s" ] }, { "cell_type": "code", "execution_count": 13, "id": "627d58df", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 0.257018\n", "1 0.682415\n", "2 -0.463491\n", "3 0.257018\n", "4 0.397419\n", "5 0.607853\n", "6 0.257018\n", "7 0.060891\n", "dtype: float64" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s.fillna(s.mean())" ] }, { "cell_type": "markdown", "id": "f2a06552", "metadata": {}, "source": [ "Here are some sample data for my tutorials, divided into German and English editions:" ] }, { "cell_type": "markdown", "id": "88c3bb07", "metadata": {}, "source": [ "Suppose you want the fill value to vary by group. These values can be predefined, and since the groups have an internal `name` attribute, you can use this with `apply`:" ] }, { "cell_type": "code", "execution_count": 14, "id": "3f29009c", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
2021-122022-012022-02
LanguageTitleLanguage
deJupyter Tutorialde30134.033295.019651.0
PyViz Tutorialde4873.03930.02573.0
Python Basicsde427.0276.0525.0
enJupyter Tutorialen6073.07716.06547.0
PyViz Tutorialen3469.03469.03469.0
Python Basicsen95.0226.0157.0
\n", "
" ], "text/plain": [ " 2021-12 2022-01 2022-02\n", "Language Title Language \n", "de Jupyter Tutorial de 30134.0 33295.0 19651.0\n", " PyViz Tutorial de 4873.0 3930.0 2573.0\n", " Python Basics de 427.0 276.0 525.0\n", "en Jupyter Tutorial en 6073.0 7716.0 6547.0\n", " PyViz Tutorial en 3469.0 3469.0 3469.0\n", " Python Basics en 95.0 226.0 157.0" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fill_values = {\"de\": 10632, \"en\": 3469}\n", "\n", "fill_func = lambda g: g.fillna(fill_values[g.name])\n", "\n", "df.groupby(\"Language\").apply(fill_func)" ] }, { "cell_type": "markdown", "id": "aa803757", "metadata": {}, "source": [ "You can also group the data and use `apply` with a function that calls `fillna` for each data packet:" ] }, { "cell_type": "code", "execution_count": 15, "id": "c737aa0b", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
2021-122022-012022-02
LanguageTitleLanguage
deJupyter Tutorialde30134.033295.019651.0
PyViz Tutorialde4873.03930.02573.0
Python Basicsde427.0276.0525.0
enJupyter Tutorialen6073.07716.06547.0
PyViz Tutorialen3084.03971.03352.0
Python Basicsen95.0226.0157.0
\n", "
" ], "text/plain": [ " 2021-12 2022-01 2022-02\n", "Language Title Language \n", "de Jupyter Tutorial de 30134.0 33295.0 19651.0\n", " PyViz Tutorial de 4873.0 3930.0 2573.0\n", " Python Basics de 427.0 276.0 525.0\n", "en Jupyter Tutorial en 6073.0 7716.0 6547.0\n", " PyViz Tutorial en 3084.0 3971.0 3352.0\n", " Python Basics en 95.0 226.0 157.0" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fill_mean = lambda g: g.fillna(g.mean())\n", "\n", "df.groupby(\"Language\").apply(fill_mean)" ] }, { "cell_type": "markdown", "id": "8c569fb0", "metadata": {}, "source": [ "## Group weighted average\n", "\n", "Since operations between columns in a `DataFrame` or two `Series` are possible, we can calculate the group-weighted average, for example:" ] }, { "cell_type": "code", "execution_count": 16, "id": "7af3e1cc", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
categorydataweights
0de303540.283774
1de284080.416312
2de216300.128192
3de759740.006033
4en143340.391982
5en561440.028325
6en461080.326988
7en337590.714362
\n", "
" ], "text/plain": [ " category data weights\n", "0 de 30354 0.283774\n", "1 de 28408 0.416312\n", "2 de 21630 0.128192\n", "3 de 75974 0.006033\n", "4 en 14334 0.391982\n", "5 en 56144 0.028325\n", "6 en 46108 0.326988\n", "7 en 33759 0.714362" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df3 = pd.DataFrame(\n", " {\n", " \"category\": [\"de\", \"de\", \"de\", \"de\", \"en\", \"en\", \"en\", \"en\"],\n", " \"data\": np.random.randint(100000, size=8),\n", " \"weights\": np.random.rand(8),\n", " }\n", ")\n", "\n", "df3" ] }, { "cell_type": "markdown", "id": "d43fc161", "metadata": {}, "source": [ "The group average weighted by category would then be:" ] }, { "cell_type": "code", "execution_count": 17, "id": "733ad0dc", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "category\n", "de 28372.421657\n", "en 31746.068449\n", "dtype: float64" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grouped_cat = df3.groupby(\"category\")\n", "get_wavg = lambda g: np.average(g[\"data\"], weights=g[\"weights\"])\n", "\n", "grouped_cat.apply(get_wavg, include_groups=False)" ] }, { "cell_type": "markdown", "id": "f6ae93b7", "metadata": {}, "source": [ "## Correlation" ] }, { "cell_type": "markdown", "id": "ab9453bf", "metadata": {}, "source": [ "An interesting task could be to calculate a `DataFrame` consisting of the percentage changes." ] }, { "cell_type": "markdown", "id": "c4793e0b", "metadata": {}, "source": [ "For this purpose, we first create a function that calculates the pairwise correlation of the `2021-12` column with the subsequent columns:" ] }, { "cell_type": "code", "execution_count": 18, "id": "9964169b", "metadata": {}, "outputs": [], "source": [ "corr = lambda x: x.corrwith(x[\"2021-12\"])" ] }, { "cell_type": "markdown", "id": "6e3c83be", "metadata": {}, "source": [ "Next, we calculate the percentage change:" ] }, { "cell_type": "code", "execution_count": 19, "id": "383eedf7", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
2021-122022-012022-02
TitleLanguage
Jupyter Tutorialen-0.798467-0.768253-0.666836
PyViz Tutorialde-0.197596-0.490669-0.606996
en0.0000000.0000000.000000
Python Basicsde-0.912374-0.929771-0.795958
en-0.777518-0.181159-0.700952
\n", "
" ], "text/plain": [ " 2021-12 2022-01 2022-02\n", "Title Language \n", "Jupyter Tutorial en -0.798467 -0.768253 -0.666836\n", "PyViz Tutorial de -0.197596 -0.490669 -0.606996\n", " en 0.000000 0.000000 0.000000\n", "Python Basics de -0.912374 -0.929771 -0.795958\n", " en -0.777518 -0.181159 -0.700952" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pcts = df.pct_change().dropna()\n", "\n", "pcts" ] }, { "cell_type": "markdown", "id": "f70c60fb", "metadata": {}, "source": [ "Finally, we group these percentage changes by year, which can be extracted from each row label with a one-line function that returns the year attribute of each date label:" ] }, { "cell_type": "code", "execution_count": 20, "id": "c4e9b858", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
2021-122022-012022-02
Language
de1.01.0000001.00000
en1.00.6990880.99781
\n", "
" ], "text/plain": [ " 2021-12 2022-01 2022-02\n", "Language \n", "de 1.0 1.000000 1.00000\n", "en 1.0 0.699088 0.99781" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grouped_lang = pcts.groupby(\"Language\")\n", "\n", "grouped_lang.apply(corr)" ] }, { "cell_type": "code", "execution_count": 21, "id": "19097802", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Language\n", "de 1.000000\n", "en 0.699088\n", "dtype: float64" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grouped_lang.apply(lambda g: g[\"2021-12\"].corr(g[\"2022-01\"]))" ] }, { "cell_type": "markdown", "id": "0eadd87e", "metadata": {}, "source": [ "## Performance problems with `apply`\n", "\n", "Since the `apply` method typically acts on each individual value in a `Series`, the function is called once for each value. If you have thousands of values, the function will be called thousands of times. This ignores the fast vectorisations of pandas unless you are using NumPy functions and slow Python is used. For example, we previously grouped the data by title and then called our `top` method with `apply`. Let’s measure the time for this:" ] }, { "cell_type": "code", "execution_count": 22, "id": "4a1d44c9", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "422 μs ± 14.5 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)\n" ] } ], "source": [ "%%timeit\n", "grouped_titles.apply(top)" ] }, { "cell_type": "markdown", "id": "eec0e168", "metadata": {}, "source": [ "We can get the same result without applying by passing the `DataFrame` to our `top` method:" ] }, { "cell_type": "code", "execution_count": 23, "id": "b7d1f960", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "39.2 μs ± 586 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)\n" ] } ], "source": [ "%%timeit\n", "top(df)" ] }, { "cell_type": "markdown", "id": "406badf8", "metadata": {}, "source": [ "This calculation is 18 times faster." ] }, { "cell_type": "markdown", "id": "f1020ebb", "metadata": {}, "source": [ "## Optimising `apply` with Cython\n", "\n", "It is not always easy to find an alternative for `apply`. However, numerical operations like our `top` method can be made faster with [Cython](https://cython.org/). To use Cython in Jupyyter, we use the following [IPython magic](../ipython/magics.ipynb):" ] }, { "cell_type": "code", "execution_count": 24, "id": "a596fc35", "metadata": {}, "outputs": [], "source": [ "%load_ext Cython" ] }, { "cell_type": "markdown", "id": "cb4b4b6c", "metadata": {}, "source": [ "Then we can define our `top` function with Cython:" ] }, { "cell_type": "code", "execution_count": 25, "id": "f28ec4a1", "metadata": {}, "outputs": [], "source": [ "%%cython\n", "def top_cy(df, n=5, column=\"2021-12\"):\n", " return df.sort_values(by=column, ascending=False)[:n]" ] }, { "cell_type": "code", "execution_count": 26, "id": "1e95211e", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "472 μs ± 27.6 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)\n" ] } ], "source": [ "%%timeit\n", "grouped_titles.apply(top_cy)" ] }, { "cell_type": "markdown", "id": "c57cd1fa", "metadata": {}, "source": [ "We haven’t really gained much with this yet. Further optimisation possibilities would be to define the type in the Cython code with `cpdef`. For this, however, we would have to modify our method, because then no `DataFrame` can be passed." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.13 Kernel", "language": "python", "name": "python313" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.0" }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 5 }