{ "cells": [ { "cell_type": "markdown", "id": "00236537", "metadata": {}, "source": [ "# Group operations\n", "\n", "By `groupby` is meant a process that involves one or more of the following steps:\n", "\n", "* **Split** divides the data into groups according to certain criteria\n", "* **Apply** applies a function independently to each group\n", "* **Combine** combines the results in a data structure\n", "\n", "In the first phase of the process, the data contained in a pandas object, be it a `Series`, a `DataFrame` or something else, is split into groups based on one or more keys. The division is done on a particular axis of an object. For example, a DataFrame can be grouped by its rows (`axis=0`) or its columns (`axis=1`). Then, a function is applied to each group to create a new value. Finally, the results of all these function applications are combined in a result object. The shape of the result object usually depends on what is done with the data.\n", "\n", "Each grouping key can take many forms, and the keys do not all have to be of the same type:\n", "\n", "* a list or array of values that have the same length as the axis being grouped\n", "* a value that specifies a column name in a DataFrame\n", "* a dict or series that is a correspondence between the values on the axis being grouped and the group names\n", "* a function that is called on the axis index or the individual labels in the index\n", "\n", "
\n", "\n", "**Note:**\n", "\n", "The latter three methods are shortcuts to create an array of values that will be used to divide the object.\n", "\n", "Don't worry if this all seems abstract. Throughout this chapter I will give many examples of all these methods. For starters, here is a small table dataset as a DataFrame:\n", "
" ] }, { "cell_type": "code", "execution_count": 1, "id": "7f8d4433", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 2, "id": "653a5337", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TitleLanguage2021-122022-012022-02
0Jupyter Tutorialde19651.030134.033295.0
1Jupyter Tutorialen4722.03497.04009.0
2PyViz Tutorialde2573.04873.03930.0
3NoneNoneNaNNaNNaN
4Python Basicsde525.0427.0276.0
5Python Basicsen157.085.0226.0
\n", "
" ], "text/plain": [ " Title Language 2021-12 2022-01 2022-02\n", "0 Jupyter Tutorial de 19651.0 30134.0 33295.0\n", "1 Jupyter Tutorial en 4722.0 3497.0 4009.0\n", "2 PyViz Tutorial de 2573.0 4873.0 3930.0\n", "3 None None NaN NaN NaN\n", "4 Python Basics de 525.0 427.0 276.0\n", "5 Python Basics en 157.0 85.0 226.0" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame(\n", " {\n", " \"Title\": [\n", " \"Jupyter Tutorial\",\n", " \"Jupyter Tutorial\",\n", " \"PyViz Tutorial\",\n", " None,\n", " \"Python Basics\",\n", " \"Python Basics\",\n", " ],\n", " \"Language\": [\"de\", \"en\", \"de\", None, \"de\", \"en\"],\n", " \"2021-12\": [19651, 4722, 2573, None, 525, 157],\n", " \"2022-01\": [30134, 3497, 4873, None, 427, 85],\n", " \"2022-02\": [33295, 4009, 3930, None, 276, 226],\n", " }\n", ")\n", "\n", "df" ] }, { "cell_type": "markdown", "id": "793372a4", "metadata": {}, "source": [ "Suppose you want to calculate the sum of column `02/2022` using the labels of Title. There are several ways to do this. One is to access `02/2022` and call `groupby` with the column (a `Series`) in `Title`:" ] }, { "cell_type": "code", "execution_count": 3, "id": "8b4b9dbb", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grouped = df[\"2022-02\"].groupby(df[\"Title\"])\n", "\n", "grouped" ] }, { "cell_type": "markdown", "id": "14141021", "metadata": {}, "source": [ "This `grouped` variable is now a special `SeriesGroupBy` object. It has not yet calculated anything except some intermediate data about the group key `df['Title']`. The idea is that this object has all the information needed to apply an operation to each of the groups. For example, to calculate the group averages, we can call the `sum` method of the `GroupBy` object:" ] }, { "cell_type": "code", "execution_count": 4, "id": "bab6893d", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Title\n", "Jupyter Tutorial 37304.0\n", "PyViz Tutorial 3930.0\n", "Python Basics 502.0\n", "Name: 2022-02, dtype: float64" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grouped.sum()" ] }, { "cell_type": "markdown", "id": "10512d81", "metadata": {}, "source": [ "Later I will explain more about what happens when you call `.sum()`. The important thing to note here is that the data (a row) has been aggregated by splitting the data across the group key, creating a new row that is now indexed by the unique values in the `Title` column. The resulting index is `Title` because `groupby(df['Title']` did this." ] }, { "cell_type": "markdown", "id": "9a22d532", "metadata": {}, "source": [ "If we had passed multiple arrays as a list instead, we would get something different:" ] }, { "cell_type": "code", "execution_count": 5, "id": "4a117a7a", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Language Title \n", "de Jupyter Tutorial 19651.0\n", " PyViz Tutorial 2573.0\n", " Python Basics 525.0\n", "en Jupyter Tutorial 4722.0\n", " Python Basics 157.0\n", "Name: 2021-12, dtype: float64" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sums = df[\"2021-12\"].groupby([df[\"Language\"], df[\"Title\"]]).sum()\n", "\n", "sums" ] }, { "cell_type": "markdown", "id": "8d6b3003", "metadata": {}, "source": [ "Here we have grouped the data based on two keys, and the resulting series now has a hierarchical index consisting of the observed unique key pairs:" ] }, { "cell_type": "code", "execution_count": 6, "id": "55a47106", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TitleJupyter TutorialPyViz TutorialPython Basics
Language
de19651.02573.0525.0
en4722.0NaN157.0
\n", "
" ], "text/plain": [ "Title Jupyter Tutorial PyViz Tutorial Python Basics\n", "Language \n", "de 19651.0 2573.0 525.0\n", "en 4722.0 NaN 157.0" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sums.unstack()" ] }, { "cell_type": "markdown", "id": "587c25bf", "metadata": {}, "source": [ "Often the grouping information is in the same DataFrame as the data you want to edit. In this case, you can pass column names (whether they are strings, numbers or other Python objects) as group keys:" ] }, { "cell_type": "code", "execution_count": 7, "id": "5b121b4d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Language2021-122022-012022-02
Title
Jupyter Tutorialdeen24373.033631.037304.0
PyViz Tutorialde2573.04873.03930.0
Python Basicsdeen682.0512.0502.0
\n", "
" ], "text/plain": [ " Language 2021-12 2022-01 2022-02\n", "Title \n", "Jupyter Tutorial deen 24373.0 33631.0 37304.0\n", "PyViz Tutorial de 2573.0 4873.0 3930.0\n", "Python Basics deen 682.0 512.0 502.0" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby(\"Title\").sum()" ] }, { "cell_type": "markdown", "id": "43b62319", "metadata": {}, "source": [ "Here it is noticeable that the result does not contain a `Language` column. Since `df['Language']` is not numeric data, it interferes with the table layout and is therefore automatically excluded from the result. By default, all numeric columns are aggregated." ] }, { "cell_type": "code", "execution_count": 8, "id": "ad498488", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
2021-122022-012022-02
TitleLanguage
Jupyter Tutorialde19651.030134.033295.0
en4722.03497.04009.0
PyViz Tutorialde2573.04873.03930.0
Python Basicsde525.0427.0276.0
en157.085.0226.0
\n", "
" ], "text/plain": [ " 2021-12 2022-01 2022-02\n", "Title Language \n", "Jupyter Tutorial de 19651.0 30134.0 33295.0\n", " en 4722.0 3497.0 4009.0\n", "PyViz Tutorial de 2573.0 4873.0 3930.0\n", "Python Basics de 525.0 427.0 276.0\n", " en 157.0 85.0 226.0" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby([\"Title\",\"Language\"]).sum()" ] }, { "cell_type": "markdown", "id": "0db3a462", "metadata": {}, "source": [ "Regardless of the goal of using `groupby`, a generally useful `groupby` method is `size`, which returns a series with the group sizes:" ] }, { "cell_type": "code", "execution_count": 9, "id": "491272d6", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Language\n", "de 3\n", "en 2\n", "dtype: int64" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby([\"Language\"]).size()" ] }, { "cell_type": "markdown", "id": "977da668", "metadata": {}, "source": [ "
\n", "\n", "**Note:**\n", "\n", "All missing values in a group key are excluded from the result by default. This behaviour can be disabled by passing `dropna=False` to `groupby`:\n", "
" ] }, { "cell_type": "code", "execution_count": 10, "id": "2d4df323", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Language\n", "de 3\n", "en 2\n", "NaN 1\n", "dtype: int64" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby(\"Language\", dropna=False).size()" ] }, { "cell_type": "code", "execution_count": 11, "id": "cd057369", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Title Language\n", "Jupyter Tutorial de 1\n", " en 1\n", "PyViz Tutorial de 1\n", "Python Basics de 1\n", " en 1\n", "NaN NaN 1\n", "dtype: int64" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby([\"Title\", \"Language\"], dropna=False).size()" ] }, { "cell_type": "markdown", "id": "4fa74c34", "metadata": {}, "source": [ "## Iteration over groups\n", "\n", "The object returned by `groupby` supports iteration and produces a sequence of 2-tuples containing the group name along with the data packet. Consider the following:" ] }, { "cell_type": "code", "execution_count": 12, "id": "279606c0", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Jupyter Tutorial\n", " Title Language 2021-12 2022-01 2022-02\n", "0 Jupyter Tutorial de 19651.0 30134.0 33295.0\n", "1 Jupyter Tutorial en 4722.0 3497.0 4009.0\n", "PyViz Tutorial\n", " Title Language 2021-12 2022-01 2022-02\n", "2 PyViz Tutorial de 2573.0 4873.0 3930.0\n", "Python Basics\n", " Title Language 2021-12 2022-01 2022-02\n", "4 Python Basics de 525.0 427.0 276.0\n", "5 Python Basics en 157.0 85.0 226.0\n" ] } ], "source": [ "for name, group in df.groupby(\"Title\"):\n", " print(name)\n", " print(group)" ] }, { "cell_type": "markdown", "id": "3944439c", "metadata": {}, "source": [ "With multiple keys, the first element of the tuple is a tuple of key values:" ] }, { "cell_type": "code", "execution_count": 13, "id": "c8778dcd", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "('Jupyter Tutorial', 'de')\n", " Title Language 2021-12 2022-01 2022-02\n", "0 Jupyter Tutorial de 19651.0 30134.0 33295.0\n", "('Jupyter Tutorial', 'en')\n", " Title Language 2021-12 2022-01 2022-02\n", "1 Jupyter Tutorial en 4722.0 3497.0 4009.0\n", "('PyViz Tutorial', 'de')\n", " Title Language 2021-12 2022-01 2022-02\n", "2 PyViz Tutorial de 2573.0 4873.0 3930.0\n", "('Python Basics', 'de')\n", " Title Language 2021-12 2022-01 2022-02\n", "4 Python Basics de 525.0 427.0 276.0\n", "('Python Basics', 'en')\n", " Title Language 2021-12 2022-01 2022-02\n", "5 Python Basics en 157.0 85.0 226.0\n" ] } ], "source": [ "for (i1, i2), group in df.groupby([\"Title\", \"Language\"]):\n", " print((i1, i2))\n", " print(group)" ] }, { "cell_type": "markdown", "id": "4e84aa84", "metadata": {}, "source": [ "Next, we want to output a `dict` of the data as a one-liner:" ] }, { "cell_type": "code", "execution_count": 14, "id": "d4c85b5a", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'Jupyter Tutorial': Title Language 2021-12 2022-01 2022-02\n", " 0 Jupyter Tutorial de 19651.0 30134.0 33295.0\n", " 1 Jupyter Tutorial en 4722.0 3497.0 4009.0,\n", " 'PyViz Tutorial': Title Language 2021-12 2022-01 2022-02\n", " 2 PyViz Tutorial de 2573.0 4873.0 3930.0,\n", " 'Python Basics': Title Language 2021-12 2022-01 2022-02\n", " 4 Python Basics de 525.0 427.0 276.0\n", " 5 Python Basics en 157.0 85.0 226.0}" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "books = dict(list(df.groupby(\"Title\")))\n", "\n", "books" ] }, { "cell_type": "markdown", "id": "b64c82f4", "metadata": {}, "source": [ "By default, `groupby` groups on `axis=0`, but you can also group on any of the other axes. For example, we could group the columns of our example `df` here by `dtype` as follows:" ] }, { "cell_type": "code", "execution_count": 15, "id": "b3ffcaba", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Title object\n", "Language object\n", "2021-12 float64\n", "2022-01 float64\n", "2022-02 float64\n", "dtype: object" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.dtypes" ] }, { "cell_type": "code", "execution_count": 16, "id": "f15cb9fc", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/var/folders/hk/s8m0bblj0g10hw885gld52mc0000gn/T/ipykernel_6958/521057107.py:1: FutureWarning: DataFrame.groupby with axis=1 is deprecated. Do `frame.T.groupby(...)` without axis instead.\n", " grouped = df.groupby(df.dtypes, axis=1)\n" ] } ], "source": [ "grouped = df.groupby(df.dtypes, axis=1)" ] }, { "cell_type": "code", "execution_count": 17, "id": "9a12d376", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "float64\n", " 2021-12 2022-01 2022-02\n", "0 19651.0 30134.0 33295.0\n", "1 4722.0 3497.0 4009.0\n", "2 2573.0 4873.0 3930.0\n", "3 NaN NaN NaN\n", "4 525.0 427.0 276.0\n", "5 157.0 85.0 226.0\n", "object\n", " Title Language\n", "0 Jupyter Tutorial de\n", "1 Jupyter Tutorial en\n", "2 PyViz Tutorial de\n", "3 None None\n", "4 Python Basics de\n", "5 Python Basics en\n" ] } ], "source": [ "for dtype, group in grouped:\n", " print(dtype)\n", " print(group)" ] }, { "cell_type": "markdown", "id": "f351b3fb", "metadata": {}, "source": [ "## Selecting a column or subset of columns\n", "\n", "Indexing a `GroupBy` object created from a DataFrame with a column name or an array of column names has the effect of subdividing columns for aggregation. This means that:" ] }, { "cell_type": "code", "execution_count": 18, "id": "7278c1ad", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby(\"Title\")[\"2021-12\"]\n", "df.groupby(\"Title\")[[\"2022-01\"]]" ] }, { "cell_type": "markdown", "id": "f7982339", "metadata": {}, "source": [ "are simplified spellings for:" ] }, { "cell_type": "code", "execution_count": 19, "id": "cc63848f", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"2021-12\"].groupby(df[\"Title\"])\n", "df[[\"2022-01\"]].groupby(df[\"Title\"])" ] }, { "cell_type": "markdown", "id": "473b9f20", "metadata": {}, "source": [ "Especially for large datasets, it may be desirable to aggregate only some columns. For example, to calculate the sum for only column `01/2022` in the previous dataset and get the result as a DataFrame, we could write:" ] }, { "cell_type": "code", "execution_count": 20, "id": "5cdb6ba9", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
2022-01
TitleLanguage
Jupyter Tutorialde30134.0
en3497.0
PyViz Tutorialde4873.0
Python Basicsde427.0
en85.0
\n", "
" ], "text/plain": [ " 2022-01\n", "Title Language \n", "Jupyter Tutorial de 30134.0\n", " en 3497.0\n", "PyViz Tutorial de 4873.0\n", "Python Basics de 427.0\n", " en 85.0" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby([\"Title\", \"Language\"])[[\"2022-01\"]].sum()" ] }, { "cell_type": "markdown", "id": "3b765a64", "metadata": {}, "source": [ "The object returned by this indexing operation is a grouped DataFrame if a list or array is passed, or a grouped series if only a single column name is passed as a scalar:" ] }, { "cell_type": "code", "execution_count": 21, "id": "99c22f19", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "series_grouped = df.groupby([\"Title\", \"Language\"])[\"2022-01\"]\n", "\n", "series_grouped" ] }, { "cell_type": "code", "execution_count": 22, "id": "3ac020d4", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Title Language\n", "Jupyter Tutorial de 30134.0\n", " en 3497.0\n", "PyViz Tutorial de 4873.0\n", "Python Basics de 427.0\n", " en 85.0\n", "Name: 2022-01, dtype: float64" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "series_grouped.sum()" ] }, { "cell_type": "markdown", "id": "3cbabf46", "metadata": {}, "source": [ "## Grouping with `dicts` and `series`\n", "\n", "Grouping information can also be in a form other than an array:" ] }, { "cell_type": "code", "execution_count": 23, "id": "e5140e19", "metadata": {}, "outputs": [], "source": [ "df.iloc[2:3, [2, 3]] = np.nan" ] }, { "cell_type": "markdown", "id": "09dfaa32", "metadata": {}, "source": [ "Suppose I have a group correspondence for the columns and want to group the columns together by group:" ] }, { "cell_type": "code", "execution_count": 24, "id": "4a252010", "metadata": {}, "outputs": [], "source": [ "mapping = {\"2021-12\": \"Dec 2021\", \n", " \"2022-01\": \"Jan 2022\",\n", " \"2022-02\": \"Feb 2022\"}" ] }, { "cell_type": "markdown", "id": "575135cd", "metadata": {}, "source": [ "Now an array could be constructed from this `dict` to pass to `groupby`, but instead we can just pass the `dict`:" ] }, { "cell_type": "code", "execution_count": 25, "id": "49c9e3b7", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/var/folders/hk/s8m0bblj0g10hw885gld52mc0000gn/T/ipykernel_6958/2568016302.py:1: FutureWarning: DataFrame.groupby with axis=1 is deprecated. Do `frame.T.groupby(...)` without axis instead.\n", " by_column = df.groupby(mapping, axis=1)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Dec 2021Feb 2022Jan 2022
019651.033295.030134.0
14722.04009.03497.0
20.03930.00.0
30.00.00.0
4525.0276.0427.0
5157.0226.085.0
\n", "
" ], "text/plain": [ " Dec 2021 Feb 2022 Jan 2022\n", "0 19651.0 33295.0 30134.0\n", "1 4722.0 4009.0 3497.0\n", "2 0.0 3930.0 0.0\n", "3 0.0 0.0 0.0\n", "4 525.0 276.0 427.0\n", "5 157.0 226.0 85.0" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "by_column = df.groupby(mapping, axis=1)\n", "\n", "by_column.sum()" ] }, { "cell_type": "markdown", "id": "66651487", "metadata": {}, "source": [ "The same functionality applies to `Series`:" ] }, { "cell_type": "code", "execution_count": 26, "id": "eb0556f4", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2021-12 Dec 2021\n", "2022-01 Jan 2022\n", "2022-02 Feb 2022\n", "dtype: object" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "map_series = pd.Series(mapping)\n", "\n", "map_series" ] }, { "cell_type": "code", "execution_count": 27, "id": "2e9dd6a0", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/var/folders/hk/s8m0bblj0g10hw885gld52mc0000gn/T/ipykernel_6958/3115622287.py:1: FutureWarning: DataFrame.groupby with axis=1 is deprecated. Do `frame.T.groupby(...)` without axis instead.\n", " df.groupby(map_series, axis=1).sum()\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Dec 2021Feb 2022Jan 2022
019651.033295.030134.0
14722.04009.03497.0
20.03930.00.0
30.00.00.0
4525.0276.0427.0
5157.0226.085.0
\n", "
" ], "text/plain": [ " Dec 2021 Feb 2022 Jan 2022\n", "0 19651.0 33295.0 30134.0\n", "1 4722.0 4009.0 3497.0\n", "2 0.0 3930.0 0.0\n", "3 0.0 0.0 0.0\n", "4 525.0 276.0 427.0\n", "5 157.0 226.0 85.0" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby(map_series, axis=1).sum()" ] }, { "cell_type": "markdown", "id": "8e7fec0f", "metadata": {}, "source": [ "## Grouping with Functions\n", "\n", "Using Python functions is a more general method of defining a group assignment compared to a `Dict` or `Series`. Each function passed as a group key is called once per index value, with the return values used as group names. Specifically, consider the example DataFrame from the previous section, which contains the titles as index values. Suppose If you want to group by the length of the names, you can calculate an array with the lengths of the strings, but it is easier to pass the `len` function:" ] }, { "cell_type": "code", "execution_count": 28, "id": "e55f29d7", "metadata": {}, "outputs": [], "source": [ "df = pd.DataFrame(\n", " [\n", " [19651, 30134, 33295],\n", " [4722, 3497, 4009],\n", " [2573, 4873, 3930],\n", " [525, 427, 276],\n", " [157, 85, 226],\n", " ],\n", " index=[\n", " \"Jupyter Tutorial\",\n", " \"Jupyter Tutorial\",\n", " \"PyViz Tutorial\",\n", " \"Python Basics\",\n", " \"Python Basics\",\n", " ],\n", " columns=[\"2021-12\", \"2022-01\", \"2022-02\"],\n", ")" ] }, { "cell_type": "code", "execution_count": 29, "id": "7c33d027", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
2021-122022-012022-02
13222
14111
16222
\n", "
" ], "text/plain": [ " 2021-12 2022-01 2022-02\n", "13 2 2 2\n", "14 1 1 1\n", "16 2 2 2" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby(len).count()" ] }, { "cell_type": "markdown", "id": "2481357a", "metadata": {}, "source": [ "Mixing functions with arrays, dicts or series is no problem, as everything is converted internally into arrays:" ] }, { "cell_type": "code", "execution_count": 30, "id": "f9d15089", "metadata": {}, "outputs": [], "source": [ "languages = [\"de\", \"en\", \"de\", \"de\", \"en\"]" ] }, { "cell_type": "code", "execution_count": 31, "id": "2c3b8863", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
2021-122022-012022-02
13de111
en111
14de111
16de111
en111
\n", "
" ], "text/plain": [ " 2021-12 2022-01 2022-02\n", "13 de 1 1 1\n", " en 1 1 1\n", "14 de 1 1 1\n", "16 de 1 1 1\n", " en 1 1 1" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby([len, languages]).count()" ] }, { "cell_type": "markdown", "id": "9b58d0cc", "metadata": {}, "source": [ "## Grouping by index levels\n", "\n", "A final practical feature for hierarchically indexed datasets is the ability to aggregate by one of the index levels of an axis. Let’s look at an example:" ] }, { "cell_type": "code", "execution_count": 32, "id": "66b260b3", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Month2021-122022-012022-02
Versionlateststablelateststablelateststable
Jupyter Tutorialde19651.00.030134.00.033295.00.0
en4722.01825.03497.02576.04009.03707.0
PyViz Tutorialde2573.00.04873.00.03930.00.0
NaNNaNNaNNaNNaNNaNNaNNaN
Python Basicsde525.00.0427.00.0276.00.0
en157.00.085.00.0226.00.0
\n", "
" ], "text/plain": [ "Month 2021-12 2022-01 2022-02 \n", "Version latest stable latest stable latest stable\n", "Jupyter Tutorial de 19651.0 0.0 30134.0 0.0 33295.0 0.0\n", " en 4722.0 1825.0 3497.0 2576.0 4009.0 3707.0\n", "PyViz Tutorial de 2573.0 0.0 4873.0 0.0 3930.0 0.0\n", "NaN NaN NaN NaN NaN NaN NaN NaN\n", "Python Basics de 525.0 0.0 427.0 0.0 276.0 0.0\n", " en 157.0 0.0 85.0 0.0 226.0 0.0" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "version_hits = [\n", " [19651, 0, 30134, 0, 33295, 0],\n", " [4722, 1825, 3497, 2576, 4009, 3707],\n", " [2573, 0, 4873, 0, 3930, 0],\n", " [None, None, None, None, None, None],\n", " [525, 0, 427, 0, 276, 0],\n", " [157, 0, 85, 0, 226, 0],\n", "]\n", "\n", "df = pd.DataFrame(\n", " version_hits,\n", " index=[\n", " [\n", " \"Jupyter Tutorial\",\n", " \"Jupyter Tutorial\",\n", " \"PyViz Tutorial\",\n", " None,\n", " \"Python Basics\",\n", " \"Python Basics\",\n", " ],\n", " [\"de\", \"en\", \"de\", None, \"de\", \"en\"],\n", " ],\n", " columns=[\n", " [\"2021-12\", \"2021-12\", \"2022-01\", \"2022-01\", \"2022-02\", \"2022-02\"],\n", " [\"latest\", \"stable\", \"latest\", \"stable\", \"latest\", \"stable\"],\n", " ],\n", ")\n", "\n", "df.columns.names = [\"Month\", \"Version\"]\n", "\n", "df" ] }, { "cell_type": "code", "execution_count": 33, "id": "311f84d8", "metadata": { "scrolled": true }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/var/folders/hk/s8m0bblj0g10hw885gld52mc0000gn/T/ipykernel_6958/3343480277.py:1: FutureWarning: DataFrame.groupby with axis=1 is deprecated. Do `frame.T.groupby(...)` without axis instead.\n", " df.groupby(level=\"Month\", axis=1).sum()\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Month2021-122022-012022-02
Jupyter Tutorialde19651.030134.033295.0
en6547.06073.07716.0
PyViz Tutorialde2573.04873.03930.0
NaNNaN0.00.00.0
Python Basicsde525.0427.0276.0
en157.085.0226.0
\n", "
" ], "text/plain": [ "Month 2021-12 2022-01 2022-02\n", "Jupyter Tutorial de 19651.0 30134.0 33295.0\n", " en 6547.0 6073.0 7716.0\n", "PyViz Tutorial de 2573.0 4873.0 3930.0\n", "NaN NaN 0.0 0.0 0.0\n", "Python Basics de 525.0 427.0 276.0\n", " en 157.0 85.0 226.0" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby(level=\"Month\", axis=1).sum()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.13 Kernel", "language": "python", "name": "python313" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.0" }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 5 }