{ "cells": [ { "cell_type": "markdown", "id": "42cca328", "metadata": {}, "source": [ "# Pivot-Tabellen und Kreuztabellen\n", "\n", "Eine [Pivot-Tabelle](https://de.wikipedia.org/wiki/Pivot-Tabelle) ist ein Werkzeug zur Datenzusammenfassung, das häufig in Tabellenkalkulationsprogrammen und anderer Datenanalysesoftware zu finden ist. Sie fasst eine Tabelle mit Daten nach einem oder mehreren Schlüsseln zusammen und ordnet die Daten in einem Rechteck an, wobei einige der Gruppenschlüssel entlang der Zeilen und einige entlang der Spalten angeordnet sind. Pivot-Tabellen in Python mit pandas werden durch die [groupby](group-operations.ipynb)-Funktion in Kombination mit Umformungsoperationen unter Verwendung [hierarchischer Indizierung](indexing.ipynb#Hierarchische-Indizierung) ermöglicht. DataFrame hat eine [pivot_table](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot_table.html)-Methode, und es gibt auch eine Top-Level-Funktion [pandas.pivot_table](https://pandas.pydata.org/docs/reference/api/pandas.pivot_table.html). `pivot_table` bietet nicht nur eine bequeme Schnittstelle zu `groupby`, sondern kann auch Teilsummen (`margins`) hinzufügen." ] }, { "cell_type": "markdown", "id": "d8b5005a", "metadata": {}, "source": [ "Nehmen wir an, wir wollten eine Tabelle mit Gruppenmittelwerten (der Standardaggregationstyp von `pivot_table`) berechnen, die nach Titel und Sprache in den Zeilen geordnet ist:" ] }, { "cell_type": "code", "execution_count": 1, "id": "dbc4929e", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 2, "id": "d1c34df4", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TitleLanguage2021-122022-012022-02
0Jupyter Tutorialde30134.033295.019651.0
1Jupyter Tutorialen6073.07716.06547.0
2PyViz Tutorialde4873.03930.02573.0
3PyViz TutorialenNaNNaNNaN
4Python Basicsde427.0276.0525.0
5Python Basicsen95.0226.0157.0
\n", "
" ], "text/plain": [ " Title Language 2021-12 2022-01 2022-02\n", "0 Jupyter Tutorial de 30134.0 33295.0 19651.0\n", "1 Jupyter Tutorial en 6073.0 7716.0 6547.0\n", "2 PyViz Tutorial de 4873.0 3930.0 2573.0\n", "3 PyViz Tutorial en NaN NaN NaN\n", "4 Python Basics de 427.0 276.0 525.0\n", "5 Python Basics en 95.0 226.0 157.0" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame(\n", " {\n", " \"Title\": [\n", " \"Jupyter Tutorial\",\n", " \"Jupyter Tutorial\",\n", " \"PyViz Tutorial\",\n", " \"PyViz Tutorial\",\n", " \"Python Basics\",\n", " \"Python Basics\",\n", " ],\n", " \"Language\": [\"de\", \"en\", \"de\", \"en\", \"de\", \"en\"],\n", " \"2021-12\": [30134, 6073, 4873, np.nan, 427, 95],\n", " \"2022-01\": [33295, 7716, 3930, np.nan, 276, 226],\n", " \"2022-02\": [19651, 6547, 2573, np.nan, 525, 157],\n", " }\n", ")\n", "\n", "df" ] }, { "cell_type": "code", "execution_count": 3, "id": "95f677aa", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
2021-122022-012022-02
TitleLanguage
Jupyter Tutorialde30134.033295.019651.0
en6073.07716.06547.0
PyViz Tutorialde4873.03930.02573.0
Python Basicsde427.0276.0525.0
en95.0226.0157.0
\n", "
" ], "text/plain": [ " 2021-12 2022-01 2022-02\n", "Title Language \n", "Jupyter Tutorial de 30134.0 33295.0 19651.0\n", " en 6073.0 7716.0 6547.0\n", "PyViz Tutorial de 4873.0 3930.0 2573.0\n", "Python Basics de 427.0 276.0 525.0\n", " en 95.0 226.0 157.0" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.pivot_table(index=[\"Title\", \"Language\"])" ] }, { "cell_type": "markdown", "id": "613af605", "metadata": {}, "source": [ "Das hätte man auch direkt mit `groupby` machen können." ] }, { "cell_type": "markdown", "id": "533e5955", "metadata": {}, "source": [ "Nehmen wir nun an, wir wollen den Durchschnitt jeden Monats nehmen und zusätzlich nach `Title` gruppieren. Ich werde `Title` und `Language` in die Tabellenspalten und die Monate in die Zeilen setzen:" ] }, { "cell_type": "code", "execution_count": 4, "id": "b8fde770", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TitleJupyter TutorialPyViz TutorialPython Basics
Languagedeendedeen
2021-1230134.06073.04873.0427.095.0
2022-0133295.07716.03930.0276.0226.0
2022-0219651.06547.02573.0525.0157.0
\n", "
" ], "text/plain": [ "Title Jupyter Tutorial PyViz Tutorial Python Basics \n", "Language de en de de en\n", "2021-12 30134.0 6073.0 4873.0 427.0 95.0\n", "2022-01 33295.0 7716.0 3930.0 276.0 226.0\n", "2022-02 19651.0 6547.0 2573.0 525.0 157.0" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.pivot_table(columns=[\"Title\", \"Language\"])" ] }, { "cell_type": "markdown", "id": "e414fbc9", "metadata": {}, "source": [ "Um eine andere Aggregationsfunktion als `mean` zu verwenden, übergebt diese an das Schlüsselwortargument `aggfunc`. Mit `sum` zum Beispiel erhaltet ihr die Summe:" ] }, { "cell_type": "code", "execution_count": 5, "id": "b24bf26e", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TitleJupyter TutorialPyViz TutorialPython Basics
LanguagedeenAlldeenAlldeenAll
2021-1230134.06073.036207.04873.00.04873.0427.095.0522.0
2022-0133295.07716.041011.03930.00.03930.0276.0226.0502.0
2022-0219651.06547.026198.02573.00.02573.0525.0157.0682.0
\n", "
" ], "text/plain": [ "Title Jupyter Tutorial PyViz Tutorial \\\n", "Language de en All de en All \n", "2021-12 30134.0 6073.0 36207.0 4873.0 0.0 4873.0 \n", "2022-01 33295.0 7716.0 41011.0 3930.0 0.0 3930.0 \n", "2022-02 19651.0 6547.0 26198.0 2573.0 0.0 2573.0 \n", "\n", "Title Python Basics \n", "Language de en All \n", "2021-12 427.0 95.0 522.0 \n", "2022-01 276.0 226.0 502.0 \n", "2022-02 525.0 157.0 682.0 " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.pivot_table(columns=[\"Title\", \"Language\"], aggfunc=\"sum\", margins=True)" ] }, { "cell_type": "markdown", "id": "064b60b9", "metadata": {}, "source": [ "In diesem Fall sind die Werte für `All` die Mittelwerte." ] }, { "cell_type": "markdown", "id": "61481099", "metadata": {}, "source": [ "Wenn einige Kombinationen leer (oder anderweitig `NA`) sind, könnt ihr `fill_value` übergeben:" ] }, { "cell_type": "markdown", "id": "234bfe56", "metadata": {}, "source": [ "`pivot_table`-Optionen:\n", "\n", "Funktionsname | Beschreibung\n", ":------------ | :-----------\n", "`values` | Spaltenname(n) zum Aggregieren; standardmäßig werden alle numerischen Spalten aggregiert\n", "`index` | Spaltennamen oder andere Gruppenschlüssel, die in den Zeilen der resultierenden Pivot-Tabelle gruppiert werden sollen\n", "`columns` | Spaltennamen oder andere Gruppenschlüssel, die in den Spalten der resultierenden Pivot-Tabelle gruppiert werden sollen\n", "`aggfunc` | Aggregationsfunktion oder Liste von Funktionen (standardmäßig `mean`); kann eine beliebige Funktion sein, die in einem `groupby`-Kontext gültig ist\n", "`fill_value` | ersetzt fehlende Werte in der Ergebnistabelle\n", "`dropna` | wenn `True`, werden Spalten, deren Einträge alle `NA` sind, nicht berücksichtigt\n", "`margins` | fügt Zeilen-/Spalten-Zwischensummen und Gesamtsummen ein (Standardeinstellung: `False`)\n", "`margins_name` | Name, der für die Zeilen-/Spaltenbeschriftung verwendet wird, wenn `margins=True` übergeben wird, Standardwert ist `All`.\n", "`observed` | Bei kategorialen Gruppenschlüsseln werden bei `True` nur die beobachteten Kategoriewerte in den Schlüsseln angezeigt und nicht alle Kategorien" ] }, { "cell_type": "markdown", "id": "6a1d0597", "metadata": {}, "source": [ "## Kreuztabellen\n", "\n", "Eine Kreuztabelle ist ein Spezialfall einer Pivot-Tabelle, die die Häufigkeit von Gruppen berechnet. Wollen wir z.B. im Rahmen einer Analyse diese Daten vielleicht ermitteln, welcher Titel in welcher Sprache erschienen ist, so könnten wir dafür `pivot_table` verwenden, aber die Funktion [pandas.crosstab](https://pandas.pydata.org/docs/reference/api/pandas.crosstab.html) ist bequemer." ] }, { "cell_type": "markdown", "id": "07a1c0e0", "metadata": {}, "source": [ " Hierfür setzen wir zunächst die bestehenden Index zurück:" ] }, { "cell_type": "code", "execution_count": 6, "id": "f3209837", "metadata": {}, "outputs": [], "source": [ "df.reset_index(inplace=True)" ] }, { "cell_type": "code", "execution_count": 7, "id": "ec1f31fd", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Languagedeen
Title
Jupyter Tutorial11
PyViz Tutorial11
Python Basics11
\n", "
" ], "text/plain": [ "Language de en\n", "Title \n", "Jupyter Tutorial 1 1\n", "PyViz Tutorial 1 1\n", "Python Basics 1 1" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.crosstab(df.Title, df.Language)" ] }, { "cell_type": "markdown", "id": "c96f76fc", "metadata": {}, "source": [ "Die ersten beiden Argumente für `crosstab` können jeweils entweder ein Array oder eine Series oder eine Liste von Arrays sein." ] }, { "cell_type": "markdown", "id": "ad0621b7", "metadata": {}, "source": [ "Mit `margins=True` können wir uns auch die Summen der Spalten und Zeilen sowie die Gesamtsumme berechnen lassen:" ] }, { "cell_type": "code", "execution_count": 8, "id": "6b8e9d80", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
LanguagedeenAll
Title
Jupyter Tutorial112
PyViz Tutorial112
Python Basics112
All336
\n", "
" ], "text/plain": [ "Language de en All\n", "Title \n", "Jupyter Tutorial 1 1 2\n", "PyViz Tutorial 1 1 2\n", "Python Basics 1 1 2\n", "All 3 3 6" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.crosstab(df.Title, df.Language, margins=True)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.13 Kernel", "language": "python", "name": "python313" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.0" }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 5 }