{ "cells": [ { "cell_type": "markdown", "id": "290ea5a6", "metadata": {}, "source": [ "# Unterteilen und Kategorisieren von Daten\n", "\n", "Kontinuierliche Daten werden häufig in Bereiche unterteilt oder auf andere Weise für die Analyse gruppiert." ] }, { "cell_type": "markdown", "id": "391c3176", "metadata": {}, "source": [ "Angenommen, ihr habt Daten über eine Gruppe von Personen in einer Studie, die ihr in diskrete Altersgruppen einteilen möchtet. Hierfür generieren wir uns einen Dataframe mit 250 Einträgen zwischen `0` und `99`:" ] }, { "cell_type": "code", "execution_count": 1, "id": "deec7725", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Age
077
133
250
383
449
......
24541
24695
24797
24874
24923
\n", "

250 rows × 1 columns

\n", "
" ], "text/plain": [ " Age\n", "0 77\n", "1 33\n", "2 50\n", "3 83\n", "4 49\n", ".. ...\n", "245 41\n", "246 95\n", "247 97\n", "248 74\n", "249 23\n", "\n", "[250 rows x 1 columns]" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "import pandas as pd\n", "\n", "\n", "ages = np.random.randint(0, 99, 250)\n", "df = pd.DataFrame({\"Age\": ages})\n", "\n", "df" ] }, { "cell_type": "markdown", "id": "2c991e35", "metadata": {}, "source": [ "Anschließend bietet uns pandas mit [pandas.cut](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html) eine einfache Möglichkeit, die Ergebnisse in zehn Bereiche aufzuteilen. Um nur ganze Jahre zu erhalten, setzen wir zusätzlich `precision=0`:" ] }, { "cell_type": "code", "execution_count": 2, "id": "e9d30508", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "[(68.0, 78.0], (29.0, 39.0], (48.0, 58.0], (78.0, 87.0], (48.0, 58.0], ..., (39.0, 48.0], (87.0, 97.0], (87.0, 97.0], (68.0, 78.0], (19.0, 29.0]]\n", "Length: 250\n", "Categories (10, interval[float64, right]): [(-0.1, 10.0] < (10.0, 19.0] < (19.0, 29.0] < (29.0, 39.0] ... (58.0, 68.0] < (68.0, 78.0] < (78.0, 87.0] < (87.0, 97.0]]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cats = pd.cut(ages, 10, precision=0)\n", "\n", "cats" ] }, { "cell_type": "markdown", "id": "ca81953c", "metadata": {}, "source": [ "Mit [pandas.Categorical.categories](https://pandas.pydata.org/docs/reference/api/pandas.Categorical.categories.html) könnt ihr euch die Kategorien anzeigen lassen:" ] }, { "cell_type": "code", "execution_count": 3, "id": "1b8fb7b8", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "IntervalIndex([(-0.1, 10.0], (10.0, 19.0], (19.0, 29.0], (29.0, 39.0],\n", " (39.0, 48.0], (48.0, 58.0], (58.0, 68.0], (68.0, 78.0],\n", " (78.0, 87.0], (87.0, 97.0]],\n", " dtype='interval[float64, right]')" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cats.categories" ] }, { "cell_type": "markdown", "id": "ed9275b5", "metadata": {}, "source": [ "…oder auch nur eine einzelne Kategorie:" ] }, { "cell_type": "code", "execution_count": 4, "id": "ab486178", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Interval(-0.1, 10.0, closed='right')" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cats.categories[0]" ] }, { "cell_type": "markdown", "id": "077d8eeb", "metadata": {}, "source": [ "Mit [pandas.Categorical.codes](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Categorical.codes.html) könnt ihr euch ein Array anzeigen lassen, in dem für jeden Wert die zugehörige Kategorie angezeigt wird:" ] }, { "cell_type": "code", "execution_count": 5, "id": "27faf8a0", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([7, 3, 5, 8, 5, 3, 7, 5, 2, 4, 9, 4, 0, 5, 4, 8, 4, 3, 8, 9, 0, 6,\n", " 8, 6, 4, 0, 2, 0, 4, 2, 4, 8, 3, 6, 1, 5, 0, 7, 9, 8, 2, 6, 0, 1,\n", " 4, 0, 0, 8, 4, 7, 7, 6, 9, 5, 6, 7, 5, 8, 6, 1, 3, 8, 8, 0, 6, 9,\n", " 2, 3, 7, 1, 6, 1, 7, 4, 1, 3, 4, 7, 5, 0, 5, 9, 7, 8, 1, 9, 9, 6,\n", " 5, 2, 0, 8, 9, 8, 9, 8, 9, 6, 3, 6, 3, 4, 6, 9, 7, 4, 3, 9, 5, 9,\n", " 2, 2, 8, 4, 4, 0, 3, 9, 8, 6, 6, 9, 0, 6, 3, 1, 8, 6, 5, 4, 8, 1,\n", " 2, 4, 9, 9, 0, 6, 6, 8, 5, 8, 6, 2, 2, 9, 2, 8, 9, 7, 6, 6, 2, 1,\n", " 5, 7, 7, 0, 1, 8, 4, 9, 9, 4, 7, 8, 7, 2, 3, 5, 4, 8, 3, 6, 2, 9,\n", " 2, 9, 3, 5, 1, 2, 8, 6, 2, 0, 7, 3, 2, 1, 8, 1, 3, 2, 6, 6, 6, 5,\n", " 9, 3, 8, 8, 9, 3, 9, 8, 3, 7, 2, 9, 6, 6, 1, 9, 8, 4, 8, 0, 6, 8,\n", " 4, 2, 1, 9, 8, 1, 6, 4, 5, 7, 7, 8, 1, 3, 4, 6, 2, 1, 9, 3, 9, 3,\n", " 3, 9, 9, 4, 9, 9, 7, 2], dtype=int8)" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cats.codes" ] }, { "cell_type": "markdown", "id": "5629393d", "metadata": {}, "source": [ "Mit `value_counts` können wir uns nun anschauen, wie sich die Anzahl auf die einzelnen Bereiche verteilt:" ] }, { "cell_type": "code", "execution_count": 6, "id": "b9656f7f", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(87.0, 97.0] 36\n", "(78.0, 87.0] 34\n", "(58.0, 68.0] 32\n", "(39.0, 48.0] 25\n", "(19.0, 29.0] 24\n", "(29.0, 39.0] 24\n", "(68.0, 78.0] 21\n", "(10.0, 19.0] 19\n", "(48.0, 58.0] 18\n", "(-0.1, 10.0] 17\n", "Name: count, dtype: int64" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.Series(cats).value_counts()" ] }, { "cell_type": "markdown", "id": "a3574d82", "metadata": {}, "source": [ "Auffalend ist, dass die Altersbereiche nicht gleich viele Jahre enthalten, sondern mit `20.0, 29.0` und `69.0, 78.0` zwei Bereiche nur 9 Jahre umfassen. Dies hängt damit zusammen, dass der Altersumfang nur von `0`bis `98` reicht:" ] }, { "cell_type": "code", "execution_count": 7, "id": "cfefc1cd", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Age 0\n", "dtype: int64" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.min()" ] }, { "cell_type": "code", "execution_count": 8, "id": "d2f2e234", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Age 97\n", "dtype: int64" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.max()" ] }, { "cell_type": "markdown", "id": "b5ca6122", "metadata": {}, "source": [ "Mit [pandas.qcut](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.qcut.html) wird die Menge hingegen in Bereiche unterteilt, die annähernd gleich groß sind:" ] }, { "cell_type": "code", "execution_count": 9, "id": "1c7cba64", "metadata": {}, "outputs": [], "source": [ "cats = pd.qcut(ages, 10, precision=0)" ] }, { "cell_type": "code", "execution_count": 10, "id": "92f7c0a6", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(-1.0, 14.0] 27\n", "(14.0, 26.0] 27\n", "(58.0, 65.0] 26\n", "(75.0, 84.0] 26\n", "(84.0, 90.0] 26\n", "(26.0, 34.0] 25\n", "(65.0, 75.0] 24\n", "(34.0, 45.0] 23\n", "(45.0, 58.0] 23\n", "(90.0, 97.0] 23\n", "Name: count, dtype: int64" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.Series(cats).value_counts()" ] }, { "cell_type": "markdown", "id": "13fa613d", "metadata": {}, "source": [ "Wollen wir gewährleisten, dass jede Altersgruppe tatsächlich genau zehn Jahre umfasst, können wir dies mit [pandas.Categorical](https://pandas.pydata.org/docs/reference/api/pandas.Categorical.html) direkt angeben:" ] }, { "cell_type": "code", "execution_count": 11, "id": "fa011ab7", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['0 - 9', '10 - 19', '100 - 109', '20 - 29', '30 - 39', '40 - 49',\n", " '50 - 59', '60 - 69', '70 - 79', '80 - 89', '90 - 99'],\n", " dtype='object')" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "age_groups = [\"{0} - {1}\".format(i, i + 9) for i in range(0, 109, 10)]\n", "cats = pd.Categorical(age_groups)\n", "\n", "cats.categories" ] }, { "cell_type": "markdown", "id": "4c620f5a", "metadata": {}, "source": [ "Für die Gruppierung wird nun [pandas.cut](https://pandas.pydata.org/docs/reference/api/pandas.cut.html) verwendet:" ] }, { "cell_type": "code", "execution_count": 12, "id": "00c6b05b", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AgeAge group
07770 - 79
13330 - 39
25050 - 59
38380 - 89
44940 - 49
.........
2454140 - 49
2469590 - 99
2479790 - 99
2487470 - 79
2492320 - 29
\n", "

250 rows × 2 columns

\n", "
" ], "text/plain": [ " Age Age group\n", "0 77 70 - 79\n", "1 33 30 - 39\n", "2 50 50 - 59\n", "3 83 80 - 89\n", "4 49 40 - 49\n", ".. ... ...\n", "245 41 40 - 49\n", "246 95 90 - 99\n", "247 97 90 - 99\n", "248 74 70 - 79\n", "249 23 20 - 29\n", "\n", "[250 rows x 2 columns]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"Age group\"] = pd.cut(df.Age, range(0, 111, 10), right=False, labels=cats)\n", "\n", "df" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.13 Kernel", "language": "python", "name": "python313" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.0" }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 5 }