{ "cells": [ { "cell_type": "markdown", "id": "35cc9e67", "metadata": {}, "source": [ "# Subdividing and categorising data\n", "\n", "Continuous data is often divided into domains or otherwise grouped for analysis." ] }, { "cell_type": "markdown", "id": "625f4a8b", "metadata": {}, "source": [ "Suppose you have data on a group of people in a study that you want to divide into discrete age groups. For this, we generate a dataframe with 250 entries between `0` and `99`:" ] }, { "cell_type": "code", "execution_count": 1, "id": "fe17e156", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Age
036
120
254
363
460
......
24535
24693
24797
24884
2499
\n", "

250 rows × 1 columns

\n", "
" ], "text/plain": [ " Age\n", "0 36\n", "1 20\n", "2 54\n", "3 63\n", "4 60\n", ".. ...\n", "245 35\n", "246 93\n", "247 97\n", "248 84\n", "249 9\n", "\n", "[250 rows x 1 columns]" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "import pandas as pd\n", "\n", "\n", "ages = np.random.randint(0, 99, 250)\n", "df = pd.DataFrame({\"Age\": ages})\n", "\n", "df" ] }, { "cell_type": "markdown", "id": "8717607a", "metadata": {}, "source": [ "Afterwards, pandas offers us a simple way to divide the results into ten ranges with [pandas.cut](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html). To get only whole years, we additionally set `precision=0`:" ] }, { "cell_type": "code", "execution_count": 2, "id": "2c89e759", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "[(29.0, 39.0], (20.0, 29.0], (49.0, 59.0], (59.0, 69.0], (59.0, 69.0], ..., (29.0, 39.0], (88.0, 98.0], (88.0, 98.0], (78.0, 88.0], (-0.1, 10.0]]\n", "Length: 250\n", "Categories (10, interval[float64, right]): [(-0.1, 10.0] < (10.0, 20.0] < (20.0, 29.0] < (29.0, 39.0] ... (59.0, 69.0] < (69.0, 78.0] < (78.0, 88.0] < (88.0, 98.0]]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cats = pd.cut(ages, 10, precision=0)\n", "\n", "cats" ] }, { "cell_type": "markdown", "id": "40135b6f", "metadata": {}, "source": [ "With [pandas.Categorical.categories](https://pandas.pydata.org/docs/reference/api/pandas.Categorical.categories.html) you can display the categories:" ] }, { "cell_type": "code", "execution_count": 3, "id": "48c787f3", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "IntervalIndex([(-0.1, 10.0], (10.0, 20.0], (20.0, 29.0], (29.0, 39.0],\n", " (39.0, 49.0], (49.0, 59.0], (59.0, 69.0], (69.0, 78.0],\n", " (78.0, 88.0], (88.0, 98.0]],\n", " dtype='interval[float64, right]')" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cats.categories" ] }, { "cell_type": "markdown", "id": "d5987f98", "metadata": {}, "source": [ "… or even just a single category:" ] }, { "cell_type": "code", "execution_count": 4, "id": "59da3a5e", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Interval(-0.1, 10.0, closed='right')" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cats.categories[0]" ] }, { "cell_type": "markdown", "id": "08fd3777", "metadata": {}, "source": [ "With [pandas.Categorical.codes](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Categorical.codes.html) you can display an array where for each value the corresponding category is shown:" ] }, { "cell_type": "code", "execution_count": 5, "id": "0df59a92", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([3, 2, 5, 6, 6, 4, 6, 9, 5, 9, 9, 1, 2, 8, 7, 3, 9, 4, 4, 7, 1, 4,\n", " 0, 8, 6, 6, 0, 2, 1, 4, 9, 0, 6, 5, 1, 4, 8, 0, 3, 1, 0, 9, 4, 2,\n", " 5, 8, 3, 8, 3, 2, 3, 9, 8, 2, 2, 8, 5, 0, 8, 9, 0, 8, 1, 5, 8, 9,\n", " 3, 6, 4, 8, 2, 4, 3, 9, 5, 9, 8, 1, 9, 7, 4, 1, 0, 9, 2, 0, 0, 9,\n", " 0, 5, 6, 8, 2, 9, 1, 6, 8, 6, 0, 8, 2, 5, 5, 9, 5, 4, 1, 7, 0, 3,\n", " 6, 8, 0, 7, 6, 2, 0, 3, 4, 6, 5, 9, 6, 2, 0, 4, 3, 7, 7, 0, 7, 1,\n", " 9, 9, 3, 0, 9, 8, 9, 7, 1, 7, 6, 3, 2, 8, 6, 2, 9, 9, 3, 7, 6, 7,\n", " 3, 3, 0, 9, 1, 5, 3, 6, 4, 6, 2, 6, 4, 9, 2, 7, 1, 7, 6, 4, 1, 5,\n", " 2, 1, 5, 4, 9, 4, 7, 0, 3, 8, 7, 6, 7, 6, 7, 7, 2, 2, 7, 3, 0, 9,\n", " 3, 7, 6, 3, 6, 9, 1, 2, 3, 7, 7, 7, 8, 5, 6, 0, 6, 1, 1, 6, 0, 5,\n", " 2, 5, 1, 9, 1, 0, 6, 9, 4, 5, 9, 6, 1, 8, 5, 6, 9, 6, 8, 7, 9, 1,\n", " 2, 4, 0, 3, 9, 9, 8, 0], dtype=int8)" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cats.codes" ] }, { "cell_type": "markdown", "id": "347712ac", "metadata": {}, "source": [ "With `value_counts` we can now look at how the number is distributed among the individual areas:" ] }, { "cell_type": "code", "execution_count": 6, "id": "048cc13b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(88.0, 98.0] 35\n", "(59.0, 69.0] 32\n", "(-0.1, 10.0] 26\n", "(69.0, 78.0] 25\n", "(10.0, 20.0] 23\n", "(20.0, 29.0] 23\n", "(29.0, 39.0] 23\n", "(78.0, 88.0] 23\n", "(39.0, 49.0] 20\n", "(49.0, 59.0] 20\n", "Name: count, dtype: int64" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.Series(cats).value_counts()" ] }, { "cell_type": "markdown", "id": "a4f58496", "metadata": {}, "source": [ "It is striking that the age ranges do not contain an equal number of years, but with `20.0, 29.0` and `69.0, 78.0` two ranges contain only 9 years. This is due to the fact that the age range only extends from `0` to `98`:" ] }, { "cell_type": "code", "execution_count": 7, "id": "062036d8", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Age 0\n", "dtype: int64" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.min()" ] }, { "cell_type": "code", "execution_count": 8, "id": "9579e708", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Age 98\n", "dtype: int64" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.max()" ] }, { "cell_type": "markdown", "id": "d6fc3deb", "metadata": {}, "source": [ "With [pandas.qcut](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.qcut.html), on the other hand, the set is divided into areas that are approximately the same size:" ] }, { "cell_type": "code", "execution_count": 9, "id": "fb4d41b1", "metadata": {}, "outputs": [], "source": [ "cats = pd.qcut(ages, 10, precision=0)" ] }, { "cell_type": "code", "execution_count": 10, "id": "8653757b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(33.0, 41.0] 27\n", "(54.0, 63.0] 27\n", "(-1.0, 9.0] 26\n", "(9.0, 20.0] 26\n", "(82.0, 91.0] 26\n", "(71.0, 82.0] 25\n", "(91.0, 98.0] 24\n", "(20.0, 33.0] 23\n", "(41.0, 54.0] 23\n", "(63.0, 71.0] 23\n", "Name: count, dtype: int64" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.Series(cats).value_counts()" ] }, { "cell_type": "markdown", "id": "d1f75798", "metadata": {}, "source": [ "If we want to ensure that each age group actually includes exactly ten years, we can specify this directly with [pandas.Categorical](https://pandas.pydata.org/docs/reference/api/pandas.Categorical.html):" ] }, { "cell_type": "code", "execution_count": 11, "id": "3be29378", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['0 - 9', '10 - 19', '20 - 29', '30 - 39', '40 - 49', '50 - 59',\n", " '60 - 69', '70 - 79', '80 - 89', '90 - 99'],\n", " dtype='object')" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "age_groups = [\"{0} - {1}\".format(i, i + 9) for i in range(0, 99, 10)]\n", "cats = pd.Categorical(age_groups)\n", "\n", "cats.categories" ] }, { "cell_type": "markdown", "id": "b281dfd0", "metadata": {}, "source": [ "For grouping we can now use [pandas.cut](https://pandas.pydata.org/docs/reference/api/pandas.cut.html). However, the number of labels must be one less than the number of edges:" ] }, { "cell_type": "code", "execution_count": 12, "id": "f5180bc8", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AgeAge group
03630 - 39
12020 - 29
25450 - 59
36360 - 69
46060 - 69
.........
2453530 - 39
2469390 - 99
2479790 - 99
2488480 - 89
24990 - 9
\n", "

250 rows × 2 columns

\n", "
" ], "text/plain": [ " Age Age group\n", "0 36 30 - 39\n", "1 20 20 - 29\n", "2 54 50 - 59\n", "3 63 60 - 69\n", "4 60 60 - 69\n", ".. ... ...\n", "245 35 30 - 39\n", "246 93 90 - 99\n", "247 97 90 - 99\n", "248 84 80 - 89\n", "249 9 0 - 9\n", "\n", "[250 rows x 2 columns]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"Age group\"] = pd.cut(df.Age, range(0, 101, 10), right=False, labels=cats)\n", "\n", "df" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.13 Kernel", "language": "python", "name": "python313" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.0" }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 5 }