{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "35cc9e67",
   "metadata": {},
   "source": [
    "# Subdividing and categorising data\n",
    "\n",
    "Continuous data is often divided into domains or otherwise grouped for analysis."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "625f4a8b",
   "metadata": {},
   "source": [
    "Suppose you have data on a group of people in a study that you want to divide into discrete age groups. For this, we generate a dataframe with 250 entries between `0` and `99`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "fe17e156",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Age</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>36</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>54</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>63</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>60</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>245</th>\n",
       "      <td>35</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>246</th>\n",
       "      <td>93</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>247</th>\n",
       "      <td>97</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>248</th>\n",
       "      <td>84</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>249</th>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>250 rows × 1 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "     Age\n",
       "0     36\n",
       "1     20\n",
       "2     54\n",
       "3     63\n",
       "4     60\n",
       "..   ...\n",
       "245   35\n",
       "246   93\n",
       "247   97\n",
       "248   84\n",
       "249    9\n",
       "\n",
       "[250 rows x 1 columns]"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "\n",
    "ages = np.random.randint(0, 99, 250)\n",
    "df = pd.DataFrame({\"Age\": ages})\n",
    "\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8717607a",
   "metadata": {},
   "source": [
    "Afterwards, pandas offers us a simple way to divide the results into ten ranges with [pandas.cut](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html). To get only whole years, we additionally set `precision=0`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "2c89e759",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(29.0, 39.0], (20.0, 29.0], (49.0, 59.0], (59.0, 69.0], (59.0, 69.0], ..., (29.0, 39.0], (88.0, 98.0], (88.0, 98.0], (78.0, 88.0], (-0.1, 10.0]]\n",
       "Length: 250\n",
       "Categories (10, interval[float64, right]): [(-0.1, 10.0] < (10.0, 20.0] < (20.0, 29.0] < (29.0, 39.0] ... (59.0, 69.0] < (69.0, 78.0] < (78.0, 88.0] < (88.0, 98.0]]"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cats = pd.cut(ages, 10, precision=0)\n",
    "\n",
    "cats"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "40135b6f",
   "metadata": {},
   "source": [
    "With [pandas.Categorical.categories](https://pandas.pydata.org/docs/reference/api/pandas.Categorical.categories.html) you can display the categories:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "48c787f3",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "IntervalIndex([(-0.1, 10.0], (10.0, 20.0], (20.0, 29.0], (29.0, 39.0],\n",
       "               (39.0, 49.0], (49.0, 59.0], (59.0, 69.0], (69.0, 78.0],\n",
       "               (78.0, 88.0], (88.0, 98.0]],\n",
       "              dtype='interval[float64, right]')"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cats.categories"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d5987f98",
   "metadata": {},
   "source": [
    "… or even just a single category:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "59da3a5e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Interval(-0.1, 10.0, closed='right')"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cats.categories[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "08fd3777",
   "metadata": {},
   "source": [
    "With [pandas.Categorical.codes](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Categorical.codes.html) you can display an array where for each value the corresponding category is shown:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "0df59a92",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([3, 2, 5, 6, 6, 4, 6, 9, 5, 9, 9, 1, 2, 8, 7, 3, 9, 4, 4, 7, 1, 4,\n",
       "       0, 8, 6, 6, 0, 2, 1, 4, 9, 0, 6, 5, 1, 4, 8, 0, 3, 1, 0, 9, 4, 2,\n",
       "       5, 8, 3, 8, 3, 2, 3, 9, 8, 2, 2, 8, 5, 0, 8, 9, 0, 8, 1, 5, 8, 9,\n",
       "       3, 6, 4, 8, 2, 4, 3, 9, 5, 9, 8, 1, 9, 7, 4, 1, 0, 9, 2, 0, 0, 9,\n",
       "       0, 5, 6, 8, 2, 9, 1, 6, 8, 6, 0, 8, 2, 5, 5, 9, 5, 4, 1, 7, 0, 3,\n",
       "       6, 8, 0, 7, 6, 2, 0, 3, 4, 6, 5, 9, 6, 2, 0, 4, 3, 7, 7, 0, 7, 1,\n",
       "       9, 9, 3, 0, 9, 8, 9, 7, 1, 7, 6, 3, 2, 8, 6, 2, 9, 9, 3, 7, 6, 7,\n",
       "       3, 3, 0, 9, 1, 5, 3, 6, 4, 6, 2, 6, 4, 9, 2, 7, 1, 7, 6, 4, 1, 5,\n",
       "       2, 1, 5, 4, 9, 4, 7, 0, 3, 8, 7, 6, 7, 6, 7, 7, 2, 2, 7, 3, 0, 9,\n",
       "       3, 7, 6, 3, 6, 9, 1, 2, 3, 7, 7, 7, 8, 5, 6, 0, 6, 1, 1, 6, 0, 5,\n",
       "       2, 5, 1, 9, 1, 0, 6, 9, 4, 5, 9, 6, 1, 8, 5, 6, 9, 6, 8, 7, 9, 1,\n",
       "       2, 4, 0, 3, 9, 9, 8, 0], dtype=int8)"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cats.codes"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "347712ac",
   "metadata": {},
   "source": [
    "With `value_counts` we can now look at how the number is distributed among the individual areas:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "048cc13b",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(88.0, 98.0]    35\n",
       "(59.0, 69.0]    32\n",
       "(-0.1, 10.0]    26\n",
       "(69.0, 78.0]    25\n",
       "(10.0, 20.0]    23\n",
       "(20.0, 29.0]    23\n",
       "(29.0, 39.0]    23\n",
       "(78.0, 88.0]    23\n",
       "(39.0, 49.0]    20\n",
       "(49.0, 59.0]    20\n",
       "Name: count, dtype: int64"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.Series(cats).value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a4f58496",
   "metadata": {},
   "source": [
    "It is striking that the age ranges do not contain an equal number of years, but with `20.0, 29.0` and `69.0, 78.0` two ranges contain only 9 years. This is due to the fact that the age range only extends from `0` to `98`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "062036d8",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Age    0\n",
       "dtype: int64"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.min()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "9579e708",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Age    98\n",
       "dtype: int64"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.max()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d6fc3deb",
   "metadata": {},
   "source": [
    "With [pandas.qcut](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.qcut.html), on the other hand, the set is divided into areas that are approximately the same size:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "fb4d41b1",
   "metadata": {},
   "outputs": [],
   "source": [
    "cats = pd.qcut(ages, 10, precision=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "8653757b",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(33.0, 41.0]    27\n",
       "(54.0, 63.0]    27\n",
       "(-1.0, 9.0]     26\n",
       "(9.0, 20.0]     26\n",
       "(82.0, 91.0]    26\n",
       "(71.0, 82.0]    25\n",
       "(91.0, 98.0]    24\n",
       "(20.0, 33.0]    23\n",
       "(41.0, 54.0]    23\n",
       "(63.0, 71.0]    23\n",
       "Name: count, dtype: int64"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.Series(cats).value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d1f75798",
   "metadata": {},
   "source": [
    "If we want to ensure that each age group actually includes exactly ten years, we can specify this directly with [pandas.Categorical](https://pandas.pydata.org/docs/reference/api/pandas.Categorical.html):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "3be29378",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['0 - 9', '10 - 19', '20 - 29', '30 - 39', '40 - 49', '50 - 59',\n",
       "       '60 - 69', '70 - 79', '80 - 89', '90 - 99'],\n",
       "      dtype='object')"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "age_groups = [\"{0} - {1}\".format(i, i + 9) for i in range(0, 99, 10)]\n",
    "cats = pd.Categorical(age_groups)\n",
    "\n",
    "cats.categories"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b281dfd0",
   "metadata": {},
   "source": [
    "For grouping we can now use [pandas.cut](https://pandas.pydata.org/docs/reference/api/pandas.cut.html). However, the number of labels must be one less than the number of edges:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "f5180bc8",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Age</th>\n",
       "      <th>Age group</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>36</td>\n",
       "      <td>30 - 39</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>20</td>\n",
       "      <td>20 - 29</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>54</td>\n",
       "      <td>50 - 59</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>63</td>\n",
       "      <td>60 - 69</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>60</td>\n",
       "      <td>60 - 69</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>245</th>\n",
       "      <td>35</td>\n",
       "      <td>30 - 39</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>246</th>\n",
       "      <td>93</td>\n",
       "      <td>90 - 99</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>247</th>\n",
       "      <td>97</td>\n",
       "      <td>90 - 99</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>248</th>\n",
       "      <td>84</td>\n",
       "      <td>80 - 89</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>249</th>\n",
       "      <td>9</td>\n",
       "      <td>0 - 9</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>250 rows × 2 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "     Age Age group\n",
       "0     36   30 - 39\n",
       "1     20   20 - 29\n",
       "2     54   50 - 59\n",
       "3     63   60 - 69\n",
       "4     60   60 - 69\n",
       "..   ...       ...\n",
       "245   35   30 - 39\n",
       "246   93   90 - 99\n",
       "247   97   90 - 99\n",
       "248   84   80 - 89\n",
       "249    9     0 - 9\n",
       "\n",
       "[250 rows x 2 columns]"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[\"Age group\"] = pd.cut(df.Age, range(0, 101, 10), right=False, labels=cats)\n",
    "\n",
    "df"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.13 Kernel",
   "language": "python",
   "name": "python313"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.0"
  },
  "widgets": {
   "application/vnd.jupyter.widget-state+json": {
    "state": {},
    "version_major": 2,
    "version_minor": 0
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}