# Subdividing and categorising data#

Continuous data is often divided into domains or otherwise grouped for analysis.

Suppose you have data on a group of people in a study that you want to divide into discrete age groups. For this, we generate a dataframe with 250 entries between 0 and 99:

[1]:

import numpy as np
import pandas as pd

ages = np.random.randint(0, 99, 250)
df = pd.DataFrame({"Age": ages})

df

[1]:

Age
0 22
1 82
2 6
3 3
4 28
... ...
245 15
246 86
247 91
248 55
249 15

250 rows × 1 columns

Afterwards, pandas offers us a simple way to divide the results into ten ranges with pandas.cut. To get only whole years, we additionally set precision=0:

[2]:

cats = pd.cut(ages, 10, precision=0)

cats

[2]:

[(20.0, 29.0], (78.0, 88.0], (-0.1, 10.0], (-0.1, 10.0], (20.0, 29.0], ..., (10.0, 20.0], (78.0, 88.0], (88.0, 98.0], (49.0, 59.0], (10.0, 20.0]]
Length: 250
Categories (10, interval[float64, right]): [(-0.1, 10.0] < (10.0, 20.0] < (20.0, 29.0] < (29.0, 39.0] ... (59.0, 69.0] < (69.0, 78.0] < (78.0, 88.0] < (88.0, 98.0]]


With pandas.Categorical.categories you can display the categories:

[3]:

cats.categories

[3]:

IntervalIndex([(-0.1, 10.0], (10.0, 20.0], (20.0, 29.0], (29.0, 39.0], (39.0, 49.0], (49.0, 59.0], (59.0, 69.0], (69.0, 78.0], (78.0, 88.0], (88.0, 98.0]], dtype='interval[float64, right]')


… or even just a single category:

[4]:

cats.categories[0]

[4]:

Interval(-0.1, 10.0, closed='right')


With pandas.Categorical.codes you can display an array where for each value the corresponding category is shown:

[5]:

cats.codes

[5]:

array([2, 8, 0, 0, 2, 6, 3, 9, 2, 1, 7, 0, 5, 1, 3, 6, 6, 7, 1, 9, 1, 6,
3, 4, 3, 2, 6, 8, 5, 0, 5, 4, 0, 8, 5, 8, 3, 8, 7, 8, 6, 1, 1, 2,
3, 4, 7, 1, 5, 9, 4, 2, 8, 2, 9, 6, 0, 9, 0, 9, 5, 0, 1, 5, 6, 5,
3, 9, 0, 4, 2, 8, 9, 6, 5, 4, 4, 5, 6, 1, 7, 4, 1, 7, 0, 0, 1, 3,
3, 7, 5, 1, 9, 3, 0, 1, 7, 5, 9, 5, 3, 9, 3, 6, 7, 6, 9, 9, 6, 0,
1, 1, 3, 2, 9, 6, 0, 2, 9, 3, 8, 3, 1, 2, 7, 2, 6, 7, 9, 6, 1, 5,
3, 3, 1, 4, 6, 9, 8, 4, 0, 4, 8, 7, 5, 5, 4, 5, 1, 5, 2, 8, 2, 6,
0, 1, 8, 6, 7, 1, 3, 3, 3, 1, 3, 0, 6, 3, 9, 5, 9, 4, 3, 3, 0, 9,
7, 8, 2, 4, 1, 5, 7, 8, 6, 1, 3, 1, 4, 8, 3, 0, 0, 2, 2, 8, 9, 3,
4, 8, 4, 0, 1, 4, 9, 2, 5, 1, 1, 5, 0, 4, 7, 1, 9, 1, 7, 8, 5, 4,
1, 7, 0, 4, 5, 0, 1, 6, 8, 0, 8, 2, 6, 0, 7, 7, 0, 2, 3, 3, 2, 0,
4, 0, 5, 1, 8, 9, 5, 1], dtype=int8)


With value_counts we can now look at how the number is distributed among the individual areas:

[6]:

pd.value_counts(cats)

[6]:

(10.0, 20.0]    34
(-0.1, 10.0]    29
(29.0, 39.0]    29
(49.0, 59.0]    26
(88.0, 98.0]    24
(59.0, 69.0]    23
(39.0, 49.0]    22
(78.0, 88.0]    22
(20.0, 29.0]    21
(69.0, 78.0]    20
Name: count, dtype: int64


It is striking that the age ranges do not contain an equal number of years, but with 20.0, 29.0 and 69.0, 78.0 two ranges contain only 9 years. This is due to the fact that the age range only extends from 0 to 98:

[7]:

df.min()

[7]:

Age    0
dtype: int64

[8]:

df.max()

[8]:

Age    98
dtype: int64


With pandas.qcut, on the other hand, the set is divided into areas that are approximately the same size:

[9]:

cats = pd.qcut(ages, 10, precision=0)

[10]:

pd.value_counts(cats)

[10]:

(24.0, 36.0]    28
(9.0, 15.0]     26
(53.0, 65.0]    26
(65.0, 76.0]    26
(-1.0, 9.0]     25
(15.0, 24.0]    24
(36.0, 44.0]    24
(76.0, 88.0]    24
(88.0, 98.0]    24
(44.0, 53.0]    23
Name: count, dtype: int64


If we want to ensure that each age group actually includes exactly ten years, we can specify this directly with pandas.Categorical:

[11]:

age_groups = ["{0} - {1}".format(i, i + 9) for i in range(0, 99, 10)]
cats = pd.Categorical(age_groups)

cats.categories

[11]:

Index(['0 - 9', '10 - 19', '20 - 29', '30 - 39', '40 - 49', '50 - 59',
'60 - 69', '70 - 79', '80 - 89', '90 - 99'],
dtype='object')


For grouping we can now use pandas.cut. However, the number of labels must be one less than the number of edges:

[12]:

df["Age group"] = pd.cut(df.Age, range(0, 101, 10), right=False, labels=cats)

df

[12]:

Age Age group
0 22 20 - 29
1 82 80 - 89
2 6 0 - 9
3 3 0 - 9
4 28 20 - 29
... ... ...
245 15 10 - 19
246 86 80 - 89
247 91 90 - 99
248 55 50 - 59
249 15 10 - 19

250 rows × 2 columns