Managing missing data with pandas#

Missing data often occurs in data analyses. pandas simplifies working with missing data as much as possible. For example, all descriptive statistics of pandas objects exclude missing data by default. pandas uses the floating point value NaN (Not a Number) for numerical data to represent missing data.

Methods for handling NA objects:

Argument

Description

dropna

filters axis labels based on whether values for each label have missing data, with different thresholds for the amount of missing data to tolerate

fillna

fills missing data with a value or with an interpolation method such as ffill or bfill

isna

returns boolean values indicating which values are missing/NA

notna

negates isna and returns True for non-NA values and False for NA values

This notebook introduces some ways to manage missing data using Pandas DataFrames. For more information, see the Pandas documentation: Working with missing data and Missing data cookbook.

See also

[1]:
import pandas as pd
[2]:
df = pd.read_csv(
    "https://raw.githubusercontent.com/kjam/data-cleaning-101/master/data/iot_example_with_nulls.csv"
)

1. Check the data#

In pandas, a convention borrowed from the R programming language was adopted and missing data was referred to as NA, which stands for not available. In statistical applications, NA data can be either data that does not exist or data that exists but has not been observed (for example due to problems in data collection). When cleaning data for analysis, it is often important to analyse the missing data itself to identify problems in data collection or potential biases in the data due to the missing data. First we display the first 20 data records:

[3]:
df.head(20)
[3]:
timestamp username temperature heartrate build latest note
0 2017-01-01T12:00:23 michaelsmith 12.0 67 4e6a7805-8faa-2768-6ef6-eb3198b483ac 0.0 interval
1 2017-01-01T12:01:09 kharrison 6.0 78 7256b7b0-e502-f576-62ec-ed73533c9c84 0.0 wake
2 2017-01-01T12:01:34 smithadam 5.0 89 9226c94b-bb4b-a6c8-8e02-cb42b53e9c90 0.0 NaN
3 2017-01-01T12:02:09 eddierodriguez 28.0 76 NaN 0.0 update
4 2017-01-01T12:02:36 kenneth94 29.0 62 122f1c6a-403c-2221-6ed1-b5caa08f11e0 NaN NaN
5 2017-01-01T12:03:04 bryanttodd 13.0 86 0897dbe5-9c5b-71ca-73a1-7586959ca198 0.0 interval
6 2017-01-01T12:03:51 andrea98 17.0 81 1c07ab9b-5f66-137d-a74f-921a41001f4e 1.0 NaN
7 2017-01-01T12:04:35 scott28 16.0 76 7a60219f-6621-e548-180e-ca69624f9824 NaN interval
8 2017-01-01T12:05:05 hillpamela 5.0 82 a8b87754-a162-da28-2527-4bce4b3d4191 1.0 NaN
9 2017-01-01T12:05:41 moorejeffrey 25.0 63 585f1a3c-0679-0ffe-9132-508933c70343 0.0 wake
10 2017-01-01T12:06:21 njohnson NaN 63 e09b6001-125d-51cf-9c3f-9cb686c19d02 NaN NaN
11 2017-01-01T12:06:53 gsutton 29.0 80 607c9f6e-2bdf-a606-6d16-3004c6958436 1.0 update
12 2017-01-01T12:07:41 jessica48 22.0 83 03e1a07b-3e14-412c-3a69-6b45bc79f81c NaN update
13 2017-01-01T12:08:08 hornjohn 16.0 73 NaN 0.0 interval
14 2017-01-01T12:08:35 gramirez 24.0 73 NaN 0.0 wake
15 2017-01-01T12:09:05 schmidtsamuel NaN 78 b9890c1e-79d5-8979-63ae-6c08a4cd476a 0.0 NaN
16 2017-01-01T12:09:48 derrick47 NaN 63 b60bd7de-4057-8a85-f806-e6eec1350338 NaN interval
17 2017-01-01T12:10:23 beckercharles 12.0 61 b1dacc73-c8b7-1d7d-ee02-578da781a71e 0.0 test
18 2017-01-01T12:10:57 ipittman 11.0 69 1aef7db8-9a3e-7dc9-d7a5-781ec0efd200 NaN user
19 2017-01-01T12:11:34 sabrina65 22.0 82 8075d058-7dae-e2ec-d47e-58ec6d26899b 1.0 NaN

Then we look at what data type the columns are:

[4]:
df.dtypes
[4]:
timestamp       object
username        object
temperature    float64
heartrate        int64
build           object
latest         float64
note            object
dtype: object

We can also display the values and their frequency, for example for the column note:

[5]:
df.note.value_counts()
[5]:
note
wake        16496
user        16416
interval    16274
sleep       16226
update      16213
test        16068
Name: count, dtype: int64

2. Remove all null values (including the indication n/a)#

2.1 …with pandas.read_csv#

pandas.read_csv usually already filters out many values that it recognises as NA or NaN. Further values can be specified with na_values.

[6]:
df = pd.read_csv(
    "https://raw.githubusercontent.com/kjam/data-cleaning-101/master/data/iot_example_with_nulls.csv",
    na_values=["n/a"],
)

2.2 …with pandas.DataFrame.dropna#

Missing values can be deleted with pandas.DataFrame.dropna.

To analyse the extent of the deletions, we display the extent of the DataFrame before and after the deletion with pandas.DataFrame.shape:

[7]:
df.shape
[7]:
(146397, 7)
[8]:
df2 = df.dropna()

df2.shape
[8]:
(46116, 7)

So we would lose more than 2/3 of the records with pandas.DataFrame.dropna.

In the next experiment, we want to analyse whether whole rows or columns contain no data. Here, how='all' removes rows or columns that do not contain values; axis=1 says that empty rows should be removed.

[9]:
df.dropna(how="all", axis=1).shape
[9]:
(146397, 7)

This, too, does not bring us any further.

2.3 Find all columns where all data is present#

[10]:
complete_columns = list(df.columns)
[11]:
complete_columns
[11]:
['timestamp',
 'username',
 'temperature',
 'heartrate',
 'build',
 'latest',
 'note']

2.4 Find all columns where the most data is available#

[12]:
list(df.dropna(thresh=int(df.shape[0] * 0.9), axis=1).columns)
[12]:
['timestamp', 'username', 'heartrate']

thresh requires a certain number of NA values, in our case 90% before axis=1 lashes a column.

2.5 Find all columns where data is missing#

With pandas.DataFrame.isnull we can find missing values and with pandas.DataFrame.any we find out if an element is valid, usually via a column.

[13]:
incomplete_columns = list(df.columns[df.isnull().any()])
[14]:
incomplete_columns
[14]:
['temperature', 'build', 'latest', 'note']

With num_missing we can now output the number of missing values per column:

[15]:
for col in incomplete_columns:
    num_missing = df[df[col].isnull() == True].shape[0]
    print(f"number missing for column {col}: {num_missing}")
number missing for column temperature: 32357
number missing for column build: 32350
number missing for column latest: 32298
number missing for column note: 48704

We can also output these values as a percentage:

[16]:
for col in incomplete_columns:
    percent_missing = df[df[col].isnull() == True].shape[0] / df.shape[0]
    print(f"percent missing for column {col}: {percent_missing}")
percent missing for column temperature: 0.22102228870810195
percent missing for column build: 0.22097447352063226
percent missing for column latest: 0.22061927498514314
percent missing for column note: 0.332684412931959

2.6 Replace missing data#

To be able to check our changes in the latest column, we use pandas.Series.value_counts. The method returns a series containing the number of unique values:

[17]:
df.latest.value_counts()
[17]:
latest
0.0    75735
1.0    38364
Name: count, dtype: int64

Now we replace the missing values in the column latest with 0 with DataFrame.fillna:

[18]:
df.latest = df.latest.fillna(0)
[19]:
df.latest.value_counts()
[19]:
latest
0.0    108033
1.0     38364
Name: count, dtype: int64

2.7 Replace missing data using backfill#

To make the records follow each other in their chronological order, we first set the index for timestamp with set_index:

[20]:
df = df.set_index("timestamp")
[21]:
df.head(20)
[21]:
username temperature heartrate build latest note
timestamp
2017-01-01T12:00:23 michaelsmith 12.0 67 4e6a7805-8faa-2768-6ef6-eb3198b483ac 0.0 interval
2017-01-01T12:01:09 kharrison 6.0 78 7256b7b0-e502-f576-62ec-ed73533c9c84 0.0 wake
2017-01-01T12:01:34 smithadam 5.0 89 9226c94b-bb4b-a6c8-8e02-cb42b53e9c90 0.0 NaN
2017-01-01T12:02:09 eddierodriguez 28.0 76 NaN 0.0 update
2017-01-01T12:02:36 kenneth94 29.0 62 122f1c6a-403c-2221-6ed1-b5caa08f11e0 0.0 NaN
2017-01-01T12:03:04 bryanttodd 13.0 86 0897dbe5-9c5b-71ca-73a1-7586959ca198 0.0 interval
2017-01-01T12:03:51 andrea98 17.0 81 1c07ab9b-5f66-137d-a74f-921a41001f4e 1.0 NaN
2017-01-01T12:04:35 scott28 16.0 76 7a60219f-6621-e548-180e-ca69624f9824 0.0 interval
2017-01-01T12:05:05 hillpamela 5.0 82 a8b87754-a162-da28-2527-4bce4b3d4191 1.0 NaN
2017-01-01T12:05:41 moorejeffrey 25.0 63 585f1a3c-0679-0ffe-9132-508933c70343 0.0 wake
2017-01-01T12:06:21 njohnson NaN 63 e09b6001-125d-51cf-9c3f-9cb686c19d02 0.0 NaN
2017-01-01T12:06:53 gsutton 29.0 80 607c9f6e-2bdf-a606-6d16-3004c6958436 1.0 update
2017-01-01T12:07:41 jessica48 22.0 83 03e1a07b-3e14-412c-3a69-6b45bc79f81c 0.0 update
2017-01-01T12:08:08 hornjohn 16.0 73 NaN 0.0 interval
2017-01-01T12:08:35 gramirez 24.0 73 NaN 0.0 wake
2017-01-01T12:09:05 schmidtsamuel NaN 78 b9890c1e-79d5-8979-63ae-6c08a4cd476a 0.0 NaN
2017-01-01T12:09:48 derrick47 NaN 63 b60bd7de-4057-8a85-f806-e6eec1350338 0.0 interval
2017-01-01T12:10:23 beckercharles 12.0 61 b1dacc73-c8b7-1d7d-ee02-578da781a71e 0.0 test
2017-01-01T12:10:57 ipittman 11.0 69 1aef7db8-9a3e-7dc9-d7a5-781ec0efd200 0.0 user
2017-01-01T12:11:34 sabrina65 22.0 82 8075d058-7dae-e2ec-d47e-58ec6d26899b 1.0 NaN

We then use pandas.DataFrame.groupby to group the records by username and then fill the missing data with the backfill method of pandas.core.groupby.DataFrameGroupBy.fillna. limit defines the maximum number of consecutive NaN values:

[22]:
df.temperature = df.groupby("username").temperature.fillna(
    method="backfill", limit=3
)
[23]:
for col in incomplete_columns:
    num_missing = df[df[col].isnull() == True].shape[0]
    print(f"number missing for column {col}: {num_missing}")
number missing for column temperature: 22633
number missing for column build: 32350
number missing for column latest: 0
number missing for column note: 48704

Arguments of the function fillna:

Argument

Description

value

Scalar value or dict-like object used to fill in missing values.

Method

interpolation; by default ffill if the function is called without further arguments

axis

Axis to be filled; default axis=0

inplace

changes the calling object without creating a copy

limit

for padding in forward and backward direction, maximum number of consecutive periods to pad