Managing missing data with pandas#
Missing data often occurs in data analyses. pandas simplifies working with missing data as much as possible. For example, all descriptive statistics of pandas objects exclude missing data by default. pandas uses the floating point value NaN
(Not a Number) for numerical data to represent missing data.
Methods for handling NA
objects:
Argument |
Description |
---|---|
|
filters axis labels based on whether values for each label have missing data, with different thresholds for the amount of missing data to tolerate |
|
fills missing data with a value or with an interpolation method such as |
|
returns boolean values indicating which values are missing/ |
|
negates |
This notebook introduces some ways to manage missing data using Pandas DataFrames. For more information, see the Pandas documentation: Working with missing data and Missing data cookbook.
[1]:
import pandas as pd
[2]:
df = pd.read_csv(
"https://raw.githubusercontent.com/kjam/data-cleaning-101/master/data/iot_example_with_nulls.csv"
)
1. Check the data#
In pandas, a convention borrowed from the R programming language was adopted and missing data was referred to as NA
, which stands for not available. In statistical applications, NA
data can be either data that does not exist or data that exists but has not been observed (for example due to problems in data collection). When cleaning data for analysis, it is often important to analyse the missing data itself to identify problems in data collection or potential biases in the data due to
the missing data. First we display the first 20 data records:
[3]:
df.head(20)
[3]:
timestamp | username | temperature | heartrate | build | latest | note | |
---|---|---|---|---|---|---|---|
0 | 2017-01-01T12:00:23 | michaelsmith | 12.0 | 67 | 4e6a7805-8faa-2768-6ef6-eb3198b483ac | 0.0 | interval |
1 | 2017-01-01T12:01:09 | kharrison | 6.0 | 78 | 7256b7b0-e502-f576-62ec-ed73533c9c84 | 0.0 | wake |
2 | 2017-01-01T12:01:34 | smithadam | 5.0 | 89 | 9226c94b-bb4b-a6c8-8e02-cb42b53e9c90 | 0.0 | NaN |
3 | 2017-01-01T12:02:09 | eddierodriguez | 28.0 | 76 | NaN | 0.0 | update |
4 | 2017-01-01T12:02:36 | kenneth94 | 29.0 | 62 | 122f1c6a-403c-2221-6ed1-b5caa08f11e0 | NaN | NaN |
5 | 2017-01-01T12:03:04 | bryanttodd | 13.0 | 86 | 0897dbe5-9c5b-71ca-73a1-7586959ca198 | 0.0 | interval |
6 | 2017-01-01T12:03:51 | andrea98 | 17.0 | 81 | 1c07ab9b-5f66-137d-a74f-921a41001f4e | 1.0 | NaN |
7 | 2017-01-01T12:04:35 | scott28 | 16.0 | 76 | 7a60219f-6621-e548-180e-ca69624f9824 | NaN | interval |
8 | 2017-01-01T12:05:05 | hillpamela | 5.0 | 82 | a8b87754-a162-da28-2527-4bce4b3d4191 | 1.0 | NaN |
9 | 2017-01-01T12:05:41 | moorejeffrey | 25.0 | 63 | 585f1a3c-0679-0ffe-9132-508933c70343 | 0.0 | wake |
10 | 2017-01-01T12:06:21 | njohnson | NaN | 63 | e09b6001-125d-51cf-9c3f-9cb686c19d02 | NaN | NaN |
11 | 2017-01-01T12:06:53 | gsutton | 29.0 | 80 | 607c9f6e-2bdf-a606-6d16-3004c6958436 | 1.0 | update |
12 | 2017-01-01T12:07:41 | jessica48 | 22.0 | 83 | 03e1a07b-3e14-412c-3a69-6b45bc79f81c | NaN | update |
13 | 2017-01-01T12:08:08 | hornjohn | 16.0 | 73 | NaN | 0.0 | interval |
14 | 2017-01-01T12:08:35 | gramirez | 24.0 | 73 | NaN | 0.0 | wake |
15 | 2017-01-01T12:09:05 | schmidtsamuel | NaN | 78 | b9890c1e-79d5-8979-63ae-6c08a4cd476a | 0.0 | NaN |
16 | 2017-01-01T12:09:48 | derrick47 | NaN | 63 | b60bd7de-4057-8a85-f806-e6eec1350338 | NaN | interval |
17 | 2017-01-01T12:10:23 | beckercharles | 12.0 | 61 | b1dacc73-c8b7-1d7d-ee02-578da781a71e | 0.0 | test |
18 | 2017-01-01T12:10:57 | ipittman | 11.0 | 69 | 1aef7db8-9a3e-7dc9-d7a5-781ec0efd200 | NaN | user |
19 | 2017-01-01T12:11:34 | sabrina65 | 22.0 | 82 | 8075d058-7dae-e2ec-d47e-58ec6d26899b | 1.0 | NaN |
Then we look at what data type the columns are:
[4]:
df.dtypes
[4]:
timestamp object
username object
temperature float64
heartrate int64
build object
latest float64
note object
dtype: object
We can also display the values and their frequency, for example for the column note
:
[5]:
df.note.value_counts()
[5]:
note
wake 16496
user 16416
interval 16274
sleep 16226
update 16213
test 16068
Name: count, dtype: int64
2. Remove all null values (including the indication n/a
)#
2.1 …with pandas.read_csv
#
pandas.read_csv usually already filters out many values that it recognises as NA
or NaN
. Further values can be specified with na_values
.
[6]:
df = pd.read_csv(
"https://raw.githubusercontent.com/kjam/data-cleaning-101/master/data/iot_example_with_nulls.csv",
na_values=["n/a"],
)
2.2 …with pandas.DataFrame.dropna
#
Missing values can be deleted with pandas.DataFrame.dropna.
To analyse the extent of the deletions, we display the extent of the DataFrame before and after the deletion with pandas.DataFrame.shape:
[7]:
df.shape
[7]:
(146397, 7)
[8]:
df2 = df.dropna()
df2.shape
[8]:
(46116, 7)
So we would lose more than 2/3 of the records with pandas.DataFrame.dropna
.
In the next experiment, we want to analyse whether whole rows or columns contain no data. Here, how='all'
removes rows or columns that do not contain values; axis=1
says that empty rows should be removed.
[9]:
df.dropna(how="all", axis=1).shape
[9]:
(146397, 7)
This, too, does not bring us any further.
2.3 Find all columns where all data is present#
[10]:
complete_columns = list(df.columns)
[11]:
complete_columns
[11]:
['timestamp',
'username',
'temperature',
'heartrate',
'build',
'latest',
'note']
2.4 Find all columns where the most data is available#
[12]:
list(df.dropna(thresh=int(df.shape[0] * 0.9), axis=1).columns)
[12]:
['timestamp', 'username', 'heartrate']
thresh
requires a certain number of NA values, in our case 90% before axis=1
lashes a column.
2.5 Find all columns where data is missing#
With pandas.DataFrame.isnull we can find missing values and with pandas.DataFrame.any we find out if an element is valid, usually via a column.
[13]:
incomplete_columns = list(df.columns[df.isnull().any()])
[14]:
incomplete_columns
[14]:
['temperature', 'build', 'latest', 'note']
With num_missing
we can now output the number of missing values per column:
[15]:
for col in incomplete_columns:
num_missing = df[df[col].isnull() == True].shape[0]
print(f"number missing for column {col}: {num_missing}")
number missing for column temperature: 32357
number missing for column build: 32350
number missing for column latest: 32298
number missing for column note: 48704
We can also output these values as a percentage:
[16]:
for col in incomplete_columns:
percent_missing = df[df[col].isnull() == True].shape[0] / df.shape[0]
print(f"percent missing for column {col}: {percent_missing}")
percent missing for column temperature: 0.22102228870810195
percent missing for column build: 0.22097447352063226
percent missing for column latest: 0.22061927498514314
percent missing for column note: 0.332684412931959
2.6 Replace missing data#
To be able to check our changes in the latest
column, we use pandas.Series.value_counts. The method returns a series containing the number of unique values:
[17]:
df.latest.value_counts()
[17]:
latest
0.0 75735
1.0 38364
Name: count, dtype: int64
Now we replace the missing values in the column latest
with 0
with DataFrame.fillna:
[18]:
df.latest = df.latest.fillna(0)
[19]:
df.latest.value_counts()
[19]:
latest
0.0 108033
1.0 38364
Name: count, dtype: int64
2.7 Replace missing data using backfill
#
To make the records follow each other in their chronological order, we first set the index for timestamp
with set_index:
[20]:
df = df.set_index("timestamp")
[21]:
df.head(20)
[21]:
username | temperature | heartrate | build | latest | note | |
---|---|---|---|---|---|---|
timestamp | ||||||
2017-01-01T12:00:23 | michaelsmith | 12.0 | 67 | 4e6a7805-8faa-2768-6ef6-eb3198b483ac | 0.0 | interval |
2017-01-01T12:01:09 | kharrison | 6.0 | 78 | 7256b7b0-e502-f576-62ec-ed73533c9c84 | 0.0 | wake |
2017-01-01T12:01:34 | smithadam | 5.0 | 89 | 9226c94b-bb4b-a6c8-8e02-cb42b53e9c90 | 0.0 | NaN |
2017-01-01T12:02:09 | eddierodriguez | 28.0 | 76 | NaN | 0.0 | update |
2017-01-01T12:02:36 | kenneth94 | 29.0 | 62 | 122f1c6a-403c-2221-6ed1-b5caa08f11e0 | 0.0 | NaN |
2017-01-01T12:03:04 | bryanttodd | 13.0 | 86 | 0897dbe5-9c5b-71ca-73a1-7586959ca198 | 0.0 | interval |
2017-01-01T12:03:51 | andrea98 | 17.0 | 81 | 1c07ab9b-5f66-137d-a74f-921a41001f4e | 1.0 | NaN |
2017-01-01T12:04:35 | scott28 | 16.0 | 76 | 7a60219f-6621-e548-180e-ca69624f9824 | 0.0 | interval |
2017-01-01T12:05:05 | hillpamela | 5.0 | 82 | a8b87754-a162-da28-2527-4bce4b3d4191 | 1.0 | NaN |
2017-01-01T12:05:41 | moorejeffrey | 25.0 | 63 | 585f1a3c-0679-0ffe-9132-508933c70343 | 0.0 | wake |
2017-01-01T12:06:21 | njohnson | NaN | 63 | e09b6001-125d-51cf-9c3f-9cb686c19d02 | 0.0 | NaN |
2017-01-01T12:06:53 | gsutton | 29.0 | 80 | 607c9f6e-2bdf-a606-6d16-3004c6958436 | 1.0 | update |
2017-01-01T12:07:41 | jessica48 | 22.0 | 83 | 03e1a07b-3e14-412c-3a69-6b45bc79f81c | 0.0 | update |
2017-01-01T12:08:08 | hornjohn | 16.0 | 73 | NaN | 0.0 | interval |
2017-01-01T12:08:35 | gramirez | 24.0 | 73 | NaN | 0.0 | wake |
2017-01-01T12:09:05 | schmidtsamuel | NaN | 78 | b9890c1e-79d5-8979-63ae-6c08a4cd476a | 0.0 | NaN |
2017-01-01T12:09:48 | derrick47 | NaN | 63 | b60bd7de-4057-8a85-f806-e6eec1350338 | 0.0 | interval |
2017-01-01T12:10:23 | beckercharles | 12.0 | 61 | b1dacc73-c8b7-1d7d-ee02-578da781a71e | 0.0 | test |
2017-01-01T12:10:57 | ipittman | 11.0 | 69 | 1aef7db8-9a3e-7dc9-d7a5-781ec0efd200 | 0.0 | user |
2017-01-01T12:11:34 | sabrina65 | 22.0 | 82 | 8075d058-7dae-e2ec-d47e-58ec6d26899b | 1.0 | NaN |
We then use pandas.DataFrame.groupby to group the records by username
and then fill the missing data with the backfill
method of pandas.core.groupby.DataFrameGroupBy.fillna. limit
defines the maximum number of consecutive NaN
values:
[22]:
df.temperature = df.groupby("username").temperature.fillna(
method="backfill", limit=3
)
[23]:
for col in incomplete_columns:
num_missing = df[df[col].isnull() == True].shape[0]
print(f"number missing for column {col}: {num_missing}")
number missing for column temperature: 22633
number missing for column build: 32350
number missing for column latest: 0
number missing for column note: 48704
Arguments of the function fillna
:
Argument |
Description |
---|---|
|
Scalar value or dict-like object used to fill in missing values. |
Method |
interpolation; by default |
|
Axis to be filled; default |
|
changes the calling object without creating a copy |
|
for padding in forward and backward direction, maximum number of consecutive periods to pad |