Managing missing data with pandas¶

Missing data often occurs in data analysis. pandas simplifies working with missing data as much as possible. For example, all descriptive statistics of pandas objects exclude missing data by default. pandas uses the floating point value NaN (Not a Number) to represent missing data for numerical data.

pandas has adopted a convention borrowed from the R programming language and refers to missing data as NA, which stands for not available. In statistical applications, NA data can be either data that does not exist or data that exists but has not been observed (for example, due to problems with data collection). Python’s None object is also treated as NA in non-numeric arrays.

Methods for handling NA objects:

Argument	Description
`dropna`	filters axis labels based on whether values for individual labels have missing data, applying different thresholds for the amount of missing data to be tolerated.
`fillna`	fills missing data with a value or with an interpolation method such as `ffill` or `bfill`.
`isna`	returns Boolean values indicating which values are missing/`NA`.
`notna`	negates `isna` and returns `True` for non-`NA` values and `False` for `NA` values.

This notebook presents some ways to manage missing data with pandas DataFrames. For more information, see the Pandas documentation: Working with missing data and Missing data cookbook.

See also:

Dora

Badfish

[1]:

import pandas as pd

[2]:

df = pd.read_csv(
    "https://raw.githubusercontent.com/kjam/data-cleaning-101/master/data/iot_example_with_nulls.csv",
)

1. Checking the data¶

When cleaning data for analysis, it is often important to analyse the missing data itself in order to identify problems with data collection or potential biases in the data due to the missing data. First, let’s display the first 20 data records:

[3]:

df.head(20)

[3]:

	timestamp	username	temperature	heartrate	build	latest	note
0	2017-01-01T12:00:23	michaelsmith	12.0	67	4e6a7805-8faa-2768-6ef6-eb3198b483ac	0.0	interval
1	2017-01-01T12:01:09	kharrison	6.0	78	7256b7b0-e502-f576-62ec-ed73533c9c84	0.0	wake
2	2017-01-01T12:01:34	smithadam	5.0	89	9226c94b-bb4b-a6c8-8e02-cb42b53e9c90	0.0	NaN
3	2017-01-01T12:02:09	eddierodriguez	28.0	76	NaN	0.0	update
4	2017-01-01T12:02:36	kenneth94	29.0	62	122f1c6a-403c-2221-6ed1-b5caa08f11e0	NaN	NaN
5	2017-01-01T12:03:04	bryanttodd	13.0	86	0897dbe5-9c5b-71ca-73a1-7586959ca198	0.0	interval
6	2017-01-01T12:03:51	andrea98	17.0	81	1c07ab9b-5f66-137d-a74f-921a41001f4e	1.0	NaN
7	2017-01-01T12:04:35	scott28	16.0	76	7a60219f-6621-e548-180e-ca69624f9824	NaN	interval
8	2017-01-01T12:05:05	hillpamela	5.0	82	a8b87754-a162-da28-2527-4bce4b3d4191	1.0	NaN
9	2017-01-01T12:05:41	moorejeffrey	25.0	63	585f1a3c-0679-0ffe-9132-508933c70343	0.0	wake
10	2017-01-01T12:06:21	njohnson	NaN	63	e09b6001-125d-51cf-9c3f-9cb686c19d02	NaN	NaN
11	2017-01-01T12:06:53	gsutton	29.0	80	607c9f6e-2bdf-a606-6d16-3004c6958436	1.0	update
12	2017-01-01T12:07:41	jessica48	22.0	83	03e1a07b-3e14-412c-3a69-6b45bc79f81c	NaN	update
13	2017-01-01T12:08:08	hornjohn	16.0	73	NaN	0.0	interval
14	2017-01-01T12:08:35	gramirez	24.0	73	NaN	0.0	wake
15	2017-01-01T12:09:05	schmidtsamuel	NaN	78	b9890c1e-79d5-8979-63ae-6c08a4cd476a	0.0	NaN
16	2017-01-01T12:09:48	derrick47	NaN	63	b60bd7de-4057-8a85-f806-e6eec1350338	NaN	interval
17	2017-01-01T12:10:23	beckercharles	12.0	61	b1dacc73-c8b7-1d7d-ee02-578da781a71e	0.0	test
18	2017-01-01T12:10:57	ipittman	11.0	69	1aef7db8-9a3e-7dc9-d7a5-781ec0efd200	NaN	user
19	2017-01-01T12:11:34	sabrina65	22.0	82	8075d058-7dae-e2ec-d47e-58ec6d26899b	1.0	NaN

Then let’s take a look at what data type the columns are:

[4]:

df.dtypes

[4]:

timestamp       object
username        object
temperature    float64
heartrate        int64
build           object
latest         float64
note            object
dtype: object

With pandas.api.types, we can also automatically check whether the data types meet our expectations:

[5]:

import pandas.api.types as ptypes


expected_datetimes = ["timestamp"]
expected_objects = ["username", "build", "note"]
expected_floats = ["temperature", "latest"]
expected_ints = ["heartrate"]

if not all(
    ptypes.is_datetime64_any_dtype(df[col]) for col in expected_datetimes
):
    msg = f"Expected datetimes in {expected_datetimes}."
    raise ValueError(msg)
if not all(ptypes.is_object_dtype(df[col]) for col in expected_objects):
    msg = f"Expected objects in {expected_objects}."
    raise ValueError(msg)
if not all(ptypes.is_float_dtype(df[col]) for col in expected_floats):
    msg = f"Expected floats in {expected_floats}"
    raise ValueError(msg)
if not all(ptypes.is_int64_dtype(df[col]) for col in expected_ints):
    msg = f"Expected ints in {expected_ints}"
    raise ValueError(msg)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[5], line 13
      9 if not all(
     10     ptypes.is_datetime64_any_dtype(df[col]) for col in expected_datetimes
     11 ):
     12     msg = f"Expected datetimes in {expected_datetimes}."
---> 13     raise ValueError(msg)
     14 if not all(ptypes.is_object_dtype(df[col]) for col in expected_objects):
     15     msg = f"Expected objects in {expected_objects}."

ValueError: Expected datetimes in ['timestamp'].

Now we can convert the timestamp column to the appropriate type, see also Converting dtype:

[6]:

df["timestamp"] = df["timestamp"].astype("datetime64[ns]")

We can also display the values and their frequency, for example for the column note:

[7]:

df.note.value_counts()

[7]:

note
wake        16496
user        16416
interval    16274
sleep       16226
update      16213
test        16068
Name: count, dtype: int64

2. Removing all zero values (including `n/a`)¶

2.1 … with `pandas.read_csv`¶

pandas.read_csv normally filters out many values that it recognises as NA or NaN. Additional values can be specified with na_values.

[8]:

df = pd.read_csv(
    "https://raw.githubusercontent.com/kjam/data-cleaning-101/master/data/iot_example_with_nulls.csv",
    na_values=["n/a"],
)

2.2 … with `pandas.DataFrame.dropna`¶

Missing values can be deleted using pandas.DataFrame.dropna.

To analyse the scope of the deletions, we display the scope of the DataFrame before and after deletion using pandas.DataFrame.shape:

[9]:

df.shape

[9]:

(146397, 7)

[10]:

df.dropna().shape

[10]:

(46116, 7)

Using pandas.DataFrame.dropna would therefore result in the loss of more than two-thirds of the data records.

In the next attempt, we want to analyse whether entire rows or columns contain no data. Using how="all", rows or columns that contain no values are removed; axis=1 specifies that empty rows should be removed.

[11]:

df.dropna(how="all", axis=1).shape

[11]:

(146397, 7)

This does not get us any further either.

2.3 Find all columns containing the most data¶

[12]:

list(df.dropna(thresh=int(df.shape[0] * 0.9), axis=1).columns)

[12]:

['timestamp', 'username', 'heartrate']

thresh requires a certain number of NA values, in our case 90%, before axis=1 hides a column.

2.4 Find all columns with missing data¶

With pandas.DataFrame.isna, we can find missing values, and with pandas.DataFrame.any, we can find out whether an element is valid, usually across a column.

[13]:

incomplete_columns = list(df.columns[df.isna().any()])

[14]:

incomplete_columns

[14]:

['temperature', 'build', 'latest', 'note']

With num_missing, we can now display the number of missing values per column:

[15]:

for col in incomplete_columns:
    num_missing = df[df[col].isna()].shape[0]
    print(f"number missing for column {col}: {num_missing}")

number missing for column temperature: 32357
number missing for column build: 32350
number missing for column latest: 32298
number missing for column note: 48704

We can also display these values as percentages:

[16]:

for col in incomplete_columns:
    percent_missing = df[df[col].isna()].shape[0] / df.shape[0]
    print(f"percent missing for column {col}: {percent_missing}")

percent missing for column temperature: 0.22102228870810195
percent missing for column build: 0.22097447352063226
percent missing for column latest: 0.22061927498514314
percent missing for column note: 0.332684412931959

2.5 Replacing missing data¶

To verify our changes in the latest column, we use pandas.Series.value_counts. The method returns a series containing the number of unique values:

[17]:

df.temperature.value_counts()

[17]:

temperature
29.0    4688
26.0    4674
16.0    4656
28.0    4648
10.0    4632
13.0    4629
7.0     4624
27.0    4621
21.0    4585
9.0     4576
23.0    4571
5.0     4568
6.0     4563
19.0    4561
18.0    4557
17.0    4556
11.0    4529
15.0    4525
8.0     4486
12.0    4484
20.0    4473
25.0    4469
14.0    4464
22.0    4455
24.0    4446
Name: count, dtype: int64

Now we replace the missing values in the temperature column with the mean value rounded to one decimal place using DataFrame.fillna:

[18]:

temp_mean = round(df.temperature.mean(), 1)
fill_mean = df.temperature.fillna(temp_mean)
fill_mean.value_counts()

[18]:

temperature
17.0    36913
29.0     4688
26.0     4674
16.0     4656
28.0     4648
10.0     4632
13.0     4629
7.0      4624
27.0     4621
21.0     4585
9.0      4576
23.0     4571
5.0      4568
6.0      4563
19.0     4561
18.0     4557
11.0     4529
15.0     4525
8.0      4486
12.0     4484
20.0     4473
25.0     4469
14.0     4464
22.0     4455
24.0     4446
Name: count, dtype: int64

2.6 Replacing missing data with `backfill`¶

To ensure that the data records follow each other in chronological order, we first set the index for timestamp with set_index:

[19]:

df = df.set_index("timestamp")

[20]:

df.head(20)

[20]:

	username	temperature	heartrate	build	latest	note
timestamp
2017-01-01T12:00:23	michaelsmith	12.0	67	4e6a7805-8faa-2768-6ef6-eb3198b483ac	0.0	interval
2017-01-01T12:01:09	kharrison	6.0	78	7256b7b0-e502-f576-62ec-ed73533c9c84	0.0	wake
2017-01-01T12:01:34	smithadam	5.0	89	9226c94b-bb4b-a6c8-8e02-cb42b53e9c90	0.0	NaN
2017-01-01T12:02:09	eddierodriguez	28.0	76	NaN	0.0	update
2017-01-01T12:02:36	kenneth94	29.0	62	122f1c6a-403c-2221-6ed1-b5caa08f11e0	NaN	NaN
2017-01-01T12:03:04	bryanttodd	13.0	86	0897dbe5-9c5b-71ca-73a1-7586959ca198	0.0	interval
2017-01-01T12:03:51	andrea98	17.0	81	1c07ab9b-5f66-137d-a74f-921a41001f4e	1.0	NaN
2017-01-01T12:04:35	scott28	16.0	76	7a60219f-6621-e548-180e-ca69624f9824	NaN	interval
2017-01-01T12:05:05	hillpamela	5.0	82	a8b87754-a162-da28-2527-4bce4b3d4191	1.0	NaN
2017-01-01T12:05:41	moorejeffrey	25.0	63	585f1a3c-0679-0ffe-9132-508933c70343	0.0	wake
2017-01-01T12:06:21	njohnson	NaN	63	e09b6001-125d-51cf-9c3f-9cb686c19d02	NaN	NaN
2017-01-01T12:06:53	gsutton	29.0	80	607c9f6e-2bdf-a606-6d16-3004c6958436	1.0	update
2017-01-01T12:07:41	jessica48	22.0	83	03e1a07b-3e14-412c-3a69-6b45bc79f81c	NaN	update
2017-01-01T12:08:08	hornjohn	16.0	73	NaN	0.0	interval
2017-01-01T12:08:35	gramirez	24.0	73	NaN	0.0	wake
2017-01-01T12:09:05	schmidtsamuel	NaN	78	b9890c1e-79d5-8979-63ae-6c08a4cd476a	0.0	NaN
2017-01-01T12:09:48	derrick47	NaN	63	b60bd7de-4057-8a85-f806-e6eec1350338	NaN	interval
2017-01-01T12:10:23	beckercharles	12.0	61	b1dacc73-c8b7-1d7d-ee02-578da781a71e	0.0	test
2017-01-01T12:10:57	ipittman	11.0	69	1aef7db8-9a3e-7dc9-d7a5-781ec0efd200	NaN	user
2017-01-01T12:11:34	sabrina65	22.0	82	8075d058-7dae-e2ec-d47e-58ec6d26899b	1.0	NaN

We then use pandas.DataFrame.groupby to group the records by username and then fill in the missing data using the backfill method from pandas.api.typing.DataFrameGroupBy.bfill. limit defines the maximum number of consecutive NaN values:

[21]:

df.temperature = df.groupby("username").temperature.bfill(limit=3)

[22]:

for col in incomplete_columns:
    num_missing = df[df[col].isna()].shape[0]
    print(f"number missing for column {col}: {num_missing}")

number missing for column temperature: 22633
number missing for column build: 32350
number missing for column latest: 32298
number missing for column note: 48704

Arguments of the fillna function:

Argument	Description
`value`	Scalar value or dict-like object used to fill in missing values
Methode	Interpolation; default is `ffill` when the function is called without any other arguments
`axis`	Axis to fill; default is `axis=0`
`inplace`	Modifies the calling object without creating a copy
`limit`	For filling in forward and backward directions, maximum number of consecutive periods to fill

Managing missing data with pandas¶

1. Checking the data¶

2. Removing all zero values (including n/a)¶

2.1 … with pandas.read_csv¶

2.2 … with pandas.DataFrame.dropna¶