Detecting and filtering outliers¶

Filtering or transforming outliers is largely a matter of applying array operations. Consider a DataFrame with some normally distributed data:

[1]:

import numpy as np
import pandas as pd


rng = np.random.default_rng()
df = pd.DataFrame(rng.normal(size=(1000, 4)))

df.describe()

[1]:

	0	1	2	3
count	1000.000000	1000.000000	1000.000000	1000.000000
mean	0.023623	0.029803	-0.028434	0.005751
std	1.012689	0.999117	0.980014	0.969842
min	-3.240759	-2.757613	-3.372727	-2.697726
25%	-0.633097	-0.650075	-0.685945	-0.684345
50%	0.001117	0.022371	-0.009798	0.002211
75%	0.705048	0.710901	0.657235	0.654270
max	3.262954	3.302385	2.956046	2.634073

Suppose you want to find values in one of the columns whose absolute value is greater than 3:

[2]:

col = df[1]

col[col.abs() > 3]

[2]:

220    3.011150
640    3.201065
674    3.302385
Name: 1, dtype: float64

To select all rows where value is greater than 3 or less than -3 in one of the columns, you can apply pandas.DataFrame.any to a Boolean DataFrame, using any(axis=1) to check if a value is in a row:

[3]:

df[(df.abs() > 3).any(axis=1)]

[3]:

	0	1	2	3
0	3.073224	0.392376	0.464029	-1.086741
50	0.456746	0.313551	-3.372727	0.232789
78	3.262954	-1.511093	-0.243049	-0.424410
220	0.657494	3.011150	-0.733968	0.549828
504	-3.240759	0.202480	-0.181495	0.088678
640	0.913886	3.201065	-0.896181	1.048140
674	-0.283886	3.302385	-0.541091	0.524652
833	3.010898	-0.341878	-0.409523	0.264089

On this basis, the values can be limited to an interval between -3 and 3. For this we use the instruction np.sign(df), which generates values 1 and -1, depending on whether the values in df are positive or negative:

[4]:

df[df > 3] = np.sign(df) * 3

df.describe()

[4]:

	0	1	2	3
count	1000.000000	1000.000000	1000.000000	1000.000000
mean	0.023276	0.029288	-0.028434	0.005751
std	1.011630	0.997518	0.980014	0.969842
min	-3.240759	-2.757613	-3.372727	-2.697726
25%	-0.633097	-0.650075	-0.685945	-0.684345
50%	0.001117	0.022371	-0.009798	0.002211
75%	0.705048	0.710901	0.657235	0.654270
max	3.000000	3.000000	2.956046	2.634073