Detecting and filtering outliers¶
Filtering or transforming outliers is largely a matter of applying array operations. Consider a DataFrame with some normally distributed data:
[1]:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(1000, 4))
df.describe()
[1]:
0 | 1 | 2 | 3 | |
---|---|---|---|---|
count | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 |
mean | -0.034508 | 0.011824 | -0.024031 | -0.048423 |
std | 1.023096 | 1.069939 | 1.037148 | 0.972926 |
min | -2.998919 | -2.939683 | -3.980539 | -3.180228 |
25% | -0.735324 | -0.739318 | -0.690162 | -0.699223 |
50% | -0.020213 | 0.009185 | -0.041272 | -0.046438 |
75% | 0.661472 | 0.728629 | 0.675814 | 0.588834 |
max | 3.187850 | 3.693235 | 3.950033 | 3.089895 |
Suppose you want to find values in one of the columns whose absolute value is greater than 3:
[2]:
col = df[1]
col[col.abs() > 3]
[2]:
435 3.693235
Name: 1, dtype: float64
To select all rows where value is greater than 3
or less than -3
in one of the columns, you can apply pandas.DataFrame.any to a Boolean DataFrame, using any(axis=1)
to check if a value is in a row:
[3]:
df[(df.abs() > 3).any(axis=1)]
[3]:
0 | 1 | 2 | 3 | |
---|---|---|---|---|
103 | -0.477368 | -0.100079 | -1.466754 | -3.180228 |
188 | 1.962728 | -0.072791 | 3.950033 | -0.012231 |
210 | 1.498744 | -0.057742 | 3.412662 | 0.586651 |
245 | 3.016760 | 1.527263 | 1.790951 | -0.015122 |
282 | 1.006073 | -0.480924 | 0.259646 | 3.089895 |
385 | 3.187850 | -1.069850 | -0.641928 | 1.733524 |
435 | -0.303929 | 3.693235 | -0.590390 | 0.052511 |
606 | -0.220844 | -0.479557 | -3.012150 | -1.476384 |
613 | 0.715983 | 0.134178 | -3.835888 | -1.358231 |
666 | -0.351409 | 1.919364 | -3.014478 | -0.340513 |
743 | 0.227552 | -0.831102 | -0.905155 | -3.046226 |
824 | 0.109159 | 0.501608 | -3.980539 | -0.783160 |
829 | 3.075201 | 1.517391 | 1.191999 | -0.690774 |
882 | -0.445649 | 0.455558 | -3.241675 | 2.569407 |
On this basis, the values can be limited to an interval between -3 and 3. For this we use the instruction np.sign(df)
, which generates values 1 and -1, depending on whether the values in df
are positive or negative:
[4]:
df[df.abs() > 3] = np.sign(df) * 3
df.describe()
[4]:
0 | 1 | 2 | 3 | |
---|---|---|---|---|
count | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 |
mean | -0.034787 | 0.011131 | -0.023309 | -0.048286 |
std | 1.022245 | 1.067774 | 1.025773 | 0.971934 |
min | -2.998919 | -2.939683 | -3.000000 | -3.000000 |
25% | -0.735324 | -0.739318 | -0.690162 | -0.699223 |
50% | -0.020213 | 0.009185 | -0.041272 | -0.046438 |
75% | 0.661472 | 0.728629 | 0.675814 | 0.588834 |
max | 3.000000 | 3.000000 | 3.000000 | 3.000000 |