Detecting and filtering outliers

Filtering or transforming outliers is largely a matter of applying array operations. Consider a DataFrame with some normally distributed data:

[1]:
import numpy as np
import pandas as pd


df = pd.DataFrame(np.random.randn(1000, 4))

df.describe()
[1]:
0 1 2 3
count 1000.000000 1000.000000 1000.000000 1000.000000
mean -0.034508 0.011824 -0.024031 -0.048423
std 1.023096 1.069939 1.037148 0.972926
min -2.998919 -2.939683 -3.980539 -3.180228
25% -0.735324 -0.739318 -0.690162 -0.699223
50% -0.020213 0.009185 -0.041272 -0.046438
75% 0.661472 0.728629 0.675814 0.588834
max 3.187850 3.693235 3.950033 3.089895

Suppose you want to find values in one of the columns whose absolute value is greater than 3:

[2]:
col = df[1]

col[col.abs() > 3]
[2]:
435    3.693235
Name: 1, dtype: float64

To select all rows where value is greater than 3 or less than -3 in one of the columns, you can apply pandas.DataFrame.any to a Boolean DataFrame, using any(axis=1) to check if a value is in a row:

[3]:
df[(df.abs() > 3).any(axis=1)]
[3]:
0 1 2 3
103 -0.477368 -0.100079 -1.466754 -3.180228
188 1.962728 -0.072791 3.950033 -0.012231
210 1.498744 -0.057742 3.412662 0.586651
245 3.016760 1.527263 1.790951 -0.015122
282 1.006073 -0.480924 0.259646 3.089895
385 3.187850 -1.069850 -0.641928 1.733524
435 -0.303929 3.693235 -0.590390 0.052511
606 -0.220844 -0.479557 -3.012150 -1.476384
613 0.715983 0.134178 -3.835888 -1.358231
666 -0.351409 1.919364 -3.014478 -0.340513
743 0.227552 -0.831102 -0.905155 -3.046226
824 0.109159 0.501608 -3.980539 -0.783160
829 3.075201 1.517391 1.191999 -0.690774
882 -0.445649 0.455558 -3.241675 2.569407

On this basis, the values can be limited to an interval between -3 and 3. For this we use the instruction np.sign(df), which generates values 1 and -1, depending on whether the values in df are positive or negative:

[4]:
df[df.abs() > 3] = np.sign(df) * 3

df.describe()
[4]:
0 1 2 3
count 1000.000000 1000.000000 1000.000000 1000.000000
mean -0.034787 0.011131 -0.023309 -0.048286
std 1.022245 1.067774 1.025773 0.971934
min -2.998919 -2.939683 -3.000000 -3.000000
25% -0.735324 -0.739318 -0.690162 -0.699223
50% -0.020213 0.009185 -0.041272 -0.046438
75% 0.661472 0.728629 0.675814 0.588834
max 3.000000 3.000000 3.000000 3.000000