Detecting and filtering outliers

Filtering or transforming outliers is largely a matter of applying array operations. Consider a DataFrame with some normally distributed data:

[1]:
import numpy as np
import pandas as pd


df = pd.DataFrame(np.random.randn(1000, 4))

df.describe()
[1]:
0 1 2 3
count 1000.000000 1000.000000 1000.000000 1000.000000
mean 0.022943 0.034915 0.018331 0.018969
std 1.046883 0.987310 0.972618 1.009193
min -3.870700 -2.833648 -3.466318 -3.491794
25% -0.664047 -0.621664 -0.603689 -0.671775
50% 0.012635 0.026107 0.028248 0.017063
75% 0.736662 0.697112 0.636774 0.743254
max 3.700035 3.006204 2.751574 3.405041

Suppose you want to find values in one of the columns whose absolute value is greater than 3:

[2]:
col = df[1]

col[col.abs() > 3]
[2]:
365    3.006204
Name: 1, dtype: float64

To select all rows where value is greater than 3 or less than -3 in one of the columns, you can apply pandas.DataFrame.any to a Boolean DataFrame, using any(axis=1) to check if a value is in a row:

[3]:
df[(df.abs() > 3).any(axis=1)]
[3]:
0 1 2 3
67 -0.065879 1.783196 0.554033 -3.000936
123 0.246540 -0.588655 1.366174 -3.491794
209 -3.615275 -1.539901 -1.109978 1.557272
326 -3.543526 -0.123145 -1.166289 -0.793547
357 1.168551 -0.951635 -0.892777 3.405041
361 3.116807 -0.184181 0.694654 -1.116010
365 -0.274058 3.006204 0.638351 -0.117403
384 -3.006891 0.871370 -0.888511 -0.498219
388 1.104036 0.127207 1.306627 3.164983
504 -0.344477 1.190462 -3.466318 -1.577547
711 3.700035 0.449643 -0.130976 -0.231090
841 -3.870700 0.165213 -0.401433 1.267149
956 3.188822 -0.048598 0.921613 -0.281664
957 -0.326832 -0.324983 0.384806 -3.062165

On this basis, the values can be limited to an interval between -3 and 3. For this we use the instruction np.sign(df), which generates values 1 and -1, depending on whether the values in df are positive or negative:

[4]:
df[df.abs() > 3] = np.sign(df) * 3

df.describe()
[4]:
0 1 2 3
count 1000.000000 1000.000000 1000.000000 1000.000000
mean 0.023974 0.034909 0.018797 0.018954
std 1.037146 0.987292 0.971056 1.005623
min -3.000000 -2.833648 -3.000000 -3.000000
25% -0.664047 -0.621664 -0.603689 -0.671775
50% 0.012635 0.026107 0.028248 0.017063
75% 0.736662 0.697112 0.636774 0.743254
max 3.000000 3.000000 2.751574 3.000000