Detecting and filtering outliers¶
Filtering or transforming outliers is largely a matter of applying array operations. Consider a DataFrame with some normally distributed data:
[1]:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(1000, 4))
df.describe()
[1]:
0 | 1 | 2 | 3 | |
---|---|---|---|---|
count | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 |
mean | 0.022943 | 0.034915 | 0.018331 | 0.018969 |
std | 1.046883 | 0.987310 | 0.972618 | 1.009193 |
min | -3.870700 | -2.833648 | -3.466318 | -3.491794 |
25% | -0.664047 | -0.621664 | -0.603689 | -0.671775 |
50% | 0.012635 | 0.026107 | 0.028248 | 0.017063 |
75% | 0.736662 | 0.697112 | 0.636774 | 0.743254 |
max | 3.700035 | 3.006204 | 2.751574 | 3.405041 |
Suppose you want to find values in one of the columns whose absolute value is greater than 3:
[2]:
col = df[1]
col[col.abs() > 3]
[2]:
365 3.006204
Name: 1, dtype: float64
To select all rows where value is greater than 3
or less than -3
in one of the columns, you can apply pandas.DataFrame.any to a Boolean DataFrame, using any(axis=1)
to check if a value is in a row:
[3]:
df[(df.abs() > 3).any(axis=1)]
[3]:
0 | 1 | 2 | 3 | |
---|---|---|---|---|
67 | -0.065879 | 1.783196 | 0.554033 | -3.000936 |
123 | 0.246540 | -0.588655 | 1.366174 | -3.491794 |
209 | -3.615275 | -1.539901 | -1.109978 | 1.557272 |
326 | -3.543526 | -0.123145 | -1.166289 | -0.793547 |
357 | 1.168551 | -0.951635 | -0.892777 | 3.405041 |
361 | 3.116807 | -0.184181 | 0.694654 | -1.116010 |
365 | -0.274058 | 3.006204 | 0.638351 | -0.117403 |
384 | -3.006891 | 0.871370 | -0.888511 | -0.498219 |
388 | 1.104036 | 0.127207 | 1.306627 | 3.164983 |
504 | -0.344477 | 1.190462 | -3.466318 | -1.577547 |
711 | 3.700035 | 0.449643 | -0.130976 | -0.231090 |
841 | -3.870700 | 0.165213 | -0.401433 | 1.267149 |
956 | 3.188822 | -0.048598 | 0.921613 | -0.281664 |
957 | -0.326832 | -0.324983 | 0.384806 | -3.062165 |
On this basis, the values can be limited to an interval between -3 and 3. For this we use the instruction np.sign(df)
, which generates values 1 and -1, depending on whether the values in df
are positive or negative:
[4]:
df[df.abs() > 3] = np.sign(df) * 3
df.describe()
[4]:
0 | 1 | 2 | 3 | |
---|---|---|---|---|
count | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 |
mean | 0.023974 | 0.034909 | 0.018797 | 0.018954 |
std | 1.037146 | 0.987292 | 0.971056 | 1.005623 |
min | -3.000000 | -2.833648 | -3.000000 | -3.000000 |
25% | -0.664047 | -0.621664 | -0.603689 | -0.671775 |
50% | 0.012635 | 0.026107 | 0.028248 | 0.017063 |
75% | 0.736662 | 0.697112 | 0.636774 | 0.743254 |
max | 3.000000 | 3.000000 | 2.751574 | 3.000000 |