Detecting and filtering outliers

Filtering or transforming outliers is largely a matter of applying array operations. Consider a DataFrame with some normally distributed data:

[1]:
import numpy as np
import pandas as pd


rng = np.random.default_rng()
df = pd.DataFrame(rng.random(size=(1000, 4)))

df.describe()
[1]:
0 1 2 3
count 1000.000000 1000.000000 1000.000000 1000.000000
mean 0.498271 0.530054 0.493507 0.517449
std 0.290744 0.289093 0.282705 0.285510
min 0.000161 0.000342 0.001022 0.000311
25% 0.251230 0.286150 0.259627 0.262910
50% 0.494820 0.537394 0.493997 0.532155
75% 0.738350 0.789380 0.743976 0.760824
max 0.999730 0.996145 0.999650 0.997659

Suppose you want to find values in one of the columns whose absolute value is greater than 3:

[2]:
col = df[1]

col[col.abs() > 0.9]
[2]:
6      0.951639
12     0.923172
23     0.977943
30     0.908939
31     0.948341
         ...
969    0.952592
971    0.994060
975    0.984719
986    0.959456
990    0.973786
Name: 1, Length: 120, dtype: float64

To select all rows where value is greater than 3 or less than -3 in one of the columns, you can apply pandas.DataFrame.any to a Boolean DataFrame, using any(axis=1) to check if a value is in a row:

[3]:
df[(df.abs() > 0.9).any(axis=1)]
[3]:
0 1 2 3
0 0.971756 0.416897 0.475384 0.141742
1 0.958448 0.859031 0.040706 0.539540
4 0.939127 0.789107 0.368062 0.288842
6 0.650532 0.951639 0.217202 0.590328
12 0.327709 0.923172 0.251240 0.633213
... ... ... ... ...
988 0.914319 0.337331 0.829953 0.577455
990 0.612209 0.973786 0.170164 0.348713
993 0.976626 0.293988 0.065464 0.727823
997 0.920140 0.065826 0.719757 0.199114
999 0.488105 0.101673 0.924032 0.029535

349 rows × 4 columns

On this basis, the values can be limited to an interval between -3 and 3. For this we use the instruction np.sign(df), which generates values 1 and -1, depending on whether the values in df are positive or negative:

[4]:
df[df > 0.9] = np.sign(df) * 0.9

df.describe()
[4]:
0 1 2 3
count 1000.000000 1000.000000 1000.000000 1000.000000
mean 0.492438 0.524041 0.489124 0.512363
std 0.281780 0.280524 0.275768 0.277965
min 0.000161 0.000342 0.001022 0.000311
25% 0.251230 0.286150 0.259627 0.262910
50% 0.494820 0.537394 0.493997 0.532155
75% 0.738350 0.789380 0.743976 0.760824
max 0.900000 0.900000 0.900000 0.900000