Detecting and filtering outliers¶
Filtering or transforming outliers is largely a matter of applying array operations. Consider a DataFrame with some normally distributed data:
[1]:
import numpy as np
import pandas as pd
rng = np.random.default_rng()
df = pd.DataFrame(rng.random(size=(1000, 4)))
df.describe()
[1]:
| 0 | 1 | 2 | 3 | |
|---|---|---|---|---|
| count | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 |
| mean | 0.498271 | 0.530054 | 0.493507 | 0.517449 |
| std | 0.290744 | 0.289093 | 0.282705 | 0.285510 |
| min | 0.000161 | 0.000342 | 0.001022 | 0.000311 |
| 25% | 0.251230 | 0.286150 | 0.259627 | 0.262910 |
| 50% | 0.494820 | 0.537394 | 0.493997 | 0.532155 |
| 75% | 0.738350 | 0.789380 | 0.743976 | 0.760824 |
| max | 0.999730 | 0.996145 | 0.999650 | 0.997659 |
Suppose you want to find values in one of the columns whose absolute value is greater than 3:
[2]:
col = df[1]
col[col.abs() > 0.9]
[2]:
6 0.951639
12 0.923172
23 0.977943
30 0.908939
31 0.948341
...
969 0.952592
971 0.994060
975 0.984719
986 0.959456
990 0.973786
Name: 1, Length: 120, dtype: float64
To select all rows where value is greater than 3 or less than -3 in one of the columns, you can apply pandas.DataFrame.any to a Boolean DataFrame, using any(axis=1) to check if a value is in a row:
[3]:
df[(df.abs() > 0.9).any(axis=1)]
[3]:
| 0 | 1 | 2 | 3 | |
|---|---|---|---|---|
| 0 | 0.971756 | 0.416897 | 0.475384 | 0.141742 |
| 1 | 0.958448 | 0.859031 | 0.040706 | 0.539540 |
| 4 | 0.939127 | 0.789107 | 0.368062 | 0.288842 |
| 6 | 0.650532 | 0.951639 | 0.217202 | 0.590328 |
| 12 | 0.327709 | 0.923172 | 0.251240 | 0.633213 |
| ... | ... | ... | ... | ... |
| 988 | 0.914319 | 0.337331 | 0.829953 | 0.577455 |
| 990 | 0.612209 | 0.973786 | 0.170164 | 0.348713 |
| 993 | 0.976626 | 0.293988 | 0.065464 | 0.727823 |
| 997 | 0.920140 | 0.065826 | 0.719757 | 0.199114 |
| 999 | 0.488105 | 0.101673 | 0.924032 | 0.029535 |
349 rows × 4 columns
On this basis, the values can be limited to an interval between -3 and 3. For this we use the instruction np.sign(df), which generates values 1 and -1, depending on whether the values in df are positive or negative:
[4]:
df[df > 0.9] = np.sign(df) * 0.9
df.describe()
[4]:
| 0 | 1 | 2 | 3 | |
|---|---|---|---|---|
| count | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 |
| mean | 0.492438 | 0.524041 | 0.489124 | 0.512363 |
| std | 0.281780 | 0.280524 | 0.275768 | 0.277965 |
| min | 0.000161 | 0.000342 | 0.001022 | 0.000311 |
| 25% | 0.251230 | 0.286150 | 0.259627 | 0.262910 |
| 50% | 0.494820 | 0.537394 | 0.493997 | 0.532155 |
| 75% | 0.738350 | 0.789380 | 0.743976 | 0.760824 |
| max | 0.900000 | 0.900000 | 0.900000 | 0.900000 |