{
"cells": [
{
"cell_type": "markdown",
"id": "455a3208",
"metadata": {},
"source": [
"# Detecting and filtering outliers\n",
"\n",
"Filtering or transforming outliers is largely a matter of applying array operations. Consider a DataFrame with some normally distributed data:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "35bb569f",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
"
\n",
" \n",
" \n",
" \n",
" | count | \n",
" 1000.000000 | \n",
" 1000.000000 | \n",
" 1000.000000 | \n",
" 1000.000000 | \n",
"
\n",
" \n",
" | mean | \n",
" -0.034508 | \n",
" 0.011824 | \n",
" -0.024031 | \n",
" -0.048423 | \n",
"
\n",
" \n",
" | std | \n",
" 1.023096 | \n",
" 1.069939 | \n",
" 1.037148 | \n",
" 0.972926 | \n",
"
\n",
" \n",
" | min | \n",
" -2.998919 | \n",
" -2.939683 | \n",
" -3.980539 | \n",
" -3.180228 | \n",
"
\n",
" \n",
" | 25% | \n",
" -0.735324 | \n",
" -0.739318 | \n",
" -0.690162 | \n",
" -0.699223 | \n",
"
\n",
" \n",
" | 50% | \n",
" -0.020213 | \n",
" 0.009185 | \n",
" -0.041272 | \n",
" -0.046438 | \n",
"
\n",
" \n",
" | 75% | \n",
" 0.661472 | \n",
" 0.728629 | \n",
" 0.675814 | \n",
" 0.588834 | \n",
"
\n",
" \n",
" | max | \n",
" 3.187850 | \n",
" 3.693235 | \n",
" 3.950033 | \n",
" 3.089895 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 0 1 2 3\n",
"count 1000.000000 1000.000000 1000.000000 1000.000000\n",
"mean -0.034508 0.011824 -0.024031 -0.048423\n",
"std 1.023096 1.069939 1.037148 0.972926\n",
"min -2.998919 -2.939683 -3.980539 -3.180228\n",
"25% -0.735324 -0.739318 -0.690162 -0.699223\n",
"50% -0.020213 0.009185 -0.041272 -0.046438\n",
"75% 0.661472 0.728629 0.675814 0.588834\n",
"max 3.187850 3.693235 3.950033 3.089895"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"\n",
"\n",
"df = pd.DataFrame(np.random.randn(1000, 4))\n",
"\n",
"df.describe()"
]
},
{
"cell_type": "markdown",
"id": "9c159cc3",
"metadata": {},
"source": [
"Suppose you want to find values in one of the columns whose absolute value is greater than 3:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "89dfec83",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"435 3.693235\n",
"Name: 1, dtype: float64"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"col = df[1]\n",
"\n",
"col[col.abs() > 3]"
]
},
{
"cell_type": "markdown",
"id": "0a411bdb",
"metadata": {},
"source": [
"To select all rows where value is greater than `3` or less than `-3` in one of the columns, you can apply [pandas.DataFrame.any](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.any.html) to a Boolean DataFrame, using `any(axis=1)` to check if a value is in a row:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "ca08a1c8",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
"
\n",
" \n",
" \n",
" \n",
" | 103 | \n",
" -0.477368 | \n",
" -0.100079 | \n",
" -1.466754 | \n",
" -3.180228 | \n",
"
\n",
" \n",
" | 188 | \n",
" 1.962728 | \n",
" -0.072791 | \n",
" 3.950033 | \n",
" -0.012231 | \n",
"
\n",
" \n",
" | 210 | \n",
" 1.498744 | \n",
" -0.057742 | \n",
" 3.412662 | \n",
" 0.586651 | \n",
"
\n",
" \n",
" | 245 | \n",
" 3.016760 | \n",
" 1.527263 | \n",
" 1.790951 | \n",
" -0.015122 | \n",
"
\n",
" \n",
" | 282 | \n",
" 1.006073 | \n",
" -0.480924 | \n",
" 0.259646 | \n",
" 3.089895 | \n",
"
\n",
" \n",
" | 385 | \n",
" 3.187850 | \n",
" -1.069850 | \n",
" -0.641928 | \n",
" 1.733524 | \n",
"
\n",
" \n",
" | 435 | \n",
" -0.303929 | \n",
" 3.693235 | \n",
" -0.590390 | \n",
" 0.052511 | \n",
"
\n",
" \n",
" | 606 | \n",
" -0.220844 | \n",
" -0.479557 | \n",
" -3.012150 | \n",
" -1.476384 | \n",
"
\n",
" \n",
" | 613 | \n",
" 0.715983 | \n",
" 0.134178 | \n",
" -3.835888 | \n",
" -1.358231 | \n",
"
\n",
" \n",
" | 666 | \n",
" -0.351409 | \n",
" 1.919364 | \n",
" -3.014478 | \n",
" -0.340513 | \n",
"
\n",
" \n",
" | 743 | \n",
" 0.227552 | \n",
" -0.831102 | \n",
" -0.905155 | \n",
" -3.046226 | \n",
"
\n",
" \n",
" | 824 | \n",
" 0.109159 | \n",
" 0.501608 | \n",
" -3.980539 | \n",
" -0.783160 | \n",
"
\n",
" \n",
" | 829 | \n",
" 3.075201 | \n",
" 1.517391 | \n",
" 1.191999 | \n",
" -0.690774 | \n",
"
\n",
" \n",
" | 882 | \n",
" -0.445649 | \n",
" 0.455558 | \n",
" -3.241675 | \n",
" 2.569407 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 0 1 2 3\n",
"103 -0.477368 -0.100079 -1.466754 -3.180228\n",
"188 1.962728 -0.072791 3.950033 -0.012231\n",
"210 1.498744 -0.057742 3.412662 0.586651\n",
"245 3.016760 1.527263 1.790951 -0.015122\n",
"282 1.006073 -0.480924 0.259646 3.089895\n",
"385 3.187850 -1.069850 -0.641928 1.733524\n",
"435 -0.303929 3.693235 -0.590390 0.052511\n",
"606 -0.220844 -0.479557 -3.012150 -1.476384\n",
"613 0.715983 0.134178 -3.835888 -1.358231\n",
"666 -0.351409 1.919364 -3.014478 -0.340513\n",
"743 0.227552 -0.831102 -0.905155 -3.046226\n",
"824 0.109159 0.501608 -3.980539 -0.783160\n",
"829 3.075201 1.517391 1.191999 -0.690774\n",
"882 -0.445649 0.455558 -3.241675 2.569407"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[(df.abs() > 3).any(axis=1)]"
]
},
{
"cell_type": "markdown",
"id": "74382233",
"metadata": {},
"source": [
"On this basis, the values can be limited to an interval between -3 and 3. For this we use the instruction `np.sign(df)`, which generates values 1 and -1, depending on whether the values in `df` are positive or negative:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "6817f226",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
"
\n",
" \n",
" \n",
" \n",
" | count | \n",
" 1000.000000 | \n",
" 1000.000000 | \n",
" 1000.000000 | \n",
" 1000.000000 | \n",
"
\n",
" \n",
" | mean | \n",
" -0.034787 | \n",
" 0.011131 | \n",
" -0.023309 | \n",
" -0.048286 | \n",
"
\n",
" \n",
" | std | \n",
" 1.022245 | \n",
" 1.067774 | \n",
" 1.025773 | \n",
" 0.971934 | \n",
"
\n",
" \n",
" | min | \n",
" -2.998919 | \n",
" -2.939683 | \n",
" -3.000000 | \n",
" -3.000000 | \n",
"
\n",
" \n",
" | 25% | \n",
" -0.735324 | \n",
" -0.739318 | \n",
" -0.690162 | \n",
" -0.699223 | \n",
"
\n",
" \n",
" | 50% | \n",
" -0.020213 | \n",
" 0.009185 | \n",
" -0.041272 | \n",
" -0.046438 | \n",
"
\n",
" \n",
" | 75% | \n",
" 0.661472 | \n",
" 0.728629 | \n",
" 0.675814 | \n",
" 0.588834 | \n",
"
\n",
" \n",
" | max | \n",
" 3.000000 | \n",
" 3.000000 | \n",
" 3.000000 | \n",
" 3.000000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 0 1 2 3\n",
"count 1000.000000 1000.000000 1000.000000 1000.000000\n",
"mean -0.034787 0.011131 -0.023309 -0.048286\n",
"std 1.022245 1.067774 1.025773 0.971934\n",
"min -2.998919 -2.939683 -3.000000 -3.000000\n",
"25% -0.735324 -0.739318 -0.690162 -0.699223\n",
"50% -0.020213 0.009185 -0.041272 -0.046438\n",
"75% 0.661472 0.728629 0.675814 0.588834\n",
"max 3.000000 3.000000 3.000000 3.000000"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[df.abs() > 3] = np.sign(df) * 3\n",
"\n",
"df.describe()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.13 Kernel",
"language": "python",
"name": "python313"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.13.0"
},
"widgets": {
"application/vnd.jupyter.widget-state+json": {
"state": {},
"version_major": 2,
"version_minor": 0
}
}
},
"nbformat": 4,
"nbformat_minor": 5
}