{ "cells": [ { "cell_type": "markdown", "id": "455a3208", "metadata": {}, "source": [ "# Detecting and filtering outliers\n", "\n", "Filtering or transforming outliers is largely a matter of applying array operations. Consider a DataFrame with some normally distributed data:" ] }, { "cell_type": "code", "execution_count": 1, "id": "35bb569f", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123
count1000.0000001000.0000001000.0000001000.000000
mean-0.0345080.011824-0.024031-0.048423
std1.0230961.0699391.0371480.972926
min-2.998919-2.939683-3.980539-3.180228
25%-0.735324-0.739318-0.690162-0.699223
50%-0.0202130.009185-0.041272-0.046438
75%0.6614720.7286290.6758140.588834
max3.1878503.6932353.9500333.089895
\n", "
" ], "text/plain": [ " 0 1 2 3\n", "count 1000.000000 1000.000000 1000.000000 1000.000000\n", "mean -0.034508 0.011824 -0.024031 -0.048423\n", "std 1.023096 1.069939 1.037148 0.972926\n", "min -2.998919 -2.939683 -3.980539 -3.180228\n", "25% -0.735324 -0.739318 -0.690162 -0.699223\n", "50% -0.020213 0.009185 -0.041272 -0.046438\n", "75% 0.661472 0.728629 0.675814 0.588834\n", "max 3.187850 3.693235 3.950033 3.089895" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "import pandas as pd\n", "\n", "\n", "df = pd.DataFrame(np.random.randn(1000, 4))\n", "\n", "df.describe()" ] }, { "cell_type": "markdown", "id": "9c159cc3", "metadata": {}, "source": [ "Suppose you want to find values in one of the columns whose absolute value is greater than 3:" ] }, { "cell_type": "code", "execution_count": 2, "id": "89dfec83", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "435 3.693235\n", "Name: 1, dtype: float64" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "col = df[1]\n", "\n", "col[col.abs() > 3]" ] }, { "cell_type": "markdown", "id": "0a411bdb", "metadata": {}, "source": [ "To select all rows where value is greater than `3` or less than `-3` in one of the columns, you can apply [pandas.DataFrame.any](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.any.html) to a Boolean DataFrame, using `any(axis=1)` to check if a value is in a row:" ] }, { "cell_type": "code", "execution_count": 3, "id": "ca08a1c8", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123
103-0.477368-0.100079-1.466754-3.180228
1881.962728-0.0727913.950033-0.012231
2101.498744-0.0577423.4126620.586651
2453.0167601.5272631.790951-0.015122
2821.006073-0.4809240.2596463.089895
3853.187850-1.069850-0.6419281.733524
435-0.3039293.693235-0.5903900.052511
606-0.220844-0.479557-3.012150-1.476384
6130.7159830.134178-3.835888-1.358231
666-0.3514091.919364-3.014478-0.340513
7430.227552-0.831102-0.905155-3.046226
8240.1091590.501608-3.980539-0.783160
8293.0752011.5173911.191999-0.690774
882-0.4456490.455558-3.2416752.569407
\n", "
" ], "text/plain": [ " 0 1 2 3\n", "103 -0.477368 -0.100079 -1.466754 -3.180228\n", "188 1.962728 -0.072791 3.950033 -0.012231\n", "210 1.498744 -0.057742 3.412662 0.586651\n", "245 3.016760 1.527263 1.790951 -0.015122\n", "282 1.006073 -0.480924 0.259646 3.089895\n", "385 3.187850 -1.069850 -0.641928 1.733524\n", "435 -0.303929 3.693235 -0.590390 0.052511\n", "606 -0.220844 -0.479557 -3.012150 -1.476384\n", "613 0.715983 0.134178 -3.835888 -1.358231\n", "666 -0.351409 1.919364 -3.014478 -0.340513\n", "743 0.227552 -0.831102 -0.905155 -3.046226\n", "824 0.109159 0.501608 -3.980539 -0.783160\n", "829 3.075201 1.517391 1.191999 -0.690774\n", "882 -0.445649 0.455558 -3.241675 2.569407" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[(df.abs() > 3).any(axis=1)]" ] }, { "cell_type": "markdown", "id": "74382233", "metadata": {}, "source": [ "On this basis, the values can be limited to an interval between -3 and 3. For this we use the instruction `np.sign(df)`, which generates values 1 and -1, depending on whether the values in `df` are positive or negative:" ] }, { "cell_type": "code", "execution_count": 4, "id": "6817f226", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123
count1000.0000001000.0000001000.0000001000.000000
mean-0.0347870.011131-0.023309-0.048286
std1.0222451.0677741.0257730.971934
min-2.998919-2.939683-3.000000-3.000000
25%-0.735324-0.739318-0.690162-0.699223
50%-0.0202130.009185-0.041272-0.046438
75%0.6614720.7286290.6758140.588834
max3.0000003.0000003.0000003.000000
\n", "
" ], "text/plain": [ " 0 1 2 3\n", "count 1000.000000 1000.000000 1000.000000 1000.000000\n", "mean -0.034787 0.011131 -0.023309 -0.048286\n", "std 1.022245 1.067774 1.025773 0.971934\n", "min -2.998919 -2.939683 -3.000000 -3.000000\n", "25% -0.735324 -0.739318 -0.690162 -0.699223\n", "50% -0.020213 0.009185 -0.041272 -0.046438\n", "75% 0.661472 0.728629 0.675814 0.588834\n", "max 3.000000 3.000000 3.000000 3.000000" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[df.abs() > 3] = np.sign(df) * 3\n", "\n", "df.describe()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.13 Kernel", "language": "python", "name": "python313" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.0" }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 5 }