{ "cells": [ { "cell_type": "markdown", "id": "5b70dbe9", "metadata": {}, "source": [ "# Arithmetic\n", "\n", "An important function of pandas is the arithmetic behaviour for objects with different indices. When adding objects, if the index pairs are not equal, the corresponding index in the result will be the union of the index pairs. For users with database experience, this is comparable to an automatic [outer join](https://en.wikipedia.org/wiki/Join_(SQL)#Outer_join) on the index labels. Let’s look at an example:" ] }, { "cell_type": "code", "execution_count": 1, "id": "93b775f5", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "\n", "\n", "rng = np.random.default_rng()\n", "s1 = pd.Series(rng.normal(size=5))\n", "s2 = pd.Series(rng.normal(size=7))" ] }, { "cell_type": "markdown", "id": "b05a13ac", "metadata": {}, "source": [ "If you add these values, you get:" ] }, { "cell_type": "code", "execution_count": 2, "id": "81c60161", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 -0.262674\n", "1 -1.419442\n", "2 -0.567750\n", "3 -0.094165\n", "4 0.129198\n", "5 NaN\n", "6 NaN\n", "dtype: float64" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s1 + s2" ] }, { "cell_type": "markdown", "id": "d0b380ee", "metadata": {}, "source": [ "The internal data matching leads to missing values at the points of the labels that do not overlap. Missing values are then passed on in further arithmetic calculations." ] }, { "cell_type": "markdown", "id": "bd4ef5fa", "metadata": {}, "source": [ "For DataFrames, alignment is performed for both rows and columns:" ] }, { "cell_type": "code", "execution_count": 3, "id": "e16d0e94", "metadata": {}, "outputs": [], "source": [ "df1 = pd.DataFrame(rng.normal(size=(5,3)))\n", "df2 = pd.DataFrame(rng.normal(size=(7,2)))" ] }, { "cell_type": "markdown", "id": "13ce8463", "metadata": {}, "source": [ "When the two DataFrames are added together, the result is a DataFrame whose index and columns are the unions of those in each of the DataFrames above:" ] }, { "cell_type": "code", "execution_count": 4, "id": "932c7e62", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
0-0.3011222.131597NaN
10.5324491.944502NaN
2-1.639601-1.208516NaN
32.3913130.068572NaN
4-0.048483-0.308281NaN
5NaNNaNNaN
6NaNNaNNaN
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 -0.301122 2.131597 NaN\n", "1 0.532449 1.944502 NaN\n", "2 -1.639601 -1.208516 NaN\n", "3 2.391313 0.068572 NaN\n", "4 -0.048483 -0.308281 NaN\n", "5 NaN NaN NaN\n", "6 NaN NaN NaN" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1 + df2" ] }, { "cell_type": "markdown", "id": "5974e184", "metadata": {}, "source": [ "Since column 2 does not appear in both DataFrame objects, its values appear as missing in the result. The same applies to the rows whose labels do not appear in both objects." ] }, { "cell_type": "markdown", "id": "1cd9fe13", "metadata": {}, "source": [ "## Arithmetic methods with fill values\n", "\n", "In arithmetic operations between differently indexed objects, a special value (e.g. `0`) can be useful if an axis label is found in one object but not in the other. The `add` method can pass the `fill_value` argument:" ] }, { "cell_type": "code", "execution_count": 5, "id": "35f89dfd", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
0-0.3011222.1315971.234249
10.5324491.9445022.292989
2-1.639601-1.208516-0.081290
32.3913130.0685721.391286
4-0.048483-0.3082811.368950
5-1.8110570.069581NaN
6-0.139361-1.018290NaN
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 -0.301122 2.131597 1.234249\n", "1 0.532449 1.944502 2.292989\n", "2 -1.639601 -1.208516 -0.081290\n", "3 2.391313 0.068572 1.391286\n", "4 -0.048483 -0.308281 1.368950\n", "5 -1.811057 0.069581 NaN\n", "6 -0.139361 -1.018290 NaN" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df12 = df1.add(df2, fill_value=0)\n", "\n", "df12" ] }, { "cell_type": "markdown", "id": "686dc151", "metadata": {}, "source": [ "In the following example, we set the two remaining NaN values to `0`:" ] }, { "cell_type": "code", "execution_count": 6, "id": "7dd4bbe1", "metadata": {}, "outputs": [], "source": [ "df12.iloc[[5, 6], [2]] = 0" ] }, { "cell_type": "code", "execution_count": 7, "id": "a7f447ae", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
0-0.3011222.1315971.234249
10.5324491.9445022.292989
2-1.639601-1.208516-0.081290
32.3913130.0685721.391286
4-0.048483-0.3082811.368950
5-1.8110570.0695810.000000
6-0.139361-1.0182900.000000
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 -0.301122 2.131597 1.234249\n", "1 0.532449 1.944502 2.292989\n", "2 -1.639601 -1.208516 -0.081290\n", "3 2.391313 0.068572 1.391286\n", "4 -0.048483 -0.308281 1.368950\n", "5 -1.811057 0.069581 0.000000\n", "6 -0.139361 -1.018290 0.000000" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df12" ] }, { "cell_type": "markdown", "id": "1dbbcce0", "metadata": {}, "source": [ "## Arithmetic methods\n", "\n", "Method | Description\n", ":----- | :----------\n", "`add`, `radd` | methods for addition (`+`)\n", "`sub`, `rsub` | methods for subtraction (`-`)\n", "`div`, `rdiv` | methods for division (`/`)\n", "`floordiv`, `rfloordiv` | methods for floor division (`//`)\n", "`mul`, `rmul` | methods for multiplication (`*`)\n", "`pow`, `rpow` | methods for exponentiation (`**`)\n", "\n", "`r` (English: _reverse_) reverses the method." ] }, { "cell_type": "markdown", "id": "2e634696", "metadata": {}, "source": [ "## Operations between DataFrame and Series\n", "\n", "As with NumPy arrays of different dimensions, the arithmetic between DataFrame and Series is also defined." ] }, { "cell_type": "code", "execution_count": 8, "id": "7920f4b7", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
01234
0-0.8370401.5779361.491790NaNNaN
1-0.0034681.3908412.550530NaNNaN
2-2.175518-1.7621770.176251NaNNaN
31.855396-0.4850891.648827NaNNaN
4-0.584401-0.8619431.626491NaNNaN
5-2.346974-0.4840800.257541NaNNaN
6-0.675279-1.5719510.257541NaNNaN
\n", "
" ], "text/plain": [ " 0 1 2 3 4\n", "0 -0.837040 1.577936 1.491790 NaN NaN\n", "1 -0.003468 1.390841 2.550530 NaN NaN\n", "2 -2.175518 -1.762177 0.176251 NaN NaN\n", "3 1.855396 -0.485089 1.648827 NaN NaN\n", "4 -0.584401 -0.861943 1.626491 NaN NaN\n", "5 -2.346974 -0.484080 0.257541 NaN NaN\n", "6 -0.675279 -1.571951 0.257541 NaN NaN" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s1 + df12" ] }, { "cell_type": "markdown", "id": "36005166", "metadata": {}, "source": [ "If we add `s1` with `df12`, the addition is done once for each line. This is called _broadcasting_. By default, the arithmetic between the DataFrame and the series corresponds to the index of the series in the columns of the DataFrame, with the rows being broadcast down." ] }, { "cell_type": "markdown", "id": "4d57486f", "metadata": {}, "source": [ "If an index value is found neither in the columns of the DataFrame nor in the index of the series, the objects are re-indexed to form the union:" ] }, { "cell_type": "markdown", "id": "0f0f79eb", "metadata": {}, "source": [ "If instead you want to transfer the columns and match the rows, you must use one of the arithmetic methods, for example:" ] }, { "cell_type": "code", "execution_count": 9, "id": "cebe5864", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
0-0.0278792.4048411.507493
1-0.3333311.0787211.427209
2-2.464891-2.033807-0.906581
34.0323361.7095953.032309
40.7275390.4677412.144972
5-1.5125400.3680970.298516
60.637496-0.2414330.776857
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 -0.027879 2.404841 1.507493\n", "1 -0.333331 1.078721 1.427209\n", "2 -2.464891 -2.033807 -0.906581\n", "3 4.032336 1.709595 3.032309\n", "4 0.727539 0.467741 2.144972\n", "5 -1.512540 0.368097 0.298516\n", "6 0.637496 -0.241433 0.776857" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df12.add(s2, axis=\"index\")" ] }, { "cell_type": "markdown", "id": "21b63ef1", "metadata": {}, "source": [ "The axis number you pass is the axis to be aligned to. In this case, the row index of the DataFrame (`axis='index'` or `axis=0`) is to be adjusted and transmitted." ] }, { "cell_type": "markdown", "id": "9ffae4f8", "metadata": {}, "source": [ "## Function application and mapping\n", "\n", "`numpy.ufunc` (element-wise array methods) also work with pandas objects:" ] }, { "cell_type": "code", "execution_count": 10, "id": "0796f179", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
00.3011222.1315971.234249
10.5324491.9445022.292989
21.6396011.2085160.081290
32.3913130.0685721.391286
40.0484830.3082811.368950
51.8110570.0695810.000000
60.1393611.0182900.000000
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 0.301122 2.131597 1.234249\n", "1 0.532449 1.944502 2.292989\n", "2 1.639601 1.208516 0.081290\n", "3 2.391313 0.068572 1.391286\n", "4 0.048483 0.308281 1.368950\n", "5 1.811057 0.069581 0.000000\n", "6 0.139361 1.018290 0.000000" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.abs(df12)" ] }, { "cell_type": "markdown", "id": "e00dcebb", "metadata": {}, "source": [ "Another common operation is to apply a function to one-dimensional arrays on each column or row. The [pandas.DataFrame.apply](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html) method does just that:" ] }, { "cell_type": "code", "execution_count": 11, "id": "97d094c1", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
0-0.3011222.1315971.234249
10.5324491.9445022.292989
2-1.639601-1.208516-0.081290
32.3913130.0685721.391286
4-0.048483-0.3082811.368950
5-1.8110570.0695810.000000
6-0.139361-1.0182900.000000
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 -0.301122 2.131597 1.234249\n", "1 0.532449 1.944502 2.292989\n", "2 -1.639601 -1.208516 -0.081290\n", "3 2.391313 0.068572 1.391286\n", "4 -0.048483 -0.308281 1.368950\n", "5 -1.811057 0.069581 0.000000\n", "6 -0.139361 -1.018290 0.000000" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df12" ] }, { "cell_type": "code", "execution_count": 12, "id": "0f11be9c", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 4.202370\n", "1 3.340113\n", "2 2.374279\n", "dtype: float64" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "f = lambda x: x.max() - x.min()\n", "\n", "df12.apply(f)" ] }, { "cell_type": "markdown", "id": "6421662b", "metadata": {}, "source": [ "Here the function `f`, which calculates the difference between the maximum and minimum of a row, is called once for each column of the frame. The result is a row with the columns of the frame as index." ] }, { "cell_type": "markdown", "id": "ecd6ad81", "metadata": {}, "source": [ "If you pass `axis='columns'` to `apply`, the function will be called once per line instead:" ] }, { "cell_type": "code", "execution_count": 13, "id": "b54c6f89", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 2.432720\n", "1 1.760540\n", "2 1.558311\n", "3 2.322741\n", "4 1.677231\n", "5 1.880637\n", "6 1.018290\n", "dtype: float64" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df12.apply(f, axis=\"columns\")" ] }, { "cell_type": "markdown", "id": "677910f9", "metadata": {}, "source": [ "Many of the most common array statistics (such as `sum` and `mean`) are DataFrame methods, so the use of `apply` is not necessary." ] }, { "cell_type": "markdown", "id": "ff49ea56", "metadata": {}, "source": [ "The function passed to apply does not have to return a single value; it can also return a series with multiple values:" ] }, { "cell_type": "code", "execution_count": 14, "id": "4abc03d5", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
min-1.811057-1.208516-0.081290
max2.3913132.1315972.292989
\n", "
" ], "text/plain": [ " 0 1 2\n", "min -1.811057 -1.208516 -0.081290\n", "max 2.391313 2.131597 2.292989" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def f(x):\n", " return pd.Series([x.min(), x.max()], index=[\"min\", \"max\"])\n", "\n", "df12.apply(f)" ] }, { "cell_type": "markdown", "id": "06dc9053", "metadata": {}, "source": [ "You can also use element-wise Python functions. Suppose you want to round each floating point value in `df12` to two decimal places, you can do this with [pandas.DataFrame.map](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.map.html):" ] }, { "cell_type": "code", "execution_count": 15, "id": "3b470cc7", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
0-0.302.131.23
10.531.942.29
2-1.64-1.21-0.08
32.390.071.39
4-0.05-0.311.37
5-1.810.070.00
6-0.14-1.020.00
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 -0.30 2.13 1.23\n", "1 0.53 1.94 2.29\n", "2 -1.64 -1.21 -0.08\n", "3 2.39 0.07 1.39\n", "4 -0.05 -0.31 1.37\n", "5 -1.81 0.07 0.00\n", "6 -0.14 -1.02 0.00" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "f = lambda x: round(x, 2)\n", "\n", "df12.map(f)" ] }, { "cell_type": "markdown", "id": "787b95cd", "metadata": {}, "source": [ "The `map` method can also be applied to Series:" ] }, { "cell_type": "code", "execution_count": 16, "id": "3aaaeef4", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1.23\n", "1 2.29\n", "2 -0.08\n", "3 1.39\n", "4 1.37\n", "5 0.00\n", "6 0.00\n", "Name: 2, dtype: float64" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df12[2].map(f)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.13 Kernel", "language": "python", "name": "python313" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.0" }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 5 }