Arithmetic¶
An important function of pandas is the arithmetic behaviour for objects with different indices. When adding objects, if the index pairs are not equal, the corresponding index in the result will be the union of the index pairs. For users with database experience, this is comparable to an automatic outer join on the index labels. Let’s look at an example:
[1]:
import numpy as np
import pandas as pd
rng = np.random.default_rng()
s1 = pd.Series(rng.normal(size=5))
s2 = pd.Series(rng.normal(size=7))
If you add these values, you get:
[2]:
s1 + s2
[2]:
0 -0.262674
1 -1.419442
2 -0.567750
3 -0.094165
4 0.129198
5 NaN
6 NaN
dtype: float64
The internal data matching leads to missing values at the points of the labels that do not overlap. Missing values are then passed on in further arithmetic calculations.
For DataFrames, alignment is performed for both rows and columns:
[3]:
df1 = pd.DataFrame(rng.normal(size=(5,3)))
df2 = pd.DataFrame(rng.normal(size=(7,2)))
When the two DataFrames are added together, the result is a DataFrame whose index and columns are the unions of those in each of the DataFrames above:
[4]:
df1 + df2
[4]:
0 | 1 | 2 | |
---|---|---|---|
0 | -0.301122 | 2.131597 | NaN |
1 | 0.532449 | 1.944502 | NaN |
2 | -1.639601 | -1.208516 | NaN |
3 | 2.391313 | 0.068572 | NaN |
4 | -0.048483 | -0.308281 | NaN |
5 | NaN | NaN | NaN |
6 | NaN | NaN | NaN |
Since column 2 does not appear in both DataFrame objects, its values appear as missing in the result. The same applies to the rows whose labels do not appear in both objects.
Arithmetic methods with fill values¶
In arithmetic operations between differently indexed objects, a special value (e.g. 0
) can be useful if an axis label is found in one object but not in the other. The add
method can pass the fill_value
argument:
[5]:
df12 = df1.add(df2, fill_value=0)
df12
[5]:
0 | 1 | 2 | |
---|---|---|---|
0 | -0.301122 | 2.131597 | 1.234249 |
1 | 0.532449 | 1.944502 | 2.292989 |
2 | -1.639601 | -1.208516 | -0.081290 |
3 | 2.391313 | 0.068572 | 1.391286 |
4 | -0.048483 | -0.308281 | 1.368950 |
5 | -1.811057 | 0.069581 | NaN |
6 | -0.139361 | -1.018290 | NaN |
In the following example, we set the two remaining NaN values to 0
:
[6]:
df12.iloc[[5, 6], [2]] = 0
[7]:
df12
[7]:
0 | 1 | 2 | |
---|---|---|---|
0 | -0.301122 | 2.131597 | 1.234249 |
1 | 0.532449 | 1.944502 | 2.292989 |
2 | -1.639601 | -1.208516 | -0.081290 |
3 | 2.391313 | 0.068572 | 1.391286 |
4 | -0.048483 | -0.308281 | 1.368950 |
5 | -1.811057 | 0.069581 | 0.000000 |
6 | -0.139361 | -1.018290 | 0.000000 |
Arithmetic methods¶
Method |
Description |
---|---|
|
methods for addition ( |
|
methods for subtraction ( |
|
methods for division ( |
|
methods for floor division ( |
|
methods for multiplication ( |
|
methods for exponentiation ( |
r
(English: reverse) reverses the method.
Operations between DataFrame and Series¶
As with NumPy arrays of different dimensions, the arithmetic between DataFrame and Series is also defined.
[8]:
s1 + df12
[8]:
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | -0.837040 | 1.577936 | 1.491790 | NaN | NaN |
1 | -0.003468 | 1.390841 | 2.550530 | NaN | NaN |
2 | -2.175518 | -1.762177 | 0.176251 | NaN | NaN |
3 | 1.855396 | -0.485089 | 1.648827 | NaN | NaN |
4 | -0.584401 | -0.861943 | 1.626491 | NaN | NaN |
5 | -2.346974 | -0.484080 | 0.257541 | NaN | NaN |
6 | -0.675279 | -1.571951 | 0.257541 | NaN | NaN |
If we add s1
with df12
, the addition is done once for each line. This is called broadcasting. By default, the arithmetic between the DataFrame and the series corresponds to the index of the series in the columns of the DataFrame, with the rows being broadcast down.
If an index value is found neither in the columns of the DataFrame nor in the index of the series, the objects are re-indexed to form the union:
If instead you want to transfer the columns and match the rows, you must use one of the arithmetic methods, for example:
[9]:
df12.add(s2, axis="index")
[9]:
0 | 1 | 2 | |
---|---|---|---|
0 | -0.027879 | 2.404841 | 1.507493 |
1 | -0.333331 | 1.078721 | 1.427209 |
2 | -2.464891 | -2.033807 | -0.906581 |
3 | 4.032336 | 1.709595 | 3.032309 |
4 | 0.727539 | 0.467741 | 2.144972 |
5 | -1.512540 | 0.368097 | 0.298516 |
6 | 0.637496 | -0.241433 | 0.776857 |
The axis number you pass is the axis to be aligned to. In this case, the row index of the DataFrame (axis='index'
or axis=0
) is to be adjusted and transmitted.
Function application and mapping¶
numpy.ufunc
(element-wise array methods) also work with pandas objects:
[10]:
np.abs(df12)
[10]:
0 | 1 | 2 | |
---|---|---|---|
0 | 0.301122 | 2.131597 | 1.234249 |
1 | 0.532449 | 1.944502 | 2.292989 |
2 | 1.639601 | 1.208516 | 0.081290 |
3 | 2.391313 | 0.068572 | 1.391286 |
4 | 0.048483 | 0.308281 | 1.368950 |
5 | 1.811057 | 0.069581 | 0.000000 |
6 | 0.139361 | 1.018290 | 0.000000 |
Another common operation is to apply a function to one-dimensional arrays on each column or row. The pandas.DataFrame.apply method does just that:
[11]:
df12
[11]:
0 | 1 | 2 | |
---|---|---|---|
0 | -0.301122 | 2.131597 | 1.234249 |
1 | 0.532449 | 1.944502 | 2.292989 |
2 | -1.639601 | -1.208516 | -0.081290 |
3 | 2.391313 | 0.068572 | 1.391286 |
4 | -0.048483 | -0.308281 | 1.368950 |
5 | -1.811057 | 0.069581 | 0.000000 |
6 | -0.139361 | -1.018290 | 0.000000 |
[12]:
f = lambda x: x.max() - x.min()
df12.apply(f)
[12]:
0 4.202370
1 3.340113
2 2.374279
dtype: float64
Here the function f
, which calculates the difference between the maximum and minimum of a row, is called once for each column of the frame. The result is a row with the columns of the frame as index.
If you pass axis='columns'
to apply
, the function will be called once per line instead:
[13]:
df12.apply(f, axis="columns")
[13]:
0 2.432720
1 1.760540
2 1.558311
3 2.322741
4 1.677231
5 1.880637
6 1.018290
dtype: float64
Many of the most common array statistics (such as sum
and mean
) are DataFrame methods, so the use of apply
is not necessary.
The function passed to apply does not have to return a single value; it can also return a series with multiple values:
[14]:
def f(x):
return pd.Series([x.min(), x.max()], index=["min", "max"])
df12.apply(f)
[14]:
0 | 1 | 2 | |
---|---|---|---|
min | -1.811057 | -1.208516 | -0.081290 |
max | 2.391313 | 2.131597 | 2.292989 |
You can also use element-wise Python functions. Suppose you want to round each floating point value in df12
to two decimal places, you can do this with pandas.DataFrame.map:
[15]:
f = lambda x: round(x, 2)
df12.map(f)
[15]:
0 | 1 | 2 | |
---|---|---|---|
0 | -0.30 | 2.13 | 1.23 |
1 | 0.53 | 1.94 | 2.29 |
2 | -1.64 | -1.21 | -0.08 |
3 | 2.39 | 0.07 | 1.39 |
4 | -0.05 | -0.31 | 1.37 |
5 | -1.81 | 0.07 | 0.00 |
6 | -0.14 | -1.02 | 0.00 |
The map
method can also be applied to Series:
[16]:
df12[2].map(f)
[16]:
0 1.23
1 2.29
2 -0.08
3 1.39
4 1.37
5 0.00
6 0.00
Name: 2, dtype: float64