Combining and merging data sets#
Data contained in pandas objects can be combined in several ways:
pandas.merge joins rows in DataFrames based on one or more keys. This function is familiar from SQL or other relational databases, as it implements database join operations.
pandas.concat concatenates or stacks objects along an axis.
The instance methods pandas.DataFrame.combine_first or pandas.Series.combine_first allow overlapping data to be joined.
With pandas.merge_asof you can perform time series based window joins between DataFrame objects.
Database-like DataFrame joins#
Merge or join operations combine data sets by linking rows with one or more keys. These operations are especially important in relational, SQL-based databases. The merge function in pandas is the main entry point for applying these algorithms to your data.
[1]:
import pandas as pd
[2]:
encoding = pd.DataFrame(
{
"Unicode": [
"U+0000", "U+0 001", "U+0002", "U+0003", "U+0004", "U+0005",
],
"Decimal": [0, 1, 2, 3, 4, 5],
"Octal": ["000", "001", "002", "003", "004", "005"],
"Key": ["NUL", "Ctrl-A", "Ctrl-B", "Ctrl-C", "Ctrl-D", "Ctrl-E"],
}
)
update = pd.DataFrame(
{
"Unicode": [
"U+0003", "U+0004", "U+0005", "U+0006", "U+0007", "U+0008", "U+0009",
],
"Decimal": [3, 4, 5, 6, 7, 8, 9],
"Octal": ["003", "004", "005", "006", "007", "008", "009"],
"Key": [
"Ctrl-C", "Ctrl-D", "Ctrl-E", "Ctrl-F", "Ctrl-G", "Ctrl-H", "Ctrl-I",
],
}
)
encoding, update
[2]:
( Unicode Decimal Octal Key
0 U+0000 0 000 NUL
1 U+0001 1 001 Ctrl-A
2 U+0002 2 002 Ctrl-B
3 U+0003 3 003 Ctrl-C
4 U+0004 4 004 Ctrl-D
5 U+0005 5 005 Ctrl-E,
Unicode Decimal Octal Key
0 U+0003 3 003 Ctrl-C
1 U+0004 4 004 Ctrl-D
2 U+0005 5 005 Ctrl-E
3 U+0006 6 006 Ctrl-F
4 U+0007 7 007 Ctrl-G
5 U+0008 8 008 Ctrl-H
6 U+0009 9 009 Ctrl-I)
When we call merge
with these objects, we get:
[3]:
pd.merge(encoding, update)
[3]:
Unicode | Decimal | Octal | Key | |
---|---|---|---|---|
0 | U+0003 | 3 | 003 | Ctrl-C |
1 | U+0004 | 4 | 004 | Ctrl-D |
2 | U+0005 | 5 | 005 | Ctrl-E |
By default, merge
performs a so-called inner join; the keys in the result are the intersection or common set in both tables.
Note:
I did not specify which column to merge over. If this information is not specified, merge will use the overlapping column names as keys. However, it is good practice to specify this explicitly:
[4]:
pd.merge(encoding, update, on="Unicode")
[4]:
Unicode | Decimal_x | Octal_x | Key_x | Decimal_y | Octal_y | Key_y | |
---|---|---|---|---|---|---|---|
0 | U+0003 | 3 | 003 | Ctrl-C | 3 | 003 | Ctrl-C |
1 | U+0004 | 4 | 004 | Ctrl-D | 4 | 004 | Ctrl-D |
2 | U+0005 | 5 | 005 | Ctrl-E | 5 | 005 | Ctrl-E |
If the column names are different in each object, you can specify them separately. In the following example update2
gets the key U+
and not Unicode
:
[5]:
update2 = pd.DataFrame(
{
"U+": [
"U+0003", "U+0004", "U+0005", "U+0006", "U+0007", "U+0008", "U+0009",
],
"Decimal": [3, 4, 5, 6, 7, 8, 9],
"Octal": ["003", "004", "005", "006", "007", "008", "009"],
"Key": [
"Ctrl-C", "Ctrl-D", "Ctrl-E", "Ctrl-F", "Ctrl-G", "Ctrl-H", "Ctrl-I",
],
}
)
pd.merge(encoding, update2, left_on="Unicode", right_on="U+")
[5]:
Unicode | Decimal_x | Octal_x | Key_x | U+ | Decimal_y | Octal_y | Key_y | |
---|---|---|---|---|---|---|---|---|
0 | U+0003 | 3 | 003 | Ctrl-C | U+0003 | 3 | 003 | Ctrl-C |
1 | U+0004 | 4 | 004 | Ctrl-D | U+0004 | 4 | 004 | Ctrl-D |
2 | U+0005 | 5 | 005 | Ctrl-E | U+0005 | 5 | 005 | Ctrl-E |
However, you can use merge
not only to perform an inner join, with which the keys in the result are the intersection or common set in both tables. Other possible options are:
Option |
Behaviour |
---|---|
|
uses only the key combinations observed in both tables |
|
uses all key combinations found in the left table |
|
uses all key combinations found in the right table |
|
uses all key combinations observed in both tables together |
[5]:
pd.merge(encoding, update, on="Unicode", how="left")
[5]:
Unicode | Decimal_x | Octal_x | Key_x | Decimal_y | Octal_y | Key_y | |
---|---|---|---|---|---|---|---|
0 | U+0000 | 0 | 000 | NUL | NaN | NaN | NaN |
1 | U+0001 | 1 | 001 | Ctrl-A | NaN | NaN | NaN |
2 | U+0002 | 2 | 002 | Ctrl-B | NaN | NaN | NaN |
3 | U+0003 | 3 | 003 | Ctrl-C | 3.0 | 003 | Ctrl-C |
4 | U+0004 | 4 | 004 | Ctrl-D | 4.0 | 004 | Ctrl-D |
5 | U+0005 | 5 | 005 | Ctrl-E | 5.0 | 005 | Ctrl-E |
[7]:
pd.merge(encoding, update, on="Unicode", how="outer")
[7]:
Unicode | Decimal_x | Octal_x | Key_x | Decimal_y | Octal_y | Key_y | |
---|---|---|---|---|---|---|---|
0 | U+0000 | 0.0 | 000 | NUL | NaN | NaN | NaN |
1 | U+0001 | 1.0 | 001 | Ctrl-A | NaN | NaN | NaN |
2 | U+0002 | 2.0 | 002 | Ctrl-B | NaN | NaN | NaN |
3 | U+0003 | 3.0 | 003 | Ctrl-C | 3.0 | 003 | Ctrl-C |
4 | U+0004 | 4.0 | 004 | Ctrl-D | 4.0 | 004 | Ctrl-D |
5 | U+0005 | 5.0 | 005 | Ctrl-E | 5.0 | 005 | Ctrl-E |
6 | U+0006 | NaN | NaN | NaN | 6.0 | 006 | Ctrl-F |
7 | U+0007 | NaN | NaN | NaN | 7.0 | 007 | Ctrl-G |
8 | U+0008 | NaN | NaN | NaN | 8.0 | 008 | Ctrl-H |
9 | U+0009 | NaN | NaN | NaN | 9.0 | 009 | Ctrl-I |
The join method only affects the unique key values that appear in the result.
To join multiple keys, you can pass a list of column names:
[6]:
pd.merge(encoding, update, on=["Unicode", "Decimal", "Octal", "Key"], how="outer")
[6]:
Unicode | Decimal | Octal | Key | |
---|---|---|---|---|
0 | U+0000 | 0 | 000 | NUL |
1 | U+0001 | 1 | 001 | Ctrl-A |
2 | U+0002 | 2 | 002 | Ctrl-B |
3 | U+0003 | 3 | 003 | Ctrl-C |
4 | U+0004 | 4 | 004 | Ctrl-D |
5 | U+0005 | 5 | 005 | Ctrl-E |
6 | U+0006 | 6 | 006 | Ctrl-F |
7 | U+0007 | 7 | 007 | Ctrl-G |
8 | U+0008 | 8 | 008 | Ctrl-H |
9 | U+0009 | 9 | 009 | Ctrl-I |