Combining and merging data sets#

Data contained in pandas objects can be combined in several ways:

Database-like DataFrame joins#

Merge or join operations combine data sets by linking rows with one or more keys. These operations are especially important in relational, SQL-based databases. The merge function in pandas is the main entry point for applying these algorithms to your data.

[1]:
import pandas as pd
[2]:
encoding = pd.DataFrame(
    {
        "Unicode": [
            "U+0000", "U+0 001", "U+0002", "U+0003", "U+0004", "U+0005",
        ],
        "Decimal": [0, 1, 2, 3, 4, 5],
        "Octal": ["000", "001", "002", "003", "004", "005"],
        "Key": ["NUL", "Ctrl-A", "Ctrl-B", "Ctrl-C", "Ctrl-D", "Ctrl-E"],
    }
)

update = pd.DataFrame(
    {
        "Unicode": [
            "U+0003", "U+0004", "U+0005", "U+0006", "U+0007", "U+0008", "U+0009",
        ],
        "Decimal": [3, 4, 5, 6, 7, 8, 9],
        "Octal": ["003", "004", "005", "006", "007", "008", "009"],
        "Key": [
            "Ctrl-C", "Ctrl-D", "Ctrl-E", "Ctrl-F", "Ctrl-G", "Ctrl-H", "Ctrl-I",
        ],
    }
)

encoding, update
[2]:
(  Unicode  Decimal Octal     Key
 0  U+0000        0   000     NUL
 1  U+0001        1   001  Ctrl-A
 2  U+0002        2   002  Ctrl-B
 3  U+0003        3   003  Ctrl-C
 4  U+0004        4   004  Ctrl-D
 5  U+0005        5   005  Ctrl-E,
   Unicode  Decimal Octal     Key
 0  U+0003        3   003  Ctrl-C
 1  U+0004        4   004  Ctrl-D
 2  U+0005        5   005  Ctrl-E
 3  U+0006        6   006  Ctrl-F
 4  U+0007        7   007  Ctrl-G
 5  U+0008        8   008  Ctrl-H
 6  U+0009        9   009  Ctrl-I)

When we call merge with these objects, we get:

[3]:
pd.merge(encoding, update)
[3]:
Unicode Decimal Octal Key
0 U+0003 3 003 Ctrl-C
1 U+0004 4 004 Ctrl-D
2 U+0005 5 005 Ctrl-E

By default, merge performs a so-called inner join; the keys in the result are the intersection or common set in both tables.

Note:

I did not specify which column to merge over. If this information is not specified, merge will use the overlapping column names as keys. However, it is good practice to specify this explicitly:

[4]:
pd.merge(encoding, update, on="Unicode")
[4]:
Unicode Decimal_x Octal_x Key_x Decimal_y Octal_y Key_y
0 U+0003 3 003 Ctrl-C 3 003 Ctrl-C
1 U+0004 4 004 Ctrl-D 4 004 Ctrl-D
2 U+0005 5 005 Ctrl-E 5 005 Ctrl-E

If the column names are different in each object, you can specify them separately. In the following example update2 gets the key U+ and not Unicode:

[5]:
update2 = pd.DataFrame(
    {
        "U+": [
            "U+0003", "U+0004", "U+0005", "U+0006", "U+0007", "U+0008", "U+0009",
        ],
        "Decimal": [3, 4, 5, 6, 7, 8, 9],
        "Octal": ["003", "004", "005", "006", "007", "008", "009"],
        "Key": [
            "Ctrl-C", "Ctrl-D", "Ctrl-E", "Ctrl-F", "Ctrl-G", "Ctrl-H", "Ctrl-I",
        ],
    }
)

pd.merge(encoding, update2, left_on="Unicode", right_on="U+")
[5]:
Unicode Decimal_x Octal_x Key_x U+ Decimal_y Octal_y Key_y
0 U+0003 3 003 Ctrl-C U+0003 3 003 Ctrl-C
1 U+0004 4 004 Ctrl-D U+0004 4 004 Ctrl-D
2 U+0005 5 005 Ctrl-E U+0005 5 005 Ctrl-E

However, you can use merge not only to perform an inner join, with which the keys in the result are the intersection or common set in both tables. Other possible options are:

Option

Behaviour

how='inner'

uses only the key combinations observed in both tables

how='left'

uses all key combinations found in the left table

how='right'

uses all key combinations found in the right table

how='outer'

uses all key combinations observed in both tables together

[5]:
pd.merge(encoding, update, on="Unicode", how="left")
[5]:
Unicode Decimal_x Octal_x Key_x Decimal_y Octal_y Key_y
0 U+0000 0 000 NUL NaN NaN NaN
1 U+0001 1 001 Ctrl-A NaN NaN NaN
2 U+0002 2 002 Ctrl-B NaN NaN NaN
3 U+0003 3 003 Ctrl-C 3.0 003 Ctrl-C
4 U+0004 4 004 Ctrl-D 4.0 004 Ctrl-D
5 U+0005 5 005 Ctrl-E 5.0 005 Ctrl-E
[7]:
pd.merge(encoding, update, on="Unicode", how="outer")
[7]:
Unicode Decimal_x Octal_x Key_x Decimal_y Octal_y Key_y
0 U+0000 0.0 000 NUL NaN NaN NaN
1 U+0001 1.0 001 Ctrl-A NaN NaN NaN
2 U+0002 2.0 002 Ctrl-B NaN NaN NaN
3 U+0003 3.0 003 Ctrl-C 3.0 003 Ctrl-C
4 U+0004 4.0 004 Ctrl-D 4.0 004 Ctrl-D
5 U+0005 5.0 005 Ctrl-E 5.0 005 Ctrl-E
6 U+0006 NaN NaN NaN 6.0 006 Ctrl-F
7 U+0007 NaN NaN NaN 7.0 007 Ctrl-G
8 U+0008 NaN NaN NaN 8.0 008 Ctrl-H
9 U+0009 NaN NaN NaN 9.0 009 Ctrl-I

The join method only affects the unique key values that appear in the result.

To join multiple keys, you can pass a list of column names:

[6]:
pd.merge(encoding, update, on=["Unicode", "Decimal", "Octal", "Key"], how="outer")
[6]:
Unicode Decimal Octal Key
0 U+0000 0 000 NUL
1 U+0001 1 001 Ctrl-A
2 U+0002 2 002 Ctrl-B
3 U+0003 3 003 Ctrl-C
4 U+0004 4 004 Ctrl-D
5 U+0005 5 005 Ctrl-E
6 U+0006 6 006 Ctrl-F
7 U+0007 7 007 Ctrl-G
8 U+0008 8 008 Ctrl-H
9 U+0009 9 009 Ctrl-I