Combining and merging data sets¶

Data contained in pandas objects can be combined in several ways:

pandas.merge joins rows in DataFrames based on one or more keys. This function is familiar from SQL or other relational databases, as it implements database join operations.
pandas.concat concatenates or stacks objects along an axis.
The instance methods pandas.DataFrame.combine_first or pandas.Series.combine_first allow overlapping data to be joined.
With pandas.merge_asof you can perform a merge by key distance.

Database-like DataFrame joins¶

Merge or join operations combine data sets by linking rows with one or more keys. These operations are especially important in relational, SQL-based databases. The merge function in pandas is the main entry point for applying these algorithms to your data.

[1]:

import pandas as pd

[2]:

encoding = pd.DataFrame(
    {
        "Unicode": [
            "U+0000",
            "U+0001",
            "U+0002",
            "U+0003",
            "U+0004",
            "U+0005",
        ],
        "Decimal": [0, 1, 2, 3, 4, 5],
        "Octal": ["000", "001", "002", "003", "004", "005"],
        "Key": ["NUL", "Ctrl-A", "Ctrl-B", "Ctrl-C", "Ctrl-D", "Ctrl-E"],
    },
)

update = pd.DataFrame(
    {
        "Unicode": [
            "U+0003",
            "U+0004",
            "U+0005",
            "U+0006",
            "U+0007",
            "U+0008",
            "U+0009",
        ],
        "Decimal": [3, 4, 5, 6, 7, 8, 9],
        "Octal": ["003", "004", "005", "006", "007", "008", "009"],
        "Key": [
            "Ctrl-C",
            "Ctrl-D",
            "Ctrl-E",
            "Ctrl-F",
            "Ctrl-G",
            "Ctrl-H",
            "Ctrl-I",
        ],
    },
)

[3]:

encoding

[3]:

	Unicode	Decimal	Octal	Key
0	U+0000	0	000	NUL
1	U+0001	1	001	Ctrl-A
2	U+0002	2	002	Ctrl-B
3	U+0003	3	003	Ctrl-C
4	U+0004	4	004	Ctrl-D
5	U+0005	5	005	Ctrl-E

[4]:

update

[4]:

	Unicode	Decimal	Octal	Key
0	U+0003	3	003	Ctrl-C
1	U+0004	4	004	Ctrl-D
2	U+0005	5	005	Ctrl-E
3	U+0006	6	006	Ctrl-F
4	U+0007	7	007	Ctrl-G
5	U+0008	8	008	Ctrl-H
6	U+0009	9	009	Ctrl-I

When we call merge with these objects, we get:

[5]:

encoding.merge(update)

[5]:

	Unicode	Decimal	Octal	Key
0	U+0003	3	003	Ctrl-C
1	U+0004	4	004	Ctrl-D
2	U+0005	5	005	Ctrl-E

By default, merge performs a so-called inner join; the keys in the result are the intersection or common set in both tables.

Note:

I did not specify which column to merge over. If this information is not specified, merge will use the overlapping column names as keys. However, it is good practice to specify this explicitly:

[6]:

encoding.merge(update, on="Unicode")

[6]:

	Unicode	Decimal_x	Octal_x	Key_x	Decimal_y	Octal_y	Key_y
0	U+0003	3	003	Ctrl-C	3	003	Ctrl-C
1	U+0004	4	004	Ctrl-D	4	004	Ctrl-D
2	U+0005	5	005	Ctrl-E	5	005	Ctrl-E

If the column names are different in each object, you can specify them separately. In the following example update2 gets the key U+ and not Unicode:

[7]:

update2 = pd.DataFrame(
    {
        "U+": [
            "U+0003",
            "U+0004",
            "U+0005",
            "U+0006",
            "U+0007",
            "U+0008",
            "U+0009",
        ],
        "Decimal": [3, 4, 5, 6, 7, 8, 9],
        "Octal": ["003", "004", "005", "006", "007", "008", "009"],
        "Key": [
            "Ctrl-C",
            "Ctrl-D",
            "Ctrl-E",
            "Ctrl-F",
            "Ctrl-G",
            "Ctrl-H",
            "Ctrl-I",
        ],
    },
)

encoding.merge(update2, left_on="Unicode", right_on="U+")

[7]:

	Unicode	Decimal_x	Octal_x	Key_x	U+	Decimal_y	Octal_y	Key_y
0	U+0003	3	003	Ctrl-C	U+0003	3	003	Ctrl-C
1	U+0004	4	004	Ctrl-D	U+0004	4	004	Ctrl-D
2	U+0005	5	005	Ctrl-E	U+0005	5	005	Ctrl-E

However, you can use .merge not only to perform an inner join, with which the keys in the result are the intersection or common set in both tables. Other possible options are:

Option	Behaviour
`how='inner'`	uses only the key combinations observed in both tables
`how='left'`	uses all key combinations found in the left table
`how='right'`	uses all key combinations found in the right table
`how='outer'`	uses all key combinations observed in both tables together

[8]:

encoding.merge(update, on="Unicode", how="left")

[8]:

	Unicode	Decimal_x	Octal_x	Key_x	Decimal_y	Octal_y	Key_y
0	U+0000	0	000	NUL	NaN	NaN	NaN
1	U+0001	1	001	Ctrl-A	NaN	NaN	NaN
2	U+0002	2	002	Ctrl-B	NaN	NaN	NaN
3	U+0003	3	003	Ctrl-C	3.0	003	Ctrl-C
4	U+0004	4	004	Ctrl-D	4.0	004	Ctrl-D
5	U+0005	5	005	Ctrl-E	5.0	005	Ctrl-E

[9]:

encoding.merge(update, on="Unicode", how="outer")

[9]:

	Unicode	Decimal_x	Octal_x	Key_x	Decimal_y	Octal_y	Key_y
0	U+0000	0.0	000	NUL	NaN	NaN	NaN
1	U+0001	1.0	001	Ctrl-A	NaN	NaN	NaN
2	U+0002	2.0	002	Ctrl-B	NaN	NaN	NaN
3	U+0003	3.0	003	Ctrl-C	3.0	003	Ctrl-C
4	U+0004	4.0	004	Ctrl-D	4.0	004	Ctrl-D
5	U+0005	5.0	005	Ctrl-E	5.0	005	Ctrl-E
6	U+0006	NaN	NaN	NaN	6.0	006	Ctrl-F
7	U+0007	NaN	NaN	NaN	7.0	007	Ctrl-G
8	U+0008	NaN	NaN	NaN	8.0	008	Ctrl-H
9	U+0009	NaN	NaN	NaN	9.0	009	Ctrl-I

The join method only affects the unique key values that appear in the result.

To join multiple keys, you can pass a list of column names:

[10]:

encoding.merge(update, on=["Unicode", "Decimal", "Octal", "Key"], how="outer")

[10]:

	Unicode	Decimal	Octal	Key
0	U+0000	0	000	NUL
1	U+0001	1	001	Ctrl-A
2	U+0002	2	002	Ctrl-B
3	U+0003	3	003	Ctrl-C
4	U+0004	4	004	Ctrl-D
5	U+0005	5	005	Ctrl-E
6	U+0006	6	006	Ctrl-F
7	U+0007	7	007	Ctrl-G
8	U+0008	8	008	Ctrl-H
9	U+0009	9	009	Ctrl-I