{
"cells": [
{
"cell_type": "markdown",
"id": "2e977f26",
"metadata": {},
"source": [
"# Combining and merging data sets\n",
"\n",
"Data contained in pandas objects can be combined in several ways:\n",
"\n",
"* [pandas.merge](https://pandas.pydata.org/docs/reference/api/pandas.merge.html) joins rows in DataFrames based on one or more keys. This function is familiar from SQL or other relational databases, as it implements database join operations.\n",
"* [pandas.concat](https://pandas.pydata.org/docs/reference/api/pandas.concat.html) concatenates or *stacks* objects along an axis.\n",
"* The instance methods [pandas.DataFrame.combine_first](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.combine_first.html) or [pandas.Series.combine_first](https://pandas.pydata.org/docs/reference/api/pandas.Series.combine_first.html) allow overlapping data to be joined.\n",
"* With [pandas.merge_asof](https://pandas.pydata.org/docs/reference/api/pandas.merge_asof.html) you can perform time series based window joins between DataFrame objects."
]
},
{
"cell_type": "markdown",
"id": "8c707380",
"metadata": {},
"source": [
"## Database-like DataFrame joins\n",
"\n",
"Merge or join operations combine data sets by linking rows with one or more keys. These operations are especially important in relational, SQL-based databases. The merge function in pandas is the main entry point for applying these algorithms to your data."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "0b80668e",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "eb975aa7",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"( Unicode Decimal Octal Key\n",
" 0 U+0000 0 000 NUL\n",
" 1 U+0 001 1 001 Ctrl-A\n",
" 2 U+0002 2 002 Ctrl-B\n",
" 3 U+0003 3 003 Ctrl-C\n",
" 4 U+0004 4 004 Ctrl-D\n",
" 5 U+0005 5 005 Ctrl-E,\n",
" Unicode Decimal Octal Key\n",
" 0 U+0003 3 003 Ctrl-C\n",
" 1 U+0004 4 004 Ctrl-D\n",
" 2 U+0005 5 005 Ctrl-E\n",
" 3 U+0006 6 006 Ctrl-F\n",
" 4 U+0007 7 007 Ctrl-G\n",
" 5 U+0008 8 008 Ctrl-H\n",
" 6 U+0009 9 009 Ctrl-I)"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"encoding = pd.DataFrame(\n",
" {\n",
" \"Unicode\": [\n",
" \"U+0000\", \"U+0 001\", \"U+0002\", \"U+0003\", \"U+0004\", \"U+0005\",\n",
" ],\n",
" \"Decimal\": [0, 1, 2, 3, 4, 5],\n",
" \"Octal\": [\"000\", \"001\", \"002\", \"003\", \"004\", \"005\"],\n",
" \"Key\": [\"NUL\", \"Ctrl-A\", \"Ctrl-B\", \"Ctrl-C\", \"Ctrl-D\", \"Ctrl-E\"],\n",
" }\n",
")\n",
"\n",
"update = pd.DataFrame(\n",
" {\n",
" \"Unicode\": [\n",
" \"U+0003\", \"U+0004\", \"U+0005\", \"U+0006\", \"U+0007\", \"U+0008\", \"U+0009\",\n",
" ],\n",
" \"Decimal\": [3, 4, 5, 6, 7, 8, 9],\n",
" \"Octal\": [\"003\", \"004\", \"005\", \"006\", \"007\", \"008\", \"009\"],\n",
" \"Key\": [\n",
" \"Ctrl-C\", \"Ctrl-D\", \"Ctrl-E\", \"Ctrl-F\", \"Ctrl-G\", \"Ctrl-H\", \"Ctrl-I\",\n",
" ],\n",
" }\n",
")\n",
"\n",
"encoding, update"
]
},
{
"cell_type": "markdown",
"id": "ce2f4cab",
"metadata": {},
"source": [
"When we call `merge` with these objects, we get:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "3c3df196",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Unicode | \n",
" Decimal | \n",
" Octal | \n",
" Key | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" U+0003 | \n",
" 3 | \n",
" 003 | \n",
" Ctrl-C | \n",
"
\n",
" \n",
" | 1 | \n",
" U+0004 | \n",
" 4 | \n",
" 004 | \n",
" Ctrl-D | \n",
"
\n",
" \n",
" | 2 | \n",
" U+0005 | \n",
" 5 | \n",
" 005 | \n",
" Ctrl-E | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Unicode Decimal Octal Key\n",
"0 U+0003 3 003 Ctrl-C\n",
"1 U+0004 4 004 Ctrl-D\n",
"2 U+0005 5 005 Ctrl-E"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.merge(encoding, update)"
]
},
{
"cell_type": "markdown",
"id": "cd0a1598",
"metadata": {},
"source": [
"By default, `merge` performs a so-called *inner join*; the keys in the result are the intersection or common set in both tables."
]
},
{
"cell_type": "markdown",
"id": "7e4b3cfe",
"metadata": {},
"source": [
"\n",
"\n",
"**Note:**\n",
"\n",
"I did not specify which column to merge over. If this information is not specified, merge will use the overlapping column names as keys. However, it is good practice to specify this explicitly:\n",
"
"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "e3ac98a4",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Unicode | \n",
" Decimal_x | \n",
" Octal_x | \n",
" Key_x | \n",
" Decimal_y | \n",
" Octal_y | \n",
" Key_y | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" U+0003 | \n",
" 3 | \n",
" 003 | \n",
" Ctrl-C | \n",
" 3 | \n",
" 003 | \n",
" Ctrl-C | \n",
"
\n",
" \n",
" | 1 | \n",
" U+0004 | \n",
" 4 | \n",
" 004 | \n",
" Ctrl-D | \n",
" 4 | \n",
" 004 | \n",
" Ctrl-D | \n",
"
\n",
" \n",
" | 2 | \n",
" U+0005 | \n",
" 5 | \n",
" 005 | \n",
" Ctrl-E | \n",
" 5 | \n",
" 005 | \n",
" Ctrl-E | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Unicode Decimal_x Octal_x Key_x Decimal_y Octal_y Key_y\n",
"0 U+0003 3 003 Ctrl-C 3 003 Ctrl-C\n",
"1 U+0004 4 004 Ctrl-D 4 004 Ctrl-D\n",
"2 U+0005 5 005 Ctrl-E 5 005 Ctrl-E"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.merge(encoding, update, on=\"Unicode\")"
]
},
{
"cell_type": "markdown",
"id": "f1e243d4",
"metadata": {},
"source": [
"If the column names are different in each object, you can specify them separately. In the following example `update2` gets the key `U+` and not `Unicode`:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "9d310a26",
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Unicode | \n",
" Decimal_x | \n",
" Octal_x | \n",
" Key_x | \n",
" U+ | \n",
" Decimal_y | \n",
" Octal_y | \n",
" Key_y | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" U+0003 | \n",
" 3 | \n",
" 003 | \n",
" Ctrl-C | \n",
" U+0003 | \n",
" 3 | \n",
" 003 | \n",
" Ctrl-C | \n",
"
\n",
" \n",
" | 1 | \n",
" U+0004 | \n",
" 4 | \n",
" 004 | \n",
" Ctrl-D | \n",
" U+0004 | \n",
" 4 | \n",
" 004 | \n",
" Ctrl-D | \n",
"
\n",
" \n",
" | 2 | \n",
" U+0005 | \n",
" 5 | \n",
" 005 | \n",
" Ctrl-E | \n",
" U+0005 | \n",
" 5 | \n",
" 005 | \n",
" Ctrl-E | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Unicode Decimal_x Octal_x Key_x U+ Decimal_y Octal_y Key_y\n",
"0 U+0003 3 003 Ctrl-C U+0003 3 003 Ctrl-C\n",
"1 U+0004 4 004 Ctrl-D U+0004 4 004 Ctrl-D\n",
"2 U+0005 5 005 Ctrl-E U+0005 5 005 Ctrl-E"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"update2 = pd.DataFrame(\n",
" {\n",
" \"U+\": [\n",
" \"U+0003\", \"U+0004\", \"U+0005\", \"U+0006\", \"U+0007\", \"U+0008\", \"U+0009\",\n",
" ],\n",
" \"Decimal\": [3, 4, 5, 6, 7, 8, 9],\n",
" \"Octal\": [\"003\", \"004\", \"005\", \"006\", \"007\", \"008\", \"009\"],\n",
" \"Key\": [\n",
" \"Ctrl-C\", \"Ctrl-D\", \"Ctrl-E\", \"Ctrl-F\", \"Ctrl-G\", \"Ctrl-H\", \"Ctrl-I\",\n",
" ],\n",
" }\n",
")\n",
"\n",
"pd.merge(encoding, update2, left_on=\"Unicode\", right_on=\"U+\")"
]
},
{
"cell_type": "markdown",
"id": "3978abc4",
"metadata": {},
"source": [
"However, you can use `merge` not only to perform an inner join, with which the keys in the result are the intersection or common set in both tables. Other possible options are:\n",
"\n",
"Option | Behaviour\n",
":----- | :--------\n",
"`how='inner'` | uses only the key combinations observed in both tables\n",
"`how='left'` | uses all key combinations found in the left table\n",
"`how='right'` | uses all key combinations found in the right table\n",
"`how='outer'` | uses all key combinations observed in both tables together"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "1c33c487",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Unicode | \n",
" Decimal_x | \n",
" Octal_x | \n",
" Key_x | \n",
" Decimal_y | \n",
" Octal_y | \n",
" Key_y | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" U+0000 | \n",
" 0 | \n",
" 000 | \n",
" NUL | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" | 1 | \n",
" U+0 001 | \n",
" 1 | \n",
" 001 | \n",
" Ctrl-A | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" | 2 | \n",
" U+0002 | \n",
" 2 | \n",
" 002 | \n",
" Ctrl-B | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" | 3 | \n",
" U+0003 | \n",
" 3 | \n",
" 003 | \n",
" Ctrl-C | \n",
" 3.0 | \n",
" 003 | \n",
" Ctrl-C | \n",
"
\n",
" \n",
" | 4 | \n",
" U+0004 | \n",
" 4 | \n",
" 004 | \n",
" Ctrl-D | \n",
" 4.0 | \n",
" 004 | \n",
" Ctrl-D | \n",
"
\n",
" \n",
" | 5 | \n",
" U+0005 | \n",
" 5 | \n",
" 005 | \n",
" Ctrl-E | \n",
" 5.0 | \n",
" 005 | \n",
" Ctrl-E | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Unicode Decimal_x Octal_x Key_x Decimal_y Octal_y Key_y\n",
"0 U+0000 0 000 NUL NaN NaN NaN\n",
"1 U+0 001 1 001 Ctrl-A NaN NaN NaN\n",
"2 U+0002 2 002 Ctrl-B NaN NaN NaN\n",
"3 U+0003 3 003 Ctrl-C 3.0 003 Ctrl-C\n",
"4 U+0004 4 004 Ctrl-D 4.0 004 Ctrl-D\n",
"5 U+0005 5 005 Ctrl-E 5.0 005 Ctrl-E"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.merge(encoding, update, on=\"Unicode\", how=\"left\")"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "5a959622",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Unicode | \n",
" Decimal_x | \n",
" Octal_x | \n",
" Key_x | \n",
" Decimal_y | \n",
" Octal_y | \n",
" Key_y | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" U+0 001 | \n",
" 1.0 | \n",
" 001 | \n",
" Ctrl-A | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" | 1 | \n",
" U+0000 | \n",
" 0.0 | \n",
" 000 | \n",
" NUL | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" | 2 | \n",
" U+0002 | \n",
" 2.0 | \n",
" 002 | \n",
" Ctrl-B | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" | 3 | \n",
" U+0003 | \n",
" 3.0 | \n",
" 003 | \n",
" Ctrl-C | \n",
" 3.0 | \n",
" 003 | \n",
" Ctrl-C | \n",
"
\n",
" \n",
" | 4 | \n",
" U+0004 | \n",
" 4.0 | \n",
" 004 | \n",
" Ctrl-D | \n",
" 4.0 | \n",
" 004 | \n",
" Ctrl-D | \n",
"
\n",
" \n",
" | 5 | \n",
" U+0005 | \n",
" 5.0 | \n",
" 005 | \n",
" Ctrl-E | \n",
" 5.0 | \n",
" 005 | \n",
" Ctrl-E | \n",
"
\n",
" \n",
" | 6 | \n",
" U+0006 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 6.0 | \n",
" 006 | \n",
" Ctrl-F | \n",
"
\n",
" \n",
" | 7 | \n",
" U+0007 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 7.0 | \n",
" 007 | \n",
" Ctrl-G | \n",
"
\n",
" \n",
" | 8 | \n",
" U+0008 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 8.0 | \n",
" 008 | \n",
" Ctrl-H | \n",
"
\n",
" \n",
" | 9 | \n",
" U+0009 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 9.0 | \n",
" 009 | \n",
" Ctrl-I | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Unicode Decimal_x Octal_x Key_x Decimal_y Octal_y Key_y\n",
"0 U+0 001 1.0 001 Ctrl-A NaN NaN NaN\n",
"1 U+0000 0.0 000 NUL NaN NaN NaN\n",
"2 U+0002 2.0 002 Ctrl-B NaN NaN NaN\n",
"3 U+0003 3.0 003 Ctrl-C 3.0 003 Ctrl-C\n",
"4 U+0004 4.0 004 Ctrl-D 4.0 004 Ctrl-D\n",
"5 U+0005 5.0 005 Ctrl-E 5.0 005 Ctrl-E\n",
"6 U+0006 NaN NaN NaN 6.0 006 Ctrl-F\n",
"7 U+0007 NaN NaN NaN 7.0 007 Ctrl-G\n",
"8 U+0008 NaN NaN NaN 8.0 008 Ctrl-H\n",
"9 U+0009 NaN NaN NaN 9.0 009 Ctrl-I"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.merge(encoding, update, on=\"Unicode\", how=\"outer\")"
]
},
{
"cell_type": "markdown",
"id": "fd6c3638",
"metadata": {},
"source": [
"The join method only affects the unique key values that appear in the result."
]
},
{
"cell_type": "markdown",
"id": "843d8e88",
"metadata": {},
"source": [
"To join multiple keys, you can pass a list of column names:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "5891c418",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Unicode | \n",
" Decimal | \n",
" Octal | \n",
" Key | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" U+0 001 | \n",
" 1 | \n",
" 001 | \n",
" Ctrl-A | \n",
"
\n",
" \n",
" | 1 | \n",
" U+0000 | \n",
" 0 | \n",
" 000 | \n",
" NUL | \n",
"
\n",
" \n",
" | 2 | \n",
" U+0002 | \n",
" 2 | \n",
" 002 | \n",
" Ctrl-B | \n",
"
\n",
" \n",
" | 3 | \n",
" U+0003 | \n",
" 3 | \n",
" 003 | \n",
" Ctrl-C | \n",
"
\n",
" \n",
" | 4 | \n",
" U+0004 | \n",
" 4 | \n",
" 004 | \n",
" Ctrl-D | \n",
"
\n",
" \n",
" | 5 | \n",
" U+0005 | \n",
" 5 | \n",
" 005 | \n",
" Ctrl-E | \n",
"
\n",
" \n",
" | 6 | \n",
" U+0006 | \n",
" 6 | \n",
" 006 | \n",
" Ctrl-F | \n",
"
\n",
" \n",
" | 7 | \n",
" U+0007 | \n",
" 7 | \n",
" 007 | \n",
" Ctrl-G | \n",
"
\n",
" \n",
" | 8 | \n",
" U+0008 | \n",
" 8 | \n",
" 008 | \n",
" Ctrl-H | \n",
"
\n",
" \n",
" | 9 | \n",
" U+0009 | \n",
" 9 | \n",
" 009 | \n",
" Ctrl-I | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Unicode Decimal Octal Key\n",
"0 U+0 001 1 001 Ctrl-A\n",
"1 U+0000 0 000 NUL\n",
"2 U+0002 2 002 Ctrl-B\n",
"3 U+0003 3 003 Ctrl-C\n",
"4 U+0004 4 004 Ctrl-D\n",
"5 U+0005 5 005 Ctrl-E\n",
"6 U+0006 6 006 Ctrl-F\n",
"7 U+0007 7 007 Ctrl-G\n",
"8 U+0008 8 008 Ctrl-H\n",
"9 U+0009 9 009 Ctrl-I"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.merge(encoding, update, on=[\"Unicode\", \"Decimal\", \"Octal\", \"Key\"], how=\"outer\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.13 Kernel",
"language": "python",
"name": "python313"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.13.0"
},
"widgets": {
"application/vnd.jupyter.widget-state+json": {
"state": {},
"version_major": 2,
"version_minor": 0
}
}
},
"nbformat": 4,
"nbformat_minor": 5
}