Pivot tables and crosstabs¶
A pivot table is a data summary tool often found in spreadsheet and other data analysis software. It summarises a table of data by one or more keys and arranges the data in a rectangle, with some of the group keys along the rows and some along the columns. Pivot tables in Python with pandas are made possible by the groupby function in combination with reshaping operations using hierarchical
indexing. DataFrame
has a pivot_table method, and there is also a top-level function pandas.pivot_table. pivot_table
not only provides a convenient interface to groupby
, but can also add partial sums (margins
).
Suppose we wanted to compute a table of group averages (the default aggregation type of pivot_table
) ordered by title and language in the rows:
[1]:
import numpy as np
import pandas as pd
[2]:
df = pd.DataFrame(
{
"Title": [
"Jupyter Tutorial",
"Jupyter Tutorial",
"PyViz Tutorial",
"PyViz Tutorial",
"Python Basics",
"Python Basics",
],
"Language": ["de", "en", "de", "en", "de", "en"],
"2021-12": [30134, 6073, 4873, np.nan, 427, 95],
"2022-01": [33295, 7716, 3930, np.nan, 276, 226],
"2022-02": [19651, 6547, 2573, np.nan, 525, 157],
}
)
df
[2]:
Title | Language | 2021-12 | 2022-01 | 2022-02 | |
---|---|---|---|---|---|
0 | Jupyter Tutorial | de | 30134.0 | 33295.0 | 19651.0 |
1 | Jupyter Tutorial | en | 6073.0 | 7716.0 | 6547.0 |
2 | PyViz Tutorial | de | 4873.0 | 3930.0 | 2573.0 |
3 | PyViz Tutorial | en | NaN | NaN | NaN |
4 | Python Basics | de | 427.0 | 276.0 | 525.0 |
5 | Python Basics | en | 95.0 | 226.0 | 157.0 |
[3]:
df.pivot_table(index=["Title", "Language"])
[3]:
2021-12 | 2022-01 | 2022-02 | ||
---|---|---|---|---|
Title | Language | |||
Jupyter Tutorial | de | 30134.0 | 33295.0 | 19651.0 |
en | 6073.0 | 7716.0 | 6547.0 | |
PyViz Tutorial | de | 4873.0 | 3930.0 | 2573.0 |
Python Basics | de | 427.0 | 276.0 | 525.0 |
en | 95.0 | 226.0 | 157.0 |
This could also have been done directly with groupby
.
Now let’s say we want to get the mean of hits of all languages per title for each individual month. For this I will enter Title
in the table columns and the months in the rows:
[4]:
df.pivot_table(columns=["Title", "Language"])
[4]:
Title | Jupyter Tutorial | PyViz Tutorial | Python Basics | ||
---|---|---|---|---|---|
Language | de | en | de | de | en |
2021-12 | 30134.0 | 6073.0 | 4873.0 | 427.0 | 95.0 |
2022-01 | 33295.0 | 7716.0 | 3930.0 | 276.0 | 226.0 |
2022-02 | 19651.0 | 6547.0 | 2573.0 | 525.0 | 157.0 |
To use an aggregation function other than mean
, pass it to the keyword argument aggfunc
. With sum
, for example, you get the sum:
[5]:
df.pivot_table(columns=["Title", "Language"], aggfunc="sum", margins=True)
[5]:
Title | Jupyter Tutorial | PyViz Tutorial | Python Basics | ||||||
---|---|---|---|---|---|---|---|---|---|
Language | de | en | All | de | en | All | de | en | All |
2021-12 | 30134.0 | 6073.0 | 36207.0 | 4873.0 | 0.0 | 4873.0 | 427.0 | 95.0 | 522.0 |
2022-01 | 33295.0 | 7716.0 | 41011.0 | 3930.0 | 0.0 | 3930.0 | 276.0 | 226.0 | 502.0 |
2022-02 | 19651.0 | 6547.0 | 26198.0 | 2573.0 | 0.0 | 2573.0 | 525.0 | 157.0 | 682.0 |
pivot_table
options:
Function name |
Description |
---|---|
|
column name(s) to aggregate; by default, all numeric columns are aggregated |
|
column names or other group keys to be grouped in the rows of the resulting pivot table |
|
column names or other group keys to be grouped in the columns of the resulting pivot table |
|
aggregation function or list of functions (by default |
|
replaces missing values in the result table |
|
if |
|
inserts row/column subtotals and grand totals (default: |
|
name used for row/column labels if |
|
For categorical group keys, if |
Crosstabs¶
A crosstab is a special case of a pivot table that calculates the frequency of groups. For example, in the context of an analysis of this data, we might want to determine which title was published in which language, so we could use pivot_table
for this, but the function pandas.crosstab is more convenient.
[6]:
pd.crosstab(df.Title, df.Language)
[6]:
Language | de | en |
---|---|---|
Title | ||
Jupyter Tutorial | 1 | 1 |
PyViz Tutorial | 1 | 1 |
Python Basics | 1 | 1 |
The first two arguments for crosstab
can each be either an array or a series or a list of arrays.
With margins=True
we can also calculate the sums of the columns and rows as well as the total sum:
[7]:
pd.crosstab(df.Title, df.Language, margins=True)
[7]:
Language | de | en | All |
---|---|---|---|
Title | |||
Jupyter Tutorial | 1 | 1 | 2 |
PyViz Tutorial | 1 | 1 | 2 |
Python Basics | 1 | 1 | 2 |
All | 3 | 3 | 6 |