Pivot tables and crosstabs¶

A pivot table is a data summary tool often found in spreadsheet and other data analysis software. It summarises a table of data by one or more keys and arranges the data in a rectangle, with some of the group keys along the rows and some along the columns. Pivot tables in Python with pandas are made possible by the groupby function in combination with reshaping operations using hierarchical indexing. DataFrame has a pivot_table method, and there is also a top-level function pandas.pivot_table. pivot_table not only provides a convenient interface to groupby, but can also add partial sums (margins).

Suppose we wanted to compute a table of group averages (the default aggregation type of pivot_table) ordered by title and language in the rows:

[1]:

import numpy as np
import pandas as pd

[2]:

df = pd.DataFrame(
    {
        "Title": [
            "Jupyter Tutorial",
            "Jupyter Tutorial",
            "PyViz Tutorial",
            "PyViz Tutorial",
            "Python Basics",
            "Python Basics",
        ],
        "Language": ["de", "en", "de", "en", "de", "en"],
        "2021-12": [30134, 6073, 4873, np.nan, 427, 95],
        "2022-01": [33295, 7716, 3930, np.nan, 276, 226],
        "2022-02": [19651, 6547, 2573, np.nan, 525, 157],
    },
)

df

[2]:

	Title	Language	2021-12	2022-01	2022-02
0	Jupyter Tutorial	de	30134.0	33295.0	19651.0
1	Jupyter Tutorial	en	6073.0	7716.0	6547.0
2	PyViz Tutorial	de	4873.0	3930.0	2573.0
3	PyViz Tutorial	en	NaN	NaN	NaN
4	Python Basics	de	427.0	276.0	525.0
5	Python Basics	en	95.0	226.0	157.0

[3]:

df.pivot_table(index=["Title", "Language"])

[3]:

		2021-12	2022-01	2022-02
Title	Language
Jupyter Tutorial	de	30134.0	33295.0	19651.0
Jupyter Tutorial	en	6073.0	7716.0	6547.0
PyViz Tutorial	de	4873.0	3930.0	2573.0
Python Basics	de	427.0	276.0	525.0
Python Basics	en	95.0	226.0	157.0

This could also have been done directly with groupby.

Now let’s say we want to get the mean of hits of all languages per title for each individual month. For this I will enter Title in the table columns and the months in the rows:

[4]:

df.pivot_table(columns=["Title", "Language"])

[4]:

Title	Jupyter Tutorial		PyViz Tutorial	Python Basics
Language	de	en	de	de	en
2021-12	30134.0	6073.0	4873.0	427.0	95.0
2022-01	33295.0	7716.0	3930.0	276.0	226.0
2022-02	19651.0	6547.0	2573.0	525.0	157.0

To use an aggregation function other than mean, pass it to the keyword argument aggfunc. With sum, for example, you get the sum:

[5]:

df.pivot_table(columns=["Title", "Language"], aggfunc="sum", margins=True)

[5]:

Title	Jupyter Tutorial			PyViz Tutorial			Python Basics
Language	de	en	All	de	en	All	de	en	All
2021-12	30134.0	6073.0	36207.0	4873.0	0.0	4873.0	427.0	95.0	522.0
2022-01	33295.0	7716.0	41011.0	3930.0	0.0	3930.0	276.0	226.0	502.0
2022-02	19651.0	6547.0	26198.0	2573.0	0.0	2573.0	525.0	157.0	682.0

pivot_table options:

Function name	Description
`values`	column name(s) to aggregate; by default, all numeric columns are aggregated
`index`	column names or other group keys to be grouped in the rows of the resulting pivot table
`columns`	column names or other group keys to be grouped in the columns of the resulting pivot table
`aggfunc`	aggregation function or list of functions (by default `mean`); can be any function valid in a `groupby` context
`fill_value`	replaces missing values in the result table
`dropna`	if `True`, columns whose entries are all `NA` are ignored
`margins`	inserts row/column subtotals and grand totals (default: `False`)
`margins_name`	name used for row/column labels if `margins=True` is passed, default is `All`.
`observed`	For categorical group keys, if `True`, only the observed category values are displayed in the keys and not all categories

Crosstabs¶

A crosstab is a special case of a pivot table that calculates the frequency of groups. For example, in the context of an analysis of this data, we might want to determine which title was published in which language, so we could use pivot_table for this, but the function pandas.crosstab is more convenient.

[6]:

pd.crosstab(df.Title, df.Language)

[6]:

Language	de	en
Title
Jupyter Tutorial	1	1
PyViz Tutorial	1	1
Python Basics	1	1

The first two arguments for crosstab can each be either an array or a series or a list of arrays.

With margins=True we can also calculate the sums of the columns and rows as well as the total sum:

[7]:

pd.crosstab(df.Title, df.Language, margins=True)

[7]:

Language	de	en	All
Title
Jupyter Tutorial	1	1	2
PyViz Tutorial	1	1	2
Python Basics	1	1	2
All	3	3	6