Pivot tables and crosstabs

A pivot table is a data summary tool often found in spreadsheet and other data analysis software. It summarises a table of data by one or more keys and arranges the data in a rectangle, with some of the group keys along the rows and some along the columns. Pivot tables in Python with pandas are made possible by the groupby function in combination with reshaping operations using hierarchical indexing. DataFrame has a pivot_table method, and there is also a top-level function pandas.pivot_table. pivot_table not only provides a convenient interface to groupby, but can also add partial sums (margins).

Suppose we wanted to compute a table of group averages (the default aggregation type of pivot_table) ordered by title and language in the rows:

[1]:
import numpy as np
import pandas as pd
[2]:
df = pd.DataFrame(
    {
        "Title": [
            "Jupyter Tutorial",
            "Jupyter Tutorial",
            "PyViz Tutorial",
            "PyViz Tutorial",
            "Python Basics",
            "Python Basics",
        ],
        "Language": ["de", "en", "de", "en", "de", "en"],
        "2021-12": [30134, 6073, 4873, np.nan, 427, 95],
        "2022-01": [33295, 7716, 3930, np.nan, 276, 226],
        "2022-02": [19651, 6547, 2573, np.nan, 525, 157],
    }
)

df
[2]:
Title Language 2021-12 2022-01 2022-02
0 Jupyter Tutorial de 30134.0 33295.0 19651.0
1 Jupyter Tutorial en 6073.0 7716.0 6547.0
2 PyViz Tutorial de 4873.0 3930.0 2573.0
3 PyViz Tutorial en NaN NaN NaN
4 Python Basics de 427.0 276.0 525.0
5 Python Basics en 95.0 226.0 157.0
[3]:
df.pivot_table(index=["Title", "Language"])
[3]:
2021-12 2022-01 2022-02
Title Language
Jupyter Tutorial de 30134.0 33295.0 19651.0
en 6073.0 7716.0 6547.0
PyViz Tutorial de 4873.0 3930.0 2573.0
Python Basics de 427.0 276.0 525.0
en 95.0 226.0 157.0

This could also have been done directly with groupby.

Now let’s say we want to get the mean of hits of all languages per title for each individual month. For this I will enter Title in the table columns and the months in the rows:

[4]:
df.pivot_table(columns=["Title", "Language"])
[4]:
Title Jupyter Tutorial PyViz Tutorial Python Basics
Language de en de de en
2021-12 30134.0 6073.0 4873.0 427.0 95.0
2022-01 33295.0 7716.0 3930.0 276.0 226.0
2022-02 19651.0 6547.0 2573.0 525.0 157.0

To use an aggregation function other than mean, pass it to the keyword argument aggfunc. With sum, for example, you get the sum:

[5]:
df.pivot_table(columns=["Title", "Language"], aggfunc="sum", margins=True)
[5]:
Title Jupyter Tutorial PyViz Tutorial Python Basics
Language de en All de en All de en All
2021-12 30134.0 6073.0 36207.0 4873.0 0.0 4873.0 427.0 95.0 522.0
2022-01 33295.0 7716.0 41011.0 3930.0 0.0 3930.0 276.0 226.0 502.0
2022-02 19651.0 6547.0 26198.0 2573.0 0.0 2573.0 525.0 157.0 682.0

pivot_table options:

Function name

Description

values

column name(s) to aggregate; by default, all numeric columns are aggregated

index

column names or other group keys to be grouped in the rows of the resulting pivot table

columns

column names or other group keys to be grouped in the columns of the resulting pivot table

aggfunc

aggregation function or list of functions (by default mean); can be any function valid in a groupby context

fill_value

replaces missing values in the result table

dropna

if True, columns whose entries are all NA are ignored

margins

inserts row/column subtotals and grand totals (default: False)

margins_name

name used for row/column labels if margins=True is passed, default is All.

observed

For categorical group keys, if True, only the observed category values are displayed in the keys and not all categories

Crosstabs

A crosstab is a special case of a pivot table that calculates the frequency of groups. For example, in the context of an analysis of this data, we might want to determine which title was published in which language, so we could use pivot_table for this, but the function pandas.crosstab is more convenient.

[6]:
pd.crosstab(df.Title, df.Language)
[6]:
Language de en
Title
Jupyter Tutorial 1 1
PyViz Tutorial 1 1
Python Basics 1 1

The first two arguments for crosstab can each be either an array or a series or a list of arrays.

With margins=True we can also calculate the sums of the columns and rows as well as the total sum:

[7]:
pd.crosstab(df.Title, df.Language, margins=True)
[7]:
Language de en All
Title
Jupyter Tutorial 1 1 2
PyViz Tutorial 1 1 2
Python Basics 1 1 2
All 3 3 6