Descriptive statistics

pandas objects are equipped with a number of common mathematical and statistical methods. Most of them fall into the category of reductions or summary statistics, methods that extract a single value (such as the sum or mean) from a series or set of values from the rows or columns of a DataFrame. Compared to similar methods found in NumPy arrays, they also handle missing data.

[1]:
import numpy as np
import pandas as pd


rng = np.random.default_rng()
df = pd.DataFrame(
    rng.normal(size=(7, 3)), index=pd.date_range("2022-02-02", periods=7)
)
new_index = pd.date_range("2022-02-03", periods=7)
df2 = df.reindex(new_index)

df2
[1]:
0 1 2
2022-02-03 0.686507 1.870769 -0.699365
2022-02-04 -1.462243 0.833043 0.423066
2022-02-05 0.227436 -1.146793 -0.495678
2022-02-06 0.404523 0.517117 -1.475375
2022-02-07 2.022298 -0.263188 -0.478148
2022-02-08 -0.056213 0.913033 -0.723379
2022-02-09 NaN NaN NaN

Calling the pandas.DataFrame.sum method returns a series containing column totals:

[2]:
df2.sum()
[2]:
0    1.822307
1    2.723981
2   -3.448879
dtype: float64

Passing axis='columns' or axis=1 instead sums over the columns:

[3]:
df2.sum(axis="columns")
[3]:
2022-02-03    1.857911
2022-02-04   -0.206135
2022-02-05   -1.415035
2022-02-06   -0.553735
2022-02-07    1.280962
2022-02-08    0.133441
2022-02-09    0.000000
Freq: D, dtype: float64

If an entire row or column contains all NA values, the sum is 0. This can be disabled with the skipna option:

[4]:
df2.sum(axis="columns", skipna=False)
[4]:
2022-02-03    1.857911
2022-02-04   -0.206135
2022-02-05   -1.415035
2022-02-06   -0.553735
2022-02-07    1.280962
2022-02-08    0.133441
2022-02-09         NaN
Freq: D, dtype: float64

Some aggregations, such as mean, require at least one non-NaN value to obtain a valuable result:

[5]:
df2.mean(axis="columns")
[5]:
2022-02-03    0.619304
2022-02-04   -0.068712
2022-02-05   -0.471678
2022-02-06   -0.184578
2022-02-07    0.426987
2022-02-08    0.044480
2022-02-09         NaN
Freq: D, dtype: float64

Options for reduction methods

Method

Description

axis

the axis of values to reduce: 0 for the rows of the DataFrame and 1 for the columns

skipna

exclude missing values; by default True.

level

reduce grouped by level if the axis is hierarchically indexed (MultiIndex)

Some methods, such as idxmin and idxmax, provide indirect statistics such as the index value at which the minimum or maximum value is reached:

[6]:
df2.idxmax()
[6]:
0   2022-02-07
1   2022-02-03
2   2022-02-04
dtype: datetime64[ns]

Other methods are accumulations:

[7]:
df2.cumsum()
[7]:
0 1 2
2022-02-03 0.686507 1.870769 -0.699365
2022-02-04 -0.775736 2.703812 -0.276300
2022-02-05 -0.548300 1.557019 -0.771977
2022-02-06 -0.143777 2.074136 -2.247352
2022-02-07 1.878520 1.810948 -2.725500
2022-02-08 1.822307 2.723981 -3.448879
2022-02-09 NaN NaN NaN

Another type of method is neither reductions nor accumulations. describe is one such example that produces several summary statistics in one go:

[8]:
df2.describe()
[8]:
0 1 2
count 6.000000 6.000000 6.000000
mean 0.303718 0.453997 -0.574813
std 1.128201 1.043312 0.609912
min -1.462243 -1.146793 -1.475375
25% 0.014699 -0.068112 -0.717375
50% 0.315979 0.675080 -0.597521
75% 0.616011 0.893035 -0.482530
max 2.022298 1.870769 0.423066

For non-numeric data, describe generates alternative summary statistics:

[9]:
data = {
    "Code": ["U+0000", "U+0001", "U+0002", "U+0003", "U+0004", "U+0005"],
    "Octal": ["001", "002", "003", "004", "004", "005"],
}
df3 = pd.DataFrame(data)

df3.describe()
[9]:
Code Octal
count 6 6
unique 6 5
top U+0000 004
freq 1 2

Descriptive and summary statistics:

Method

Description

count

number of non-NA values

describe

calculation of a set of summary statistics for series or each DataFrame column

min, max

calculation of minimum and maximum values

argmin, argmax

calculation of the index points (integers) at which the minimum or maximum value was reached

idxmin, idxmax

calculation of the index labels at which the minimum or maximum values were reached

quantile

calculation of the sample quantile in the range from 0 to 1

sum

sum of the values

mean

arithmetic mean of the values

median

arithmetic median (50% quantile) of the values

mad

mean absolute deviation from the mean value

prod

product of all values

var

sample variance of the values

std

sample standard deviation of the values

skew

sample skewness (third moment) of the values

kurt

sample kurtosis (fourth moment) of the values

cumsum

cumulative sum of the values

cummin, cummax

cumulated minimum and maximum of the values respectively

cumprod

cumulated product of the values

diff

calculation of the first arithmetic difference (useful for time series)

pct_change

calculation of the percentage changes

ydata-profiling

ydata-profiling generates profile reports from a pandas DataFrame. The pandas df.describe() function is handy, but a bit basic for exploratory data analysis. ydata-profiling extends pandas DataFrame with df.profile_report(), which automatically generates a standardised report for understanding the data.

Installation

$ pipenv install "ydata-profiling[notebook, unicode, pyspark]"
…
✔ Success!
Updated Pipfile.lock (cbc5f7)!
Installing dependencies from Pipfile.lock (cbc5f7)...
  🐍   ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 80/80  00:02:26
…
$ pipenv run jupyter nbextension enable --py widgetsnbextension
Enabling notebook extension jupyter-js-widgets/extension...
      - Validating: OK

Example

[10]:
from ydata_profiling import ProfileReport


profile = ProfileReport(df2, title="pandas Profiling Report")

profile.to_widgets()

Configuration for large datasets

By default, ydata-profiling summarises the dataset to provide the most insights for data analysis. If the computation time of profiling becomes a bottleneck, pandas-profiling offers several alternatives to overcome it. For the following examples, we first read a larger data set into pandas:

[11]:
titanic = pd.read_csv(
    "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
)

1. minimal mode

ydata-profiling contains a minimal configuration file config_minimal.yaml, in which the most expensive calculations are turned off by default. This is the recommended starting point for larger data sets.

[12]:
profile = ProfileReport(
    titanic, title="Minimal pandas Profiling Report", minimal=True
)

profile.to_widgets()

Further details on settings and configuration can be found in Available settings.

2. Sample

An alternative option for very large data sets is to use only a part of them for the profiling report:

[13]:
sample = titanic.sample(frac=0.05)

profile = ProfileReport(sample, title="Sample pandas Profiling Report")

profile.to_widgets()

3. Deactivate expensive calculations

To reduce the computational effort in large datasets, but still get some interesting information, some calculations can be filtered only for certain columns:

[14]:
profile = ProfileReport()
profile.config.interactions.targets = ["Sex", "Age"]
profile.df = titanic

profile.to_widgets()

The setting interactions.targets, can be changed via configuration files as well as via environment variables; see Interactions for details.

4 Concurrency

Currently work is being done on a scalable Spark backend for pandas-profiling, see Spark Profiling Development.