Descriptive statistics

pandas objects are equipped with a number of common mathematical and statistical methods. Most of them fall into the category of reductions or summary statistics, methods that extract a single value (such as the sum or mean) from a series or set of values from the rows or columns of a DataFrame. Compared to similar methods found in NumPy arrays, they also handle missing data.

[1]:
import numpy as np
import pandas as pd


rng = np.random.default_rng()
df = pd.DataFrame(
    rng.normal(size=(7, 3)), index=pd.date_range("2022-02-02", periods=7)
)
new_index = pd.date_range("2022-02-03", periods=7)
df2 = df.reindex(new_index)

df2
[1]:
0 1 2
2022-02-03 0.233008 -0.443973 1.358824
2022-02-04 -1.483024 -1.729070 -1.904267
2022-02-05 0.277089 -1.332157 -0.190171
2022-02-06 -0.753274 -0.960732 -1.524737
2022-02-07 -0.100957 0.423734 0.085448
2022-02-08 1.150540 0.811464 -0.406824
2022-02-09 NaN NaN NaN

Calling the pandas.DataFrame.sum method returns a series containing column totals:

[2]:
df2.sum()
[2]:
0   -0.676619
1   -3.230734
2   -2.581727
dtype: float64

Passing axis='columns' or axis=1 instead sums over the columns:

[3]:
df2.sum(axis="columns")
[3]:
2022-02-03    1.147859
2022-02-04   -5.116361
2022-02-05   -1.245239
2022-02-06   -3.238743
2022-02-07    0.408226
2022-02-08    1.555179
2022-02-09    0.000000
Freq: D, dtype: float64

If an entire row or column contains all NA values, the sum is 0. This can be disabled with the skipna option:

[4]:
df2.sum(axis="columns", skipna=False)
[4]:
2022-02-03    1.147859
2022-02-04   -5.116361
2022-02-05   -1.245239
2022-02-06   -3.238743
2022-02-07    0.408226
2022-02-08    1.555179
2022-02-09         NaN
Freq: D, dtype: float64

Some aggregations, such as mean, require at least one non-NaN value to obtain a valuable result:

[5]:
df2.mean(axis="columns")
[5]:
2022-02-03    0.382620
2022-02-04   -1.705454
2022-02-05   -0.415080
2022-02-06   -1.079581
2022-02-07    0.136075
2022-02-08    0.518393
2022-02-09         NaN
Freq: D, dtype: float64

Options for reduction methods

Method

Description

axis

the axis of values to reduce: 0 for the rows of the DataFrame and 1 for the columns

skipna

exclude missing values; by default True.

level

reduce grouped by level if the axis is hierarchically indexed (MultiIndex)

Some methods, such as idxmin and idxmax, provide indirect statistics such as the index value at which the minimum or maximum value is reached:

[6]:
df2.idxmax()
[6]:
0   2022-02-08
1   2022-02-08
2   2022-02-03
dtype: datetime64[ns]

Other methods are accumulations:

[7]:
df2.cumsum()
[7]:
0 1 2
2022-02-03 0.233008 -0.443973 1.358824
2022-02-04 -1.250016 -2.173043 -0.545444
2022-02-05 -0.972928 -3.505199 -0.735614
2022-02-06 -1.726201 -4.465931 -2.260351
2022-02-07 -1.827158 -4.042197 -2.174903
2022-02-08 -0.676619 -3.230734 -2.581727
2022-02-09 NaN NaN NaN

Another type of method is neither reductions nor accumulations. describe is one such example that produces several summary statistics in one go:

[8]:
df2.describe()
[8]:
0 1 2
count 6.000000 6.000000 6.000000
mean -0.112770 -0.538456 -0.430288
std 0.911645 0.998284 1.174355
min -1.483024 -1.729070 -1.904267
25% -0.590194 -1.239301 -1.245259
50% 0.066026 -0.702352 -0.298498
75% 0.266069 0.206808 0.016544
max 1.150540 0.811464 1.358824

For non-numeric data, describe generates alternative summary statistics:

[9]:
data = {
    "Code": ["U+0000", "U+0001", "U+0002", "U+0003", "U+0004", "U+0005"],
    "Octal": ["001", "002", "003", "004", "004", "005"],
}
df3 = pd.DataFrame(data)

df3.describe()
[9]:
Code Octal
count 6 6
unique 6 5
top U+0000 004
freq 1 2

Descriptive and summary statistics:

Method

Description

count

number of non-NA values

describe

calculation of a set of summary statistics for series or each DataFrame column

min, max

calculation of minimum and maximum values

argmin, argmax

calculation of the index points (integers) at which the minimum or maximum value was reached

idxmin, idxmax

calculation of the index labels at which the minimum or maximum values were reached

quantile

calculation of the sample quantile in the range from 0 to 1

sum

sum of the values

mean

arithmetic mean of the values

median

arithmetic median (50% quantile) of the values

mad

mean absolute deviation from the mean value

prod

product of all values

var

sample variance of the values

std

sample standard deviation of the values

skew

sample skewness (third moment) of the values

kurt

sample kurtosis (fourth moment) of the values

cumsum

cumulative sum of the values

cummin, cummax

cumulated minimum and maximum of the values respectively

cumprod

cumulated product of the values

diff

calculation of the first arithmetic difference (useful for time series)

pct_change

calculation of the percentage changes

ydata-profiling

ydata-profiling generates profile reports from a pandas DataFrame. The pandas df.describe() function is handy, but a bit basic for exploratory data analysis. ydata-profiling extends pandas DataFrame with df.profile_report(), which automatically generates a standardised report for understanding the data.

Installation

$ uv add standard-imghdr legacy-cgi "ydata-profiling[notebook, unicode]"
Resolved 251 packages in 2.53s
Prepared 1 package in 106ms
Installed 24 packages in 155ms
 + annotated-types==0.7.0
 + dacite==1.9.2
 + htmlmin==0.1.12
 + imagehash==4.3.1
 + legacy-cgi==2.6.3
 + llvmlite==0.44.0
 + multimethod==1.12
 + networkx==3.5
 + numba==0.61.0
 + patsy==1.0.1
 + phik==0.12.4
 + puremagic==1.29
 + pydantic==2.11.7
 + pydantic-core==2.33.2
 + pywavelets==1.8.0
 + seaborn==0.13.2
 + standard-imghdr==3.13.0
 + statsmodels==0.14.4
 + tangled-up-in-unicode==0.2.0
 + typeguard==4.4.2
 + typing-inspection==0.4.1
 + visions==0.8.1
 + wordcloud==1.9.4
 + ydata-profiling==4.16.1
$ uv run jupyter notebook

In Python 3.13, the imghdr and cgi modules were removed, see also PEP 594. However, as a workaround for these legacy products, standard-imghdr and legacy-cgi were provided in the Python Package Index.

Example

[10]:
from ydata_profiling import ProfileReport


profile = ProfileReport(df2, title="pandas Profiling Report")

profile.to_notebook_iframe()
Upgrade to ydata-sdk

Improve your data and profiling with ydata-sdk, featuring data quality scoring, redundancy detection, outlier identification, text validation, and synthetic data generation.


100%|██████████████████████████████████████████| 3/3 [00:00<00:00, 71493.82it/s]

Configuration for large datasets

By default, ydata-profiling summarises the dataset to provide the most insights for data analysis. If the computation time of profiling becomes a bottleneck, pandas-profiling offers several alternatives to overcome it. For the following examples, we first read a larger data set into pandas:

[11]:
titanic = pd.read_csv(
    "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
)

1. minimal mode

ydata-profiling contains a minimal configuration file config_minimal.yaml, in which the most expensive calculations are turned off by default. This is the recommended starting point for larger data sets.

[12]:
profile = ProfileReport(
    titanic, title="Minimal pandas Profiling Report", minimal=True
)

profile.to_notebook_iframe()

%|                                                    | 0/12 [00:00<?, ?it/s]
100%|██████████████████████████████████████████| 12/12 [00:00<00:00, 108.77it/s]

Further details on settings and configuration can be found in Available settings.

2. Sample

An alternative option for very large data sets is to use only a part of them for the profiling report:

[13]:
sample = titanic.sample(frac=0.05)

profile = ProfileReport(sample, title="Sample pandas Profiling Report")

profile.to_notebook_iframe()

100%|██████████████████████████████████████████| 12/12 [00:00<00:00, 454.29it/s]

3. Deactivate expensive calculations

To reduce the computational effort in large datasets, but still get some interesting information, some calculations can be filtered only for certain columns:

[14]:
profile = ProfileReport()
profile.config.interactions.targets = ["Sex", "Age"]
profile.df = titanic

profile.to_notebook_iframe()

100%|███████████████████████████████████████| 12/12 [00:00<00:00, 260785.74it/s]