Descriptive statistics¶

pandas objects are equipped with a number of common mathematical and statistical methods. Most of them fall into the category of reductions or summary statistics, methods that extract a single value (such as the sum or mean) from a series or set of values from the rows or columns of a DataFrame. Compared to similar methods found in NumPy arrays, they also handle missing data.

[1]:

import numpy as np
import pandas as pd


rng = np.random.default_rng()
df = pd.DataFrame(
    rng.normal(size=(7, 3)),
    index=pd.date_range("2022-02-02", periods=7),
)
new_index = pd.date_range("2022-02-03", periods=7)
df2 = df.reindex(new_index)

df2

[1]:

	0	1	2
2022-02-03	-1.118009	-0.596567	1.675063
2022-02-04	0.622485	0.460062	-1.011463
2022-02-05	-0.951856	0.188381	2.315514
2022-02-06	0.916177	0.178374	0.207760
2022-02-07	0.713193	-0.246814	1.469239
2022-02-08	-0.847348	-0.328571	-1.491281
2022-02-09	NaN	NaN	NaN

Calling the pandas.DataFrame.sum method returns a series containing column totals:

[2]:

df2.sum()

[2]:

0   -0.665358
1   -0.345136
2    3.164833
dtype: float64

Passing axis='columns' or axis=1 instead sums over the columns:

[3]:

df2.sum(axis="columns")

[3]:

2022-02-03   -0.039513
2022-02-04    0.071084
2022-02-05    1.552039
2022-02-06    1.302311
2022-02-07    1.935618
2022-02-08   -2.667200
2022-02-09    0.000000
Freq: D, dtype: float64

If an entire row or column contains all NA values, the sum is 0. This can be disabled with the skipna option:

[4]:

df2.sum(axis="columns", skipna=False)

[4]:

2022-02-03   -0.039513
2022-02-04    0.071084
2022-02-05    1.552039
2022-02-06    1.302311
2022-02-07    1.935618
2022-02-08   -2.667200
2022-02-09         NaN
Freq: D, dtype: float64

Some aggregations, such as mean, require at least one non-NaN value to obtain a valuable result:

[5]:

df2.mean(axis="columns")

[5]:

2022-02-03   -0.013171
2022-02-04    0.023695
2022-02-05    0.517346
2022-02-06    0.434104
2022-02-07    0.645206
2022-02-08   -0.889067
2022-02-09         NaN
Freq: D, dtype: float64

Options for reduction methods¶

Method	Description
`axis`	the axis of values to reduce: `0` for the rows of the DataFrame and `1` for the columns
`skipna`	exclude missing values; by default `True`.
`level`	reduce grouped by level if the axis is hierarchically indexed (MultiIndex)

Some methods, such as idxmin and idxmax, provide indirect statistics such as the index value at which the minimum or maximum value is reached:

[6]:

df2.idxmax()

[6]:

0   2022-02-06
1   2022-02-04
2   2022-02-05
dtype: datetime64[ns]

Other methods are accumulations:

[7]:

df2.cumsum()

[7]:

	0	1	2
2022-02-03	-1.118009	-0.596567	1.675063
2022-02-04	-0.495524	-0.136505	0.663600
2022-02-05	-1.447380	0.051876	2.979114
2022-02-06	-0.531203	0.230250	3.186875
2022-02-07	0.181990	-0.016564	4.656114
2022-02-08	-0.665358	-0.345136	3.164833
2022-02-09	NaN	NaN	NaN

Another type of method is neither reductions nor accumulations. describe is one such example that produces several summary statistics in one go:

[8]:

df2.describe()

[8]:

	0	1	2
count	6.000000	6.000000	6.000000
mean	-0.110893	-0.057523	0.527472
std	0.952439	0.395949	1.545761
min	-1.118009	-0.596567	-1.491281
25%	-0.925729	-0.308132	-0.706657
50%	-0.112431	-0.034220	0.838500
75%	0.690516	0.185880	1.623607
max	0.916177	0.460062	2.315514

For non-numeric data, describe generates alternative summary statistics:

[9]:

data = {
    "Code": ["U+0000", "U+0001", "U+0002", "U+0003", "U+0004", "U+0005"],
    "Octal": ["001", "002", "003", "004", "004", "005"],
}
df3 = pd.DataFrame(data)

df3.describe()

[9]:

	Code	Octal
count	6	6
unique	6	5
top	U+0000	004
freq	1	2

Descriptive and summary statistics:

Method	Description
`count`	number of non-NA values
`describe`	calculation of a set of summary statistics for series or each DataFrame column
`min`, `max`	calculation of minimum and maximum values
`argmin`, `argmax`	calculation of the index points (integers) at which the minimum or maximum value was reached
`idxmin`, `idxmax`	calculation of the index labels at which the minimum or maximum values were reached
`quantile`	calculation of the sample quantile in the range from 0 to 1
`sum`	sum of the values
`mean`	arithmetic mean of the values
`median`	arithmetic median (50% quantile) of the values
`mad`	mean absolute deviation from the mean value
`prod`	product of all values
`var`	sample variance of the values
`std`	sample standard deviation of the values
`skew`	sample skewness (third moment) of the values
`kurt`	sample kurtosis (fourth moment) of the values
`cumsum`	cumulative sum of the values
`cummin`, `cummax`	cumulated minimum and maximum of the values respectively
`cumprod`	cumulated product of the values
`diff`	calculation of the first arithmetic difference (useful for time series)
`pct_change`	calculation of the percentage changes

`ydata-profiling`¶

ydata-profiling generates profile reports from a pandas DataFrame. The pandas df.describe() function is handy, but a bit basic for exploratory data analysis. ydata-profiling extends pandas DataFrame with df.profile_report(), which automatically generates a standardised report for understanding the data.

Installation¶

$ uv add standard-imghdr legacy-cgi "ydata-profiling[notebook, unicode]"
Resolved 251 packages in 2.53s
Prepared 1 package in 106ms
Installed 24 packages in 155ms
 + annotated-types==0.7.0
 + dacite==1.9.2
 + htmlmin==0.1.12
 + imagehash==4.3.1
 + legacy-cgi==2.6.3
 + llvmlite==0.44.0
 + multimethod==1.12
 + networkx==3.5
 + numba==0.61.0
 + patsy==1.0.1
 + phik==0.12.4
 + puremagic==1.29
 + pydantic==2.11.7
 + pydantic-core==2.33.2
 + pywavelets==1.8.0
 + seaborn==0.13.2
 + standard-imghdr==3.13.0
 + statsmodels==0.14.4
 + tangled-up-in-unicode==0.2.0
 + typeguard==4.4.2
 + typing-inspection==0.4.1
 + visions==0.8.1
 + wordcloud==1.9.4
 + ydata-profiling==4.16.1
$ uv run jupyter notebook

In Python 3.13, the imghdr and cgi modules were removed, see also PEP 594. However, as a workaround for these legacy products, standard-imghdr and legacy-cgi were provided in the Python Package Index.

Example¶

[10]:

from ydata_profiling import ProfileReport

profile = ProfileReport(df2, title="pandas Profiling Report")

profile.to_notebook_iframe()

Upgrade to ydata-sdk

Improve your data and profiling with ydata-sdk, featuring data quality scoring, redundancy detection, outlier identification, text validation, and synthetic data generation.


100%|██████████████████████████████████████████| 3/3 [00:00<00:00, 68015.74it/s]

Configuration for large datasets¶

By default, ydata-profiling summarises the dataset to provide the most insights for data analysis. If the computation time of profiling becomes a bottleneck, pandas-profiling offers several alternatives to overcome it. For the following examples, we first read a larger data set into pandas:

[11]:

titanic = pd.read_csv(
    "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv",
)

1. minimal mode¶

ydata-profiling contains a minimal configuration file config_minimal.yaml, in which the most expensive calculations are turned off by default. This is the recommended starting point for larger data sets.

[12]:

profile = ProfileReport(
    titanic,
    title="Minimal pandas Profiling Report",
    minimal=True,
)

profile.to_notebook_iframe()


  0%|                                                    | 0/12 [00:00<?, ?it/s]
100%|██████████████████████████████████████████| 12/12 [00:00<00:00, 109.46it/s]

Further details on settings and configuration can be found in Available settings.

2. Sample¶

An alternative option for very large data sets is to use only a part of them for the profiling report:

[13]:

sample = titanic.sample(frac=0.05)

profile = ProfileReport(sample, title="Sample pandas Profiling Report")

profile.to_notebook_iframe()


100%|██████████████████████████████████████████| 12/12 [00:00<00:00, 705.30it/s]

3. Deactivate expensive calculations¶

To reduce the computational effort in large datasets, but still get some interesting information, some calculations can be filtered only for certain columns:

[14]:

profile = ProfileReport()
profile.config.interactions.targets = ["Sex", "Age"]
profile.df = titanic

profile.to_notebook_iframe()


100%|███████████████████████████████████████| 12/12 [00:00<00:00, 267721.53it/s]

The setting interactions.targets, can be changed via configuration files as well as via environment variables; see Interactions for details.

4 Concurrency¶

Currently work is being done on a scalable Spark backend for pandas-profiling, see Spark profiling support.