Descriptive statisticsΒΆ

pandas objects are equipped with a number of common mathematical and statistical methods. Most of them fall into the category of reductions or summary statistics, methods that extract a single value (such as the sum or mean) from a series or set of values from the rows or columns of a DataFrame. Compared to similar methods found in NumPy arrays, they also handle missing data.

[1]:
import numpy as np
import pandas as pd


rng = np.random.default_rng()
df = pd.DataFrame(
    rng.normal(size=(7, 3)),
    index=pd.date_range("2022-02-02", periods=7),
)
new_index = pd.date_range("2022-02-03", periods=7)
df2 = df.reindex(new_index)

df2
[1]:
0 1 2
2022-02-03 -1.118009 -0.596567 1.675063
2022-02-04 0.622485 0.460062 -1.011463
2022-02-05 -0.951856 0.188381 2.315514
2022-02-06 0.916177 0.178374 0.207760
2022-02-07 0.713193 -0.246814 1.469239
2022-02-08 -0.847348 -0.328571 -1.491281
2022-02-09 NaN NaN NaN

Calling the pandas.DataFrame.sum method returns a series containing column totals:

[2]:
df2.sum()
[2]:
0   -0.665358
1   -0.345136
2    3.164833
dtype: float64

Passing axis='columns' or axis=1 instead sums over the columns:

[3]:
df2.sum(axis="columns")
[3]:
2022-02-03   -0.039513
2022-02-04    0.071084
2022-02-05    1.552039
2022-02-06    1.302311
2022-02-07    1.935618
2022-02-08   -2.667200
2022-02-09    0.000000
Freq: D, dtype: float64

If an entire row or column contains all NA values, the sum is 0. This can be disabled with the skipna option:

[4]:
df2.sum(axis="columns", skipna=False)
[4]:
2022-02-03   -0.039513
2022-02-04    0.071084
2022-02-05    1.552039
2022-02-06    1.302311
2022-02-07    1.935618
2022-02-08   -2.667200
2022-02-09         NaN
Freq: D, dtype: float64

Some aggregations, such as mean, require at least one non-NaN value to obtain a valuable result:

[5]:
df2.mean(axis="columns")
[5]:
2022-02-03   -0.013171
2022-02-04    0.023695
2022-02-05    0.517346
2022-02-06    0.434104
2022-02-07    0.645206
2022-02-08   -0.889067
2022-02-09         NaN
Freq: D, dtype: float64

Options for reduction methodsΒΆ

Method

Description

axis

the axis of values to reduce: 0 for the rows of the DataFrame and 1 for the columns

skipna

exclude missing values; by default True.

level

reduce grouped by level if the axis is hierarchically indexed (MultiIndex)

Some methods, such as idxmin and idxmax, provide indirect statistics such as the index value at which the minimum or maximum value is reached:

[6]:
df2.idxmax()
[6]:
0   2022-02-06
1   2022-02-04
2   2022-02-05
dtype: datetime64[ns]

Other methods are accumulations:

[7]:
df2.cumsum()
[7]:
0 1 2
2022-02-03 -1.118009 -0.596567 1.675063
2022-02-04 -0.495524 -0.136505 0.663600
2022-02-05 -1.447380 0.051876 2.979114
2022-02-06 -0.531203 0.230250 3.186875
2022-02-07 0.181990 -0.016564 4.656114
2022-02-08 -0.665358 -0.345136 3.164833
2022-02-09 NaN NaN NaN

Another type of method is neither reductions nor accumulations. describe is one such example that produces several summary statistics in one go:

[8]:
df2.describe()
[8]:
0 1 2
count 6.000000 6.000000 6.000000
mean -0.110893 -0.057523 0.527472
std 0.952439 0.395949 1.545761
min -1.118009 -0.596567 -1.491281
25% -0.925729 -0.308132 -0.706657
50% -0.112431 -0.034220 0.838500
75% 0.690516 0.185880 1.623607
max 0.916177 0.460062 2.315514

For non-numeric data, describe generates alternative summary statistics:

[9]:
data = {
    "Code": ["U+0000", "U+0001", "U+0002", "U+0003", "U+0004", "U+0005"],
    "Octal": ["001", "002", "003", "004", "004", "005"],
}
df3 = pd.DataFrame(data)

df3.describe()
[9]:
Code Octal
count 6 6
unique 6 5
top U+0000 004
freq 1 2

Descriptive and summary statistics:

Method

Description

count

number of non-NA values

describe

calculation of a set of summary statistics for series or each DataFrame column

min, max

calculation of minimum and maximum values

argmin, argmax

calculation of the index points (integers) at which the minimum or maximum value was reached

idxmin, idxmax

calculation of the index labels at which the minimum or maximum values were reached

quantile

calculation of the sample quantile in the range from 0 to 1

sum

sum of the values

mean

arithmetic mean of the values

median

arithmetic median (50% quantile) of the values

mad

mean absolute deviation from the mean value

prod

product of all values

var

sample variance of the values

std

sample standard deviation of the values

skew

sample skewness (third moment) of the values

kurt

sample kurtosis (fourth moment) of the values

cumsum

cumulative sum of the values

cummin, cummax

cumulated minimum and maximum of the values respectively

cumprod

cumulated product of the values

diff

calculation of the first arithmetic difference (useful for time series)

pct_change

calculation of the percentage changes

ydata-profilingΒΆ

ydata-profiling generates profile reports from a pandas DataFrame. The pandas df.describe() function is handy, but a bit basic for exploratory data analysis. ydata-profiling extends pandas DataFrame with df.profile_report(), which automatically generates a standardised report for understanding the data.

InstallationΒΆ

$ uv add standard-imghdr legacy-cgi "ydata-profiling[notebook, unicode]"
Resolved 251 packages in 2.53s
Prepared 1 package in 106ms
Installed 24 packages in 155ms
 + annotated-types==0.7.0
 + dacite==1.9.2
 + htmlmin==0.1.12
 + imagehash==4.3.1
 + legacy-cgi==2.6.3
 + llvmlite==0.44.0
 + multimethod==1.12
 + networkx==3.5
 + numba==0.61.0
 + patsy==1.0.1
 + phik==0.12.4
 + puremagic==1.29
 + pydantic==2.11.7
 + pydantic-core==2.33.2
 + pywavelets==1.8.0
 + seaborn==0.13.2
 + standard-imghdr==3.13.0
 + statsmodels==0.14.4
 + tangled-up-in-unicode==0.2.0
 + typeguard==4.4.2
 + typing-inspection==0.4.1
 + visions==0.8.1
 + wordcloud==1.9.4
 + ydata-profiling==4.16.1
$ uv run jupyter notebook

In Python 3.13, the imghdr and cgi modules were removed, see also PEP 594. However, as a workaround for these legacy products, standard-imghdr and legacy-cgi were provided in the Python Package Index.

ExampleΒΆ

[10]:
from ydata_profiling import ProfileReport


profile = ProfileReport(df2, title="pandas Profiling Report")

profile.to_notebook_iframe()
Upgrade to ydata-sdk

Improve your data and profiling with ydata-sdk, featuring data quality scoring, redundancy detection, outlier identification, text validation, and synthetic data generation.


100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3/3 [00:00<00:00, 68015.74it/s]

Configuration for large datasetsΒΆ

By default, ydata-profiling summarises the dataset to provide the most insights for data analysis. If the computation time of profiling becomes a bottleneck, pandas-profiling offers several alternatives to overcome it. For the following examples, we first read a larger data set into pandas:

[11]:
titanic = pd.read_csv(
    "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv",
)

1. minimal modeΒΆ

ydata-profiling contains a minimal configuration file config_minimal.yaml, in which the most expensive calculations are turned off by default. This is the recommended starting point for larger data sets.

[12]:
profile = ProfileReport(
    titanic,
    title="Minimal pandas Profiling Report",
    minimal=True,
)

profile.to_notebook_iframe()

  0%|                                                    | 0/12 [00:00<?, ?it/s]
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 12/12 [00:00<00:00, 109.46it/s]

Further details on settings and configuration can be found in Available settings.

2. SampleΒΆ

An alternative option for very large data sets is to use only a part of them for the profiling report:

[13]:
sample = titanic.sample(frac=0.05)

profile = ProfileReport(sample, title="Sample pandas Profiling Report")

profile.to_notebook_iframe()

100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 12/12 [00:00<00:00, 705.30it/s]

3. Deactivate expensive calculationsΒΆ

To reduce the computational effort in large datasets, but still get some interesting information, some calculations can be filtered only for certain columns:

[14]:
profile = ProfileReport()
profile.config.interactions.targets = ["Sex", "Age"]
profile.df = titanic

profile.to_notebook_iframe()

100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 12/12 [00:00<00:00, 267721.53it/s]