Descriptive statistics¶
pandas objects are equipped with a number of common mathematical and statistical methods. Most of them fall into the category of reductions or summary statistics, methods that extract a single value (such as the sum or mean) from a series or set of values from the rows or columns of a DataFrame. Compared to similar methods found in NumPy arrays, they also handle missing data.
[1]:
import numpy as np
import pandas as pd
rng = np.random.default_rng()
df = pd.DataFrame(
rng.normal(size=(7, 3)), index=pd.date_range("2022-02-02", periods=7)
)
new_index = pd.date_range("2022-02-03", periods=7)
df2 = df.reindex(new_index)
df2
[1]:
0 | 1 | 2 | |
---|---|---|---|
2022-02-03 | 0.233008 | -0.443973 | 1.358824 |
2022-02-04 | -1.483024 | -1.729070 | -1.904267 |
2022-02-05 | 0.277089 | -1.332157 | -0.190171 |
2022-02-06 | -0.753274 | -0.960732 | -1.524737 |
2022-02-07 | -0.100957 | 0.423734 | 0.085448 |
2022-02-08 | 1.150540 | 0.811464 | -0.406824 |
2022-02-09 | NaN | NaN | NaN |
Calling the pandas.DataFrame.sum
method returns a series containing column totals:
[2]:
df2.sum()
[2]:
0 -0.676619
1 -3.230734
2 -2.581727
dtype: float64
Passing axis='columns'
or axis=1
instead sums over the columns:
[3]:
df2.sum(axis="columns")
[3]:
2022-02-03 1.147859
2022-02-04 -5.116361
2022-02-05 -1.245239
2022-02-06 -3.238743
2022-02-07 0.408226
2022-02-08 1.555179
2022-02-09 0.000000
Freq: D, dtype: float64
If an entire row or column contains all NA values, the sum is 0
. This can be disabled with the skipna
option:
[4]:
df2.sum(axis="columns", skipna=False)
[4]:
2022-02-03 1.147859
2022-02-04 -5.116361
2022-02-05 -1.245239
2022-02-06 -3.238743
2022-02-07 0.408226
2022-02-08 1.555179
2022-02-09 NaN
Freq: D, dtype: float64
Some aggregations, such as mean
, require at least one non-NaN
value to obtain a valuable result:
[5]:
df2.mean(axis="columns")
[5]:
2022-02-03 0.382620
2022-02-04 -1.705454
2022-02-05 -0.415080
2022-02-06 -1.079581
2022-02-07 0.136075
2022-02-08 0.518393
2022-02-09 NaN
Freq: D, dtype: float64
Options for reduction methods¶
Method |
Description |
---|---|
|
the axis of values to reduce: |
|
exclude missing values; by default |
|
reduce grouped by level if the axis is hierarchically indexed (MultiIndex) |
Some methods, such as idxmin
and idxmax
, provide indirect statistics such as the index value at which the minimum or maximum value is reached:
[6]:
df2.idxmax()
[6]:
0 2022-02-08
1 2022-02-08
2 2022-02-03
dtype: datetime64[ns]
Other methods are accumulations:
[7]:
df2.cumsum()
[7]:
0 | 1 | 2 | |
---|---|---|---|
2022-02-03 | 0.233008 | -0.443973 | 1.358824 |
2022-02-04 | -1.250016 | -2.173043 | -0.545444 |
2022-02-05 | -0.972928 | -3.505199 | -0.735614 |
2022-02-06 | -1.726201 | -4.465931 | -2.260351 |
2022-02-07 | -1.827158 | -4.042197 | -2.174903 |
2022-02-08 | -0.676619 | -3.230734 | -2.581727 |
2022-02-09 | NaN | NaN | NaN |
Another type of method is neither reductions nor accumulations. describe
is one such example that produces several summary statistics in one go:
[8]:
df2.describe()
[8]:
0 | 1 | 2 | |
---|---|---|---|
count | 6.000000 | 6.000000 | 6.000000 |
mean | -0.112770 | -0.538456 | -0.430288 |
std | 0.911645 | 0.998284 | 1.174355 |
min | -1.483024 | -1.729070 | -1.904267 |
25% | -0.590194 | -1.239301 | -1.245259 |
50% | 0.066026 | -0.702352 | -0.298498 |
75% | 0.266069 | 0.206808 | 0.016544 |
max | 1.150540 | 0.811464 | 1.358824 |
For non-numeric data, describe
generates alternative summary statistics:
[9]:
data = {
"Code": ["U+0000", "U+0001", "U+0002", "U+0003", "U+0004", "U+0005"],
"Octal": ["001", "002", "003", "004", "004", "005"],
}
df3 = pd.DataFrame(data)
df3.describe()
[9]:
Code | Octal | |
---|---|---|
count | 6 | 6 |
unique | 6 | 5 |
top | U+0000 | 004 |
freq | 1 | 2 |
Descriptive and summary statistics:
Method |
Description |
---|---|
|
number of non-NA values |
|
calculation of a set of summary statistics for series or each DataFrame column |
|
calculation of minimum and maximum values |
|
calculation of the index points (integers) at which the minimum or maximum value was reached |
|
calculation of the index labels at which the minimum or maximum values were reached |
|
calculation of the sample quantile in the range from 0 to 1 |
|
sum of the values |
|
arithmetic mean of the values |
|
arithmetic median (50% quantile) of the values |
|
mean absolute deviation from the mean value |
|
product of all values |
|
sample variance of the values |
|
sample standard deviation of the values |
|
sample skewness (third moment) of the values |
|
sample kurtosis (fourth moment) of the values |
|
cumulative sum of the values |
|
cumulated minimum and maximum of the values respectively |
|
cumulated product of the values |
|
calculation of the first arithmetic difference (useful for time series) |
|
calculation of the percentage changes |
ydata-profiling
¶
ydata-profiling generates profile reports from a pandas DataFrame. The pandas df.describe()
function is handy, but a bit basic for exploratory data analysis. ydata-profiling extends pandas DataFrame with df.profile_report()
, which automatically generates a standardised report for understanding the data.
Installation¶
$ uv add standard-imghdr legacy-cgi "ydata-profiling[notebook, unicode]"
Resolved 251 packages in 2.53s
Prepared 1 package in 106ms
Installed 24 packages in 155ms
+ annotated-types==0.7.0
+ dacite==1.9.2
+ htmlmin==0.1.12
+ imagehash==4.3.1
+ legacy-cgi==2.6.3
+ llvmlite==0.44.0
+ multimethod==1.12
+ networkx==3.5
+ numba==0.61.0
+ patsy==1.0.1
+ phik==0.12.4
+ puremagic==1.29
+ pydantic==2.11.7
+ pydantic-core==2.33.2
+ pywavelets==1.8.0
+ seaborn==0.13.2
+ standard-imghdr==3.13.0
+ statsmodels==0.14.4
+ tangled-up-in-unicode==0.2.0
+ typeguard==4.4.2
+ typing-inspection==0.4.1
+ visions==0.8.1
+ wordcloud==1.9.4
+ ydata-profiling==4.16.1
$ uv run jupyter notebook
In Python 3.13, the imghdr
and cgi
modules were removed, see also PEP 594. However, as a workaround for these legacy products, standard-imghdr and legacy-cgi were provided in the Python Package Index.
Example¶
[10]:
from ydata_profiling import ProfileReport
profile = ProfileReport(df2, title="pandas Profiling Report")
profile.to_notebook_iframe()
Improve your data and profiling with ydata-sdk, featuring data quality scoring, redundancy detection, outlier identification, text validation, and synthetic data generation.
100%|██████████████████████████████████████████| 3/3 [00:00<00:00, 71493.82it/s]
Configuration for large datasets¶
By default, ydata-profiling summarises the dataset to provide the most insights for data analysis. If the computation time of profiling becomes a bottleneck, pandas-profiling offers several alternatives to overcome it. For the following examples, we first read a larger data set into pandas:
[11]:
titanic = pd.read_csv(
"https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
)
1. minimal mode¶
ydata-profiling contains a minimal configuration file config_minimal.yaml, in which the most expensive calculations are turned off by default. This is the recommended starting point for larger data sets.
[12]:
profile = ProfileReport(
titanic, title="Minimal pandas Profiling Report", minimal=True
)
profile.to_notebook_iframe()
%| | 0/12 [00:00<?, ?it/s]
100%|██████████████████████████████████████████| 12/12 [00:00<00:00, 108.77it/s]
Further details on settings and configuration can be found in Available settings.
2. Sample¶
An alternative option for very large data sets is to use only a part of them for the profiling report:
[13]:
sample = titanic.sample(frac=0.05)
profile = ProfileReport(sample, title="Sample pandas Profiling Report")
profile.to_notebook_iframe()
100%|██████████████████████████████████████████| 12/12 [00:00<00:00, 454.29it/s]
3. Deactivate expensive calculations¶
To reduce the computational effort in large datasets, but still get some interesting information, some calculations can be filtered only for certain columns:
[14]:
profile = ProfileReport()
profile.config.interactions.targets = ["Sex", "Age"]
profile.df = titanic
profile.to_notebook_iframe()
100%|███████████████████████████████████████| 12/12 [00:00<00:00, 260785.74it/s]