Descriptive statisticsΒΆ
pandas objects are equipped with a number of common mathematical and statistical methods. Most of them fall into the category of reductions or summary statistics, methods that extract a single value (such as the sum or mean) from a series or set of values from the rows or columns of a DataFrame. Compared to similar methods found in NumPy arrays, they also handle missing data.
[1]:
import numpy as np
import pandas as pd
rng = np.random.default_rng()
df = pd.DataFrame(
rng.normal(size=(7, 3)),
index=pd.date_range("2022-02-02", periods=7),
)
new_index = pd.date_range("2022-02-03", periods=7)
df2 = df.reindex(new_index)
df2
[1]:
| 0 | 1 | 2 | |
|---|---|---|---|
| 2022-02-03 | -1.118009 | -0.596567 | 1.675063 |
| 2022-02-04 | 0.622485 | 0.460062 | -1.011463 |
| 2022-02-05 | -0.951856 | 0.188381 | 2.315514 |
| 2022-02-06 | 0.916177 | 0.178374 | 0.207760 |
| 2022-02-07 | 0.713193 | -0.246814 | 1.469239 |
| 2022-02-08 | -0.847348 | -0.328571 | -1.491281 |
| 2022-02-09 | NaN | NaN | NaN |
Calling the pandas.DataFrame.sum method returns a series containing column totals:
[2]:
df2.sum()
[2]:
0 -0.665358
1 -0.345136
2 3.164833
dtype: float64
Passing axis='columns' or axis=1 instead sums over the columns:
[3]:
df2.sum(axis="columns")
[3]:
2022-02-03 -0.039513
2022-02-04 0.071084
2022-02-05 1.552039
2022-02-06 1.302311
2022-02-07 1.935618
2022-02-08 -2.667200
2022-02-09 0.000000
Freq: D, dtype: float64
If an entire row or column contains all NA values, the sum is 0. This can be disabled with the skipna option:
[4]:
df2.sum(axis="columns", skipna=False)
[4]:
2022-02-03 -0.039513
2022-02-04 0.071084
2022-02-05 1.552039
2022-02-06 1.302311
2022-02-07 1.935618
2022-02-08 -2.667200
2022-02-09 NaN
Freq: D, dtype: float64
Some aggregations, such as mean, require at least one non-NaN value to obtain a valuable result:
[5]:
df2.mean(axis="columns")
[5]:
2022-02-03 -0.013171
2022-02-04 0.023695
2022-02-05 0.517346
2022-02-06 0.434104
2022-02-07 0.645206
2022-02-08 -0.889067
2022-02-09 NaN
Freq: D, dtype: float64
Options for reduction methodsΒΆ
Method |
Description |
|---|---|
|
the axis of values to reduce: |
|
exclude missing values; by default |
|
reduce grouped by level if the axis is hierarchically indexed (MultiIndex) |
Some methods, such as idxmin and idxmax, provide indirect statistics such as the index value at which the minimum or maximum value is reached:
[6]:
df2.idxmax()
[6]:
0 2022-02-06
1 2022-02-04
2 2022-02-05
dtype: datetime64[ns]
Other methods are accumulations:
[7]:
df2.cumsum()
[7]:
| 0 | 1 | 2 | |
|---|---|---|---|
| 2022-02-03 | -1.118009 | -0.596567 | 1.675063 |
| 2022-02-04 | -0.495524 | -0.136505 | 0.663600 |
| 2022-02-05 | -1.447380 | 0.051876 | 2.979114 |
| 2022-02-06 | -0.531203 | 0.230250 | 3.186875 |
| 2022-02-07 | 0.181990 | -0.016564 | 4.656114 |
| 2022-02-08 | -0.665358 | -0.345136 | 3.164833 |
| 2022-02-09 | NaN | NaN | NaN |
Another type of method is neither reductions nor accumulations. describe is one such example that produces several summary statistics in one go:
[8]:
df2.describe()
[8]:
| 0 | 1 | 2 | |
|---|---|---|---|
| count | 6.000000 | 6.000000 | 6.000000 |
| mean | -0.110893 | -0.057523 | 0.527472 |
| std | 0.952439 | 0.395949 | 1.545761 |
| min | -1.118009 | -0.596567 | -1.491281 |
| 25% | -0.925729 | -0.308132 | -0.706657 |
| 50% | -0.112431 | -0.034220 | 0.838500 |
| 75% | 0.690516 | 0.185880 | 1.623607 |
| max | 0.916177 | 0.460062 | 2.315514 |
For non-numeric data, describe generates alternative summary statistics:
[9]:
data = {
"Code": ["U+0000", "U+0001", "U+0002", "U+0003", "U+0004", "U+0005"],
"Octal": ["001", "002", "003", "004", "004", "005"],
}
df3 = pd.DataFrame(data)
df3.describe()
[9]:
| Code | Octal | |
|---|---|---|
| count | 6 | 6 |
| unique | 6 | 5 |
| top | U+0000 | 004 |
| freq | 1 | 2 |
Descriptive and summary statistics:
Method |
Description |
|---|---|
|
number of non-NA values |
|
calculation of a set of summary statistics for series or each DataFrame column |
|
calculation of minimum and maximum values |
|
calculation of the index points (integers) at which the minimum or maximum value was reached |
|
calculation of the index labels at which the minimum or maximum values were reached |
|
calculation of the sample quantile in the range from 0 to 1 |
|
sum of the values |
|
arithmetic mean of the values |
|
arithmetic median (50% quantile) of the values |
|
mean absolute deviation from the mean value |
|
product of all values |
|
sample variance of the values |
|
sample standard deviation of the values |
|
sample skewness (third moment) of the values |
|
sample kurtosis (fourth moment) of the values |
|
cumulative sum of the values |
|
cumulated minimum and maximum of the values respectively |
|
cumulated product of the values |
|
calculation of the first arithmetic difference (useful for time series) |
|
calculation of the percentage changes |
ydata-profilingΒΆ
ydata-profiling generates profile reports from a pandas DataFrame. The pandas df.describe() function is handy, but a bit basic for exploratory data analysis. ydata-profiling extends pandas DataFrame with df.profile_report(), which automatically generates a standardised report for understanding the data.
InstallationΒΆ
$ uv add standard-imghdr legacy-cgi "ydata-profiling[notebook, unicode]"
Resolved 251 packages in 2.53s
Prepared 1 package in 106ms
Installed 24 packages in 155ms
+ annotated-types==0.7.0
+ dacite==1.9.2
+ htmlmin==0.1.12
+ imagehash==4.3.1
+ legacy-cgi==2.6.3
+ llvmlite==0.44.0
+ multimethod==1.12
+ networkx==3.5
+ numba==0.61.0
+ patsy==1.0.1
+ phik==0.12.4
+ puremagic==1.29
+ pydantic==2.11.7
+ pydantic-core==2.33.2
+ pywavelets==1.8.0
+ seaborn==0.13.2
+ standard-imghdr==3.13.0
+ statsmodels==0.14.4
+ tangled-up-in-unicode==0.2.0
+ typeguard==4.4.2
+ typing-inspection==0.4.1
+ visions==0.8.1
+ wordcloud==1.9.4
+ ydata-profiling==4.16.1
$ uv run jupyter notebook
In Python 3.13, the imghdr and cgi modules were removed, see also PEP 594. However, as a workaround for these legacy products, standard-imghdr and legacy-cgi were provided in the Python Package Index.
ExampleΒΆ
[10]:
from ydata_profiling import ProfileReport
profile = ProfileReport(df2, title="pandas Profiling Report")
profile.to_notebook_iframe()
Improve your data and profiling with ydata-sdk, featuring data quality scoring, redundancy detection, outlier identification, text validation, and synthetic data generation.
100%|ββββββββββββββββββββββββββββββββββββββββββ| 3/3 [00:00<00:00, 68015.74it/s]
Configuration for large datasetsΒΆ
By default, ydata-profiling summarises the dataset to provide the most insights for data analysis. If the computation time of profiling becomes a bottleneck, pandas-profiling offers several alternatives to overcome it. For the following examples, we first read a larger data set into pandas:
[11]:
titanic = pd.read_csv(
"https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv",
)
1. minimal modeΒΆ
ydata-profiling contains a minimal configuration file config_minimal.yaml, in which the most expensive calculations are turned off by default. This is the recommended starting point for larger data sets.
[12]:
profile = ProfileReport(
titanic,
title="Minimal pandas Profiling Report",
minimal=True,
)
profile.to_notebook_iframe()
0%| | 0/12 [00:00<?, ?it/s]
100%|ββββββββββββββββββββββββββββββββββββββββββ| 12/12 [00:00<00:00, 109.46it/s]
Further details on settings and configuration can be found in Available settings.
2. SampleΒΆ
An alternative option for very large data sets is to use only a part of them for the profiling report:
[13]:
sample = titanic.sample(frac=0.05)
profile = ProfileReport(sample, title="Sample pandas Profiling Report")
profile.to_notebook_iframe()
100%|ββββββββββββββββββββββββββββββββββββββββββ| 12/12 [00:00<00:00, 705.30it/s]
3. Deactivate expensive calculationsΒΆ
To reduce the computational effort in large datasets, but still get some interesting information, some calculations can be filtered only for certain columns:
[14]:
profile = ProfileReport()
profile.config.interactions.targets = ["Sex", "Age"]
profile.df = titanic
profile.to_notebook_iframe()
100%|βββββββββββββββββββββββββββββββββββββββ| 12/12 [00:00<00:00, 267721.53it/s]