Descriptive statistics¶
pandas objects are equipped with a number of common mathematical and statistical methods. Most of them fall into the category of reductions or summary statistics, methods that extract a single value (such as the sum or mean) from a series or set of values from the rows or columns of a DataFrame. Compared to similar methods found in NumPy arrays, they also handle missing data.
[1]:
import numpy as np
import pandas as pd
rng = np.random.default_rng()
df = pd.DataFrame(
rng.normal(size=(7, 3)), index=pd.date_range("2022-02-02", periods=7)
)
new_index = pd.date_range("2022-02-03", periods=7)
df2 = df.reindex(new_index)
df2
[1]:
0 | 1 | 2 | |
---|---|---|---|
2022-02-03 | 0.686507 | 1.870769 | -0.699365 |
2022-02-04 | -1.462243 | 0.833043 | 0.423066 |
2022-02-05 | 0.227436 | -1.146793 | -0.495678 |
2022-02-06 | 0.404523 | 0.517117 | -1.475375 |
2022-02-07 | 2.022298 | -0.263188 | -0.478148 |
2022-02-08 | -0.056213 | 0.913033 | -0.723379 |
2022-02-09 | NaN | NaN | NaN |
Calling the pandas.DataFrame.sum
method returns a series containing column totals:
[2]:
df2.sum()
[2]:
0 1.822307
1 2.723981
2 -3.448879
dtype: float64
Passing axis='columns'
or axis=1
instead sums over the columns:
[3]:
df2.sum(axis="columns")
[3]:
2022-02-03 1.857911
2022-02-04 -0.206135
2022-02-05 -1.415035
2022-02-06 -0.553735
2022-02-07 1.280962
2022-02-08 0.133441
2022-02-09 0.000000
Freq: D, dtype: float64
If an entire row or column contains all NA values, the sum is 0
. This can be disabled with the skipna
option:
[4]:
df2.sum(axis="columns", skipna=False)
[4]:
2022-02-03 1.857911
2022-02-04 -0.206135
2022-02-05 -1.415035
2022-02-06 -0.553735
2022-02-07 1.280962
2022-02-08 0.133441
2022-02-09 NaN
Freq: D, dtype: float64
Some aggregations, such as mean
, require at least one non-NaN
value to obtain a valuable result:
[5]:
df2.mean(axis="columns")
[5]:
2022-02-03 0.619304
2022-02-04 -0.068712
2022-02-05 -0.471678
2022-02-06 -0.184578
2022-02-07 0.426987
2022-02-08 0.044480
2022-02-09 NaN
Freq: D, dtype: float64
Options for reduction methods¶
Method |
Description |
---|---|
|
the axis of values to reduce: |
|
exclude missing values; by default |
|
reduce grouped by level if the axis is hierarchically indexed (MultiIndex) |
Some methods, such as idxmin
and idxmax
, provide indirect statistics such as the index value at which the minimum or maximum value is reached:
[6]:
df2.idxmax()
[6]:
0 2022-02-07
1 2022-02-03
2 2022-02-04
dtype: datetime64[ns]
Other methods are accumulations:
[7]:
df2.cumsum()
[7]:
0 | 1 | 2 | |
---|---|---|---|
2022-02-03 | 0.686507 | 1.870769 | -0.699365 |
2022-02-04 | -0.775736 | 2.703812 | -0.276300 |
2022-02-05 | -0.548300 | 1.557019 | -0.771977 |
2022-02-06 | -0.143777 | 2.074136 | -2.247352 |
2022-02-07 | 1.878520 | 1.810948 | -2.725500 |
2022-02-08 | 1.822307 | 2.723981 | -3.448879 |
2022-02-09 | NaN | NaN | NaN |
Another type of method is neither reductions nor accumulations. describe
is one such example that produces several summary statistics in one go:
[8]:
df2.describe()
[8]:
0 | 1 | 2 | |
---|---|---|---|
count | 6.000000 | 6.000000 | 6.000000 |
mean | 0.303718 | 0.453997 | -0.574813 |
std | 1.128201 | 1.043312 | 0.609912 |
min | -1.462243 | -1.146793 | -1.475375 |
25% | 0.014699 | -0.068112 | -0.717375 |
50% | 0.315979 | 0.675080 | -0.597521 |
75% | 0.616011 | 0.893035 | -0.482530 |
max | 2.022298 | 1.870769 | 0.423066 |
For non-numeric data, describe
generates alternative summary statistics:
[9]:
data = {
"Code": ["U+0000", "U+0001", "U+0002", "U+0003", "U+0004", "U+0005"],
"Octal": ["001", "002", "003", "004", "004", "005"],
}
df3 = pd.DataFrame(data)
df3.describe()
[9]:
Code | Octal | |
---|---|---|
count | 6 | 6 |
unique | 6 | 5 |
top | U+0000 | 004 |
freq | 1 | 2 |
Descriptive and summary statistics:
Method |
Description |
---|---|
|
number of non-NA values |
|
calculation of a set of summary statistics for series or each DataFrame column |
|
calculation of minimum and maximum values |
|
calculation of the index points (integers) at which the minimum or maximum value was reached |
|
calculation of the index labels at which the minimum or maximum values were reached |
|
calculation of the sample quantile in the range from 0 to 1 |
|
sum of the values |
|
arithmetic mean of the values |
|
arithmetic median (50% quantile) of the values |
|
mean absolute deviation from the mean value |
|
product of all values |
|
sample variance of the values |
|
sample standard deviation of the values |
|
sample skewness (third moment) of the values |
|
sample kurtosis (fourth moment) of the values |
|
cumulative sum of the values |
|
cumulated minimum and maximum of the values respectively |
|
cumulated product of the values |
|
calculation of the first arithmetic difference (useful for time series) |
|
calculation of the percentage changes |
ydata-profiling
¶
ydata-profiling generates profile reports from a pandas DataFrame. The pandas df.describe()
function is handy, but a bit basic for exploratory data analysis. ydata-profiling extends pandas DataFrame with df.profile_report()
, which automatically generates a standardised report for understanding the data.
Installation¶
$ pipenv install "ydata-profiling[notebook, unicode, pyspark]"
…
✔ Success!
Updated Pipfile.lock (cbc5f7)!
Installing dependencies from Pipfile.lock (cbc5f7)...
🐍 ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 80/80 — 00:02:26
…
$ pipenv run jupyter nbextension enable --py widgetsnbextension
Enabling notebook extension jupyter-js-widgets/extension...
- Validating: OK
Example¶
[10]:
from ydata_profiling import ProfileReport
profile = ProfileReport(df2, title="pandas Profiling Report")
profile.to_widgets()
Configuration for large datasets¶
By default, ydata-profiling summarises the dataset to provide the most insights for data analysis. If the computation time of profiling becomes a bottleneck, pandas-profiling offers several alternatives to overcome it. For the following examples, we first read a larger data set into pandas:
[11]:
titanic = pd.read_csv(
"https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
)
1. minimal mode¶
ydata-profiling contains a minimal configuration file config_minimal.yaml, in which the most expensive calculations are turned off by default. This is the recommended starting point for larger data sets.
[12]:
profile = ProfileReport(
titanic, title="Minimal pandas Profiling Report", minimal=True
)
profile.to_widgets()
Further details on settings and configuration can be found in Available settings.
2. Sample¶
An alternative option for very large data sets is to use only a part of them for the profiling report:
[13]:
sample = titanic.sample(frac=0.05)
profile = ProfileReport(sample, title="Sample pandas Profiling Report")
profile.to_widgets()
3. Deactivate expensive calculations¶
To reduce the computational effort in large datasets, but still get some interesting information, some calculations can be filtered only for certain columns:
[14]:
profile = ProfileReport()
profile.config.interactions.targets = ["Sex", "Age"]
profile.df = titanic
profile.to_widgets()
The setting interactions.targets
, can be changed via configuration files as well as via environment variables; see Interactions for details.
4 Concurrency¶
Currently work is being done on a scalable Spark backend for pandas-profiling, see Spark profiling support.