# Descriptive statistics¶

pandas objects are equipped with a number of common mathematical and statistical methods. Most of them fall into the category of reductions or summary statistics, methods that extract a single value (such as the sum or mean) from a series or set of values from the rows or columns of a DataFrame. Compared to similar methods found in NumPy arrays, they also handle missing data.

```[1]:
```
```import numpy as np
import pandas as pd

rng = np.random.default_rng()
df = pd.DataFrame(
rng.normal(size=(7, 3)), index=pd.date_range("2022-02-02", periods=7)
)
new_index = pd.date_range("2022-02-03", periods=7)
df2 = df.reindex(new_index)

df2
```
```[1]:
```
0 1 2
2022-02-03 0.686507 1.870769 -0.699365
2022-02-04 -1.462243 0.833043 0.423066
2022-02-05 0.227436 -1.146793 -0.495678
2022-02-06 0.404523 0.517117 -1.475375
2022-02-07 2.022298 -0.263188 -0.478148
2022-02-08 -0.056213 0.913033 -0.723379
2022-02-09 NaN NaN NaN

Calling the `pandas.DataFrame.sum` method returns a series containing column totals:

```[2]:
```
```df2.sum()
```
```[2]:
```
```0    1.822307
1    2.723981
2   -3.448879
dtype: float64
```

Passing `axis='columns'` or `axis=1` instead sums over the columns:

```[3]:
```
```df2.sum(axis="columns")
```
```[3]:
```
```2022-02-03    1.857911
2022-02-04   -0.206135
2022-02-05   -1.415035
2022-02-06   -0.553735
2022-02-07    1.280962
2022-02-08    0.133441
2022-02-09    0.000000
Freq: D, dtype: float64
```

If an entire row or column contains all NA values, the sum is `0`. This can be disabled with the `skipna` option:

```[4]:
```
```df2.sum(axis="columns", skipna=False)
```
```[4]:
```
```2022-02-03    1.857911
2022-02-04   -0.206135
2022-02-05   -1.415035
2022-02-06   -0.553735
2022-02-07    1.280962
2022-02-08    0.133441
2022-02-09         NaN
Freq: D, dtype: float64
```

Some aggregations, such as `mean`, require at least one non-`NaN` value to obtain a valuable result:

```[5]:
```
```df2.mean(axis="columns")
```
```[5]:
```
```2022-02-03    0.619304
2022-02-04   -0.068712
2022-02-05   -0.471678
2022-02-06   -0.184578
2022-02-07    0.426987
2022-02-08    0.044480
2022-02-09         NaN
Freq: D, dtype: float64
```

## Options for reduction methods¶

Method

Description

`axis`

the axis of values to reduce: `0` for the rows of the DataFrame and `1` for the columns

`skipna`

exclude missing values; by default `True`.

`level`

reduce grouped by level if the axis is hierarchically indexed (MultiIndex)

Some methods, such as `idxmin` and `idxmax`, provide indirect statistics such as the index value at which the minimum or maximum value is reached:

```[6]:
```
```df2.idxmax()
```
```[6]:
```
```0   2022-02-07
1   2022-02-03
2   2022-02-04
dtype: datetime64[ns]
```

Other methods are accumulations:

```[7]:
```
```df2.cumsum()
```
```[7]:
```
0 1 2
2022-02-03 0.686507 1.870769 -0.699365
2022-02-04 -0.775736 2.703812 -0.276300
2022-02-05 -0.548300 1.557019 -0.771977
2022-02-06 -0.143777 2.074136 -2.247352
2022-02-07 1.878520 1.810948 -2.725500
2022-02-08 1.822307 2.723981 -3.448879
2022-02-09 NaN NaN NaN

Another type of method is neither reductions nor accumulations. `describe` is one such example that produces several summary statistics in one go:

```[8]:
```
```df2.describe()
```
```[8]:
```
0 1 2
count 6.000000 6.000000 6.000000
mean 0.303718 0.453997 -0.574813
std 1.128201 1.043312 0.609912
min -1.462243 -1.146793 -1.475375
25% 0.014699 -0.068112 -0.717375
50% 0.315979 0.675080 -0.597521
75% 0.616011 0.893035 -0.482530
max 2.022298 1.870769 0.423066

For non-numeric data, `describe` generates alternative summary statistics:

```[9]:
```
```data = {
"Code": ["U+0000", "U+0001", "U+0002", "U+0003", "U+0004", "U+0005"],
"Octal": ["001", "002", "003", "004", "004", "005"],
}
df3 = pd.DataFrame(data)

df3.describe()
```
```[9]:
```
Code Octal
count 6 6
unique 6 5
top U+0000 004
freq 1 2

Descriptive and summary statistics:

Method

Description

`count`

number of non-NA values

`describe`

calculation of a set of summary statistics for series or each DataFrame column

`min`, `max`

calculation of minimum and maximum values

`argmin`, `argmax`

calculation of the index points (integers) at which the minimum or maximum value was reached

`idxmin`, `idxmax`

calculation of the index labels at which the minimum or maximum values were reached

`quantile`

calculation of the sample quantile in the range from 0 to 1

`sum`

sum of the values

`mean`

arithmetic mean of the values

`median`

arithmetic median (50% quantile) of the values

`mad`

mean absolute deviation from the mean value

`prod`

product of all values

`var`

sample variance of the values

`std`

sample standard deviation of the values

`skew`

sample skewness (third moment) of the values

`kurt`

sample kurtosis (fourth moment) of the values

`cumsum`

cumulative sum of the values

`cummin`, `cummax`

cumulated minimum and maximum of the values respectively

`cumprod`

cumulated product of the values

`diff`

calculation of the first arithmetic difference (useful for time series)

`pct_change`

calculation of the percentage changes

## `ydata-profiling`¶

ydata-profiling generates profile reports from a pandas DataFrame. The pandas `df.describe()` function is handy, but a bit basic for exploratory data analysis. ydata-profiling extends pandas DataFrame with `df.profile_report()`, which automatically generates a standardised report for understanding the data.

### Installation¶

```\$ pipenv install "ydata-profiling[notebook, unicode, pyspark]"
…
✔ Success!
Updated Pipfile.lock (cbc5f7)!
Installing dependencies from Pipfile.lock (cbc5f7)...
🐍   ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 80/80 — 00:02:26
…
\$ pipenv run jupyter nbextension enable --py widgetsnbextension
Enabling notebook extension jupyter-js-widgets/extension...
- Validating: OK
```

### Example¶

```[10]:
```
```from ydata_profiling import ProfileReport

profile = ProfileReport(df2, title="pandas Profiling Report")

profile.to_widgets()
```

### Configuration for large datasets¶

By default, ydata-profiling summarises the dataset to provide the most insights for data analysis. If the computation time of profiling becomes a bottleneck, pandas-profiling offers several alternatives to overcome it. For the following examples, we first read a larger data set into pandas:

```[11]:
```
```titanic = pd.read_csv(
"https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
)
```

#### 1. minimal mode¶

ydata-profiling contains a minimal configuration file config_minimal.yaml, in which the most expensive calculations are turned off by default. This is the recommended starting point for larger data sets.

```[12]:
```
```profile = ProfileReport(
titanic, title="Minimal pandas Profiling Report", minimal=True
)

profile.to_widgets()
```

Further details on settings and configuration can be found in Available settings.

#### 2. Sample¶

An alternative option for very large data sets is to use only a part of them for the profiling report:

```[13]:
```
```sample = titanic.sample(frac=0.05)

profile = ProfileReport(sample, title="Sample pandas Profiling Report")

profile.to_widgets()
```

#### 3. Deactivate expensive calculations¶

To reduce the computational effort in large datasets, but still get some interesting information, some calculations can be filtered only for certain columns:

```[14]:
```
```profile = ProfileReport()
profile.config.interactions.targets = ["Sex", "Age"]
profile.df = titanic

profile.to_widgets()
```

The setting `interactions.targets`, can be changed via configuration files as well as via environment variables; see Interactions for details.

#### 4 Concurrency¶

Currently work is being done on a scalable Spark backend for pandas-profiling, see Spark Profiling Development.