{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Normalisation and Preprocessing\n",
"\n",
"[sklearn.preprocessing](https://scikit-learn.org/stable/modules/preprocessing.html) can be used in many ways to clean data:\n",
"\n",
"* Standardisation with [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html), [MinMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html), [MaxAbsScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MaxAbsScaler.html) or [RobustScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html).\n",
"* Centring of kernel matrices with [KernelCenterer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.KernelCenterer.html).\n",
"* Non-linear transformations with [QuantileTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.QuantileTransformer.html), [PowerTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PowerTransformer.html)\n",
"* Normalisation with [normalize](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.normalize.html).\n",
"* Encoding of categorical features with [OrdinalEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html), [OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html).\n",
"* [Discretisation](https://en.wikipedia.org/wiki/Discretization_of_continuous_features) (also known as quantisation or binning) with [KBinsDiscretizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.KBinsDiscretizer.html).\n",
"* Binarisation of features with [Binarizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Binarizer.html)\n",
"* Imputation of missing values with [SimpleImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html), [IterativeImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html) or [KNNImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html) where the added values can be marked with [MissingIndicator](https://scikit-learn.org/stable/modules/generated/sklearn.impute.MissingIndicator.html).\n",
"\n",
"
\n",
"\n",
"**See also:**\n",
"\n",
"* [statsmodels](https://www.statsmodels.org/stable/index.html)\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Example\n",
"\n",
"In the following example, we fill in mean values and do some scaling:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1. Imports"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"from datetime import datetime\n",
"\n",
"import numpy as np\n",
"import pandas as pd\n",
"\n",
"from sklearn import preprocessing\n",
"from sklearn.impute import SimpleImputer"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"hvac = pd.read_csv(\n",
" \"https://raw.githubusercontent.com/kjam/data-cleaning-101/master/data/HVAC_with_nulls.csv\"\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2. Check data quality\n",
"\n",
"Display data types with [pandas.DataFrame.dtypes](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dtypes.html):"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Date object\n",
"Time object\n",
"TargetTemp float64\n",
"ActualTemp int64\n",
"System int64\n",
"SystemAge float64\n",
"BuildingID int64\n",
"10 float64\n",
"dtype: object"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"hvac.dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Return dimensions of the DataFrame as a tuple with [pandas.DataFrame.shape](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shape.html):"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(8000, 8)"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"hvac.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Return first *n* rows with [pandas.DataFrame.head](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html):"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Date | \n",
" Time | \n",
" TargetTemp | \n",
" ActualTemp | \n",
" System | \n",
" SystemAge | \n",
" BuildingID | \n",
" 10 | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 6/1/13 | \n",
" 0:00:01 | \n",
" 66.0 | \n",
" 58 | \n",
" 13 | \n",
" 20.0 | \n",
" 4 | \n",
" NaN | \n",
"
\n",
" \n",
" | 1 | \n",
" 6/2/13 | \n",
" 1:00:01 | \n",
" NaN | \n",
" 68 | \n",
" 3 | \n",
" 20.0 | \n",
" 17 | \n",
" NaN | \n",
"
\n",
" \n",
" | 2 | \n",
" 6/3/13 | \n",
" 2:00:01 | \n",
" 70.0 | \n",
" 73 | \n",
" 17 | \n",
" 20.0 | \n",
" 18 | \n",
" NaN | \n",
"
\n",
" \n",
" | 3 | \n",
" 6/4/13 | \n",
" 3:00:01 | \n",
" 67.0 | \n",
" 63 | \n",
" 2 | \n",
" NaN | \n",
" 15 | \n",
" NaN | \n",
"
\n",
" \n",
" | 4 | \n",
" 6/5/13 | \n",
" 4:00:01 | \n",
" 68.0 | \n",
" 74 | \n",
" 16 | \n",
" 9.0 | \n",
" 3 | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Date Time TargetTemp ActualTemp System SystemAge BuildingID 10\n",
"0 6/1/13 0:00:01 66.0 58 13 20.0 4 NaN\n",
"1 6/2/13 1:00:01 NaN 68 3 20.0 17 NaN\n",
"2 6/3/13 2:00:01 70.0 73 17 20.0 18 NaN\n",
"3 6/4/13 3:00:01 67.0 63 2 NaN 15 NaN\n",
"4 6/5/13 4:00:01 68.0 74 16 9.0 3 NaN"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"hvac.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3. Attribute the mean value to missing values\n",
"\n",
"For this we use the `mean` strategy of [sklearn.impute.SimpleImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#simpleimputer):"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"imp = SimpleImputer(missing_values=np.nan, strategy=\"mean\")"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"hvac_numeric = hvac[[\"TargetTemp\", \"SystemAge\"]]"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"imp = imp.fit(hvac_numeric.loc[:10])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For more information on `fit`, see the [Scikit Learn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer.fit).\n",
"\n",
"[fit_transform](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer.fit_transform) then transforms the adapted data:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"transformed = imp.fit_transform(hvac_numeric)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[66. , 20. ],\n",
" [67.50773481, 20. ],\n",
" [70. , 20. ],\n",
" ...,\n",
" [67.50773481, 4. ],\n",
" [65. , 23. ],\n",
" [66. , 21. ]])"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"transformed"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"hvac[\"TargetTemp\"], hvac[\"SystemAge\"] = transformed[:, 0], transformed[:, 1]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we display the first rows with the changed data records:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Date | \n",
" Time | \n",
" TargetTemp | \n",
" ActualTemp | \n",
" System | \n",
" SystemAge | \n",
" BuildingID | \n",
" 10 | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 6/1/13 | \n",
" 0:00:01 | \n",
" 66.000000 | \n",
" 58 | \n",
" 13 | \n",
" 20.000000 | \n",
" 4 | \n",
" NaN | \n",
"
\n",
" \n",
" | 1 | \n",
" 6/2/13 | \n",
" 1:00:01 | \n",
" 67.507735 | \n",
" 68 | \n",
" 3 | \n",
" 20.000000 | \n",
" 17 | \n",
" NaN | \n",
"
\n",
" \n",
" | 2 | \n",
" 6/3/13 | \n",
" 2:00:01 | \n",
" 70.000000 | \n",
" 73 | \n",
" 17 | \n",
" 20.000000 | \n",
" 18 | \n",
" NaN | \n",
"
\n",
" \n",
" | 3 | \n",
" 6/4/13 | \n",
" 3:00:01 | \n",
" 67.000000 | \n",
" 63 | \n",
" 2 | \n",
" 15.386643 | \n",
" 15 | \n",
" NaN | \n",
"
\n",
" \n",
" | 4 | \n",
" 6/5/13 | \n",
" 4:00:01 | \n",
" 68.000000 | \n",
" 74 | \n",
" 16 | \n",
" 9.000000 | \n",
" 3 | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Date Time TargetTemp ActualTemp System SystemAge BuildingID 10\n",
"0 6/1/13 0:00:01 66.000000 58 13 20.000000 4 NaN\n",
"1 6/2/13 1:00:01 67.507735 68 3 20.000000 17 NaN\n",
"2 6/3/13 2:00:01 70.000000 73 17 20.000000 18 NaN\n",
"3 6/4/13 3:00:01 67.000000 63 2 15.386643 15 NaN\n",
"4 6/5/13 4:00:01 68.000000 74 16 9.000000 3 NaN"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"hvac.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4. Scale\n",
"\n",
"To standardise data sets that look like standard normally distributed data, we can use [sklearn.preprocessing.scale](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html). This can be used to determine the factors by which a value increases or decreases. We can use this to scale the current temperature."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"hvac[\"ScaledTemp\"] = preprocessing.scale(hvac[\"ActualTemp\"])"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 -1.293272\n",
"1 0.048732\n",
"2 0.719733\n",
"3 -0.622270\n",
"4 0.853934\n",
"Name: ScaledTemp, dtype: float64"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"hvac[\"ScaledTemp\"].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[sklearn.preprocessing.MinMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html) scales the terms so that they lie between a certain minimum and maximum value, often between zero and one. This has the advantage of making the scaling more robust to very small standard deviations of features."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"min_max_scaler = preprocessing.MinMaxScaler()"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"temp_minmax = min_max_scaler.fit_transform(hvac[[\"ActualTemp\"]])"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[0.12],\n",
" [0.52],\n",
" [0.72],\n",
" ...,\n",
" [0.56],\n",
" [0.32],\n",
" [0.44]])"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"temp_minmax"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we also add `temp_minmax` as a new column:"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 0.12\n",
"1 0.52\n",
"2 0.72\n",
"3 0.32\n",
"4 0.76\n",
"Name: MinMaxScaledTemp, dtype: float64"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"hvac[\"MinMaxScaledTemp\"] = temp_minmax[:,0]\n",
"hvac[\"MinMaxScaledTemp\"].head()"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Date | \n",
" Time | \n",
" TargetTemp | \n",
" ActualTemp | \n",
" System | \n",
" SystemAge | \n",
" BuildingID | \n",
" 10 | \n",
" ScaledTemp | \n",
" MinMaxScaledTemp | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 6/1/13 | \n",
" 0:00:01 | \n",
" 66.000000 | \n",
" 58 | \n",
" 13 | \n",
" 20.000000 | \n",
" 4 | \n",
" NaN | \n",
" -1.293272 | \n",
" 0.12 | \n",
"
\n",
" \n",
" | 1 | \n",
" 6/2/13 | \n",
" 1:00:01 | \n",
" 67.507735 | \n",
" 68 | \n",
" 3 | \n",
" 20.000000 | \n",
" 17 | \n",
" NaN | \n",
" 0.048732 | \n",
" 0.52 | \n",
"
\n",
" \n",
" | 2 | \n",
" 6/3/13 | \n",
" 2:00:01 | \n",
" 70.000000 | \n",
" 73 | \n",
" 17 | \n",
" 20.000000 | \n",
" 18 | \n",
" NaN | \n",
" 0.719733 | \n",
" 0.72 | \n",
"
\n",
" \n",
" | 3 | \n",
" 6/4/13 | \n",
" 3:00:01 | \n",
" 67.000000 | \n",
" 63 | \n",
" 2 | \n",
" 15.386643 | \n",
" 15 | \n",
" NaN | \n",
" -0.622270 | \n",
" 0.32 | \n",
"
\n",
" \n",
" | 4 | \n",
" 6/5/13 | \n",
" 4:00:01 | \n",
" 68.000000 | \n",
" 74 | \n",
" 16 | \n",
" 9.000000 | \n",
" 3 | \n",
" NaN | \n",
" 0.853934 | \n",
" 0.76 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Date Time TargetTemp ActualTemp System SystemAge BuildingID 10 \\\n",
"0 6/1/13 0:00:01 66.000000 58 13 20.000000 4 NaN \n",
"1 6/2/13 1:00:01 67.507735 68 3 20.000000 17 NaN \n",
"2 6/3/13 2:00:01 70.000000 73 17 20.000000 18 NaN \n",
"3 6/4/13 3:00:01 67.000000 63 2 15.386643 15 NaN \n",
"4 6/5/13 4:00:01 68.000000 74 16 9.000000 3 NaN \n",
"\n",
" ScaledTemp MinMaxScaledTemp \n",
"0 -1.293272 0.12 \n",
"1 0.048732 0.52 \n",
"2 0.719733 0.72 \n",
"3 -0.622270 0.32 \n",
"4 0.853934 0.76 "
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"hvac.head()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.13 Kernel",
"language": "python",
"name": "python313"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.13.0"
},
"latex_envs": {
"LaTeX_envs_menu_present": true,
"autoclose": false,
"autocomplete": true,
"bibliofile": "biblio.bib",
"cite_by": "apalike",
"current_citInitial": 1,
"eqLabelWithNumbers": true,
"eqNumInitial": 1,
"hotkeys": {
"equation": "Ctrl-E",
"itemize": "Ctrl-I"
},
"labels_anchors": false,
"latex_user_defs": false,
"report_style_numbering": false,
"user_envs_cfg": false
},
"widgets": {
"application/vnd.jupyter.widget-state+json": {
"state": {},
"version_major": 2,
"version_minor": 0
}
}
},
"nbformat": 4,
"nbformat_minor": 4
}