Normalisation and Preprocessing¶

sklearn.preprocessing can be used in many ways to clean data:

Standardisation with StandardScaler, MinMaxScaler, MaxAbsScaler or RobustScaler.
Centring of kernel matrices with KernelCenterer.
Non-linear transformations with QuantileTransformer, PowerTransformer
Normalisation with normalize.
Encoding of categorical features with OrdinalEncoder, OneHotEncoder.
Discretisation (also known as quantisation or binning) with KBinsDiscretizer.
Binarisation of features with Binarizer
Imputation of missing values with SimpleImputer, IterativeImputer or KNNImputer where the added values can be marked with MissingIndicator.

See also:

statsmodels

Example¶

In the following example, we fill in mean values and do some scaling:

1. Imports¶

[1]:

import numpy as np
import pandas as pd

from sklearn import preprocessing
from sklearn.impute import SimpleImputer

[2]:

hvac = pd.read_csv(
    "https://raw.githubusercontent.com/kjam/data-cleaning-101/master/data/HVAC_with_nulls.csv",
)

2. Check data quality¶

Display data types with pandas.DataFrame.dtypes:

[3]:

hvac.dtypes

[3]:

Date           object
Time           object
TargetTemp    float64
ActualTemp      int64
System          int64
SystemAge     float64
BuildingID      int64
10            float64
dtype: object

Return dimensions of the DataFrame as a tuple with pandas.DataFrame.shape:

[4]:

hvac.shape

[4]:

(8000, 8)

Return first n rows with pandas.DataFrame.head:

[5]:

hvac.head()

[5]:

	Date	Time	TargetTemp	ActualTemp	System	SystemAge	BuildingID	10
0	6/1/13	0:00:01	66.0	58	13	20.0	4	NaN
1	6/2/13	1:00:01	NaN	68	3	20.0	17	NaN
2	6/3/13	2:00:01	70.0	73	17	20.0	18	NaN
3	6/4/13	3:00:01	67.0	63	2	NaN	15	NaN
4	6/5/13	4:00:01	68.0	74	16	9.0	3	NaN

3. Attribute the mean value to missing values¶

For this we use the mean strategy of sklearn.impute.SimpleImputer:

[6]:

imp = SimpleImputer(missing_values=np.nan, strategy="mean")

[7]:

hvac_numeric = hvac[["TargetTemp", "SystemAge"]]

[8]:

imp = imp.fit(hvac_numeric.loc[:10])

For more information on fit, see the Scikit Learn documentation.

fit_transform then transforms the adapted data:

[9]:

transformed = imp.fit_transform(hvac_numeric)

[10]:

transformed

[10]:

array([[66.        , 20.        ],
       [67.50773481, 20.        ],
       [70.        , 20.        ],
       ...,
       [67.50773481,  4.        ],
       [65.        , 23.        ],
       [66.        , 21.        ]])

[11]:

hvac["TargetTemp"], hvac["SystemAge"] = transformed[:, 0], transformed[:, 1]

Now we display the first rows with the changed data records:

[12]:

hvac.head()

[12]:

	Date	Time	TargetTemp	ActualTemp	System	SystemAge	BuildingID	10
0	6/1/13	0:00:01	66.000000	58	13	20.000000	4	NaN
1	6/2/13	1:00:01	67.507735	68	3	20.000000	17	NaN
2	6/3/13	2:00:01	70.000000	73	17	20.000000	18	NaN
3	6/4/13	3:00:01	67.000000	63	2	15.386643	15	NaN
4	6/5/13	4:00:01	68.000000	74	16	9.000000	3	NaN

4. Scale¶

To standardise data sets that look like standard normally distributed data, we can use sklearn.preprocessing.scale. This can be used to determine the factors by which a value increases or decreases. We can use this to scale the current temperature.

[13]:

hvac["ScaledTemp"] = preprocessing.scale(hvac["ActualTemp"])

[14]:

hvac["ScaledTemp"].head()

[14]:

0   -1.293272
1    0.048732
2    0.719733
3   -0.622270
4    0.853934
Name: ScaledTemp, dtype: float64

sklearn.preprocessing.MinMaxScaler scales the terms so that they lie between a certain minimum and maximum value, often between zero and one. This has the advantage of making the scaling more robust to very small standard deviations of features.

[15]:

min_max_scaler = preprocessing.MinMaxScaler()

[16]:

temp_minmax = min_max_scaler.fit_transform(hvac[["ActualTemp"]])

[17]:

temp_minmax

[17]:

array([[0.12],
       [0.52],
       [0.72],
       ...,
       [0.56],
       [0.32],
       [0.44]])

Now we also add temp_minmax as a new column:

[18]:

hvac["MinMaxScaledTemp"] = temp_minmax[:, 0]
hvac["MinMaxScaledTemp"].head()

[18]:

0    0.12
1    0.52
2    0.72
3    0.32
4    0.76
Name: MinMaxScaledTemp, dtype: float64

[19]:

hvac.head()

[19]:

	Date	Time	TargetTemp	ActualTemp	System	SystemAge	BuildingID	10	ScaledTemp	MinMaxScaledTemp
0	6/1/13	0:00:01	66.000000	58	13	20.000000	4	NaN	-1.293272	0.12
1	6/2/13	1:00:01	67.507735	68	3	20.000000	17	NaN	0.048732	0.52
2	6/3/13	2:00:01	70.000000	73	17	20.000000	18	NaN	0.719733	0.72
3	6/4/13	3:00:01	67.000000	63	2	15.386643	15	NaN	-0.622270	0.32
4	6/5/13	4:00:01	68.000000	74	16	9.000000	3	NaN	0.853934	0.76