Normalisation and Preprocessing¶
sklearn.preprocessing can be used in many ways to clean data:
Standardisation with StandardScaler, MinMaxScaler, MaxAbsScaler or RobustScaler.
Centring of kernel matrices with KernelCenterer.
Non-linear transformations with QuantileTransformer, PowerTransformer
Normalisation with normalize.
Encoding of categorical features with OrdinalEncoder, OneHotEncoder.
Discretisation (also known as quantisation or binning) with KBinsDiscretizer.
Binarisation of features with Binarizer
Imputation of missing values with SimpleImputer, IterativeImputer or KNNImputer where the added values can be marked with MissingIndicator.
See also
Example¶
In the following example, we fill in mean values and do some scaling:
1. Imports¶
[1]:
from datetime import datetime
import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn.impute import SimpleImputer
[2]:
hvac = pd.read_csv(
"https://raw.githubusercontent.com/kjam/data-cleaning-101/master/data/HVAC_with_nulls.csv"
)
2. Check data quality¶
Display data types with pandas.DataFrame.dtypes:
[3]:
hvac.dtypes
[3]:
Date object
Time object
TargetTemp float64
ActualTemp int64
System int64
SystemAge float64
BuildingID int64
10 float64
dtype: object
Return dimensions of the DataFrame as a tuple with pandas.DataFrame.shape:
[4]:
hvac.shape
[4]:
(8000, 8)
Return first n rows with pandas.DataFrame.head:
[5]:
hvac.head()
[5]:
Date | Time | TargetTemp | ActualTemp | System | SystemAge | BuildingID | 10 | |
---|---|---|---|---|---|---|---|---|
0 | 6/1/13 | 0:00:01 | 66.0 | 58 | 13 | 20.0 | 4 | NaN |
1 | 6/2/13 | 1:00:01 | NaN | 68 | 3 | 20.0 | 17 | NaN |
2 | 6/3/13 | 2:00:01 | 70.0 | 73 | 17 | 20.0 | 18 | NaN |
3 | 6/4/13 | 3:00:01 | 67.0 | 63 | 2 | NaN | 15 | NaN |
4 | 6/5/13 | 4:00:01 | 68.0 | 74 | 16 | 9.0 | 3 | NaN |
3. Attribute the mean value to missing values¶
For this we use the mean
strategy of sklearn.impute.SimpleImputer:
[6]:
imp = SimpleImputer(missing_values=np.nan, strategy="mean")
[7]:
hvac_numeric = hvac[["TargetTemp", "SystemAge"]]
[8]:
imp = imp.fit(hvac_numeric.loc[:10])
For more information on fit
, see the Scikit Learn documentation.
fit_transform then transforms the adapted data:
[9]:
transformed = imp.fit_transform(hvac_numeric)
[10]:
transformed
[10]:
array([[66. , 20. ],
[67.50773481, 20. ],
[70. , 20. ],
...,
[67.50773481, 4. ],
[65. , 23. ],
[66. , 21. ]])
[11]:
hvac["TargetTemp"], hvac["SystemAge"] = transformed[:, 0], transformed[:, 1]
Now we display the first rows with the changed data records:
[12]:
hvac.head()
[12]:
Date | Time | TargetTemp | ActualTemp | System | SystemAge | BuildingID | 10 | |
---|---|---|---|---|---|---|---|---|
0 | 6/1/13 | 0:00:01 | 66.000000 | 58 | 13 | 20.000000 | 4 | NaN |
1 | 6/2/13 | 1:00:01 | 67.507735 | 68 | 3 | 20.000000 | 17 | NaN |
2 | 6/3/13 | 2:00:01 | 70.000000 | 73 | 17 | 20.000000 | 18 | NaN |
3 | 6/4/13 | 3:00:01 | 67.000000 | 63 | 2 | 15.386643 | 15 | NaN |
4 | 6/5/13 | 4:00:01 | 68.000000 | 74 | 16 | 9.000000 | 3 | NaN |
4. Scale¶
To standardise data sets that look like standard normally distributed data, we can use sklearn.preprocessing.scale. This can be used to determine the factors by which a value increases or decreases. We can use this to scale the current temperature.
[13]:
hvac["ScaledTemp"] = preprocessing.scale(hvac["ActualTemp"])
[14]:
hvac["ScaledTemp"].head()
[14]:
0 -1.293272
1 0.048732
2 0.719733
3 -0.622270
4 0.853934
Name: ScaledTemp, dtype: float64
sklearn.preprocessing.MinMaxScaler scales the terms so that they lie between a certain minimum and maximum value, often between zero and one. This has the advantage of making the scaling more robust to very small standard deviations of features.
[15]:
min_max_scaler = preprocessing.MinMaxScaler()
[16]:
temp_minmax = min_max_scaler.fit_transform(hvac[["ActualTemp"]])
[17]:
temp_minmax
[17]:
array([[0.12],
[0.52],
[0.72],
...,
[0.56],
[0.32],
[0.44]])
Now we also add temp_minmax
as a new column:
[18]:
hvac["MinMaxScaledTemp"] = temp_minmax[:,0]
hvac["MinMaxScaledTemp"].head()
[18]:
0 0.12
1 0.52
2 0.72
3 0.32
4 0.76
Name: MinMaxScaledTemp, dtype: float64
[19]:
hvac.head()
[19]:
Date | Time | TargetTemp | ActualTemp | System | SystemAge | BuildingID | 10 | ScaledTemp | MinMaxScaledTemp | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 6/1/13 | 0:00:01 | 66.000000 | 58 | 13 | 20.000000 | 4 | NaN | -1.293272 | 0.12 |
1 | 6/2/13 | 1:00:01 | 67.507735 | 68 | 3 | 20.000000 | 17 | NaN | 0.048732 | 0.52 |
2 | 6/3/13 | 2:00:01 | 70.000000 | 73 | 17 | 20.000000 | 18 | NaN | 0.719733 | 0.72 |
3 | 6/4/13 | 3:00:01 | 67.000000 | 63 | 2 | 15.386643 | 15 | NaN | -0.622270 | 0.32 |
4 | 6/5/13 | 4:00:01 | 68.000000 | 74 | 16 | 9.000000 | 3 | NaN | 0.853934 | 0.76 |