Normalisation and Preprocessing

sklearn.preprocessing can be used in many ways to clean data:

See also

Example

In the following example, we fill in mean values and do some scaling:

1. Imports

[1]:
from datetime import datetime

import numpy as np
import pandas as pd

from sklearn import preprocessing
from sklearn.impute import SimpleImputer
[2]:
hvac = pd.read_csv(
    "https://raw.githubusercontent.com/kjam/data-cleaning-101/master/data/HVAC_with_nulls.csv"
)

2. Check data quality

Display data types with pandas.DataFrame.dtypes:

[3]:
hvac.dtypes
[3]:
Date           object
Time           object
TargetTemp    float64
ActualTemp      int64
System          int64
SystemAge     float64
BuildingID      int64
10            float64
dtype: object

Return dimensions of the DataFrame as a tuple with pandas.DataFrame.shape:

[4]:
hvac.shape
[4]:
(8000, 8)

Return first n rows with pandas.DataFrame.head:

[5]:
hvac.head()
[5]:
Date Time TargetTemp ActualTemp System SystemAge BuildingID 10
0 6/1/13 0:00:01 66.0 58 13 20.0 4 NaN
1 6/2/13 1:00:01 NaN 68 3 20.0 17 NaN
2 6/3/13 2:00:01 70.0 73 17 20.0 18 NaN
3 6/4/13 3:00:01 67.0 63 2 NaN 15 NaN
4 6/5/13 4:00:01 68.0 74 16 9.0 3 NaN

3. Attribute the mean value to missing values

For this we use the mean strategy of sklearn.impute.SimpleImputer:

[6]:
imp = SimpleImputer(missing_values=np.nan, strategy="mean")
[7]:
hvac_numeric = hvac[["TargetTemp", "SystemAge"]]
[8]:
imp = imp.fit(hvac_numeric.loc[:10])

For more information on fit, see the Scikit Learn documentation.

fit_transform then transforms the adapted data:

[9]:
transformed = imp.fit_transform(hvac_numeric)
[10]:
transformed
[10]:
array([[66.        , 20.        ],
       [67.50773481, 20.        ],
       [70.        , 20.        ],
       ...,
       [67.50773481,  4.        ],
       [65.        , 23.        ],
       [66.        , 21.        ]])
[11]:
hvac["TargetTemp"], hvac["SystemAge"] = transformed[:, 0], transformed[:, 1]

Now we display the first rows with the changed data records:

[12]:
hvac.head()
[12]:
Date Time TargetTemp ActualTemp System SystemAge BuildingID 10
0 6/1/13 0:00:01 66.000000 58 13 20.000000 4 NaN
1 6/2/13 1:00:01 67.507735 68 3 20.000000 17 NaN
2 6/3/13 2:00:01 70.000000 73 17 20.000000 18 NaN
3 6/4/13 3:00:01 67.000000 63 2 15.386643 15 NaN
4 6/5/13 4:00:01 68.000000 74 16 9.000000 3 NaN

4. Scale

To standardise data sets that look like standard normally distributed data, we can use sklearn.preprocessing.scale. This can be used to determine the factors by which a value increases or decreases. We can use this to scale the current temperature.

[13]:
hvac["ScaledTemp"] = preprocessing.scale(hvac["ActualTemp"])
[14]:
hvac["ScaledTemp"].head()
[14]:
0   -1.293272
1    0.048732
2    0.719733
3   -0.622270
4    0.853934
Name: ScaledTemp, dtype: float64

sklearn.preprocessing.MinMaxScaler scales the terms so that they lie between a certain minimum and maximum value, often between zero and one. This has the advantage of making the scaling more robust to very small standard deviations of features.

[15]:
min_max_scaler = preprocessing.MinMaxScaler()
[16]:
temp_minmax = min_max_scaler.fit_transform(hvac[["ActualTemp"]])
[17]:
temp_minmax
[17]:
array([[0.12],
       [0.52],
       [0.72],
       ...,
       [0.56],
       [0.32],
       [0.44]])

Now we also add temp_minmax as a new column:

[18]:
hvac["MinMaxScaledTemp"] = temp_minmax[:,0]
hvac["MinMaxScaledTemp"].head()
[18]:
0    0.12
1    0.52
2    0.72
3    0.32
4    0.76
Name: MinMaxScaledTemp, dtype: float64
[19]:
hvac.head()
[19]:
Date Time TargetTemp ActualTemp System SystemAge BuildingID 10 ScaledTemp MinMaxScaledTemp
0 6/1/13 0:00:01 66.000000 58 13 20.000000 4 NaN -1.293272 0.12
1 6/2/13 1:00:01 67.507735 68 3 20.000000 17 NaN 0.048732 0.52
2 6/3/13 2:00:01 70.000000 73 17 20.000000 18 NaN 0.719733 0.72
3 6/4/13 3:00:01 67.000000 63 2 15.386643 15 NaN -0.622270 0.32
4 6/5/13 4:00:01 68.000000 74 16 9.000000 3 NaN 0.853934 0.76