Data validation with Voluptuous (schema definitions)#
In this notebook we use Voluptuous to define schemas for our data. We can then use schema checking at various points in our cleanup to ensure that we meet the criteria. Finally, we can use schema checking exceptions to flag, set aside or remove impure or invalid data.
See also
1. Imports#
[1]:
import logging
from datetime import datetime
import pandas as pd
from voluptuous import ALLOW_EXTRA, All, Range, Required, Schema
from voluptuous.error import Invalid, MultipleInvalid
Required
marks the node of a schema as required and optionally specifies a default value, see also voluptuous.schema_builder.Required.Range
limits the value to a range where eithermin
ormax
can be omitted; see also voluptuous.validators.Range.ALL
is used for cross-field validations: checks the basic structure of the data in a first pass and only in the second pass the cross-field validation is applied; see also voluptuous.validators.All.ALLOW_EXTRA
allows additional dictionary keys.MultipleInvalid
is based onInvalid
, see also voluptuous.error.MultipleInvalid.Invalid
marks data as invalid, see also voluptuous.error.Invalid.
2. Logger#
[2]:
logger = logging.getLogger(0)
logger.setLevel(logging.WARNING)
3. Read sample data#
[3]:
sales = pd.read_csv(
"https://raw.githubusercontent.com/kjam/data-cleaning-101/master/data/sales_data.csv"
)
4. Examine data#
[4]:
sales.head()
[4]:
Unnamed: 0 | timestamp | city | store_id | sale_number | sale_amount | associate | |
---|---|---|---|---|---|---|---|
0 | 0 | 2018-09-10 05:00:45 | Williamburgh | 6 | 1530 | 1167.0 | Gary Lee |
1 | 1 | 2018-09-12 10:01:27 | Ibarraberg | 1 | 2744 | 258.0 | Daniel Davis |
2 | 2 | 2018-09-13 12:01:48 | Sarachester | 2 | 1908 | 266.0 | Michael Roth |
3 | 3 | 2018-09-14 20:02:19 | Caldwellbury | 14 | 771 | -108.0 | Michaela Stewart |
4 | 4 | 2018-09-16 01:03:21 | Erikaland | 11 | 1571 | -372.0 | Mark Taylor |
[5]:
sales.shape
[5]:
(213, 7)
[6]:
sales.dtypes
[6]:
Unnamed: 0 int64
timestamp object
city object
store_id int64
sale_number int64
sale_amount float64
associate object
dtype: object
5. Define schema#
In the column sale_amount
all values should be between 2.5 and 1450.99:
[7]:
schema = Schema(
{
Required("sale_amount"): All(float, Range(min=2.50, max=1450.99)),
},
extra=ALLOW_EXTRA,
)
To be able to use the elements of one column as keys and the elements of another column as values, we simply make the desired column the index of the DataFrame and transpose it with the function .T()
; see also pandas.DataFrame.transpose.
[8]:
error_count = 0
for s_id, sale in sales.T.to_dict().items():
try:
schema(sale)
except MultipleInvalid as e:
logging.warning(
"issue with sale: %s (%s) - %s", s_id, sale["sale_amount"], e
)
error_count += 1
WARNING:root:issue with sale: 3 (-108.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 4 (-372.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 5 (-399.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 6 (-304.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 7 (-295.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 10 (-89.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 13 (-303.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 15 (-432.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 19 (-177.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 20 (-154.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 22 (-130.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 23 (1487.0) - value must be at most 1450.99 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 25 (-145.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 28 (1471.0) - value must be at most 1450.99 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 31 (-259.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 38 (-241.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 40 (-4.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 41 (1581.0) - value must be at most 1450.99 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 45 (1529.0) - value must be at most 1450.99 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 46 (-238.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 48 (-284.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 51 (-164.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 55 (-184.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 56 (-304.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 59 (1579.0) - value must be at most 1450.99 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 60 (-455.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 63 (1551.0) - value must be at most 1450.99 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 65 (-397.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 69 (-400.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 70 (1482.0) - value must be at most 1450.99 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 71 (-321.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 74 (-47.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 76 (-68.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 86 (1454.0) - value must be at most 1450.99 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 101 (-213.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 103 (-144.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 104 (-265.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 107 (-349.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 111 (-78.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 112 (-310.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 116 (1570.0) - value must be at most 1450.99 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 120 (1490.0) - value must be at most 1450.99 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 123 (-179.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 124 (-391.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 129 (1504.0) - value must be at most 1450.99 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 130 (-91.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 132 (-372.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 141 (1512.0) - value must be at most 1450.99 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 142 (-449.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 149 (1494.0) - value must be at most 1450.99 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 152 (-405.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 155 (1599.0) - value must be at most 1450.99 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 156 (1527.0) - value must be at most 1450.99 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 157 (-462.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 162 (-358.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 164 (-78.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 167 (-358.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 171 (-391.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 178 (-304.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 180 (-9.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 187 (1475.0) - value must be at most 1450.99 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 194 (-433.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 195 (-329.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 196 (-147.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 203 (-319.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 206 (-132.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 207 (-20.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 209 (1539.0) - value must be at most 1450.99 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 211 (-167.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
[9]:
error_count
[9]:
69
Currently, however, we do not yet know whether
we have a wrongly defined schema
possibly negative values are returned or incorrectly marked
higher values are combined purchases or special sales
6. Adding a custom validation#
[10]:
def ValidDate(fmt="%Y-%m-%d %H:%M:%S"):
return lambda v: datetime.strptime(v, fmt)
[11]:
schema = Schema(
{
Required("timestamp"): All(ValidDate()),
},
extra=ALLOW_EXTRA,
)
[12]:
error_count = 0
for s_id, sale in sales.T.to_dict().items():
try:
schema(sale)
except MultipleInvalid as e:
logging.warning(
"issue with sale: %s (%s) - %s", s_id, sale["timestamp"], e
)
error_count += 1
[13]:
error_count
[13]:
0
7. Valid date structures are not yet valid dates#
[14]:
def ValidDate(fmt="%Y-%m-%d %H:%M:%S"):
def validation_func(v):
try:
assert datetime.strptime(v, fmt) <= datetime.now()
except AssertionError:
raise Invalid("date is in the future! %s" % v)
return validation_func
[15]:
schema = Schema(
{
Required("timestamp"): All(ValidDate()),
},
extra=ALLOW_EXTRA,
)
[16]:
error_count = 0
for s_id, sale in sales.T.to_dict().items():
try:
schema(sale)
except MultipleInvalid as e:
logging.warning(
"issue with sale: %s (%s) - %s", s_id, sale["timestamp"], e
)
error_count += 1
[17]:
error_count
[17]:
0