Datenvalidierung mit Voluptuous (Schemadefinitionen)¶
In diesem Notebook verwenden wir Voluptuous, um Schemata für unsere Daten zu definieren. Wir können dann die Schemaprüfung an verschiedenen Stellen unserer Bereinigung verwenden, um sicherzustellen, dass wir die Kriterien erfüllen. Schließllich können wir Ausnahmen für die Schemaüberprüfung verwenden, um unreine oder ungültige Daten zu markieren, beiseite zu legen oder zu entfernen.
Siehe auch
1. Importe¶
[1]:
import logging
from datetime import datetime
import pandas as pd
from voluptuous import ALLOW_EXTRA, All, Range, Required, Schema
from voluptuous.error import Invalid, MultipleInvalid
Required
markiert den Knoten eines Schemas als erforderlich und gibt optional einen Standardwert an, siehe auch voluptuous.schema_builder.Required.Range
begrenzt den Wert auf einen Bereich, wobei entwedermin
odermax
weggelassen werden kann; siehe auch voluptuous.validators.Range.ALL
wird für feldübergreifende Validierungen verwendet: prüft die Grundstruktur der Daten in einem ersten Durchgang und erst im zweiten Durchgang wird die feldübergreifende Validierung angewendet; siehe auch voluptuous.validators.All.ALLOW_EXTRA
erlaubt zusätzliche WörterbuchschlüsselMultipleInvalid
basiert aufInvalid
, siehe auch voluptuous.error.MultipleInvalid.Invalid
kennzeichnet Daten als ungültig, siehe auch voluptuous.error.Invalid.
2. Logger¶
[2]:
logger = logging.getLogger(0)
logger.setLevel(logging.WARNING)
3. Beispieldaten lesen¶
[3]:
sales = pd.read_csv('https://raw.githubusercontent.com/kjam/data-cleaning-101/master/data/sales_data.csv')
4. Daten untersuchen¶
[4]:
sales.head()
[4]:
Unnamed: 0 | timestamp | city | store_id | sale_number | sale_amount | associate | |
---|---|---|---|---|---|---|---|
0 | 0 | 2018-09-10 05:00:45 | Williamburgh | 6 | 1530 | 1167.0 | Gary Lee |
1 | 1 | 2018-09-12 10:01:27 | Ibarraberg | 1 | 2744 | 258.0 | Daniel Davis |
2 | 2 | 2018-09-13 12:01:48 | Sarachester | 2 | 1908 | 266.0 | Michael Roth |
3 | 3 | 2018-09-14 20:02:19 | Caldwellbury | 14 | 771 | -108.0 | Michaela Stewart |
4 | 4 | 2018-09-16 01:03:21 | Erikaland | 11 | 1571 | -372.0 | Mark Taylor |
[5]:
sales.dtypes
[5]:
Unnamed: 0 int64
timestamp object
city object
store_id int64
sale_number int64
sale_amount float64
associate object
dtype: object
5. Schema definieren¶
In der Spalte sale_amount
sollen alle Werte zwischen 2,5 and 1450,99 sein.
[6]:
schema = Schema(
{
Required("sale_amount"): All(float, Range(min=2.50, max=1450.99)),
},
extra=ALLOW_EXTRA,
)
[7]:
error_count = 0
for s_id, sale in sales.T.to_dict().items():
try:
schema(sale)
except MultipleInvalid as e:
logging.warning(
"issue with sale: %s (%s) - %s", s_id, sale["sale_amount"], e
)
error_count += 1
WARNING:root:issue with sale: 3 (-108.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 4 (-372.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 5 (-399.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 6 (-304.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 7 (-295.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 10 (-89.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 13 (-303.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 15 (-432.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 19 (-177.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 20 (-154.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 22 (-130.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 23 (1487.0) - value must be at most 1450.99 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 25 (-145.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 28 (1471.0) - value must be at most 1450.99 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 31 (-259.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 38 (-241.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 40 (-4.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 41 (1581.0) - value must be at most 1450.99 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 45 (1529.0) - value must be at most 1450.99 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 46 (-238.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 48 (-284.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 51 (-164.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 55 (-184.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 56 (-304.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 59 (1579.0) - value must be at most 1450.99 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 60 (-455.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 63 (1551.0) - value must be at most 1450.99 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 65 (-397.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 69 (-400.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 70 (1482.0) - value must be at most 1450.99 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 71 (-321.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 74 (-47.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 76 (-68.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 86 (1454.0) - value must be at most 1450.99 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 101 (-213.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 103 (-144.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 104 (-265.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 107 (-349.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 111 (-78.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 112 (-310.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 116 (1570.0) - value must be at most 1450.99 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 120 (1490.0) - value must be at most 1450.99 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 123 (-179.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 124 (-391.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 129 (1504.0) - value must be at most 1450.99 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 130 (-91.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 132 (-372.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 141 (1512.0) - value must be at most 1450.99 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 142 (-449.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 149 (1494.0) - value must be at most 1450.99 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 152 (-405.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 155 (1599.0) - value must be at most 1450.99 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 156 (1527.0) - value must be at most 1450.99 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 157 (-462.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 162 (-358.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 164 (-78.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 167 (-358.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 171 (-391.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 178 (-304.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 180 (-9.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 187 (1475.0) - value must be at most 1450.99 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 194 (-433.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 195 (-329.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 196 (-147.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 203 (-319.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 206 (-132.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 207 (-20.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 209 (1539.0) - value must be at most 1450.99 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 211 (-167.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
Um die Elemente einer Spalte als Schlüssel und die Elemente einer anderen Spalte als Werte verwenden zu können, machen wir einfach die gewünschte Spalte zum Index des DataFrame und transponieren sie mit der Funktion .T()
; siehe auch pandas.DataFrame.transpose.
[8]:
error_count
[8]:
69
Aktuell wissen wir jedoch noch nicht, ob
wir ein falsch definiertes Schema haben
möglicherweise negative Werte zurückgegeben oder falsch markiert werden
höhere Werte kombinierte Einkäufe oder Sonderverkäufe sind
6. Hinzufügen einer benutzerdefinierten Validierung¶
[9]:
def ValidDate(fmt="%Y-%m-%d %H:%M:%S"):
return lambda v: datetime.strptime(v, fmt)
[10]:
schema = Schema(
{
Required("timestamp"): All(ValidDate()),
},
extra=ALLOW_EXTRA,
)
[11]:
error_count = 0
for s_id, sale in sales.T.to_dict().items():
try:
schema(sale)
except MultipleInvalid as e:
logging.warning(
"issue with sale: %s (%s) - %s", s_id, sale["timestamp"], e
)
error_count += 1
[12]:
error_count
[12]:
0
7. Gültige Datumsstrukturen sind noch keine gültigen Daten¶
[13]:
def ValidDate(fmt="%Y-%m-%d %H:%M:%S"):
def validation_func(v):
try:
assert datetime.strptime(v, fmt) <= datetime.now()
except AssertionError:
raise Invalid("date is in the future! %s" % v)
return validation_func
[14]:
schema = Schema(
{
Required("timestamp"): All(ValidDate()),
},
extra=ALLOW_EXTRA,
)
[15]:
error_count = 0
for s_id, sale in sales.T.to_dict().items():
try:
schema(sale)
except MultipleInvalid as e:
logging.warning(
"issue with sale: %s (%s) - %s", s_id, sale["timestamp"], e
)
error_count += 1
[16]:
error_count
[16]:
0