TDDA: Test-Driven Data Analysis#
TDDA uses file inputs (such as NumPy arrays or Pandas DataFrames) and a set of constraints that are stored as a JSON file.
Reference Tests supports the creation of reference tests based on either unittest or pytest.
Constraints is used to retrieve constraints from a (pandas) DataFrame, write them out as JSON and check whether records satisfy the constraints in the constraints file. It also supports tables in a variety of relational databases.
Rexpy is a tool for automatically deriving regular expressions from a column in a pandas DataFrame or from a (Python) list of examples.
1. Imports#
[1]:
import numpy as np
import pandas as pd
from tdda.constraints import discover_df, verify_df
[2]:
df = pd.read_csv(
"https://raw.githubusercontent.com/kjam/data-cleaning-101/master/data/iot_example.csv"
)
2. Check data#
With pandas.DataFrame.sample we display ten random data sets:
[3]:
df.sample(10)
[3]:
timestamp | username | temperature | heartrate | build | latest | note | |
---|---|---|---|---|---|---|---|
34835 | 2017-01-15T10:01:00 | jwilliams | 6 | 89 | 8aee9cf7-84c6-6935-22ff-9b034d9aa1f4 | 0 | update |
120063 | 2017-02-18T11:25:06 | ericliu | 28 | 77 | c1179197-abcc-ee64-2851-cb2ed21baa1a | 0 | wake |
130072 | 2017-02-22T11:12:22 | lee99 | 9 | 79 | 184d2848-9367-71cb-be72-6bbd57074857 | 0 | NaN |
79530 | 2017-02-02T07:23:17 | jamie46 | 5 | 82 | 1c56c556-2ba0-11fb-5a27-29289487b748 | 1 | wake |
53587 | 2017-01-22T22:02:38 | daniellemacias | 26 | 63 | acd9a855-077c-dda7-c73f-7621f3179f17 | 0 | sleep |
112089 | 2017-02-15T07:04:50 | carlosyoung | 23 | 63 | 71dfc6de-147e-00f1-da89-6e4489a33aba | 0 | user |
91218 | 2017-02-06T23:19:13 | gutierreznathan | 11 | 72 | 166e2a51-ae08-bd1f-3cee-3c65a0d5740b | 0 | NaN |
105807 | 2017-02-12T18:57:19 | gutierrezashley | 28 | 72 | 9b1984e4-a620-68f6-c639-2db7774fe27c | 0 | user |
51928 | 2017-01-22T05:58:03 | uschwartz | 29 | 81 | 54bf638e-68e1-9514-02df-acbc5417443a | 0 | user |
70436 | 2017-01-29T15:52:55 | crystalunderwood | 23 | 82 | 8f003e04-509d-e412-0979-0c9f9526f1e8 | 1 | NaN |
And with pandas.DataFrame.dtypes we display the data types for the individual columns:
[4]:
df.dtypes
[4]:
timestamp object
username object
temperature int64
heartrate int64
build object
latest int64
note object
dtype: object
3. Creating a constraints object#
With discover_constraints
a constraints object can be created.
[5]:
constraints = discover_df(df)
[6]:
constraints
[6]:
<tdda.constraints.base.DatasetConstraints at 0x161ac2c50>
[7]:
constraints.fields
[7]:
Fields([('timestamp', <tdda.constraints.base.FieldConstraints at 0x161ac1350>),
('username', <tdda.constraints.base.FieldConstraints at 0x161ac1650>),
('temperature',
<tdda.constraints.base.FieldConstraints at 0x136bc2790>),
('heartrate', <tdda.constraints.base.FieldConstraints at 0x161ac1f10>),
('build', <tdda.constraints.base.FieldConstraints at 0x161ac2350>),
('latest', <tdda.constraints.base.FieldConstraints at 0x161ac2990>),
('note', <tdda.constraints.base.FieldConstraints at 0x161ac2d50>)])
4. Writing the constraints into a file#
[8]:
with open("../../data/ignore-iot_constraints.tdda", "w") as f:
f.write(constraints.to_json())
If we take a closer look at the file, we can see that, for example, a string with 19 characters is expected for the timestamp
column and temperature
expects integers with values from 5-29.
[9]:
!cat ../../data/ignore-iot_constraints.tdda
{
"creation_metadata": {
"local_time": "2023-07-26 18:18:37",
"utc_time": "2023-07-26 16:16:37",
"creator": "TDDA 2.0.09",
"host": "fay.local",
"user": "veit",
"n_records": 146397,
"n_selected": 146397
},
"fields": {
"timestamp": {
"type": "string",
"min_length": 19,
"max_length": 19,
"max_nulls": 0,
"no_duplicates": true
},
"username": {
"type": "string",
"min_length": 3,
"max_length": 21,
"max_nulls": 0
},
"temperature": {
"type": "int",
"min": 5,
"max": 29,
"sign": "positive",
"max_nulls": 0
},
"heartrate": {
"type": "int",
"min": 60,
"max": 89,
"sign": "positive",
"max_nulls": 0
},
"build": {
"type": "string",
"min_length": 36,
"max_length": 36,
"max_nulls": 0,
"no_duplicates": true
},
"latest": {
"type": "int",
"min": 0,
"max": 1,
"sign": "non-negative",
"max_nulls": 0
},
"note": {
"type": "string",
"min_length": 4,
"max_length": 8,
"allowed_values": [
"interval",
"sleep",
"test",
"update",
"user",
"wake"
]
}
}
}
5. Checking data frames#
To do this, we first read in a new csv file with pandas and then have ten data records output as examples:
[10]:
new_df = pd.read_csv(
"https://raw.githubusercontent.com/kjam/data-cleaning-101/master/data/iot_example_with_nulls.csv"
)
new_df.sample(10)
[10]:
timestamp | username | temperature | heartrate | build | latest | note | |
---|---|---|---|---|---|---|---|
985 | 2017-01-01T21:31:32 | qmartinez | 9.0 | 85 | 76f24b19-9d9e-17b8-4d02-c6e6f00e9f13 | 1.0 | NaN |
102620 | 2017-02-11T12:30:15 | laurenwilliams | 19.0 | 85 | 5ea80d05-56b2-5632-8480-cf3ef40d34a4 | 0.0 | user |
59859 | 2017-01-25T10:06:55 | xwright | 22.0 | 66 | NaN | 0.0 | test |
110018 | 2017-02-14T11:13:49 | wibarra | NaN | 68 | 9d04e55f-c9e1-2ab9-666b-9de0f739431a | 0.0 | test |
1736 | 2017-01-02T04:47:11 | joshuaperez | NaN | 79 | 3d320343-34f9-bf79-ec39-aaafa061c39c | 1.0 | NaN |
85078 | 2017-02-04T12:37:57 | oaustin | 23.0 | 63 | ccf940cd-9b59-e444-1a68-5f7f7590d5db | 0.0 | NaN |
145979 | 2017-02-28T20:00:11 | hholder | 21.0 | 77 | NaN | 0.0 | NaN |
37789 | 2017-01-16T14:12:06 | kennethdavis | 8.0 | 62 | 9bc9bc15-bf2a-7098-bf85-73fa42e30df2 | 0.0 | user |
114849 | 2017-02-16T09:27:34 | bobby64 | 20.0 | 67 | 2c5dc517-d725-bc66-7d6d-25716025476a | 0.0 | sleep |
107275 | 2017-02-13T09:05:58 | okirby | 29.0 | 64 | NaN | 1.0 | interval |
We see several fields that are output as NaN
. Now, to analyse this systematically, we apply verify_df to our new DataFrame. Here, passes
returns the number of passed constraints, and failures
returns the number of failed constraints.
[11]:
v = verify_df(new_df, '../../data/ignore-iot_constraints.tdda')
[12]:
v
[12]:
<tdda.constraints.pd.constraints.PandasVerification at 0x1636e7450>
[13]:
v.passes
[13]:
30
[14]:
v.failures
[14]:
3
We can also display which constraints passed and failed in which columns:
[15]:
print(str(v))
FIELDS:
timestamp: 0 failures 5 passes type ✓ min_length ✓ max_length ✓ max_nulls ✓ no_duplicates ✓
username: 0 failures 4 passes type ✓ min_length ✓ max_length ✓ max_nulls ✓
temperature: 1 failure 4 passes type ✓ min ✓ max ✓ sign ✓ max_nulls ✗
heartrate: 0 failures 5 passes type ✓ min ✓ max ✓ sign ✓ max_nulls ✓
build: 1 failure 4 passes type ✓ min_length ✓ max_length ✓ max_nulls ✗ no_duplicates ✓
latest: 1 failure 4 passes type ✓ min ✓ max ✓ sign ✓ max_nulls ✗
note: 0 failures 4 passes type ✓ min_length ✓ max_length ✓ allowed_values ✓
SUMMARY:
Constraints passing: 30
Constraints failing: 3
Alternatively, we can also display these results in tabular form:
[16]:
v.to_frame()
[16]:
field | failures | passes | type | min | min_length | max | max_length | sign | max_nulls | no_duplicates | allowed_values | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | timestamp | 0 | 5 | True | NaN | True | NaN | True | NaN | True | True | NaN |
1 | username | 0 | 4 | True | NaN | True | NaN | True | NaN | True | NaN | NaN |
2 | temperature | 1 | 4 | True | True | NaN | True | NaN | True | False | NaN | NaN |
3 | heartrate | 0 | 5 | True | True | NaN | True | NaN | True | True | NaN | NaN |
4 | build | 1 | 4 | True | NaN | True | NaN | True | NaN | False | True | NaN |
5 | latest | 1 | 4 | True | True | NaN | True | NaN | True | False | NaN | NaN |
6 | note | 0 | 4 | True | NaN | True | NaN | True | NaN | NaN | NaN | True |