TDDA: Test-Driven Data Analysis¶
TDDA uses file inputs (such as NumPy arrays or Pandas DataFrames) and a set of constraints that are stored as a JSON file.
Reference Tests supports the creation of reference tests based on either unittest or pytest.
Constraints is used to retrieve constraints from a (pandas) DataFrame, write them out as JSON and check whether records satisfy the constraints in the constraints file. It also supports tables in a variety of relational databases.
Rexpy is a tool for automatically deriving regular expressions from a column in a pandas DataFrame or from a (Python) list of examples.
1. Imports¶
[1]:
from pathlib import Path
import pandas as pd
from tdda.constraints import detect_df, discover_df, verify_df
[2]:
df = pd.read_csv(
"https://raw.githubusercontent.com/kjam/data-cleaning-101/master/data/iot_example.csv",
)
2. Check data¶
With pandas.DataFrame.sample we display ten random data sets:
[3]:
df.sample(10)
[3]:
| timestamp | username | temperature | heartrate | build | latest | note | |
|---|---|---|---|---|---|---|---|
| 70075 | 2017-01-29T12:23:30 | mary09 | 28 | 86 | 84e04441-1d5e-2c42-5f69-88e6695dadd7 | 0 | sleep |
| 96161 | 2017-02-08T22:47:44 | rickycombs | 23 | 69 | 16b49999-ae32-7642-7561-a29cd4c4b4d1 | 0 | NaN |
| 97532 | 2017-02-09T11:58:38 | pamelawilson | 27 | 75 | dccf0c7d-2438-9a2b-ed38-970421a8f072 | 0 | user |
| 60507 | 2017-01-25T16:18:01 | iblankenship | 11 | 78 | 6348eb20-f0b4-7c08-2c86-c48d442eb2ad | 0 | sleep |
| 136239 | 2017-02-24T22:27:19 | nathan51 | 29 | 86 | 758ea74b-a8ea-17fd-0b82-4431c593bc3e | 0 | NaN |
| 49838 | 2017-01-21T10:02:26 | rachel04 | 18 | 78 | d3ccbe78-60b0-bf3f-c312-83652b2ad740 | 0 | sleep |
| 116026 | 2017-02-16T20:47:21 | kimberlytaylor | 18 | 84 | ddef2b34-b211-463e-9a74-6a4dfb327001 | 0 | sleep |
| 30452 | 2017-01-13T15:53:55 | adam57 | 8 | 81 | 30625faf-f5b3-44d3-129d-2ddd149c9f69 | 1 | wake |
| 143952 | 2017-02-28T00:40:19 | breynolds | 16 | 67 | 3f7785f3-3f63-df31-b6c0-97948c1194ee | 0 | interval |
| 138200 | 2017-02-25T17:25:19 | timothyrodriguez | 20 | 82 | 99ff17ba-f7dd-d3ad-1545-ae4392a4bd87 | 0 | user |
And with pandas.DataFrame.dtypes we display the data types for the individual columns:
[4]:
df.dtypes
[4]:
timestamp object
username object
temperature int64
heartrate int64
build object
latest int64
note object
dtype: object
3. Creating a constraints object¶
With discover_constraints a constraints object can be created.
[5]:
constraints = discover_df(df)
[6]:
constraints
[6]:
<tdda.constraints.base.DatasetConstraints at 0x116452270>
[7]:
constraints.fields
[7]:
Fields([('timestamp', <tdda.constraints.base.FieldConstraints at 0x116451a90>),
('username', <tdda.constraints.base.FieldConstraints at 0x116386350>),
('temperature',
<tdda.constraints.base.FieldConstraints at 0x116386c10>),
('heartrate', <tdda.constraints.base.FieldConstraints at 0x1377ac510>),
('build', <tdda.constraints.base.FieldConstraints at 0x1377ac8a0>),
('latest', <tdda.constraints.base.FieldConstraints at 0x1377b4a70>),
('note', <tdda.constraints.base.FieldConstraints at 0x11641d9d0>)])
4. Writing the constraints into a file¶
[8]:
with Path.open("../../data/iot_example.json", "w") as f:
f.write(constraints.to_json())
If we take a closer look at the file, we can see that, for example, a string with 19 characters is expected for the timestamp column and temperature expects integers with values from 5-29.
[9]:
!cat ../../data/iot_example.json
{
"creation_metadata": {
"local_time": "2026-05-20T13:25:07",
"utc_time": "2026-05-20T11:25:07",
"creator": "TDDA 2.2.05",
"host": "fay.local",
"user": "veit",
"n_records": 146397,
"n_selected": 146397
},
"fields": {
"timestamp": {
"type": "string",
"min_length": 19,
"max_length": 19,
"max_nulls": 0,
"no_duplicates": true
},
"username": {
"type": "string",
"min_length": 3,
"max_length": 21,
"max_nulls": 0
},
"temperature": {
"type": "int",
"min": 5,
"max": 29,
"sign": "positive",
"max_nulls": 0
},
"heartrate": {
"type": "int",
"min": 60,
"max": 89,
"sign": "positive",
"max_nulls": 0
},
"build": {
"type": "string",
"min_length": 36,
"max_length": 36,
"max_nulls": 0,
"no_duplicates": true
},
"latest": {
"type": "int",
"min": 0,
"max": 1,
"sign": "non-negative",
"max_nulls": 0
},
"note": {
"type": "string",
"min_length": 4,
"max_length": 8,
"allowed_values": [
"interval",
"sleep",
"test",
"update",
"user",
"wake"
]
}
}
}
5. Checking data frames¶
To do this, we first read in a new csv file with pandas and then have ten data records output as examples:
[10]:
new_df = pd.read_csv(
"https://raw.githubusercontent.com/kjam/data-cleaning-101/master/data/iot_example_with_nulls.csv",
)
new_df.sample(10)
[10]:
| timestamp | username | temperature | heartrate | build | latest | note | |
|---|---|---|---|---|---|---|---|
| 43886 | 2017-01-19T00:54:05 | qknight | 12.0 | 65 | 0ced430f-35af-3792-7c5b-0ad5a876f947 | 1.0 | user |
| 18751 | 2017-01-08T23:43:53 | hrojas | NaN | 88 | NaN | 1.0 | NaN |
| 19839 | 2017-01-09T10:08:46 | thernandez | NaN | 84 | 406e8b71-aa5e-883c-2cc6-5f320f81ec41 | 1.0 | wake |
| 11748 | 2017-01-06T04:41:15 | fdavis | 28.0 | 79 | 9f24681a-754a-f9a2-da4e-9411722675f0 | 0.0 | NaN |
| 30912 | 2017-01-13T20:20:59 | jordandustin | NaN | 86 | 64439e81-1560-73f2-0bea-099af9e979c7 | 1.0 | interval |
| 102427 | 2017-02-11T10:39:43 | wallsamantha | NaN | 61 | NaN | 0.0 | wake |
| 64788 | 2017-01-27T09:38:51 | smithshelly | NaN | 67 | 39ec9837-23b5-418e-ab70-580322b3cd10 | NaN | update |
| 81275 | 2017-02-03T00:09:03 | alisonhouse | 26.0 | 67 | 1e453b86-b292-81ae-4bdf-a25897ff1234 | NaN | test |
| 102743 | 2017-02-11T13:41:40 | beardjohn | 14.0 | 88 | e003517c-c5bd-c1c5-d326-b9ebd762d2ae | 1.0 | test |
| 104712 | 2017-02-12T08:33:52 | nescobar | 26.0 | 62 | b705897f-206e-9cfb-4ced-75d7179732e1 | 0.0 | wake |
We see several fields that are output as NaN. Now, to analyse this systematically, we apply verify_df to our new DataFrame. Here, passes returns the number of passed constraints, and failures returns the number of failed constraints.
[11]:
v = verify_df(new_df, "../../data/iot_example.json")
[12]:
v
[12]:
<tdda.constraints.pd.constraints.PandasVerification at 0x116453230>
[13]:
v.passes
[13]:
30
[14]:
v.failures
[14]:
3
We can also display which constraints passed and failed in which columns:
[15]:
print(str(v))
FIELDS:
timestamp: 0 failures 5 passes type ✓ min_length ✓ max_length ✓ max_nulls ✓ no_duplicates ✓
username: 0 failures 4 passes type ✓ min_length ✓ max_length ✓ max_nulls ✓
temperature: 1 failure 4 passes type ✓ min ✓ max ✓ sign ✓ max_nulls ✗
heartrate: 0 failures 5 passes type ✓ min ✓ max ✓ sign ✓ max_nulls ✓
build: 1 failure 4 passes type ✓ min_length ✓ max_length ✓ max_nulls ✗ no_duplicates ✓
latest: 1 failure 4 passes type ✓ min ✓ max ✓ sign ✓ max_nulls ✗
note: 0 failures 4 passes type ✓ min_length ✓ max_length ✓ allowed_values ✓
SUMMARY:
Constraints passing: 30
Constraints failing: 3
Alternatively, we can also display these results in tabular form:
[16]:
v.to_frame()
[16]:
| field | failures | passes | type | min | min_length | max | max_length | sign | max_nulls | no_duplicates | allowed_values | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | timestamp | 0 | 5 | True | NaN | True | NaN | True | NaN | True | True | NaN |
| 1 | username | 0 | 4 | True | NaN | True | NaN | True | NaN | True | NaN | NaN |
| 2 | temperature | 1 | 4 | True | True | NaN | True | NaN | True | False | NaN | NaN |
| 3 | heartrate | 0 | 5 | True | True | NaN | True | NaN | True | True | NaN | NaN |
| 4 | build | 1 | 4 | True | NaN | True | NaN | True | NaN | False | True | NaN |
| 5 | latest | 1 | 4 | True | True | NaN | True | NaN | True | False | NaN | NaN |
| 6 | note | 0 | 4 | True | NaN | True | NaN | True | NaN | NaN | NaN | True |
6. Finding the faulty rows¶
tdda.constraints.pd.constraints.detect_df() detects records in the pandas DataFrame that violate one of the constraints in the provided JSON file. We can then call the detected() function on the created PandasDetection object to output the rows that are faulty:
[17]:
d = detect_df(new_df, "iot_example.json")
d.detected()
[17]:
| n_failures | |
|---|---|
| Index | |
| 3 | 1 |
| 4 | 1 |
| 7 | 1 |
| 10 | 2 |
| 12 | 1 |
| ... | ... |
| 146385 | 1 |
| 146387 | 2 |
| 146391 | 2 |
| 146393 | 2 |
| 146394 | 1 |
77260 rows × 1 columns
We can display all incorrect data records by using only the part of the index from new_df that also appears in d.detected():
[18]:
d_index = d.detected().index
[19]:
new_df[new_df.index.isin(d_index)]
[19]:
| timestamp | username | temperature | heartrate | build | latest | note | |
|---|---|---|---|---|---|---|---|
| 3 | 2017-01-01T12:02:09 | eddierodriguez | 28.0 | 76 | NaN | 0.0 | update |
| 4 | 2017-01-01T12:02:36 | kenneth94 | 29.0 | 62 | 122f1c6a-403c-2221-6ed1-b5caa08f11e0 | NaN | NaN |
| 7 | 2017-01-01T12:04:35 | scott28 | 16.0 | 76 | 7a60219f-6621-e548-180e-ca69624f9824 | NaN | interval |
| 10 | 2017-01-01T12:06:21 | njohnson | NaN | 63 | e09b6001-125d-51cf-9c3f-9cb686c19d02 | NaN | NaN |
| 12 | 2017-01-01T12:07:41 | jessica48 | 22.0 | 83 | 03e1a07b-3e14-412c-3a69-6b45bc79f81c | NaN | update |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 146385 | 2017-02-28T23:53:59 | powelleric | 20.0 | 86 | 152eda10-676a-069c-b664-19443f2c8081 | NaN | test |
| 146387 | 2017-02-28T23:54:50 | jthompson | NaN | 66 | 8da10303-fe49-e313-8fda-0d5e79ded054 | NaN | update |
| 146391 | 2017-02-28T23:57:21 | aaronbecker | NaN | 87 | 7e52f4a8-345c-5ee0-e515-b8c392213062 | NaN | sleep |
| 146393 | 2017-02-28T23:58:43 | joelrusso | NaN | 89 | NaN | 0.0 | NaN |
| 146394 | 2017-02-28T23:59:23 | lellis | NaN | 84 | dac87426-e147-9c39-6e4c-790bb11f8fc9 | 0.0 | update |
77260 rows × 7 columns
[20]:
new_df[~new_df.index.isin(d_index)]
[20]:
| timestamp | username | temperature | heartrate | build | latest | note | |
|---|---|---|---|---|---|---|---|
| 0 | 2017-01-01T12:00:23 | michaelsmith | 12.0 | 67 | 4e6a7805-8faa-2768-6ef6-eb3198b483ac | 0.0 | interval |
| 1 | 2017-01-01T12:01:09 | kharrison | 6.0 | 78 | 7256b7b0-e502-f576-62ec-ed73533c9c84 | 0.0 | wake |
| 2 | 2017-01-01T12:01:34 | smithadam | 5.0 | 89 | 9226c94b-bb4b-a6c8-8e02-cb42b53e9c90 | 0.0 | NaN |
| 5 | 2017-01-01T12:03:04 | bryanttodd | 13.0 | 86 | 0897dbe5-9c5b-71ca-73a1-7586959ca198 | 0.0 | interval |
| 6 | 2017-01-01T12:03:51 | andrea98 | 17.0 | 81 | 1c07ab9b-5f66-137d-a74f-921a41001f4e | 1.0 | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 146389 | 2017-02-28T23:56:05 | kathy63 | 5.0 | 88 | c2f76050-abd4-aee4-7bc0-3498325d0573 | 0.0 | NaN |
| 146390 | 2017-02-28T23:56:34 | cookallison | 16.0 | 84 | f0b0c1f9-900b-276c-bca9-ac4d4ec4e88e | 0.0 | user |
| 146392 | 2017-02-28T23:58:06 | mcontreras | 15.0 | 63 | 69e61a15-d2d0-47a7-1a27-e07b3eeeba10 | 0.0 | NaN |
| 146395 | 2017-02-28T23:59:48 | grayjasmin | 17.0 | 64 | 4911a589-3a15-4bbf-1de1-e5a69ab739da | 1.0 | update |
| 146396 | 2017-03-01T00:00:30 | jgreene | 23.0 | 70 | 4f95bbca-26a7-29e7-1f19-aaedf1a51741 | 0.0 | interval |
69137 rows × 7 columns