TDDA: Test-Driven Data Analysis

TDDA uses file inputs (such as NumPy arrays or Pandas DataFrames) and a set of constraints that are stored as a JSON file.

  • Reference Tests supports the creation of reference tests based on either unittest or pytest.

  • Constraints is used to retrieve constraints from a (pandas) DataFrame, write them out as JSON and check whether records satisfy the constraints in the constraints file. It also supports tables in a variety of relational databases.

  • Rexpy is a tool for automatically deriving regular expressions from a column in a pandas DataFrame or from a (Python) list of examples.

1. Imports

[1]:
from pathlib import Path

import pandas as pd

from tdda.constraints import detect_df, discover_df, verify_df
[2]:
df = pd.read_csv(
    "https://raw.githubusercontent.com/kjam/data-cleaning-101/master/data/iot_example.csv",
)

2. Check data

With pandas.DataFrame.sample we display ten random data sets:

[3]:
df.sample(10)
[3]:
timestamp username temperature heartrate build latest note
70075 2017-01-29T12:23:30 mary09 28 86 84e04441-1d5e-2c42-5f69-88e6695dadd7 0 sleep
96161 2017-02-08T22:47:44 rickycombs 23 69 16b49999-ae32-7642-7561-a29cd4c4b4d1 0 NaN
97532 2017-02-09T11:58:38 pamelawilson 27 75 dccf0c7d-2438-9a2b-ed38-970421a8f072 0 user
60507 2017-01-25T16:18:01 iblankenship 11 78 6348eb20-f0b4-7c08-2c86-c48d442eb2ad 0 sleep
136239 2017-02-24T22:27:19 nathan51 29 86 758ea74b-a8ea-17fd-0b82-4431c593bc3e 0 NaN
49838 2017-01-21T10:02:26 rachel04 18 78 d3ccbe78-60b0-bf3f-c312-83652b2ad740 0 sleep
116026 2017-02-16T20:47:21 kimberlytaylor 18 84 ddef2b34-b211-463e-9a74-6a4dfb327001 0 sleep
30452 2017-01-13T15:53:55 adam57 8 81 30625faf-f5b3-44d3-129d-2ddd149c9f69 1 wake
143952 2017-02-28T00:40:19 breynolds 16 67 3f7785f3-3f63-df31-b6c0-97948c1194ee 0 interval
138200 2017-02-25T17:25:19 timothyrodriguez 20 82 99ff17ba-f7dd-d3ad-1545-ae4392a4bd87 0 user

And with pandas.DataFrame.dtypes we display the data types for the individual columns:

[4]:
df.dtypes
[4]:
timestamp      object
username       object
temperature     int64
heartrate       int64
build          object
latest          int64
note           object
dtype: object

3. Creating a constraints object

With discover_constraints a constraints object can be created.

[5]:
constraints = discover_df(df)
[6]:
constraints
[6]:
<tdda.constraints.base.DatasetConstraints at 0x116452270>
[7]:
constraints.fields
[7]:
Fields([('timestamp', <tdda.constraints.base.FieldConstraints at 0x116451a90>),
        ('username', <tdda.constraints.base.FieldConstraints at 0x116386350>),
        ('temperature',
         <tdda.constraints.base.FieldConstraints at 0x116386c10>),
        ('heartrate', <tdda.constraints.base.FieldConstraints at 0x1377ac510>),
        ('build', <tdda.constraints.base.FieldConstraints at 0x1377ac8a0>),
        ('latest', <tdda.constraints.base.FieldConstraints at 0x1377b4a70>),
        ('note', <tdda.constraints.base.FieldConstraints at 0x11641d9d0>)])

4. Writing the constraints into a file

[8]:
with Path.open("../../data/iot_example.json", "w") as f:
    f.write(constraints.to_json())

If we take a closer look at the file, we can see that, for example, a string with 19 characters is expected for the timestamp column and temperature expects integers with values from 5-29.

[9]:
!cat ../../data/iot_example.json
{
    "creation_metadata": {
        "local_time": "2026-05-20T13:25:07",
        "utc_time": "2026-05-20T11:25:07",
        "creator": "TDDA 2.2.05",
        "host": "fay.local",
        "user": "veit",
        "n_records": 146397,
        "n_selected": 146397
    },
    "fields": {
        "timestamp": {
            "type": "string",
            "min_length": 19,
            "max_length": 19,
            "max_nulls": 0,
            "no_duplicates": true
        },
        "username": {
            "type": "string",
            "min_length": 3,
            "max_length": 21,
            "max_nulls": 0
        },
        "temperature": {
            "type": "int",
            "min": 5,
            "max": 29,
            "sign": "positive",
            "max_nulls": 0
        },
        "heartrate": {
            "type": "int",
            "min": 60,
            "max": 89,
            "sign": "positive",
            "max_nulls": 0
        },
        "build": {
            "type": "string",
            "min_length": 36,
            "max_length": 36,
            "max_nulls": 0,
            "no_duplicates": true
        },
        "latest": {
            "type": "int",
            "min": 0,
            "max": 1,
            "sign": "non-negative",
            "max_nulls": 0
        },
        "note": {
            "type": "string",
            "min_length": 4,
            "max_length": 8,
            "allowed_values": [
                "interval",
                "sleep",
                "test",
                "update",
                "user",
                "wake"
            ]
        }
    }
}

5. Checking data frames

To do this, we first read in a new csv file with pandas and then have ten data records output as examples:

[10]:
new_df = pd.read_csv(
    "https://raw.githubusercontent.com/kjam/data-cleaning-101/master/data/iot_example_with_nulls.csv",
)
new_df.sample(10)
[10]:
timestamp username temperature heartrate build latest note
43886 2017-01-19T00:54:05 qknight 12.0 65 0ced430f-35af-3792-7c5b-0ad5a876f947 1.0 user
18751 2017-01-08T23:43:53 hrojas NaN 88 NaN 1.0 NaN
19839 2017-01-09T10:08:46 thernandez NaN 84 406e8b71-aa5e-883c-2cc6-5f320f81ec41 1.0 wake
11748 2017-01-06T04:41:15 fdavis 28.0 79 9f24681a-754a-f9a2-da4e-9411722675f0 0.0 NaN
30912 2017-01-13T20:20:59 jordandustin NaN 86 64439e81-1560-73f2-0bea-099af9e979c7 1.0 interval
102427 2017-02-11T10:39:43 wallsamantha NaN 61 NaN 0.0 wake
64788 2017-01-27T09:38:51 smithshelly NaN 67 39ec9837-23b5-418e-ab70-580322b3cd10 NaN update
81275 2017-02-03T00:09:03 alisonhouse 26.0 67 1e453b86-b292-81ae-4bdf-a25897ff1234 NaN test
102743 2017-02-11T13:41:40 beardjohn 14.0 88 e003517c-c5bd-c1c5-d326-b9ebd762d2ae 1.0 test
104712 2017-02-12T08:33:52 nescobar 26.0 62 b705897f-206e-9cfb-4ced-75d7179732e1 0.0 wake

We see several fields that are output as NaN. Now, to analyse this systematically, we apply verify_df to our new DataFrame. Here, passes returns the number of passed constraints, and failures returns the number of failed constraints.

[11]:
v = verify_df(new_df, "../../data/iot_example.json")
[12]:
v
[12]:
<tdda.constraints.pd.constraints.PandasVerification at 0x116453230>
[13]:
v.passes
[13]:
30
[14]:
v.failures
[14]:
3

We can also display which constraints passed and failed in which columns:

[15]:
print(str(v))
FIELDS:

timestamp: 0 failures  5 passes  type ✓  min_length ✓  max_length ✓  max_nulls ✓  no_duplicates ✓

username: 0 failures  4 passes  type ✓  min_length ✓  max_length ✓  max_nulls ✓

temperature: 1 failure  4 passes  type ✓  min ✓  max ✓  sign ✓  max_nulls ✗

heartrate: 0 failures  5 passes  type ✓  min ✓  max ✓  sign ✓  max_nulls ✓

build: 1 failure  4 passes  type ✓  min_length ✓  max_length ✓  max_nulls ✗  no_duplicates ✓

latest: 1 failure  4 passes  type ✓  min ✓  max ✓  sign ✓  max_nulls ✗

note: 0 failures  4 passes  type ✓  min_length ✓  max_length ✓  allowed_values ✓

SUMMARY:

Constraints passing: 30
Constraints failing: 3

Alternatively, we can also display these results in tabular form:

[16]:
v.to_frame()
[16]:
field failures passes type min min_length max max_length sign max_nulls no_duplicates allowed_values
0 timestamp 0 5 True NaN True NaN True NaN True True NaN
1 username 0 4 True NaN True NaN True NaN True NaN NaN
2 temperature 1 4 True True NaN True NaN True False NaN NaN
3 heartrate 0 5 True True NaN True NaN True True NaN NaN
4 build 1 4 True NaN True NaN True NaN False True NaN
5 latest 1 4 True True NaN True NaN True False NaN NaN
6 note 0 4 True NaN True NaN True NaN NaN NaN True

6. Finding the faulty rows

tdda.constraints.pd.constraints.detect_df() detects records in the pandas DataFrame that violate one of the constraints in the provided JSON file. We can then call the detected() function on the created PandasDetection object to output the rows that are faulty:

[17]:
d = detect_df(new_df, "iot_example.json")

d.detected()
[17]:
n_failures
Index
3 1
4 1
7 1
10 2
12 1
... ...
146385 1
146387 2
146391 2
146393 2
146394 1

77260 rows × 1 columns

We can display all incorrect data records by using only the part of the index from new_df that also appears in d.detected():

[18]:
d_index = d.detected().index
[19]:
new_df[new_df.index.isin(d_index)]
[19]:
timestamp username temperature heartrate build latest note
3 2017-01-01T12:02:09 eddierodriguez 28.0 76 NaN 0.0 update
4 2017-01-01T12:02:36 kenneth94 29.0 62 122f1c6a-403c-2221-6ed1-b5caa08f11e0 NaN NaN
7 2017-01-01T12:04:35 scott28 16.0 76 7a60219f-6621-e548-180e-ca69624f9824 NaN interval
10 2017-01-01T12:06:21 njohnson NaN 63 e09b6001-125d-51cf-9c3f-9cb686c19d02 NaN NaN
12 2017-01-01T12:07:41 jessica48 22.0 83 03e1a07b-3e14-412c-3a69-6b45bc79f81c NaN update
... ... ... ... ... ... ... ...
146385 2017-02-28T23:53:59 powelleric 20.0 86 152eda10-676a-069c-b664-19443f2c8081 NaN test
146387 2017-02-28T23:54:50 jthompson NaN 66 8da10303-fe49-e313-8fda-0d5e79ded054 NaN update
146391 2017-02-28T23:57:21 aaronbecker NaN 87 7e52f4a8-345c-5ee0-e515-b8c392213062 NaN sleep
146393 2017-02-28T23:58:43 joelrusso NaN 89 NaN 0.0 NaN
146394 2017-02-28T23:59:23 lellis NaN 84 dac87426-e147-9c39-6e4c-790bb11f8fc9 0.0 update

77260 rows × 7 columns

[20]:
new_df[~new_df.index.isin(d_index)]
[20]:
timestamp username temperature heartrate build latest note
0 2017-01-01T12:00:23 michaelsmith 12.0 67 4e6a7805-8faa-2768-6ef6-eb3198b483ac 0.0 interval
1 2017-01-01T12:01:09 kharrison 6.0 78 7256b7b0-e502-f576-62ec-ed73533c9c84 0.0 wake
2 2017-01-01T12:01:34 smithadam 5.0 89 9226c94b-bb4b-a6c8-8e02-cb42b53e9c90 0.0 NaN
5 2017-01-01T12:03:04 bryanttodd 13.0 86 0897dbe5-9c5b-71ca-73a1-7586959ca198 0.0 interval
6 2017-01-01T12:03:51 andrea98 17.0 81 1c07ab9b-5f66-137d-a74f-921a41001f4e 1.0 NaN
... ... ... ... ... ... ... ...
146389 2017-02-28T23:56:05 kathy63 5.0 88 c2f76050-abd4-aee4-7bc0-3498325d0573 0.0 NaN
146390 2017-02-28T23:56:34 cookallison 16.0 84 f0b0c1f9-900b-276c-bca9-ac4d4ec4e88e 0.0 user
146392 2017-02-28T23:58:06 mcontreras 15.0 63 69e61a15-d2d0-47a7-1a27-e07b3eeeba10 0.0 NaN
146395 2017-02-28T23:59:48 grayjasmin 17.0 64 4911a589-3a15-4bbf-1de1-e5a69ab739da 1.0 update
146396 2017-03-01T00:00:30 jgreene 23.0 70 4f95bbca-26a7-29e7-1f19-aaedf1a51741 0.0 interval

69137 rows × 7 columns