TDDA: Test-Driven Data Analysis¶

TDDA uses file inputs (such as NumPy arrays or Pandas DataFrames) and a set of constraints that are stored as a JSON file.

Reference Tests supports the creation of reference tests based on either unittest or pytest.
Constraints is used to retrieve constraints from a (pandas) DataFrame, write them out as JSON and check whether records satisfy the constraints in the constraints file. It also supports tables in a variety of relational databases.
Rexpy is a tool for automatically deriving regular expressions from a column in a pandas DataFrame or from a (Python) list of examples.

1. Imports¶

[1]:

from pathlib import Path

import pandas as pd

from tdda.constraints import detect_df, discover_df, verify_df

[2]:

df = pd.read_csv(
    "https://raw.githubusercontent.com/kjam/data-cleaning-101/master/data/iot_example.csv",
)

2. Check data¶

With pandas.DataFrame.sample we display ten random data sets:

[3]:

df.sample(10)

[3]:

	timestamp	username	temperature	heartrate	build	latest	note
70075	2017-01-29T12:23:30	mary09	28	86	84e04441-1d5e-2c42-5f69-88e6695dadd7	0	sleep
96161	2017-02-08T22:47:44	rickycombs	23	69	16b49999-ae32-7642-7561-a29cd4c4b4d1	0	NaN
97532	2017-02-09T11:58:38	pamelawilson	27	75	dccf0c7d-2438-9a2b-ed38-970421a8f072	0	user
60507	2017-01-25T16:18:01	iblankenship	11	78	6348eb20-f0b4-7c08-2c86-c48d442eb2ad	0	sleep
136239	2017-02-24T22:27:19	nathan51	29	86	758ea74b-a8ea-17fd-0b82-4431c593bc3e	0	NaN
49838	2017-01-21T10:02:26	rachel04	18	78	d3ccbe78-60b0-bf3f-c312-83652b2ad740	0	sleep
116026	2017-02-16T20:47:21	kimberlytaylor	18	84	ddef2b34-b211-463e-9a74-6a4dfb327001	0	sleep
30452	2017-01-13T15:53:55	adam57	8	81	30625faf-f5b3-44d3-129d-2ddd149c9f69	1	wake
143952	2017-02-28T00:40:19	breynolds	16	67	3f7785f3-3f63-df31-b6c0-97948c1194ee	0	interval
138200	2017-02-25T17:25:19	timothyrodriguez	20	82	99ff17ba-f7dd-d3ad-1545-ae4392a4bd87	0	user

And with pandas.DataFrame.dtypes we display the data types for the individual columns:

[4]:

df.dtypes

[4]:

timestamp      object
username       object
temperature     int64
heartrate       int64
build          object
latest          int64
note           object
dtype: object

3. Creating a constraints object¶

With discover_constraints a constraints object can be created.

[5]:

constraints = discover_df(df)

[6]:

constraints

[6]:

<tdda.constraints.base.DatasetConstraints at 0x116452270>

[7]:

constraints.fields

[7]:

Fields([('timestamp', <tdda.constraints.base.FieldConstraints at 0x116451a90>),
        ('username', <tdda.constraints.base.FieldConstraints at 0x116386350>),
        ('temperature',
         <tdda.constraints.base.FieldConstraints at 0x116386c10>),
        ('heartrate', <tdda.constraints.base.FieldConstraints at 0x1377ac510>),
        ('build', <tdda.constraints.base.FieldConstraints at 0x1377ac8a0>),
        ('latest', <tdda.constraints.base.FieldConstraints at 0x1377b4a70>),
        ('note', <tdda.constraints.base.FieldConstraints at 0x11641d9d0>)])

4. Writing the constraints into a file¶

[8]:

with Path.open("../../data/iot_example.json", "w") as f:
    f.write(constraints.to_json())

If we take a closer look at the file, we can see that, for example, a string with 19 characters is expected for the timestamp column and temperature expects integers with values from 5-29.

[9]:

!cat ../../data/iot_example.json

{
    "creation_metadata": {
        "local_time": "2026-05-20T13:25:07",
        "utc_time": "2026-05-20T11:25:07",
        "creator": "TDDA 2.2.05",
        "host": "fay.local",
        "user": "veit",
        "n_records": 146397,
        "n_selected": 146397
    },
    "fields": {
        "timestamp": {
            "type": "string",
            "min_length": 19,
            "max_length": 19,
            "max_nulls": 0,
            "no_duplicates": true
        },
        "username": {
            "type": "string",
            "min_length": 3,
            "max_length": 21,
            "max_nulls": 0
        },
        "temperature": {
            "type": "int",
            "min": 5,
            "max": 29,
            "sign": "positive",
            "max_nulls": 0
        },
        "heartrate": {
            "type": "int",
            "min": 60,
            "max": 89,
            "sign": "positive",
            "max_nulls": 0
        },
        "build": {
            "type": "string",
            "min_length": 36,
            "max_length": 36,
            "max_nulls": 0,
            "no_duplicates": true
        },
        "latest": {
            "type": "int",
            "min": 0,
            "max": 1,
            "sign": "non-negative",
            "max_nulls": 0
        },
        "note": {
            "type": "string",
            "min_length": 4,
            "max_length": 8,
            "allowed_values": [
                "interval",
                "sleep",
                "test",
                "update",
                "user",
                "wake"
            ]
        }
    }
}

5. Checking data frames¶

To do this, we first read in a new csv file with pandas and then have ten data records output as examples:

[10]:

new_df = pd.read_csv(
    "https://raw.githubusercontent.com/kjam/data-cleaning-101/master/data/iot_example_with_nulls.csv",
)
new_df.sample(10)

[10]:

	timestamp	username	temperature	heartrate	build	latest	note
43886	2017-01-19T00:54:05	qknight	12.0	65	0ced430f-35af-3792-7c5b-0ad5a876f947	1.0	user
18751	2017-01-08T23:43:53	hrojas	NaN	88	NaN	1.0	NaN
19839	2017-01-09T10:08:46	thernandez	NaN	84	406e8b71-aa5e-883c-2cc6-5f320f81ec41	1.0	wake
11748	2017-01-06T04:41:15	fdavis	28.0	79	9f24681a-754a-f9a2-da4e-9411722675f0	0.0	NaN
30912	2017-01-13T20:20:59	jordandustin	NaN	86	64439e81-1560-73f2-0bea-099af9e979c7	1.0	interval
102427	2017-02-11T10:39:43	wallsamantha	NaN	61	NaN	0.0	wake
64788	2017-01-27T09:38:51	smithshelly	NaN	67	39ec9837-23b5-418e-ab70-580322b3cd10	NaN	update
81275	2017-02-03T00:09:03	alisonhouse	26.0	67	1e453b86-b292-81ae-4bdf-a25897ff1234	NaN	test
102743	2017-02-11T13:41:40	beardjohn	14.0	88	e003517c-c5bd-c1c5-d326-b9ebd762d2ae	1.0	test
104712	2017-02-12T08:33:52	nescobar	26.0	62	b705897f-206e-9cfb-4ced-75d7179732e1	0.0	wake

We see several fields that are output as NaN. Now, to analyse this systematically, we apply verify_df to our new DataFrame. Here, passes returns the number of passed constraints, and failures returns the number of failed constraints.

[11]:

v = verify_df(new_df, "../../data/iot_example.json")

[12]:

[12]:

<tdda.constraints.pd.constraints.PandasVerification at 0x116453230>

[13]:

v.passes

[13]:

[14]:

v.failures

[14]:

We can also display which constraints passed and failed in which columns:

[15]:

print(str(v))

FIELDS:

timestamp: 0 failures  5 passes  type ✓  min_length ✓  max_length ✓  max_nulls ✓  no_duplicates ✓

username: 0 failures  4 passes  type ✓  min_length ✓  max_length ✓  max_nulls ✓

temperature: 1 failure  4 passes  type ✓  min ✓  max ✓  sign ✓  max_nulls ✗

heartrate: 0 failures  5 passes  type ✓  min ✓  max ✓  sign ✓  max_nulls ✓

build: 1 failure  4 passes  type ✓  min_length ✓  max_length ✓  max_nulls ✗  no_duplicates ✓

latest: 1 failure  4 passes  type ✓  min ✓  max ✓  sign ✓  max_nulls ✗

note: 0 failures  4 passes  type ✓  min_length ✓  max_length ✓  allowed_values ✓

SUMMARY:

Constraints passing: 30
Constraints failing: 3

Alternatively, we can also display these results in tabular form:

[16]:

v.to_frame()

[16]:

	field	failures	passes	type	min	min_length	max	max_length	sign	max_nulls	no_duplicates	allowed_values
0	timestamp	0	5	True	NaN	True	NaN	True	NaN	True	True	NaN
1	username	0	4	True	NaN	True	NaN	True	NaN	True	NaN	NaN
2	temperature	1	4	True	True	NaN	True	NaN	True	False	NaN	NaN
3	heartrate	0	5	True	True	NaN	True	NaN	True	True	NaN	NaN
4	build	1	4	True	NaN	True	NaN	True	NaN	False	True	NaN
5	latest	1	4	True	True	NaN	True	NaN	True	False	NaN	NaN
6	note	0	4	True	NaN	True	NaN	True	NaN	NaN	NaN	True

6. Finding the faulty rows¶

tdda.constraints.pd.constraints.detect_df() detects records in the pandas DataFrame that violate one of the constraints in the provided JSON file. We can then call the detected() function on the created PandasDetection object to output the rows that are faulty:

[17]:

d = detect_df(new_df, "iot_example.json")

d.detected()

[17]:

	n_failures
Index
3	1
4	1
7	1
10	2
12	1
...	...
146385	1
146387	2
146391	2
146393	2
146394	1

77260 rows × 1 columns

We can display all incorrect data records by using only the part of the index from new_df that also appears in d.detected():

[18]:

d_index = d.detected().index

[19]:

new_df[new_df.index.isin(d_index)]

[19]:

	timestamp	username	temperature	heartrate	build	latest	note
3	2017-01-01T12:02:09	eddierodriguez	28.0	76	NaN	0.0	update
4	2017-01-01T12:02:36	kenneth94	29.0	62	122f1c6a-403c-2221-6ed1-b5caa08f11e0	NaN	NaN
7	2017-01-01T12:04:35	scott28	16.0	76	7a60219f-6621-e548-180e-ca69624f9824	NaN	interval
10	2017-01-01T12:06:21	njohnson	NaN	63	e09b6001-125d-51cf-9c3f-9cb686c19d02	NaN	NaN
12	2017-01-01T12:07:41	jessica48	22.0	83	03e1a07b-3e14-412c-3a69-6b45bc79f81c	NaN	update
...	...	...	...	...	...	...	...
146385	2017-02-28T23:53:59	powelleric	20.0	86	152eda10-676a-069c-b664-19443f2c8081	NaN	test
146387	2017-02-28T23:54:50	jthompson	NaN	66	8da10303-fe49-e313-8fda-0d5e79ded054	NaN	update
146391	2017-02-28T23:57:21	aaronbecker	NaN	87	7e52f4a8-345c-5ee0-e515-b8c392213062	NaN	sleep
146393	2017-02-28T23:58:43	joelrusso	NaN	89	NaN	0.0	NaN
146394	2017-02-28T23:59:23	lellis	NaN	84	dac87426-e147-9c39-6e4c-790bb11f8fc9	0.0	update

77260 rows × 7 columns

[20]:

new_df[~new_df.index.isin(d_index)]

[20]:

	timestamp	username	temperature	heartrate	build	latest	note
0	2017-01-01T12:00:23	michaelsmith	12.0	67	4e6a7805-8faa-2768-6ef6-eb3198b483ac	0.0	interval
1	2017-01-01T12:01:09	kharrison	6.0	78	7256b7b0-e502-f576-62ec-ed73533c9c84	0.0	wake
2	2017-01-01T12:01:34	smithadam	5.0	89	9226c94b-bb4b-a6c8-8e02-cb42b53e9c90	0.0	NaN
5	2017-01-01T12:03:04	bryanttodd	13.0	86	0897dbe5-9c5b-71ca-73a1-7586959ca198	0.0	interval
6	2017-01-01T12:03:51	andrea98	17.0	81	1c07ab9b-5f66-137d-a74f-921a41001f4e	1.0	NaN
...	...	...	...	...	...	...	...
146389	2017-02-28T23:56:05	kathy63	5.0	88	c2f76050-abd4-aee4-7bc0-3498325d0573	0.0	NaN
146390	2017-02-28T23:56:34	cookallison	16.0	84	f0b0c1f9-900b-276c-bca9-ac4d4ec4e88e	0.0	user
146392	2017-02-28T23:58:06	mcontreras	15.0	63	69e61a15-d2d0-47a7-1a27-e07b3eeeba10	0.0	NaN
146395	2017-02-28T23:59:48	grayjasmin	17.0	64	4911a589-3a15-4bbf-1de1-e5a69ab739da	1.0	update
146396	2017-03-01T00:00:30	jgreene	23.0	70	4f95bbca-26a7-29e7-1f19-aaedf1a51741	0.0	interval

69137 rows × 7 columns