TDDA: Test-Driven Data Analysis#

TDDA uses file inputs (such as NumPy arrays or Pandas DataFrames) and a set of constraints that are stored as a JSON file.

  • Reference Tests supports the creation of reference tests based on either unittest or pytest.

  • Constraints is used to retrieve constraints from a (pandas) DataFrame, write them out as JSON and check whether records satisfy the constraints in the constraints file. It also supports tables in a variety of relational databases.

  • Rexpy is a tool for automatically deriving regular expressions from a column in a pandas DataFrame or from a (Python) list of examples.

1. Imports#

[1]:
import numpy as np
import pandas as pd

from tdda.constraints import discover_df, verify_df
[2]:
df = pd.read_csv(
    "https://raw.githubusercontent.com/kjam/data-cleaning-101/master/data/iot_example.csv"
)

2. Check data#

With pandas.DataFrame.sample we display ten random data sets:

[3]:
df.sample(10)
[3]:
timestamp username temperature heartrate build latest note
34835 2017-01-15T10:01:00 jwilliams 6 89 8aee9cf7-84c6-6935-22ff-9b034d9aa1f4 0 update
120063 2017-02-18T11:25:06 ericliu 28 77 c1179197-abcc-ee64-2851-cb2ed21baa1a 0 wake
130072 2017-02-22T11:12:22 lee99 9 79 184d2848-9367-71cb-be72-6bbd57074857 0 NaN
79530 2017-02-02T07:23:17 jamie46 5 82 1c56c556-2ba0-11fb-5a27-29289487b748 1 wake
53587 2017-01-22T22:02:38 daniellemacias 26 63 acd9a855-077c-dda7-c73f-7621f3179f17 0 sleep
112089 2017-02-15T07:04:50 carlosyoung 23 63 71dfc6de-147e-00f1-da89-6e4489a33aba 0 user
91218 2017-02-06T23:19:13 gutierreznathan 11 72 166e2a51-ae08-bd1f-3cee-3c65a0d5740b 0 NaN
105807 2017-02-12T18:57:19 gutierrezashley 28 72 9b1984e4-a620-68f6-c639-2db7774fe27c 0 user
51928 2017-01-22T05:58:03 uschwartz 29 81 54bf638e-68e1-9514-02df-acbc5417443a 0 user
70436 2017-01-29T15:52:55 crystalunderwood 23 82 8f003e04-509d-e412-0979-0c9f9526f1e8 1 NaN

And with pandas.DataFrame.dtypes we display the data types for the individual columns:

[4]:
df.dtypes
[4]:
timestamp      object
username       object
temperature     int64
heartrate       int64
build          object
latest          int64
note           object
dtype: object

3. Creating a constraints object#

With discover_constraints a constraints object can be created.

[5]:
constraints = discover_df(df)
[6]:
constraints
[6]:
<tdda.constraints.base.DatasetConstraints at 0x161ac2c50>
[7]:
constraints.fields
[7]:
Fields([('timestamp', <tdda.constraints.base.FieldConstraints at 0x161ac1350>),
        ('username', <tdda.constraints.base.FieldConstraints at 0x161ac1650>),
        ('temperature',
         <tdda.constraints.base.FieldConstraints at 0x136bc2790>),
        ('heartrate', <tdda.constraints.base.FieldConstraints at 0x161ac1f10>),
        ('build', <tdda.constraints.base.FieldConstraints at 0x161ac2350>),
        ('latest', <tdda.constraints.base.FieldConstraints at 0x161ac2990>),
        ('note', <tdda.constraints.base.FieldConstraints at 0x161ac2d50>)])

4. Writing the constraints into a file#

[8]:
with open("../../data/ignore-iot_constraints.tdda", "w") as f:
    f.write(constraints.to_json())

If we take a closer look at the file, we can see that, for example, a string with 19 characters is expected for the timestamp column and temperature expects integers with values from 5-29.

[9]:
!cat ../../data/ignore-iot_constraints.tdda
{
    "creation_metadata": {
        "local_time": "2023-07-26 18:18:37",
        "utc_time": "2023-07-26 16:16:37",
        "creator": "TDDA 2.0.09",
        "host": "fay.local",
        "user": "veit",
        "n_records": 146397,
        "n_selected": 146397
    },
    "fields": {
        "timestamp": {
            "type": "string",
            "min_length": 19,
            "max_length": 19,
            "max_nulls": 0,
            "no_duplicates": true
        },
        "username": {
            "type": "string",
            "min_length": 3,
            "max_length": 21,
            "max_nulls": 0
        },
        "temperature": {
            "type": "int",
            "min": 5,
            "max": 29,
            "sign": "positive",
            "max_nulls": 0
        },
        "heartrate": {
            "type": "int",
            "min": 60,
            "max": 89,
            "sign": "positive",
            "max_nulls": 0
        },
        "build": {
            "type": "string",
            "min_length": 36,
            "max_length": 36,
            "max_nulls": 0,
            "no_duplicates": true
        },
        "latest": {
            "type": "int",
            "min": 0,
            "max": 1,
            "sign": "non-negative",
            "max_nulls": 0
        },
        "note": {
            "type": "string",
            "min_length": 4,
            "max_length": 8,
            "allowed_values": [
                "interval",
                "sleep",
                "test",
                "update",
                "user",
                "wake"
            ]
        }
    }
}

5. Checking data frames#

To do this, we first read in a new csv file with pandas and then have ten data records output as examples:

[10]:
new_df = pd.read_csv(
    "https://raw.githubusercontent.com/kjam/data-cleaning-101/master/data/iot_example_with_nulls.csv"
)
new_df.sample(10)
[10]:
timestamp username temperature heartrate build latest note
985 2017-01-01T21:31:32 qmartinez 9.0 85 76f24b19-9d9e-17b8-4d02-c6e6f00e9f13 1.0 NaN
102620 2017-02-11T12:30:15 laurenwilliams 19.0 85 5ea80d05-56b2-5632-8480-cf3ef40d34a4 0.0 user
59859 2017-01-25T10:06:55 xwright 22.0 66 NaN 0.0 test
110018 2017-02-14T11:13:49 wibarra NaN 68 9d04e55f-c9e1-2ab9-666b-9de0f739431a 0.0 test
1736 2017-01-02T04:47:11 joshuaperez NaN 79 3d320343-34f9-bf79-ec39-aaafa061c39c 1.0 NaN
85078 2017-02-04T12:37:57 oaustin 23.0 63 ccf940cd-9b59-e444-1a68-5f7f7590d5db 0.0 NaN
145979 2017-02-28T20:00:11 hholder 21.0 77 NaN 0.0 NaN
37789 2017-01-16T14:12:06 kennethdavis 8.0 62 9bc9bc15-bf2a-7098-bf85-73fa42e30df2 0.0 user
114849 2017-02-16T09:27:34 bobby64 20.0 67 2c5dc517-d725-bc66-7d6d-25716025476a 0.0 sleep
107275 2017-02-13T09:05:58 okirby 29.0 64 NaN 1.0 interval

We see several fields that are output as NaN. Now, to analyse this systematically, we apply verify_df to our new DataFrame. Here, passes returns the number of passed constraints, and failures returns the number of failed constraints.

[11]:
v = verify_df(new_df, '../../data/ignore-iot_constraints.tdda')
[12]:
v
[12]:
<tdda.constraints.pd.constraints.PandasVerification at 0x1636e7450>
[13]:
v.passes
[13]:
30
[14]:
v.failures
[14]:
3

We can also display which constraints passed and failed in which columns:

[15]:
print(str(v))
FIELDS:

timestamp: 0 failures  5 passes  type ✓  min_length ✓  max_length ✓  max_nulls ✓  no_duplicates ✓

username: 0 failures  4 passes  type ✓  min_length ✓  max_length ✓  max_nulls ✓

temperature: 1 failure  4 passes  type ✓  min ✓  max ✓  sign ✓  max_nulls ✗

heartrate: 0 failures  5 passes  type ✓  min ✓  max ✓  sign ✓  max_nulls ✓

build: 1 failure  4 passes  type ✓  min_length ✓  max_length ✓  max_nulls ✗  no_duplicates ✓

latest: 1 failure  4 passes  type ✓  min ✓  max ✓  sign ✓  max_nulls ✗

note: 0 failures  4 passes  type ✓  min_length ✓  max_length ✓  allowed_values ✓

SUMMARY:

Constraints passing: 30
Constraints failing: 3

Alternatively, we can also display these results in tabular form:

[16]:
v.to_frame()
[16]:
field failures passes type min min_length max max_length sign max_nulls no_duplicates allowed_values
0 timestamp 0 5 True NaN True NaN True NaN True True NaN
1 username 0 4 True NaN True NaN True NaN True NaN NaN
2 temperature 1 4 True True NaN True NaN True False NaN NaN
3 heartrate 0 5 True True NaN True NaN True True NaN NaN
4 build 1 4 True NaN True NaN True NaN False True NaN
5 latest 1 4 True True NaN True NaN True False NaN NaN
6 note 0 4 True NaN True NaN True NaN NaN NaN True