{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# TDDA: Test-Driven Data Analysis\n", "\n", "[TDDA](https://github.com/tdda/tdda) verwendet Dateieingaben (wie NumPy-Arrays oder Pandas DataFrames) und eine Reihe von Einschränkungen (engl.: _constraints_), die als JSON-Datei gespeichert werden.\n", "\n", "* [Reference Test](https://tdda.readthedocs.io/en/latest/referencetest.html) unterstützt die Erstellung von Referenztests, die entweder auf `unittest` oder `pytest` basieren.\n", "* [Constraints](https://tdda.readthedocs.io/en/v1.0.30/constraints.html) wird verwendet, um Constraints aus einem (Pandas)-DataFrame zu ermitteln, sie als JSON auszuschreiben und zu überprüfen, ob Datensätze die Constraints in der Constraints-Datei erfüllen. Es unterstützt auch Tabellen in einer Vielzahl von relationalen Datenbanken.\n", "* [Rexpy](https://tdda.readthedocs.io/en/v1.0.30/rexpy.html) ist ein Werkzeug zur automatischen Ableitung von regulären Ausdrücken aus einer Spalte in einem Pandas DataFrame oder aus einer (Python)-Liste von Beispielen." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Importe" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "\n", "from tdda.constraints import discover_df, verify_df" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv(\"https://raw.githubusercontent.com/kjam/data-cleaning-101/master/data/iot_example.csv\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Daten überprüfen\n", "\n", "Mit [pandas.DataFrame.sample](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html) lassen wir uns die ersten zehn Datensätze anzeigen:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
timestampusernametemperatureheartratebuildlatestnote
1232212017-02-19T17:27:43wilsonpatricia20602452cba2-623a-9aaf-4620-b1a00a7070880NaN
806842017-02-02T18:32:06vmiller2779877d2246-f54d-2d6b-8991-5698199def391user
1088102017-02-13T23:50:44sandraobrien9732c185e6b-ad8a-bbc3-5b8a-a7f0cff3b3b51user
473302017-01-20T10:00:25rcline1863c8b03e53-db4a-bbb7-1d06-efb20eca17ec0NaN
258422017-01-11T19:32:20colton9920721d99cb10-0189-3bc2-d7e3-4fee382387ef0sleep
314132017-01-14T01:06:46zjimenez1189eb2bfbd3-6a61-cfae-afcc-6879319aebad1user
1153472017-02-16T14:11:28alvin9423879924687d-959e-99cf-6510-712904df25830wake
1381312017-02-25T16:45:28chelsea05766f995acb5-fff8-3ff7-0581-152e81988b811user
976932017-02-09T13:31:43thomasknight1866a9180bc3-90a3-88bf-36e5-a67549147b281user
339042017-01-15T01:02:00paulwall86656b09285-495a-23ef-31f7-489fad16096f1sleep
\n", "
" ], "text/plain": [ " timestamp username temperature heartrate \\\n", "123221 2017-02-19T17:27:43 wilsonpatricia 20 60 \n", "80684 2017-02-02T18:32:06 vmiller 27 79 \n", "108810 2017-02-13T23:50:44 sandraobrien 9 73 \n", "47330 2017-01-20T10:00:25 rcline 18 63 \n", "25842 2017-01-11T19:32:20 colton99 20 72 \n", "31413 2017-01-14T01:06:46 zjimenez 11 89 \n", "115347 2017-02-16T14:11:28 alvin94 23 87 \n", "138131 2017-02-25T16:45:28 chelsea05 7 66 \n", "97693 2017-02-09T13:31:43 thomasknight 18 66 \n", "33904 2017-01-15T01:02:00 paulwall 8 66 \n", "\n", " build latest note \n", "123221 2452cba2-623a-9aaf-4620-b1a00a707088 0 NaN \n", "80684 877d2246-f54d-2d6b-8991-5698199def39 1 user \n", "108810 2c185e6b-ad8a-bbc3-5b8a-a7f0cff3b3b5 1 user \n", "47330 c8b03e53-db4a-bbb7-1d06-efb20eca17ec 0 NaN \n", "25842 1d99cb10-0189-3bc2-d7e3-4fee382387ef 0 sleep \n", "31413 eb2bfbd3-6a61-cfae-afcc-6879319aebad 1 user \n", "115347 9924687d-959e-99cf-6510-712904df2583 0 wake \n", "138131 f995acb5-fff8-3ff7-0581-152e81988b81 1 user \n", "97693 a9180bc3-90a3-88bf-36e5-a67549147b28 1 user \n", "33904 56b09285-495a-23ef-31f7-489fad16096f 1 sleep " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.sample(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Und mit [pandas.DataFrame.dtypes](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dtypes.html) lassen wir uns die Datentypen für die einzelnen Spalten anzeigen:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "timestamp object\n", "username object\n", "temperature int64\n", "heartrate int64\n", "build object\n", "latest int64\n", "note object\n", "dtype: object" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Erstellen eines _constraints_-Objekt\n", "\n", "Mit `discover_constraints` kann ein Vonstraints-Objekt erzeugt werden." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "constraints = discover_df(df)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "constraints" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Fields([('timestamp', ),\n", " ('username', ),\n", " ('temperature',\n", " ),\n", " ('heartrate', ),\n", " ('build', ),\n", " ('latest', ),\n", " ('note', )])" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "constraints.fields" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Schreiben der _Constraints_ in eine Datei" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "with open(\"ignore-iot_constraints.tdda\", \"w\") as f:\n", " f.write(constraints.to_json())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Wenn wir uns die Datei genauer betrachten können wir erkennen, dass z.B. für die `timestamp`-Spalte eine Zeichenkette mit 19 Zeichen erwartet wird und `temperature` Integer mit Werten von 5–29 erwartet." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\r\n", " \"creation_metadata\": {\r\n", " \"local_time\": \"2023-08-19 13:13:48\",\r\n", " \"utc_time\": \"2023-08-19 11:11:48\",\r\n", " \"creator\": \"TDDA 2.0.09\",\r\n", " \"host\": \"fay.fritz.box\",\r\n", " \"user\": \"veit\",\r\n", " \"n_records\": 146397,\r\n", " \"n_selected\": 146397\r\n", " },\r\n", " \"fields\": {\r\n", " \"timestamp\": {\r\n", " \"type\": \"string\",\r\n", " \"min_length\": 19,\r\n", " \"max_length\": 19,\r\n", " \"max_nulls\": 0,\r\n", " \"no_duplicates\": true\r\n", " },\r\n", " \"username\": {\r\n", " \"type\": \"string\",\r\n", " \"min_length\": 3,\r\n", " \"max_length\": 21,\r\n", " \"max_nulls\": 0\r\n", " },\r\n", " \"temperature\": {\r\n", " \"type\": \"int\",\r\n", " \"min\": 5,\r\n", " \"max\": 29,\r\n", " \"sign\": \"positive\",\r\n", " \"max_nulls\": 0\r\n", " },\r\n", " \"heartrate\": {\r\n", " \"type\": \"int\",\r\n", " \"min\": 60,\r\n", " \"max\": 89,\r\n", " \"sign\": \"positive\",\r\n", " \"max_nulls\": 0\r\n", " },\r\n", " \"build\": {\r\n", " \"type\": \"string\",\r\n", " \"min_length\": 36,\r\n", " \"max_length\": 36,\r\n", " \"max_nulls\": 0,\r\n", " \"no_duplicates\": true\r\n", " },\r\n", " \"latest\": {\r\n", " \"type\": \"int\",\r\n", " \"min\": 0,\r\n", " \"max\": 1,\r\n", " \"sign\": \"non-negative\",\r\n", " \"max_nulls\": 0\r\n", " },\r\n", " \"note\": {\r\n", " \"type\": \"string\",\r\n", " \"min_length\": 4,\r\n", " \"max_length\": 8,\r\n", " \"allowed_values\": [\r\n", " \"interval\",\r\n", " \"sleep\",\r\n", " \"test\",\r\n", " \"update\",\r\n", " \"user\",\r\n", " \"wake\"\r\n", " ]\r\n", " }\r\n", " }\r\n", "}\r\n" ] } ], "source": [ "cat ignore-iot_constraints.tdda" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Überprüfen von Dataframes\n", "\n", "Hierfür lesen wir zunächst eine neue csv-Datei mit Pandas ein und lassen uns dann zehn Datensätze exemplarisch ausgeben:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
timestampusernametemperatureheartratebuildlatestnote
1260442017-02-20T20:33:42karenrichardsNaN79284fab65-9fcc-18e4-8838-6e89ac938f77NaNNaN
548442017-01-23T10:00:53karen37NaN766ea2310d-b136-dfae-3e4a-730cb01a68811.0wake
154842017-01-07T16:36:43carterjill24.0613b524f97-4a6a-156e-e182-760818cc5c6b0.0interval
37092017-01-02T23:50:12ebenton18.082f23b7b48-0ad1-18b5-c5a8-033d66d470071.0NaN
1319782017-02-23T05:37:46cameron67NaN872806ee39-a668-d0f7-44b9-9255188d51df1.0interval
613022017-01-25T23:57:04rebecca8819.071NaNNaNinterval
1304932017-02-22T15:15:56cgriffin29.081NaN1.0NaN
1216852017-02-19T02:53:16johnberg16.068e3b81408-7c4f-78b5-c373-2ee086e6dbbfNaNNaN
871502017-02-05T08:17:46kelly7128.06899842995-cc89-638c-9067-a7d92e4500970.0interval
1387232017-02-25T22:22:49ayoung22.075dca3e5f6-c05a-c490-8384-4f7cfda4e19a0.0NaN
\n", "
" ], "text/plain": [ " timestamp username temperature heartrate \\\n", "126044 2017-02-20T20:33:42 karenrichards NaN 79 \n", "54844 2017-01-23T10:00:53 karen37 NaN 76 \n", "15484 2017-01-07T16:36:43 carterjill 24.0 61 \n", "3709 2017-01-02T23:50:12 ebenton 18.0 82 \n", "131978 2017-02-23T05:37:46 cameron67 NaN 87 \n", "61302 2017-01-25T23:57:04 rebecca88 19.0 71 \n", "130493 2017-02-22T15:15:56 cgriffin 29.0 81 \n", "121685 2017-02-19T02:53:16 johnberg 16.0 68 \n", "87150 2017-02-05T08:17:46 kelly71 28.0 68 \n", "138723 2017-02-25T22:22:49 ayoung 22.0 75 \n", "\n", " build latest note \n", "126044 284fab65-9fcc-18e4-8838-6e89ac938f77 NaN NaN \n", "54844 6ea2310d-b136-dfae-3e4a-730cb01a6881 1.0 wake \n", "15484 3b524f97-4a6a-156e-e182-760818cc5c6b 0.0 interval \n", "3709 f23b7b48-0ad1-18b5-c5a8-033d66d47007 1.0 NaN \n", "131978 2806ee39-a668-d0f7-44b9-9255188d51df 1.0 interval \n", "61302 NaN NaN interval \n", "130493 NaN 1.0 NaN \n", "121685 e3b81408-7c4f-78b5-c373-2ee086e6dbbf NaN NaN \n", "87150 99842995-cc89-638c-9067-a7d92e450097 0.0 interval \n", "138723 dca3e5f6-c05a-c490-8384-4f7cfda4e19a 0.0 NaN " ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "new_df = pd.read_csv(\"https://raw.githubusercontent.com/kjam/data-cleaning-101/master/data/iot_example_with_nulls.csv\")\n", "\n", "new_df.sample(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Wir sehen mehrere Felder, die als `NaN` ausgegeben werden. Um dies nun systematisch zu analysieren, wenden wir [verify_df](https://tdda.readthedocs.io/en/v1.0.31/constraints.html#tdda.constraints.verify_df) auf unseren neuen DataFrame an. Dabei gibt `passes` gibt die Anzahl der bestandenen, `failures` die Anzahl der fehlgeschlagenen Constraints zurück." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "v = verify_df(new_df, \"ignore-iot_constraints.tdda\")" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "v" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "30" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "v.passes" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "v.failures" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Wir können uns auch anzeigen lassen, in welchen Spalten welche Constraints bestanden und fehlgeschlagen sind: " ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "FIELDS:\n", "\n", "timestamp: 0 failures 5 passes type ✓ min_length ✓ max_length ✓ max_nulls ✓ no_duplicates ✓\n", "\n", "username: 0 failures 4 passes type ✓ min_length ✓ max_length ✓ max_nulls ✓\n", "\n", "temperature: 1 failure 4 passes type ✓ min ✓ max ✓ sign ✓ max_nulls ✗\n", "\n", "heartrate: 0 failures 5 passes type ✓ min ✓ max ✓ sign ✓ max_nulls ✓\n", "\n", "build: 1 failure 4 passes type ✓ min_length ✓ max_length ✓ max_nulls ✗ no_duplicates ✓\n", "\n", "latest: 1 failure 4 passes type ✓ min ✓ max ✓ sign ✓ max_nulls ✗\n", "\n", "note: 0 failures 4 passes type ✓ min_length ✓ max_length ✓ allowed_values ✓\n", "\n", "SUMMARY:\n", "\n", "Constraints passing: 30\n", "Constraints failing: 3\n" ] } ], "source": [ "print(str(v))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alternativ können wir uns diese Ergebnisse auch tabellarisch anzeigen lassen:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
fieldfailurespassestypeminmin_lengthmaxmax_lengthsignmax_nullsno_duplicatesallowed_values
0timestamp05TrueNaNTrueNaNTrueNaNTrueTrueNaN
1username04TrueNaNTrueNaNTrueNaNTrueNaNNaN
2temperature14TrueTrueNaNTrueNaNTrueFalseNaNNaN
3heartrate05TrueTrueNaNTrueNaNTrueTrueNaNNaN
4build14TrueNaNTrueNaNTrueNaNFalseTrueNaN
5latest14TrueTrueNaNTrueNaNTrueFalseNaNNaN
6note04TrueNaNTrueNaNTrueNaNNaNNaNTrue
\n", "
" ], "text/plain": [ " field failures passes type min min_length max max_length \\\n", "0 timestamp 0 5 True NaN True NaN True \n", "1 username 0 4 True NaN True NaN True \n", "2 temperature 1 4 True True NaN True NaN \n", "3 heartrate 0 5 True True NaN True NaN \n", "4 build 1 4 True NaN True NaN True \n", "5 latest 1 4 True True NaN True NaN \n", "6 note 0 4 True NaN True NaN True \n", "\n", " sign max_nulls no_duplicates allowed_values \n", "0 NaN True True NaN \n", "1 NaN True NaN NaN \n", "2 True False NaN NaN \n", "3 True True NaN NaN \n", "4 NaN False True NaN \n", "5 True False NaN NaN \n", "6 NaN NaN NaN True " ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "v.to_frame()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.11 Kernel", "language": "python", "name": "python311" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.4" }, "latex_envs": { "LaTeX_envs_menu_present": true, "autoclose": false, "autocomplete": true, "bibliofile": "biblio.bib", "cite_by": "apalike", "current_citInitial": 1, "eqLabelWithNumbers": true, "eqNumInitial": 1, "hotkeys": { "equation": "Ctrl-E", "itemize": "Ctrl-I" }, "labels_anchors": false, "latex_user_defs": false, "report_style_numbering": false, "user_envs_cfg": false }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 4 }