{ "cells": [ { "cell_type": "markdown", "id": "8ab93a7d", "metadata": {}, "source": [ "# pandas DataFrame Validation with Bulwark\n", "\n", "[Bulwark](https://bulwark.readthedocs.io/en/stable/index.html) is a package for property-based testing of pandas dataframes. The project was heavily influenced by the no longer supported [Engarde](https://github.com/engarde-dev/engarde) library." ] }, { "cell_type": "markdown", "id": "bf371eec", "metadata": {}, "source": [ "## 1. Installation\n", "\n", "``` console\n", "$ uv add bulwark\n", "Installing bulwark…\n", "Adding bulwark to Pipfile's [packages]…\n", "✔ Installation Succeeded\n", "Locking [dev-packages] dependencies…\n", "✔ Success!\n", "Updated Pipfile.lock (0d075a)!\n", "```" ] }, { "cell_type": "markdown", "id": "1938d487", "metadata": {}, "source": [ "## 2. Use" ] }, { "cell_type": "markdown", "id": "7447fbb2", "metadata": {}, "source": [ "### 2.1 Checks\n", "\n", "With the [bulwark.checks](https://bulwark.readthedocs.io/en/v0.4.2/bulwark.html#module-bulwark.checks) module you can check many common assumptions, e.g.\n", "\n", "* `has_columns` checks whether certain columns exist in such-and-such a way and in the correct order\n", "* `has_dtypes` checks the data types of columns\n", "* `has_no_infs` checks if there are no [numpy.inf](https://numpy.org/doc/stable/reference/constants.html#numpy.inf) in the DataFrame\n", "* `has_no_nans` checks if there are no [numpy.nan](https://numpy.org/doc/stable/reference/constants.html#numpy.nan) in the DataFrame\n", "* `has_set_within_vals` checks if the values specified in a dict are a subset of the associated column\n", "* `has_unique_index` checks if the index is unique\n", "* `is_monotonic` checks whether values of a column are ascending or descending\n", "* `one_to_many` checks whether there is an n:1 relationship between two columns\n", "\n", "The checks are then very simple, e.g. the check whether there are no `numpy.nan` in the column `pipe` with\n", "\n", "```python\n", "import bulwark.checks as ck\n", "\n", "df.pipe(ck.has_no_nans())\n", "```" ] }, { "cell_type": "markdown", "id": "589eb14f", "metadata": {}, "source": [ "### 2.2 Decorators\n", "\n", "For each check, bulwark.creates [decorators](https://bulwark.readthedocs.io/en/v0.4.2/bulwark.html#module-bulwark.decorators), e.g. `@dc.IsShape((-1, 10))` or `@dc.IsMonotonic(strict=True)`." ] }, { "cell_type": "markdown", "id": "442385ad", "metadata": {}, "source": [ "### `CustomCheck`\n", "\n", "You can also create your own custom functions, for example:" ] }, { "cell_type": "code", "execution_count": 1, "id": "648e3b87", "metadata": {}, "outputs": [ { "ename": "AssertionError", "evalue": "len_longer_than is not true.", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mAssertionError\u001b[0m Traceback (most recent call last)", "Cell \u001b[0;32mIn[1], line 21\u001b[0m\n\u001b[1;32m 18\u001b[0m df \u001b[38;5;241m=\u001b[39m pd\u001b[38;5;241m.\u001b[39mDataFrame({\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124ma\u001b[39m\u001b[38;5;124m\"\u001b[39m: [\u001b[38;5;241m1\u001b[39m, \u001b[38;5;241m2\u001b[39m, \u001b[38;5;241m3\u001b[39m], \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mb\u001b[39m\u001b[38;5;124m\"\u001b[39m: [\u001b[38;5;241m4\u001b[39m, \u001b[38;5;241m5\u001b[39m, \u001b[38;5;241m6\u001b[39m]})\n\u001b[1;32m 19\u001b[0m df2 \u001b[38;5;241m=\u001b[39m pd\u001b[38;5;241m.\u001b[39mDataFrame({\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124ma\u001b[39m\u001b[38;5;124m\"\u001b[39m: [\u001b[38;5;241m1\u001b[39m, np\u001b[38;5;241m.\u001b[39mnan, \u001b[38;5;241m3\u001b[39m, \u001b[38;5;241m4\u001b[39m], \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mb\u001b[39m\u001b[38;5;124m\"\u001b[39m: [\u001b[38;5;241m4\u001b[39m, \u001b[38;5;241m5\u001b[39m, \u001b[38;5;241m6\u001b[39m, \u001b[38;5;241m7\u001b[39m]})\n\u001b[0;32m---> 21\u001b[0m \u001b[43mappend_a_df\u001b[49m\u001b[43m(\u001b[49m\u001b[43mdf\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mdf2\u001b[49m\u001b[43m)\u001b[49m\n", "File \u001b[0;32m~/sandbox/py313/.venv/lib/python3.13/site-packages/bulwark/decorators.py:81\u001b[0m, in \u001b[0;36mCustomCheck.__call__..decorated\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m 78\u001b[0m df \u001b[38;5;241m=\u001b[39m f(\u001b[38;5;241m*\u001b[39margs, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs)\n\u001b[1;32m 79\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39menabled:\n\u001b[1;32m 80\u001b[0m \u001b[38;5;66;03m# differs from BaseDecorator\u001b[39;00m\n\u001b[0;32m---> 81\u001b[0m \u001b[43mck\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mcustom_check\u001b[49m\u001b[43m(\u001b[49m\u001b[43mdf\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mcheck_func\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mcheck_func_params\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 82\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m df\n", "File \u001b[0;32m~/sandbox/py313/.venv/lib/python3.13/site-packages/bulwark/checks.py:588\u001b[0m, in \u001b[0;36mcustom_check\u001b[0;34m(df, check_func, *args, **kwargs)\u001b[0m\n\u001b[1;32m 576\u001b[0m \u001b[38;5;250m\u001b[39m\u001b[38;5;124;03m\"\"\"Assert that `check(df, *args, **kwargs)` is true.\u001b[39;00m\n\u001b[1;32m 577\u001b[0m \n\u001b[1;32m 578\u001b[0m \u001b[38;5;124;03mArgs:\u001b[39;00m\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 585\u001b[0m \n\u001b[1;32m 586\u001b[0m \u001b[38;5;124;03m\"\"\"\u001b[39;00m\n\u001b[1;32m 587\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[0;32m--> 588\u001b[0m \u001b[43mcheck_func\u001b[49m\u001b[43m(\u001b[49m\u001b[43mdf\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 589\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mAssertionError\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m e:\n\u001b[1;32m 590\u001b[0m msg \u001b[38;5;241m=\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;132;01m{}\u001b[39;00m\u001b[38;5;124m is not true.\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;241m.\u001b[39mformat(check_func\u001b[38;5;241m.\u001b[39m\u001b[38;5;18m__name__\u001b[39m)\n", "Cell \u001b[0;32mIn[1], line 9\u001b[0m, in \u001b[0;36mlen_longer_than\u001b[0;34m(df, l)\u001b[0m\n\u001b[1;32m 7\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21mlen_longer_than\u001b[39m(df, l):\n\u001b[1;32m 8\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mlen\u001b[39m(df) \u001b[38;5;241m<\u001b[39m\u001b[38;5;241m=\u001b[39m l:\n\u001b[0;32m----> 9\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mAssertionError\u001b[39;00m(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mdf is not as long as expected.\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[1;32m 10\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m df\n", "\u001b[0;31mAssertionError\u001b[0m: len_longer_than is not true." ] } ], "source": [ "import bulwark.checks as ck\n", "import bulwark.decorators as dc\n", "import numpy as np\n", "import pandas as pd\n", "\n", "\n", "def len_longer_than(df, l):\n", " if len(df) <= l:\n", " raise AssertionError(\"df is not as long as expected.\")\n", " return df\n", "\n", "\n", "@dc.CustomCheck(len_longer_than, 10)\n", "def append_a_df(df, df2):\n", " return pd.concat([df, df2], ignore_index=True)\n", "\n", "\n", "df = pd.DataFrame({\"a\": [1, 2, 3], \"b\": [4, 5, 6]})\n", "df2 = pd.DataFrame({\"a\": [1, np.nan, 3, 4], \"b\": [4, 5, 6, 7]})\n", "\n", "append_a_df(df, df2)" ] }, { "cell_type": "markdown", "id": "aaa36dd0", "metadata": {}, "source": [ "### `MultiCheck`\n", "\n", "With `MultiCheck` you can run several tests at the same time and see all the errors at once, for example:" ] }, { "cell_type": "code", "execution_count": 2, "id": "1ed48a04", "metadata": {}, "outputs": [ { "ename": "AssertionError", "evalue": "(4, 'a')", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mAssertionError\u001b[0m Traceback (most recent call last)", "Cell \u001b[0;32mIn[2], line 15\u001b[0m\n\u001b[1;32m 12\u001b[0m df \u001b[38;5;241m=\u001b[39m pd\u001b[38;5;241m.\u001b[39mDataFrame({\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124ma\u001b[39m\u001b[38;5;124m\"\u001b[39m: [\u001b[38;5;241m1\u001b[39m, \u001b[38;5;241m2\u001b[39m, \u001b[38;5;241m3\u001b[39m], \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mb\u001b[39m\u001b[38;5;124m\"\u001b[39m: [\u001b[38;5;241m4\u001b[39m, \u001b[38;5;241m5\u001b[39m, \u001b[38;5;241m6\u001b[39m]})\n\u001b[1;32m 13\u001b[0m df2 \u001b[38;5;241m=\u001b[39m pd\u001b[38;5;241m.\u001b[39mDataFrame({\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124ma\u001b[39m\u001b[38;5;124m\"\u001b[39m: [\u001b[38;5;241m1\u001b[39m, np\u001b[38;5;241m.\u001b[39mnan, \u001b[38;5;241m3\u001b[39m, \u001b[38;5;241m4\u001b[39m], \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mb\u001b[39m\u001b[38;5;124m\"\u001b[39m: [\u001b[38;5;241m4\u001b[39m, \u001b[38;5;241m5\u001b[39m, \u001b[38;5;241m6\u001b[39m, \u001b[38;5;241m7\u001b[39m]})\n\u001b[0;32m---> 15\u001b[0m \u001b[43mappend_a_df\u001b[49m\u001b[43m(\u001b[49m\u001b[43mdf\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mdf2\u001b[49m\u001b[43m)\u001b[49m\n", "File \u001b[0;32m~/sandbox/py313/.venv/lib/python3.13/site-packages/bulwark/decorators.py:24\u001b[0m, in \u001b[0;36mBaseDecorator.__call__..decorated\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m 22\u001b[0m df \u001b[38;5;241m=\u001b[39m f(\u001b[38;5;241m*\u001b[39margs, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs)\n\u001b[1;32m 23\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39menabled:\n\u001b[0;32m---> 24\u001b[0m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mcheck_func\u001b[49m\u001b[43m(\u001b[49m\u001b[43mdf\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mcheck_func_params\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 25\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m df\n", "File \u001b[0;32m~/sandbox/py313/.venv/lib/python3.13/site-packages/bulwark/checks.py:570\u001b[0m, in \u001b[0;36mmulti_check\u001b[0;34m(df, checks, warn)\u001b[0m\n\u001b[1;32m 568\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m df\n\u001b[1;32m 569\u001b[0m \u001b[38;5;28;01melif\u001b[39;00m error_msgs:\n\u001b[0;32m--> 570\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mAssertionError\u001b[39;00m(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;130;01m\\n\u001b[39;00m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;241m.\u001b[39mjoin(\u001b[38;5;28mstr\u001b[39m(i) \u001b[38;5;28;01mfor\u001b[39;00m i \u001b[38;5;129;01min\u001b[39;00m error_msgs))\n\u001b[1;32m 572\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m df\n", "\u001b[0;31mAssertionError\u001b[0m: (4, 'a')" ] } ], "source": [ "@dc.MultiCheck(\n", " checks={\n", " ck.has_no_nans: {\"columns\": None},\n", " len_longer_than: {\"l\": 6}\n", " },\n", " warn=False,\n", ")\n", "def append_a_df(df, df2):\n", " return pd.concat([df, df2], ignore_index=True)\n", "\n", "\n", "df = pd.DataFrame({\"a\": [1, 2, 3], \"b\": [4, 5, 6]})\n", "df2 = pd.DataFrame({\"a\": [1, np.nan, 3, 4], \"b\": [4, 5, 6, 7]})\n", "\n", "append_a_df(df, df2)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.13 Kernel", "language": "python", "name": "python313" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.0" }, "latex_envs": { "LaTeX_envs_menu_present": true, "autoclose": false, "autocomplete": true, "bibliofile": "biblio.bib", "cite_by": "apalike", "current_citInitial": 1, "eqLabelWithNumbers": true, "eqNumInitial": 1, "hotkeys": { "equation": "Ctrl-E", "itemize": "Ctrl-I" }, "labels_anchors": false, "latex_user_defs": false, "report_style_numbering": false, "user_envs_cfg": false }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 5 }