{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Data validation with Voluptuous (schema definitions)\n", "\n", "In this notebook we use [Voluptuous](https://github.com/alecthomas/voluptuous) to define schemas for our data. We can then use schema checking at various points in our cleanup to ensure that we meet the criteria. Finally, we can use schema checking exceptions to flag, set aside or remove impure or invalid data.\n", "\n", "
\n", "\n", "**See also:**\n", "\n", "* [Validr](https://github.com/guyskk/validr)\n", "* [marshmallow](https://marshmallow.readthedocs.io/en/latest/)\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Imports" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import logging\n", "\n", "from datetime import datetime\n", "\n", "import pandas as pd\n", "\n", "from voluptuous import ALLOW_EXTRA, All, Range, Required, Schema\n", "from voluptuous.error import Invalid, MultipleInvalid" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* `Required` marks the node of a schema as required and optionally specifies a default value, see also [voluptuous.schema_builder.Required](http://alecthomas.github.io/voluptuous/docs/_build/html/voluptuous.html?highlight=required#voluptuous.schema_builder.Required).\n", "* `Range` limits the value to a range where either `min` or `max` can be omitted; see also [voluptuous.validators.Range](http://alecthomas.github.io/voluptuous/docs/_build/html/voluptuous.html?highlight=required#voluptuous.validators.Range).\n", "* `ALL` is used for cross-field validations: checks the basic structure of the data in a first pass and only in the second pass the cross-field validation is applied; see also [voluptuous.validators.All](http://alecthomas.github.io/voluptuous/docs/_build/html/voluptuous.html?highlight=required#voluptuous.validators.All).\n", "* `ALLOW_EXTRA` allows additional dictionary keys.\n", "* `MultipleInvalid` is based on `Invalid`, see also [voluptuous.error.MultipleInvalid](http://alecthomas.github.io/voluptuous/docs/_build/html/voluptuous.html?highlight=required#voluptuous.error.MultipleInvalid).\n", "* `Invalid` marks data as invalid, see also [voluptuous.error.Invalid](http://alecthomas.github.io/voluptuous/docs/_build/html/voluptuous.html?highlight=required#voluptuous.error.Invalid)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Logger" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "logger = logging.getLogger(0)\n", "logger.setLevel(logging.WARNING)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Read sample data" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "sales = pd.read_csv(\n", " \"https://raw.githubusercontent.com/kjam/data-cleaning-101/master/data/sales_data.csv\"\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Examine data" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 0timestampcitystore_idsale_numbersale_amountassociate
002018-09-10 05:00:45Williamburgh615301167.0Gary Lee
112018-09-12 10:01:27Ibarraberg12744258.0Daniel Davis
222018-09-13 12:01:48Sarachester21908266.0Michael Roth
332018-09-14 20:02:19Caldwellbury14771-108.0Michaela Stewart
442018-09-16 01:03:21Erikaland111571-372.0Mark Taylor
\n", "
" ], "text/plain": [ " Unnamed: 0 timestamp city store_id sale_number \\\n", "0 0 2018-09-10 05:00:45 Williamburgh 6 1530 \n", "1 1 2018-09-12 10:01:27 Ibarraberg 1 2744 \n", "2 2 2018-09-13 12:01:48 Sarachester 2 1908 \n", "3 3 2018-09-14 20:02:19 Caldwellbury 14 771 \n", "4 4 2018-09-16 01:03:21 Erikaland 11 1571 \n", "\n", " sale_amount associate \n", "0 1167.0 Gary Lee \n", "1 258.0 Daniel Davis \n", "2 266.0 Michael Roth \n", "3 -108.0 Michaela Stewart \n", "4 -372.0 Mark Taylor " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sales.head()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(213, 7)" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sales.shape" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Unnamed: 0 int64\n", "timestamp object\n", "city object\n", "store_id int64\n", "sale_number int64\n", "sale_amount float64\n", "associate object\n", "dtype: object" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sales.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Define schema\n", "\n", "In the column `sale_amount` all values should be between 2.5 and 1450.99:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "schema = Schema(\n", " {\n", " Required(\"sale_amount\"): All(float, Range(min=2.50, max=1450.99)),\n", " },\n", " extra=ALLOW_EXTRA,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To be able to use the elements of one column as keys and the elements of another column as values, we simply make the desired column the index of the DataFrame and transpose it with the function `.T()`; see also [pandas.DataFrame.transpose](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.transpose.html)." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "WARNING:root:issue with sale: 3 (-108.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "WARNING:root:issue with sale: 4 (-372.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "WARNING:root:issue with sale: 5 (-399.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "WARNING:root:issue with sale: 6 (-304.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "WARNING:root:issue with sale: 7 (-295.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "WARNING:root:issue with sale: 10 (-89.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "WARNING:root:issue with sale: 13 (-303.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "WARNING:root:issue with sale: 15 (-432.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "WARNING:root:issue with sale: 19 (-177.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "WARNING:root:issue with sale: 20 (-154.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "WARNING:root:issue with sale: 22 (-130.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "WARNING:root:issue with sale: 23 (1487.0) - value must be at most 1450.99 for dictionary value @ data['sale_amount']\n", "WARNING:root:issue with sale: 25 (-145.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "WARNING:root:issue with sale: 28 (1471.0) - value must be at most 1450.99 for dictionary value @ data['sale_amount']\n", "WARNING:root:issue with sale: 31 (-259.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "WARNING:root:issue with sale: 38 (-241.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "WARNING:root:issue with sale: 40 (-4.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "WARNING:root:issue with sale: 41 (1581.0) - value must be at most 1450.99 for dictionary value @ data['sale_amount']\n", "WARNING:root:issue with sale: 45 (1529.0) - value must be at most 1450.99 for dictionary value @ data['sale_amount']\n", "WARNING:root:issue with sale: 46 (-238.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "WARNING:root:issue with sale: 48 (-284.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "WARNING:root:issue with sale: 51 (-164.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "WARNING:root:issue with sale: 55 (-184.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "WARNING:root:issue with sale: 56 (-304.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "WARNING:root:issue with sale: 59 (1579.0) - value must be at most 1450.99 for dictionary value @ data['sale_amount']\n", "WARNING:root:issue with sale: 60 (-455.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "WARNING:root:issue with sale: 63 (1551.0) - value must be at most 1450.99 for dictionary value @ data['sale_amount']\n", "WARNING:root:issue with sale: 65 (-397.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "WARNING:root:issue with sale: 69 (-400.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "WARNING:root:issue with sale: 70 (1482.0) - value must be at most 1450.99 for dictionary value @ data['sale_amount']\n", "WARNING:root:issue with sale: 71 (-321.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "WARNING:root:issue with sale: 74 (-47.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "WARNING:root:issue with sale: 76 (-68.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "WARNING:root:issue with sale: 86 (1454.0) - value must be at most 1450.99 for dictionary value @ data['sale_amount']\n", "WARNING:root:issue with sale: 101 (-213.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "WARNING:root:issue with sale: 103 (-144.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "WARNING:root:issue with sale: 104 (-265.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "WARNING:root:issue with sale: 107 (-349.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "WARNING:root:issue with sale: 111 (-78.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "WARNING:root:issue with sale: 112 (-310.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "WARNING:root:issue with sale: 116 (1570.0) - value must be at most 1450.99 for dictionary value @ data['sale_amount']\n", "WARNING:root:issue with sale: 120 (1490.0) - value must be at most 1450.99 for dictionary value @ data['sale_amount']\n", "WARNING:root:issue with sale: 123 (-179.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "WARNING:root:issue with sale: 124 (-391.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "WARNING:root:issue with sale: 129 (1504.0) - value must be at most 1450.99 for dictionary value @ data['sale_amount']\n", "WARNING:root:issue with sale: 130 (-91.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "WARNING:root:issue with sale: 132 (-372.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "WARNING:root:issue with sale: 141 (1512.0) - value must be at most 1450.99 for dictionary value @ data['sale_amount']\n", "WARNING:root:issue with sale: 142 (-449.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "WARNING:root:issue with sale: 149 (1494.0) - value must be at most 1450.99 for dictionary value @ data['sale_amount']\n", "WARNING:root:issue with sale: 152 (-405.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "WARNING:root:issue with sale: 155 (1599.0) - value must be at most 1450.99 for dictionary value @ data['sale_amount']\n", "WARNING:root:issue with sale: 156 (1527.0) - value must be at most 1450.99 for dictionary value @ data['sale_amount']\n", "WARNING:root:issue with sale: 157 (-462.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "WARNING:root:issue with sale: 162 (-358.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "WARNING:root:issue with sale: 164 (-78.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "WARNING:root:issue with sale: 167 (-358.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "WARNING:root:issue with sale: 171 (-391.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "WARNING:root:issue with sale: 178 (-304.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "WARNING:root:issue with sale: 180 (-9.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "WARNING:root:issue with sale: 187 (1475.0) - value must be at most 1450.99 for dictionary value @ data['sale_amount']\n", "WARNING:root:issue with sale: 194 (-433.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "WARNING:root:issue with sale: 195 (-329.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "WARNING:root:issue with sale: 196 (-147.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "WARNING:root:issue with sale: 203 (-319.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "WARNING:root:issue with sale: 206 (-132.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "WARNING:root:issue with sale: 207 (-20.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "WARNING:root:issue with sale: 209 (1539.0) - value must be at most 1450.99 for dictionary value @ data['sale_amount']\n", "WARNING:root:issue with sale: 211 (-167.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n" ] } ], "source": [ "error_count = 0\n", "for s_id, sale in sales.T.to_dict().items():\n", " try:\n", " schema(sale)\n", " except MultipleInvalid as e:\n", " logging.warning(\n", " \"issue with sale: %s (%s) - %s\", s_id, sale[\"sale_amount\"], e\n", " )\n", " error_count += 1" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "69" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "error_count" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Currently, however, we do not yet know whether\n", "\n", "* we have a wrongly defined schema\n", "* possibly negative values are returned or incorrectly marked\n", "* higher values are combined purchases or special sales" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 6. Adding a custom validation" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "def ValidDate(fmt=\"%Y-%m-%d %H:%M:%S\"):\n", " return lambda v: datetime.strptime(v, fmt)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "schema = Schema(\n", " {\n", " Required(\"timestamp\"): All(ValidDate()),\n", " },\n", " extra=ALLOW_EXTRA,\n", ")" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "error_count = 0\n", "for s_id, sale in sales.T.to_dict().items():\n", " try:\n", " schema(sale)\n", " except MultipleInvalid as e:\n", " logging.warning(\n", " \"issue with sale: %s (%s) - %s\", s_id, sale[\"timestamp\"], e\n", " )\n", " error_count += 1" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "error_count" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 7. Valid date structures are not yet valid dates" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "def ValidDate(fmt=\"%Y-%m-%d %H:%M:%S\"):\n", " def validation_func(v):\n", " try:\n", " assert datetime.strptime(v, fmt) <= datetime.now()\n", " except AssertionError:\n", " raise Invalid(\"date is in the future! %s\" % v)\n", "\n", " return validation_func" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "schema = Schema(\n", " {\n", " Required(\"timestamp\"): All(ValidDate()),\n", " },\n", " extra=ALLOW_EXTRA,\n", ")" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "error_count = 0\n", "for s_id, sale in sales.T.to_dict().items():\n", " try:\n", " schema(sale)\n", " except MultipleInvalid as e:\n", " logging.warning(\n", " \"issue with sale: %s (%s) - %s\", s_id, sale[\"timestamp\"], e\n", " )\n", " error_count += 1" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "error_count" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.13 Kernel", "language": "python", "name": "python313" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.0" }, "latex_envs": { "LaTeX_envs_menu_present": true, "autoclose": false, "autocomplete": true, "bibliofile": "biblio.bib", "cite_by": "apalike", "current_citInitial": 1, "eqLabelWithNumbers": true, "eqNumInitial": 1, "hotkeys": { "equation": "Ctrl-E", "itemize": "Ctrl-I" }, "labels_anchors": false, "latex_user_defs": false, "report_style_numbering": false, "user_envs_cfg": false }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 4 }