pandas DataFrame Validation with Bulwark

Bulwark is a package for property-based testing of pandas dataframes. The project was heavily influenced by the no longer supported Engarde library.

1. Installation

$ pipenv install bulwark
Installing bulwark…
Adding bulwark to Pipfile's [packages]…
✔ Installation Succeeded
Locking [dev-packages] dependencies…
✔ Success!
Updated Pipfile.lock (0d075a)!

2. Use

2.1 Checks

With the bulwark.checks module you can check many common assumptions, e.g.

  • has_columns checks whether certain columns exist in such-and-such a way and in the correct order

  • has_dtypes checks the data types of columns

  • has_no_infs checks if there are no numpy.inf in the DataFrame

  • has_no_nans checks if there are no numpy.nan in the DataFrame

  • has_set_within_vals checks if the values specified in a dict are a subset of the associated column

  • has_unique_index checks if the index is unique

  • is_monotonic checks whether values of a column are ascending or descending

  • one_to_many checks whether there is an n:1 relationship between two columns

The checks are then very simple, e.g. the check whether there are no numpy.nan in the column pipe with

import bulwark.checks as ck

df.pipe(ck.has_no_nans())

2.2 Decorators

For each check, bulwark.creates decorators, e.g. @dc.IsShape((-1, 10)) or @dc.IsMonotonic(strict=True).

CustomCheck

You can also create your own custom functions, for example:

[1]:
import bulwark.checks as ck
import bulwark.decorators as dc
import numpy as np
import pandas as pd


def len_longer_than(df, l):
    if len(df) <= l:
        raise AssertionError("df is not as long as expected.")
    return df


@dc.CustomCheck(len_longer_than, 10)
def append_a_df(df, df2):
    return pd.concat([df, df2], ignore_index=True)


df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
df2 = pd.DataFrame({"a": [1, np.nan, 3, 4], "b": [4, 5, 6, 7]})

append_a_df(df, df2)
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[1], line 21
     18 df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
     19 df2 = pd.DataFrame({"a": [1, np.nan, 3, 4], "b": [4, 5, 6, 7]})
---> 21 append_a_df(df, df2)

File ~/.local/share/virtualenvs/python-311-6zxVKbDJ/lib/python3.11/site-packages/bulwark/decorators.py:81, in CustomCheck.__call__.<locals>.decorated(*args, **kwargs)
     78 df = f(*args, **kwargs)
     79 if self.enabled:
     80     # differs from BaseDecorator
---> 81     ck.custom_check(df, self.check_func, **self.check_func_params)
     82 return df

File ~/.local/share/virtualenvs/python-311-6zxVKbDJ/lib/python3.11/site-packages/bulwark/checks.py:588, in custom_check(df, check_func, *args, **kwargs)
    576 """Assert that `check(df, *args, **kwargs)` is true.
    577
    578 Args:
   (...)
    585
    586 """
    587 try:
--> 588     check_func(df, *args, **kwargs)
    589 except AssertionError as e:
    590     msg = "{} is not true.".format(check_func.__name__)

Cell In[1], line 9, in len_longer_than(df, l)
      7 def len_longer_than(df, l):
      8     if len(df) <= l:
----> 9         raise AssertionError("df is not as long as expected.")
     10     return df

AssertionError: len_longer_than is not true.

MultiCheck

With MultiCheck you can run several tests at the same time and see all the errors at once, for example:

[2]:
@dc.MultiCheck(
    checks={
        ck.has_no_nans: {"columns": None},
        len_longer_than: {"l": 6}
    },
    warn=False,
)
def append_a_df(df, df2):
    return pd.concat([df, df2], ignore_index=True)


df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
df2 = pd.DataFrame({"a": [1, np.nan, 3, 4], "b": [4, 5, 6, 7]})

append_a_df(df, df2)
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[2], line 15
     12 df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
     13 df2 = pd.DataFrame({"a": [1, np.nan, 3, 4], "b": [4, 5, 6, 7]})
---> 15 append_a_df(df, df2)

File ~/.local/share/virtualenvs/python-311-6zxVKbDJ/lib/python3.11/site-packages/bulwark/decorators.py:24, in BaseDecorator.__call__.<locals>.decorated(*args, **kwargs)
     22 df = f(*args, **kwargs)
     23 if self.enabled:
---> 24     self.check_func(df, **self.check_func_params)
     25 return df

File ~/.local/share/virtualenvs/python-311-6zxVKbDJ/lib/python3.11/site-packages/bulwark/checks.py:570, in multi_check(df, checks, warn)
    568     return df
    569 elif error_msgs:
--> 570     raise AssertionError("\n".join(str(i) for i in error_msgs))
    572 return df

AssertionError: (4, 'a')