{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# String comparisons\n", "\n", "In this notebook we use the popular library for string comparisons [fuzzywuzzy](https://github.com/seatgeek/fuzzywuzzy). It is based on the built-in Python library [difflib](https://docs.python.org/3/library/difflib.html). For more information on the various methods available and their differences, see the blog post [FuzzyWuzzy: Fuzzy String Matching in Python](https://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/).\n", "\n", "
\n", "\n", "**See also:**\n", "\n", "* [textacy](https://github.com/chartbeat-labs/textacy)\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Installation\n", "\n", "With [Spack](../productive/envs/spack/index.rst) you can provide `fuzzywuzzy` and the optional `python-levenshtein` library in your kernel:\n", "\n", "``` bash\n", "$ spack env activate python-311\n", "$ spack install py-fuzzywuzzy+speedup\n", "```\n", "\n", "Alternatively, you can install the two libraries with other package managers, for example\n", "\n", "``` bash\n", "$ uv add \"fuzzywuzzy[speedup]\"\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Import" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from fuzzywuzzy import fuzz, process" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Example" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "berlin = [\n", " \"Berlin, Germany\",\n", " \"Berlin, Deutschland\",\n", " \"Berlin\",\n", " \"Berlin, DE\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## String similarity\n", "\n", "The similarity of the first two strings `'Berlin, Germany'` and `'Berlin, Deutschland'` seems low:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "65" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fuzz.ratio(berlin[0], berlin[1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Partial string similarity\n", "\n", "Inconsistent partial strings are a common problem. To get around this, fuzzywuzzy uses a heuristic called _best partial_." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "60" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fuzz.partial_ratio(berlin[0], berlin[1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Token sorting\n", "\n", "In token sorting, the string in question is given a token, the tokens are sorted alphabetically and then reassembled into a string, for example:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "48" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fuzz.ratio(berlin[1], berlin[2])" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "100" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fuzz.token_set_ratio(berlin[1], berlin[2])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Further information" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\u001b[0;31mSignature:\u001b[0m \u001b[0mfuzz\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mratio\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ms1\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0ms2\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mDocstring:\u001b[0m \n", "\u001b[0;31mFile:\u001b[0m ~/sandbox/py313/.venv/lib/python3.13/site-packages/fuzzywuzzy/fuzz.py\n", "\u001b[0;31mType:\u001b[0m function" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "fuzz.ratio?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Extract from a list" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "choices = [\n", " \"Germany\",\n", " \"Deutschland\",\n", " \"France\",\n", " \"United Kingdom\",\n", " \"Great Britain\",\n", " \"United States\",\n", "]" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('Deutschland', 90), ('Germany', 45)]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "process.extract(\"DE\", choices, limit=2)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('United Kingdom', 51),\n", " ('United States', 41),\n", " ('Germany', 39),\n", " ('Great Britain', 35),\n", " ('Deutschland', 31)]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "process.extract(\"Vereinigtes Königreich\", choices)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "('France', 62)" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "process.extractOne(\"frankreich\", choices)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "('United States', 86)" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "process.extractOne(\"U.S.\", choices)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Known ports\n", "\n", "FuzzyWuzzy is also ported to other languages! Here are some known ports:\n", "\n", "* Java: [xpresso](https://github.com/WantedTechnologies/xpresso)\n", "* Java: [xdrop fuzzywuzzy](https://github.com/xdrop/fuzzywuzzy)\n", "* Rust: [fuzzyrusty](https://github.com/logannc/fuzzywuzzy-rs)\n", "* JavaScript: [fuzzball.js](https://github.com/nol13/fuzzball.js)\n", "* C++: [tmplt fuzzywuzzy](https://github.com/Tmplt/fuzzywuzzy)\n", "* C#: [FuzzySharp](https://github.com/BoomTownRoi/BoomTown.FuzzySharp)\n", "* Go: [go-fuzzywuzzy](https://github.com/paul-mannino/go-fuzzywuzzy)\n", "* Pascal: [FuzzyWuzzy.pas](https://github.com/DavidMoraisFerreira/FuzzyWuzzy.pas)\n", "* Kotlin: [FuzzyWuzzy-Kotlin](https://github.com/willowtreeapps/fuzzywuzzy-kotlin)\n", "* R: [fuzzywuzzyR](https://github.com/mlampros/fuzzywuzzyR)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.13 Kernel", "language": "python", "name": "python313" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.0" }, "latex_envs": { "LaTeX_envs_menu_present": true, "autoclose": false, "autocomplete": true, "bibliofile": "biblio.bib", "cite_by": "apalike", "current_citInitial": 1, "eqLabelWithNumbers": true, "eqNumInitial": 1, "hotkeys": { "equation": "Ctrl-E", "itemize": "Ctrl-I" }, "labels_anchors": false, "latex_user_defs": false, "report_style_numbering": false, "user_envs_cfg": false }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 4 }