String comparisons¶

In this notebook we use the popular library for string comparisons fuzzywuzzy. It is based on the built-in Python library difflib. For more information on the various methods available and their differences, see the blog post FuzzyWuzzy: Fuzzy String Matching in Python.

See also:

textacy

1. Installation¶

With Spack you can provide fuzzywuzzy and the optional python-levenshtein library in your kernel:

$ spack env activate python-311
$ spack install py-fuzzywuzzy+speedup

Alternatively, you can install the two libraries with other package managers, for example

$ uv add "fuzzywuzzy[speedup]"

2. Import¶

[1]:

from fuzzywuzzy import fuzz, process

3. Example¶

[2]:

berlin = ["Berlin, Germany", "Berlin, Deutschland", "Berlin", "Berlin, DE"]

String similarity¶

[3]:

fuzz.ratio(berlin[0], berlin[1])

[3]:

[4]:

fuzz.ratio(berlin[0], berlin[2])

[4]:

[5]:

fuzz.ratio(berlin[0], berlin[3])

[5]:

Partial string similarity¶

Inconsistent partial strings are a common problem. To get around this, fuzzywuzzy uses a heuristic called best partial.

[6]:

fuzz.partial_ratio(berlin[0], berlin[1])

[6]:

[7]:

fuzz.partial_ratio(berlin[0], berlin[2])

[7]:

Token sorting¶

In token sorting, the string in question is given a token, the tokens are sorted alphabetically and then reassembled into a string, for example:

[8]:

fuzz.token_set_ratio(berlin[0], berlin[1])

[8]:

[9]:

fuzz.token_set_ratio(berlin[0], berlin[2])

[9]:

Further information¶

[10]:

fuzz.ratio?

Signature: fuzz.ratio(s1, s2)
Docstring: <no docstring>
File:      ~/cusy/trn/jupyter-tutorial/uvenvs/py313/.venv/lib/python3.13/site-packages/fuzzywuzzy/fuzz.py
Type:      function

Extract from a list¶

[11]:

choices = [
    "Germany",
    "Deutschland",
    "France",
    "United Kingdom",
    "Great Britain",
    "United States",
]

[12]:

process.extract("DE", choices, limit=2)

[12]:

[('Deutschland', 90), ('Germany', 45)]

[13]:

process.extract("Vereinigtes Königreich", choices)

[13]:

[('United Kingdom', 51),
 ('United States', 41),
 ('Germany', 39),
 ('Great Britain', 35),
 ('Deutschland', 31)]

[14]:

process.extractOne("frankreich", choices)

[14]:

('France', 62)

[15]:

process.extractOne("U.S.", choices)

[15]:

('United States', 86)

Known ports¶

FuzzyWuzzy is also ported to other languages! Here are some known ports: