String comparisons#

In this notebook we use the popular library for string comparisons fuzzywuzzy. It is based on the built-in Python library difflib. For more information on the various methods available and their differences, see the blog post FuzzyWuzzy: Fuzzy String Matching in Python.

See also

1. Installation#

With Spack you can provide fuzzywuzzy and the optional python-levenshtein library in your kernel:

$ spack env activate python-311
$ spack install py-fuzzywuzzy+speedup

Alternatively, you can install the two libraries with other package managers, for example

$ pipenv install fuzzywuzzy[speedup]

2. Imort#

from fuzzywuzzy import fuzz, process

3. Example#

berlin = [
    "Berlin, Germany",
    "Berlin, Deutschland",
    "Berlin, DE"]

String similarity#

The similarity of the first two strings 'Berlin, Germany' and 'Berlin, Deutschland' seems low:

fuzz.ratio(berlin[0], berlin[1])

Partial string similarity#

Inconsistent partial strings are a common problem. To get around this, fuzzywuzzy uses a heuristic called best partial.

fuzz.partial_ratio(berlin[0], berlin[1])

Token sorting#

In token sorting, the string in question is given a token, the tokens are sorted alphabetically and then reassembled into a string, for example:

fuzz.ratio(berlin[1], berlin[2])
fuzz.token_set_ratio(berlin[1], berlin[2])

Further information#


Extract from a list#

choices = [
    "United Kingdom",
    "Great Britain",
    "United States",
process.extract("DE", choices, limit=2)
[('Deutschland', 90), ('Germany', 45)]
process.extract("Vereinigtes Königreich", choices)
[('United Kingdom', 51),
 ('United States', 41),
 ('Germany', 39),
 ('Great Britain', 35),
 ('Deutschland', 31)]
process.extractOne("frankreich", choices)
('France', 62)
process.extractOne("U.S.", choices)
('United States', 86)

Known ports#

FuzzyWuzzy is also ported to other languages! Here are some known ports: