Deduplicating data#

In this notebook, we deduplicate data using the Dedupe library, which uses a flat neural network to learn from a little training.

See also

  • csvdedupe offers a command line interface for Dedupe.

In addition, the same developers have created parserator, which you can use to extract text functions and train your own text extraction.

1. Load sample data#

[1]:
import pandas as pd
[2]:
customers = pd.read_csv(
    "https://raw.githubusercontent.com/kjam/data-cleaning-101/master/data/customer_data_duped.csv",
    encoding="utf-8",
)

2. Deduplicate with pandas#

[3]:
customers
[3]:
name job company street_address city state email user_name
0 Patricia Schaefer Programmer, systems Estrada-Best 398 Paul Drive Christianview Delaware lambdavid@gmail.com ndavidson
1 Olivie Dubois Ingénieur recherche et développement en agroal... Moreno rue Lucas Benard Saint Anastasie-les-Bains AR berthelotjacqueline@mahe.fr manonallain
2 Mary Davies-Kirk Public affairs consultant Baker Ltd Flat 3\nPugh mews Stanleyfurt ZA middletonconor@hotmail.com colemanmichael
3 Miroslawa Eckbauer Dispensing optician Ladeck GmbH Mijo-Lübs-Straße 12 Neubrandenburg Berlin sophia01@yahoo.de romanjunitz
4 Richard Bauer Accountant, chartered certified Hoffman-Rocha 6541 Rodriguez Wall Carlosmouth Texas tross@jensen-ware.org adam78
... ... ... ... ... ... ... ... ...
2075 Maurice Stey Systems developer Linke Margraf GmbH & Co. OHG Laila-Scheibe-Allee 2/0 Luckenwalde Hamburg gutknechtevelyn@niemeier.com dkreusel
2076 Linda Alexander Commrcil horiculuri Webb, Ballald and Vasquel 5594 Persn Ciff Mooneybury Maryland ahleythoa@ail.co kennethrchn
2077 Diane Bailly Pharmacien Voisin 527, rue Dijoux Duval-les-Bains CH aruiz@reynaud.fr dorothee41
2078 Jorge Riba Cerdán Hotel manager Amador-Diego Rambla de Adriana Barceló 854 Puerta 3 Huesca Asturias manuelamosquera@yahoo.com eugenia17
2079 Ryan Thompson Brewing technologist Smith-Sullivan 136 Rodriguez Point Bradfordborough North Dakota lcruz@gmail.com cnewton

2080 rows × 8 columns

2.2 Show data types#

For this we use pandas.DataFrame.dtypes:

[4]:
customers.dtypes
[4]:
name              object
job               object
company           object
street_address    object
city              object
state             object
email             object
user_name         object
dtype: object

2.3 Determine missing values#

pandas.isnull shows for an array-like object whether values are missing:

  • NaN in numeric arrays

  • None or NaN in object arrays

  • NaT in datetimelike

See also

[5]:
for col in customers.columns:
    print(col, customers[col].isnull().sum())
name 0
job 0
company 0
street_address 0
city 0
state 0
email 0
user_name 0

2.4 Determine duplicate records#

[6]:
customers.duplicated()
[6]:
0       False
1       False
2       False
3       False
4       False
        ...
2075    False
2076    False
2077    False
2078    False
2079    False
Length: 2080, dtype: bool

customers.duplicated() does not yet give us the desired indication of whether there are duplicate records. In the following, we will output all data records for which True is returned:

[7]:
customers[customers.duplicated()]
[7]:
name job company street_address city state email user_name

Apparently there are no duplicated records.

2.5 Delete duplicated data#

Deleting duplicated records with drop_duplicates should therefore not change anything and leave the number of records at 2080:

[8]:
customers.drop_duplicates()
[8]:
name job company street_address city state email user_name
0 Patricia Schaefer Programmer, systems Estrada-Best 398 Paul Drive Christianview Delaware lambdavid@gmail.com ndavidson
1 Olivie Dubois Ingénieur recherche et développement en agroal... Moreno rue Lucas Benard Saint Anastasie-les-Bains AR berthelotjacqueline@mahe.fr manonallain
2 Mary Davies-Kirk Public affairs consultant Baker Ltd Flat 3\nPugh mews Stanleyfurt ZA middletonconor@hotmail.com colemanmichael
3 Miroslawa Eckbauer Dispensing optician Ladeck GmbH Mijo-Lübs-Straße 12 Neubrandenburg Berlin sophia01@yahoo.de romanjunitz
4 Richard Bauer Accountant, chartered certified Hoffman-Rocha 6541 Rodriguez Wall Carlosmouth Texas tross@jensen-ware.org adam78
... ... ... ... ... ... ... ... ...
2075 Maurice Stey Systems developer Linke Margraf GmbH & Co. OHG Laila-Scheibe-Allee 2/0 Luckenwalde Hamburg gutknechtevelyn@niemeier.com dkreusel
2076 Linda Alexander Commrcil horiculuri Webb, Ballald and Vasquel 5594 Persn Ciff Mooneybury Maryland ahleythoa@ail.co kennethrchn
2077 Diane Bailly Pharmacien Voisin 527, rue Dijoux Duval-les-Bains CH aruiz@reynaud.fr dorothee41
2078 Jorge Riba Cerdán Hotel manager Amador-Diego Rambla de Adriana Barceló 854 Puerta 3 Huesca Asturias manuelamosquera@yahoo.com eugenia17
2079 Ryan Thompson Brewing technologist Smith-Sullivan 136 Rodriguez Point Bradfordborough North Dakota lcruz@gmail.com cnewton

2080 rows × 8 columns

Now we want to delete only those records whose user_name is identical:

[9]:
customers.drop_duplicates(["user_name"])
[9]:
name job company street_address city state email user_name
0 Patricia Schaefer Programmer, systems Estrada-Best 398 Paul Drive Christianview Delaware lambdavid@gmail.com ndavidson
1 Olivie Dubois Ingénieur recherche et développement en agroal... Moreno rue Lucas Benard Saint Anastasie-les-Bains AR berthelotjacqueline@mahe.fr manonallain
2 Mary Davies-Kirk Public affairs consultant Baker Ltd Flat 3\nPugh mews Stanleyfurt ZA middletonconor@hotmail.com colemanmichael
3 Miroslawa Eckbauer Dispensing optician Ladeck GmbH Mijo-Lübs-Straße 12 Neubrandenburg Berlin sophia01@yahoo.de romanjunitz
4 Richard Bauer Accountant, chartered certified Hoffman-Rocha 6541 Rodriguez Wall Carlosmouth Texas tross@jensen-ware.org adam78
... ... ... ... ... ... ... ... ...
2074 Rhonda James Recruitment consultant Turner, Bradley and Scott 28382 Stokes Expressway Port Gabrielaport New Hampshire zroberts@hotmail.com heathscott
2076 Linda Alexander Commrcil horiculuri Webb, Ballald and Vasquel 5594 Persn Ciff Mooneybury Maryland ahleythoa@ail.co kennethrchn
2077 Diane Bailly Pharmacien Voisin 527, rue Dijoux Duval-les-Bains CH aruiz@reynaud.fr dorothee41
2078 Jorge Riba Cerdán Hotel manager Amador-Diego Rambla de Adriana Barceló 854 Puerta 3 Huesca Asturias manuelamosquera@yahoo.com eugenia17
2079 Ryan Thompson Brewing technologist Smith-Sullivan 136 Rodriguez Point Bradfordborough North Dakota lcruz@gmail.com cnewton

2029 rows × 8 columns

This deleted 51 records.

3. dedupe#

Alternatively, we can detect the duplicated data with the Dedupe library, which uses a flat neural network to learn from a small training.

See also

csvdedupe provides a command line tool for dedupe.

In addition, the same developers have created parserator, which you can use to extract text functions and train your own text extraction.

3.1 Configure Dedupe#

Now we define the fields to be taken care of during deduplication and create a new deduper object:

[10]:
import os

import dedupe


customers = pd.read_csv(
    "https://raw.githubusercontent.com/kjam/data-cleaning-101/master/data/customer_data_duped.csv",
    encoding="utf-8",
)
[11]:
variables = [
    {"field": "name", "type": "String"},
    {"field": "job", "type": "String"},
    {"field": "company", "type": "String"},
    {"field": "street_address", "type": "String"},
    {"field": "city", "type": "String"},
    {"field": "state", "type": "String", "has_missing": True},
    {"field": "email", "type": "String", "has_missing": True},
    {"field": "user_name", "type": "String"},
]

deduper = dedupe.Dedupe(variables)

If the value of a field is missing, this missing value should be represented as a None object. However, by 'has_missing': True, a new, additional field is created to indicate whether the data was present or not, and the missing data is given a null.

See also

[12]:
deduper
[12]:
<dedupe.api.Dedupe at 0x7fd414e1a3a0>
[13]:
customers.shape
[13]:
(2080, 8)

4. Create training data#

[14]:
deduper.prepare_training(customers.T.to_dict())

prepare_training initialises active learning with our data and, optionally, with existing training data.

T mirrors the DataFrame across its diagonal by writing rows as columns and vice versa. For this, pandas.DataFrame.transpose is used.

5. Active learning#

Use dedupe.console_label to train your dedupe instance. When Dedupe finds a record pair, you will be asked to label it as a duplicate. You can use the y, n and u keys to label duplicates. Press f when you are finished.

[15]:
dedupe.console_label(deduper)
name : Frédérique Lejeune-Daniel
job : Technicien chimiste
company : Schmitt
street_address : chemin Denise Ferrand
city : Saint CharlotteVille
state : IE
email : jchretien@costa.com
user_name : joseph60

name : Frédérique Lejeune-Daniel
job : Tecce cse
company : Sctmitt
street_address : chemin Denise Ferrand
city : Saint ChalotteVille
state : IE
email : jchretien@costacom
user_name : joseph60

0/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished
y
name : Jose Carlos Pérez Arias
job : Engineer, maintenance (IT)
company : Marquez PLC
street_address : Pasadizo Ángel Sureda 715 Piso 3
city : La Rioja
state : Córdoba
email : cifuentesraquel@peralta.com
user_name : gonzalo63

name : Jose Carlos Pérez Arias
job : Egieer, maiteace (IT)
company : Marquez PLC
street_address : Psdizo Ángel Sured 715 Piso
city : La Rioja
state : Córdob
email : ifuenteraque@perata.om
user_name : gonzalo6

1/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious
y
name : Julio Agustín Amaya
job : Tax adviser
company : Piñol, Belmonte and Codina
street_address : Callejón de Gregorio Bustamante 28 Piso 7
city : Las Palmas
state : Salamanca
email : usolana@jáuregui-pedraza.com
user_name : gloriaolmo

name : Julio Agustín Amaya
job : Tax aviser
company : Piñolk Belmonke and Codina
street_address : Calleón de Gregorio Bustamante 28 Piso 7
city : La Pala
state : Salamanca
email : usolana@jáuregui-pedraza.om
user_name : gloriaolmo

2/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious
y
name : Monique Marty
job : Maoqiie
company : Arnfud
street_address : 70, rue de Carre
city : CheallierBour
state : EC
email : frederiquerichard@cohen.com
user_name : marquesseastie

name : Monique Marty
job : Maroquinier
company : Arnaud
street_address : 70, rue de Carre
city : ChevallierBourg
state : EC
email : frederiquerichard@cohen.com
user_name : marquessebastien

3/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious
y
name : Susan Aubry
job : Direeur d'gee bire
company : Payet George2 S2A2S2
street_address : , rue Inè Valentn
city : Nicolas
state : FI
email : milletedith@sf.f
user_name : tthierry

name : Susan Aubry
job : Directeur d'agence bancaire
company : Payet Georges S.A.S.
street_address : 67, rue Inès Valentin
city : Nicolas
state : FI
email : milletedith@sfr.fr
user_name : tthierry

4/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious
f
Finished labeling

The last training dataset compared make it clear that we did not delete this duplicate with our drop_duplicates example above - marquesseastie and marquessebastien were recognised as different.

Dedupe.train adds the record pairs you marked to the training data and updates the matching model.

With index_predicates=True, deduplication also takes into account predicates based on the indexing of the data.

When you are done, save your training data with Dedupe.write_settings.

[16]:
settings_file = "csv_example_learned_settings"
if os.path.exists(settings_file):
    print("reading from", settings_file)
    with open(settings_file, "rb") as f:
        deduper = dedupe.StaticDedupe(f)
else:
    deduper.train(index_predicates=True)
    with open(settings_file, "wb") as sf:
        deduper.write_settings(sf)

With dedupe.Dedupe.partition, records that all refer to the same entity are identified and returned as tuples that are a sequence of record IDs and confidence values. For more details on the confidence value, see dedupe.Dedupe.cluster.

[17]:
dupes = deduper.partition(customers.T.to_dict())
[18]:
dupes
[18]:
[((84, 1600), (1.0, 1.0)),
 ((136, 1360), (1.0, 1.0)),
 ((670, 1170), (1.0, 1.0)),
 ((856, 1781), (1.0, 1.0)),
 ((902, 942), (1.0, 1.0)),
 ((1395, 1560), (1.0, 1.0)),
 ((1594, 1706), (1.0, 1.0)),
 ((0,), (1.0,)),
 ((1,), (1.0,)),
 ...]

We can also output only individual entries:

[19]:
dupes[1]
[19]:
((136, 1360), (1.0, 1.0))

We can then display these with pandas.DataFrame.iloc:

[20]:
customers.iloc[[136,1360]]
[20]:
name job company street_address city state email user_name
136 Frédérique Lejeune-Daniel Technicien chimiste Schmitt chemin Denise Ferrand Saint CharlotteVille IE jchretien@costa.com joseph60
1360 Frédérique Lejeune-Daniel Tecce cse Sctmitt chemin Denise Ferrand Saint ChalotteVille IE jchretien@costacom joseph60