Deduplicate data¶

1. Load sample data¶

[1]:

import pandas as pd

[2]:

customers = pd.read_csv(
    "https://raw.githubusercontent.com/kjam/data-cleaning-101/master/data/customer_data_duped.csv",
    encoding="utf-8",
)

2. Deduplication with pandas¶

2.1 Overview¶

[3]:

customers

[3]:

	name	job	company	street_address	city	state	email	user_name
0	Patricia Schaefer	Programmer, systems	Estrada-Best	398 Paul Drive	Christianview	Delaware	lambdavid@gmail.com	ndavidson
1	Olivie Dubois	Ingénieur recherche et développement en agroal...	Moreno	rue Lucas Benard	Saint Anastasie-les-Bains	AR	berthelotjacqueline@mahe.fr	manonallain
2	Mary Davies-Kirk	Public affairs consultant	Baker Ltd	Flat 3\nPugh mews	Stanleyfurt	ZA	middletonconor@hotmail.com	colemanmichael
3	Miroslawa Eckbauer	Dispensing optician	Ladeck GmbH	Mijo-Lübs-Straße 12	Neubrandenburg	Berlin	sophia01@yahoo.de	romanjunitz
4	Richard Bauer	Accountant, chartered certified	Hoffman-Rocha	6541 Rodriguez Wall	Carlosmouth	Texas	tross@jensen-ware.org	adam78
...	...	...	...	...	...	...	...	...
2075	Maurice Stey	Systems developer	Linke Margraf GmbH & Co. OHG	Laila-Scheibe-Allee 2/0	Luckenwalde	Hamburg	gutknechtevelyn@niemeier.com	dkreusel
2076	Linda Alexander	Commrcil horiculuri	Webb, Ballald and Vasquel	5594 Persn Ciff	Mooneybury	Maryland	ahleythoa@ail.co	kennethrchn
2077	Diane Bailly	Pharmacien	Voisin	527, rue Dijoux	Duval-les-Bains	CH	aruiz@reynaud.fr	dorothee41
2078	Jorge Riba Cerdán	Hotel manager	Amador-Diego	Rambla de Adriana Barceló 854 Puerta 3	Huesca	Asturias	manuelamosquera@yahoo.com	eugenia17
2079	Ryan Thompson	Brewing technologist	Smith-Sullivan	136 Rodriguez Point	Bradfordborough	North Dakota	lcruz@gmail.com	cnewton

2080 rows × 8 columns

2.2 Display data types¶

For this we use pandas.DataFrame.dtypes:

[4]:

customers.dtypes

[4]:

name              object
job               object
company           object
street_address    object
city              object
state             object
email             object
user_name         object
dtype: object

2.3 Determining missing values¶

pandas.isnull shows whether values are missing for an array-like object:

NaN in numeric arrays
None or NaN in object arrays
NaT in datetimelike

See also:

notna for the Boolean inverse of pandas.isna

Series.isna for the missing values in a series

DataFrame.isna for the missing values in a DataFrame

Index.isna for the missing values in an index

[5]:

for col in customers.columns:
    print(col, customers[col].isna().sum())

name 0
job 0
company 0
street_address 0
city 0
state 0
email 0
user_name 0

2.4 Determine duplicated data records¶

[6]:

customers.duplicated()

[6]:

0       False
1       False
2       False
3       False
4       False
        ...
2075    False
2076    False
2077    False
2078    False
2079    False
Length: 2080, dtype: bool

customers.duplicated() does not yet give us the desired indication of whether there are duplicate data records. In the following, we display all data records for which True is returned:

[7]:

customers[customers.duplicated()]

[7]:

	name	job	company	street_address	city	state	email	user_name

Obviously there are no identical data records.

2.5 Deleting duplicated data¶

Deleting duplicate data records with drop_duplicates should therefore not change anything and leave the number of data records at 2080:

[8]:

customers.drop_duplicates()

[8]:

	name	job	company	street_address	city	state	email	user_name
0	Patricia Schaefer	Programmer, systems	Estrada-Best	398 Paul Drive	Christianview	Delaware	lambdavid@gmail.com	ndavidson
1	Olivie Dubois	Ingénieur recherche et développement en agroal...	Moreno	rue Lucas Benard	Saint Anastasie-les-Bains	AR	berthelotjacqueline@mahe.fr	manonallain
2	Mary Davies-Kirk	Public affairs consultant	Baker Ltd	Flat 3\nPugh mews	Stanleyfurt	ZA	middletonconor@hotmail.com	colemanmichael
3	Miroslawa Eckbauer	Dispensing optician	Ladeck GmbH	Mijo-Lübs-Straße 12	Neubrandenburg	Berlin	sophia01@yahoo.de	romanjunitz
4	Richard Bauer	Accountant, chartered certified	Hoffman-Rocha	6541 Rodriguez Wall	Carlosmouth	Texas	tross@jensen-ware.org	adam78
...	...	...	...	...	...	...	...	...
2075	Maurice Stey	Systems developer	Linke Margraf GmbH & Co. OHG	Laila-Scheibe-Allee 2/0	Luckenwalde	Hamburg	gutknechtevelyn@niemeier.com	dkreusel
2076	Linda Alexander	Commrcil horiculuri	Webb, Ballald and Vasquel	5594 Persn Ciff	Mooneybury	Maryland	ahleythoa@ail.co	kennethrchn
2077	Diane Bailly	Pharmacien	Voisin	527, rue Dijoux	Duval-les-Bains	CH	aruiz@reynaud.fr	dorothee41
2078	Jorge Riba Cerdán	Hotel manager	Amador-Diego	Rambla de Adriana Barceló 854 Puerta 3	Huesca	Asturias	manuelamosquera@yahoo.com	eugenia17
2079	Ryan Thompson	Brewing technologist	Smith-Sullivan	136 Rodriguez Point	Bradfordborough	North Dakota	lcruz@gmail.com	cnewton

2080 rows × 8 columns

Now we want to display those data records for which user_name is identical:

[9]:

customers[customers.duplicated(["user_name"])]

[9]:

	name	job	company	street_address	city	state	email	user_name
337	Aysel Binner	Reccig officer	Kuhl Kalleww Swifwunw & Co. KGaA	Batix-Kanz-Staß 5/4	Fulda	Berli	frncoise@wgnerco	christinefinke
377	Jolanta Rogge	Accommodation managr	Scholl e.V.	Lrchplz 4/6	Mettmnn	Thüringen	inrharff@yah.d	walentinabeier
506	Mrs. Frances Peters	Fuiue desie	Rsgers, Lawrence and Richards	Studio \nCarpntr kys	Wes Simn	BO	halenewilliams@wilson-sandes.og	amy17
545	Gerhart Krebs MBA.	Surgeon	Roskoth	Kühnertweg 863	Stade	Bayern	olav44@bolander.de	bettyhahn
592	Folkert Gnatz	Meteorologist	Bolnbach	Heinfried-Austermühle-Ring 05	Eilenburg	Thüringen	jaentschbirgitt@boerner.org	francesco44
633	Manon Jacquot	Ingénieur en aéronautique	Jacob	8, chemin Éléonore Evrard	Marechal-les-Bains	AR	ilemaitre@voila.fr	astrid58
658	Austin Waller	Insurance risk surveyor	Sexton Group	11097 Hansen Field	Davidmouth	Texas	christina74@doyle-baker.biz	olynn
723	Wanda Moran	Solicitor, Scotland	Estes PLC	08011 Hernandez Streets Apt. 149	Natalieshire	Oregon	howardreginald@gmail.com	dana91
762	Charles Russell	Scientist, research (physical sciences)	Preston-Wilson	6709 Ashley Circle Apt. 309	Danielberg	South Dakota	nancyescobar@brown.net	ruben71
772	Waltrud Wohlgemut	Designer, fashion/clothing	Nerger AG	Elmar-Ullmann-Allee 6	Schlüchtern	Rheinland-Pfalz	auch-schlauchindietlind@gmx.de	zitakuhl
783	Caroline Mata	Engineer, elecrical	Grimes Grrur	80157 Whte Alley Sute 79	Soh Mark	Iw	jared52@aoo.com	thomasthompson
889	Ricardo Ripoll Lucena	Teevisi camera peratr	Luzq Estraqa anq Galinqq	Caejón Rosario Viapana 16	Palencia	Lgo	ev0@oo.com	colomerenrique
928	Sophie Letellier du Carpentier	Cnucteu e ét	Valle7 SARL	3, boulvard Jan Augr	Saint Daviddan	BS	rdorm@dbmi.com	anne28
979	Irene Roda Dávila	Eitor, maazine featres	Daza Inc	Roda Carla Miró 5	Viy	La Rioa	sldrpére@ps.cm	ipeñalver
995	Abigail Hernandez	Mechanical engineer	Smith Ltd	766 Adrian Ranch	Ellismouth	Colorado	jordan60@gmail.com	mendozajody
1015	Mr. Paul Newton	Government soa researh offer	LemnardmWatsmn	Studi 86\nKaty ill	West Jue	VE	em@mil.cm	bbennett
1043	Anna Adams	Programmer alcatons	Jones Gjoup	22 Kateen ova	Noth Joa	KZ	asleig65@aisay.co	lloydann
1052	Aurélie Vidal	Magistrat	Martins	88, rue Stéphanie Letellier	Rouxnec	SE	boutineric@blin.fr	iwagner
1062	Regina Schacht-Kusch	Herbalist	Hartung GmbH & Co. KGaA	Wenke-Hörle-Ring 36	Eggenfelden	Sachsen-Anhalt	oluebs@troest.de	xklotz
1120	Jeffrey Benjamin	Publ house manager	Chcn Inc	27 Rodgrs Rdgs Apt. 269	Suth Jeffererg	Iinois	stepanie90@rogers.co	lori67
1170	Julio Agustín Amaya	Tax aviser	Piñolk Belmonke and Codina	Calleón de Gregorio Bustamante 28 Piso 7	La Pala	Salamanca	usolana@jáuregui-pedraza.om	gloriaolmo
1339	Ing. Andrew Schleich B.A.	Ln	Holt Putz GnR	Hugasse 8/8	Hainichn	Neersachsen	jun@putz.com	jesselmaja
1360	Frédérique Lejeune-Daniel	Tecce cse	Sctmitt	chemin Denise Ferrand	Saint ChalotteVille	IE	jchretien@costacom	joseph60
1384	Kenneth Moore	Magazine journalist	Cross, Bfll anf Diaz	753 Lindsey Pine	Thompsonshe	Colorao	ashey28@rice.co	todd72
1423	Thomas Coulon	Collecteur de fonds	Levy	91, rue Laetitia Collet	Dias-sur-Normand	SC	deschampsgabriel@guyot.fr	michelepetit
1433	Jerry Barnes	Tour mner	Col-Wllllams	30 Mpy Ovepass	Jeiferview	Utah	insnashl@gas-hais.cm	christopher62
1452	Karen Weeks	Psychotherapist, child	Rodriguez, Brady and Jackson	233 Kevin Street	Larryside	Indiana	gregg39@hernandez-gomez.com	knapprobert
1489	Herr Johann Eigenwillig	Immigration officer	Süßebier Hänel GmbH	Langernplatz 0	Stadtsteinach	Thüringen	haasemarieluise@noack.com	istoll
1544	Pasquale Schwital	Trade mark attorney	Finke	Detlef-Binner-Platz 0/1	Burg	Niedersachsen	hanne-lore98@gmx.de	thomas14
1557	Stephanie Young	Herpetologist	Bryant and Sons	5163 Rebecca Creek Suite 421	North Theresaberg	Alaska	stephenwilliams@summers.com	ahawkins
1567	Carolina Reguera Sanz	Fam manae	Cami77, C7aparr7 a7d N7gu7ra	Vil e Imel Oorio 25	Madd	Vicaya	mordóñ@cámara.info	eva16
1616	Sonia Amores	Senir tax prfessina/tax inspectr	J5an-Núñez	Avnida d Grgorio Manón 344 Prta 8	Ponevedr	Lugo	icent4@montenero-brroso.info	sanmartínguillermo
1647	Juan Carlos Iker Boix Ros	Pre phtgrapher	Pont, P44om4r4s 4nd Arjon4	Pasadzo de Josep Bentez Pso	Las Palmas	Mia	srgio24@gail.co	luis-miguel23
1652	Jörg Henschel	Chaity office	Schicke AG	HennyLorchRng 484	Hohensein-Ensh	BadenWürtteberg	huerhes@hmal.de	anne-katrin51
1703	Marc Tate	Ship broker	Wagner, Mitchell and Grimes	721 Christopher View Suite 840	Watsonmouth	Connecticut	chenjessica@hotmail.com	patricia34
1707	Joseph Hines	Pyhiatri nre	Cr4ig, G4rci4 4nd Rich4rds	85663 Savage Gles	Mcgeeon	Als	bcaldern@htmail.cm	emilytorres
1722	Julie Baldwin	Set deigner	W5ll55mson-G5rz5	58513 Paricia Res Suie 45	So Me	Alaska	diuez@uess.	cmoss
1759	Sarah Hoffman	Exhibitin designe	Hensont Wiley and Ryan	9490 Curts Spur Sute 82	Jseptwn	Arizona	ncole@yahoo.com	csmith
1796	Valentine Devaux-Roger	Direceur d'ôial	Leiris	57, enue de Gros	BenadBou	AL	rogrlro@munoz.om	xherve
1809	Slavica Seidel	Psychotherapist, child	Wulff Hande KG	Preißgasse 0/4	Soest	Rheinland-Pfalz	tloos@krause.net	abien
1820	Wenke Schweitzer	Enginr, automoti	Wesa4k KG	Eies. 7	Ba Lnwra	Thürige	rsthveriue@mies.rg	kwernecke
1829	Dr. Thomas Hein	Copy	Geisel	Ladeckgasse 11	Rockenhausen	Nordrhein-Westfalen	grein-grotharnim@kallert.de	siegmar08
1837	Andrew Hart	Engineer, civil (contracting)	Barnett LLC	258 Day Hollow Suite 410	Kimberlyhaven	Colorado	brandy00@yahoo.com	amy30
1914	Shelby Fowler	Air traffic controller	Fields-Sanchez	533 Fitzpatrick Bypass	Francesberg	Michigan	terrystephen@anderson.org	gcain
1938	Susan Aubry	Directeur d'agence bancaire	Payet Georges S.A.S.	67, rue Inès Valentin	Nicolas	FI	milletedith@sfr.fr	tthierry
1948	Richard Karge-Kobelt	Junalist maaine	Abberb Keubeb AG	Mitschkeee 8	Mß	SachsnAnhalt	nrejwgner@gmx.e	muehlehenni
1960	Anna de Lobato	Medcl techcl ocer	Maciag PLC	Calleón de Dolore Parea 21 At 7	Palncia	Cantaria	vázqzlornzo@al.om	daniel70
1968	Zoltan Wähner B.A.	Professor Emerits	Th8e8	Stotr. 1	Saulgau	Shlsg-Holst	arlenpruschke@salz.or	kklemm
1995	Kenneth Dunn	Programmer, systems	Leonard Inc	5361 Patterson Mission Suite 504	Villaburgh	Rhode Island	kristen54@gmail.com	jkent
2010	Gertraude Schomber	Insurance risk surveyor	Bruder	Christa-Ullrich-Allee 0/1	Schwäbisch Hall	Hessen	gumprichalice@schmidt.de	fruppert
2075	Maurice Stey	Systems developer	Linke Margraf GmbH & Co. OHG	Laila-Scheibe-Allee 2/0	Luckenwalde	Hamburg	gutknechtevelyn@niemeier.com	dkreusel

Now we can display the associated data records, for example with:

[10]:

customers[customers["user_name"] == "christinefinke"]

[10]:

	name	job	company	street_address	city	state	email	user_name
236	Aysel Binner	Recycling officer	Kuhl Kallert Stiftung & Co. KGaA	Beatrix-Kranz-Straße 5/4	Fulda	Berlin	francoise22@wagner.com	christinefinke
337	Aysel Binner	Reccig officer	Kuhl Kalleww Swifwunw & Co. KGaA	Batix-Kanz-Staß 5/4	Fulda	Berli	frncoise@wgnerco	christinefinke

Finally, we can delete those data records whose user_name is identical:

[11]:

customers.drop_duplicates(["user_name"])

[11]:

	name	job	company	street_address	city	state	email	user_name
0	Patricia Schaefer	Programmer, systems	Estrada-Best	398 Paul Drive	Christianview	Delaware	lambdavid@gmail.com	ndavidson
1	Olivie Dubois	Ingénieur recherche et développement en agroal...	Moreno	rue Lucas Benard	Saint Anastasie-les-Bains	AR	berthelotjacqueline@mahe.fr	manonallain
2	Mary Davies-Kirk	Public affairs consultant	Baker Ltd	Flat 3\nPugh mews	Stanleyfurt	ZA	middletonconor@hotmail.com	colemanmichael
3	Miroslawa Eckbauer	Dispensing optician	Ladeck GmbH	Mijo-Lübs-Straße 12	Neubrandenburg	Berlin	sophia01@yahoo.de	romanjunitz
4	Richard Bauer	Accountant, chartered certified	Hoffman-Rocha	6541 Rodriguez Wall	Carlosmouth	Texas	tross@jensen-ware.org	adam78
...	...	...	...	...	...	...	...	...
2074	Rhonda James	Recruitment consultant	Turner, Bradley and Scott	28382 Stokes Expressway	Port Gabrielaport	New Hampshire	zroberts@hotmail.com	heathscott
2076	Linda Alexander	Commrcil horiculuri	Webb, Ballald and Vasquel	5594 Persn Ciff	Mooneybury	Maryland	ahleythoa@ail.co	kennethrchn
2077	Diane Bailly	Pharmacien	Voisin	527, rue Dijoux	Duval-les-Bains	CH	aruiz@reynaud.fr	dorothee41
2078	Jorge Riba Cerdán	Hotel manager	Amador-Diego	Rambla de Adriana Barceló 854 Puerta 3	Huesca	Asturias	manuelamosquera@yahoo.com	eugenia17
2079	Ryan Thompson	Brewing technologist	Smith-Sullivan	136 Rodriguez Point	Bradfordborough	North Dakota	lcruz@gmail.com	cnewton

2029 rows × 8 columns

This deleted 51 data records.

3. Dedupe¶

Alternatively, we can recognise the duplicated data with the Dedupe library, which uses a shallow neural network to learn from a small training.

See also:

csvdedupe offers a command line tool for dedupe.

In addition, the same developers have created parserator, which you can use to extract text functions and train your own text extraction.

3.1 Configuring Dedupe¶

Now we define the fields to be taken into account during deduplication and create a new deduper object:

[12]:

from pathlib import Path

import dedupe


customers = pd.read_csv(
    "https://raw.githubusercontent.com/kjam/data-cleaning-101/master/data/customer_data_duped.csv",
    encoding="utf-8",
)

[13]:

variables = [
    dedupe.variables.String("name"),
    dedupe.variables.String("job"),
    dedupe.variables.String("company"),
    dedupe.variables.String("street_address"),
    dedupe.variables.String("city"),
    dedupe.variables.String("state"),
    dedupe.variables.String("email"),
    dedupe.variables.String("user_name"),
]

deduper = dedupe.Dedupe(variables)

If the value of a field is missing, this missing value should be displayed as a None object. However, 'has_missing': True creates a new, additional field that indicates whether the data was present or not, and the missing data is assigned zero.

See also:

Missing Data

[14]:

deduper

[14]:

<dedupe.api.Dedupe at 0x14a970c20>

[15]:

customers.shape

[15]:

(2080, 8)

4. Create training data¶

[16]:

deduper.prepare_training(customers.T.to_dict())

prepare_training initialises active learning with our data and, optionally, with existing training data.

T mirrors the DataFrame via its diagonal by writing rows as columns and vice versa. For this, pandas.DataFrame.transpose is used.

5. Active learning¶

You can train your dedupe instance with dedupe.console_label. If Dedupe finds a pair of data sets, you will be asked to label it as a duplicate. You can use the y, n and u keys to label duplicates. Press f when you are finished.

[17]:

dedupe.console_label(deduper)

name : Lauren Green
job : Market researcher
company : Chen-Kelly
street_address : 75836 Lopez Plain Suite 513
city : South Matthew
state : Indiana
email : simskevin@gmail.com
user_name : briggsjamie

name : Lauren Green
job : Maet eeache
company : Chen-Kelly
street_address : 75836 Lopez Plai Suite 53
city : Soh Mahew
state : Indiana
email : smskevn@gmalcom
user_name : risjamie

0/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished

name : Anthony Walker
job : Solicitor, Scotland
company : Barrera-Wilcox
street_address : 649 Jacob Harbors
city : Drewton
state : Virginia
email : fjackson@gmail.com
user_name : bethburch

name : Anthony Walker
job : Solicior Scoland
company : Barrera-Wildox
street_address : 649 Jacb Harbrs
city : Drewton
state : Virginia
email : fjaso@gmail.om
user_name : betbr

1/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious

name : Michael Miller
job : Algcal scns
company : Landry Grmmp
street_address : 895 Randy Plains
city : Braorogh
state : Nebaska
email : jul@hotmlcom
user_name : tlas

name : Michael Miller
job : Audiological scientist
company : Landry Group
street_address : 895 Randy Plains
city : Brayborough
state : Nebraska
email : juliabaird@hotmail.com
user_name : tlucas

2/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious

name : Sarah Hoffman
job : Exhibition designer
company : Henson, Wiley and Ryan
street_address : 97490 Curtis Spur Suite 825
city : Josephtown
state : Arizona
email : ncole@yahoo.com
user_name : csmith

name : Sarah Hoffman
job : Exhibitin designe
company : Hensont Wiley and Ryan
street_address : 9490 Curts Spur Sute 82
city : Jseptwn
state : Arizona
email : ncole@yahoo.com
user_name : csmith

3/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious

name : Jonathan Campos
job : Edior, ommissioning
company : Azvazez Inc
street_address : 740 Willia Sals
city : Lake Case
state : Wahington
email : james4@taorcom
user_name : yduy

name : Jonathan Campos
job : Editor, commissioning
company : Alvarez Inc
street_address : 78840 William Shoals
city : Lake Chase
state : Washington
email : james04@taylor.com
user_name : yduffy

4/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious

name : Joerg Hornich
job : Rtal managr
company : H\.\.n
street_address : Astrd-Spß-All 09
city : Shmön
state : Meclebr-Vorpommer
email : christine@gx.de
user_name : rsbtte

name : Joerg Hornich
job : Retail manager
company : Hein
street_address : Astrid-Spieß-Allee 09
city : Schmölln
state : Mecklenburg-Vorpommern
email : christine74@gmx.de
user_name : ursbutte

5/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious

name : Martin Butte
job : Sale executve
company : Junhk Stiftung & Chh KG
street_address : Tadus-Tröst-Rin 1/
city : Gen
state : Niedersacsen
email : beckercarlo@googlemail.com
user_name : cmaeler

name : Hans Eberhardt
job : Research scientist (life sciences)
company : Pruschke Stiftung & Co. KG
street_address : Notburga-Reising-Weg 452
city : Griesbach Rottal
state : Niedersachsen
email : twohlgemut@hotmail.de
user_name : qmuehle

6/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious

name : Andrea Jurado Bas
job : Prnmakr
company : Znrita, Madrid and Cnlladn
street_address : Pasaje Lís Blazqez 29
city : Huesc
state : Ciu
email : foso@rocmor-dgdocom
user_name : jan-franiso

name : Andrea Jurado Bas
job : Printmaker
company : Zurita, Madrid and Collado
street_address : Pasaje Luís Blazquez 29
city : Huesca
state : Ciudad
email : alfonso33@rocamora-delgado.com
user_name : juan-francisco06

6/10 positive, 1/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious

Finished labeling

The last two training datasets compared make it clear that we did not delete this duplicate with our drop_duplicates example above – clittle and little were recognised as different.

With Dedupe.train, the data record pairs you have marked are added to the training data and the matching model is updated.

With index_predicates=True, deduplication also takes into account predicates based on the indexing of the data.

When you are finished, save your training data with Dedupe.write_settings.

[18]:

settings_file = "csv_example_learned_settings"

if Path(settings_file).exists():
    print("reading from", settings_file)
    with Path.open(settings_file, "rb") as f:
        deduper = dedupe.StaticDedupe(f)
else:
    deduper.train(index_predicates=True)
    with Path.open(settings_file, "wb") as sf:
        deduper.write_settings(sf)

reading from csv_example_learned_settings

With dedupe.Dedupe.partition, data sets that all refer to the same entity are identified and returned as tuples that are a sequence of data set IDs and confidence values. Further details on the confidence value can be found at dedupe.Dedupe.cluster.

[19]:

dupes = deduper.partition(customers.T.to_dict())

We can also display only individual entries:

[20]:

dupes[0]

[20]:

((np.int64(0), np.int64(963)),
 (np.float32(0.95884323), np.float32(0.95884323)))

We can then display these with pandas.DataFrame.iloc:

[21]:

customers.iloc[[0, 963]]

[21]:

	name	job	company	street_address	city	state	email	user_name
0	Patricia Schaefer	Programmer, systems	Estrada-Best	398 Paul Drive	Christianview	Delaware	lambdavid@gmail.com	ndavidson
963	Patricia Schaefer	Prorammer, ytem	Es:rada-Bes:	39 Pul Drve	Chistianview	Delwre	mbdvid@gmim	ndvdson