Comparison of Machine Learning Algorithms in a Human-Computer Hybrid Record Linkage System

Ramezani Foukolayi, Mahin

dc.contributor.advisor	Da Silva, Dilma
dc.contributor.advisor	Kum, Hye Chung
dc.creator	Ramezani Foukolayi, Mahin
dc.date.accessioned	2021-05-19T14:06:39Z
dc.date.available	2021-05-19T14:06:39Z
dc.date.created	2021-05
dc.date.issued	2021-04-28
dc.date.submitted	May 2021
dc.identifier.uri	https://hdl.handle.net/1969.1/193211
dc.description.abstract	Record linkage, often called entity resolution or de-duplication, refers to identifying the same entities across one or more databases. As the amount of data that is generated grows at an exponential rate, it becomes increasingly important to be able to integrate data from several sources to perform richer analysis. In this paper, we present an open source comprehensive end to end hybrid record linkage framework that combines the automatic and manual review process. Using this framework, we train several models based on different machine learning algorithms such as random forests, linear SVM, Radial SVM, and Dense Neural Networks and compare the effectiveness and efficiency of these models for record linkage in different settings. We evaluate model performance based on Recall, F1-score (quality of linkages) and number of uncertain pairs which is the number of pairs that need manual review. We also test our trained models in a new dataset to test how different trained models transfer to a new setting. The RF, linear SVM and radial SVM models transfer much better compared to the DNN. Finally, we study the effect of name2vec (n2v) feature, a letter embedding in names, on model performance. Using n2v results in a smaller manual review set with slightly less F1-score. Overall the SVM models performed best in all experiments.	en
dc.format.mimetype	application/pdf
dc.language.iso	en
dc.subject	Record Linkage	en
dc.subject	deduplication	en
dc.subject	entity resolution	en
dc.subject	machine learning	en
dc.subject	Benchmarking	en
dc.subject	patient matching	en
dc.title	Comparison of Machine Learning Algorithms in a Human-Computer Hybrid Record Linkage System	en
dc.type	Thesis	en
thesis.degree.department	Computer Science and Engineering	en
thesis.degree.discipline	Computer Science	en
thesis.degree.grantor	Texas A&M University	en
thesis.degree.name	Master of Science	en
thesis.degree.level	Masters	en
dc.contributor.committeeMember	Fossett, Mark
dc.type.material	text	en
dc.date.updated	2021-05-19T14:06:40Z
local.etdauthor.orcid	0000-0002-2659-6151

Files in this item

Name:: RAMEZANIFOUKOLAYI-THESIS-2021.pdf
Size:: 1.446Mb
Format:: PDF

View/ Open

This item appears in the following Collection(s)

Electronic Theses, Dissertations, and Records of Study (2002– )
Texas A&M University Theses, Dissertations, and Records of Study (2002– )

Show simple item record