Show simple item record

dc.contributor.advisorDa Silva, Dilma
dc.contributor.advisorKum, Hye Chung
dc.creatorRamezani Foukolayi, Mahin
dc.date.accessioned2021-05-19T14:06:39Z
dc.date.available2021-05-19T14:06:39Z
dc.date.created2021-05
dc.date.issued2021-04-28
dc.date.submittedMay 2021
dc.identifier.urihttps://hdl.handle.net/1969.1/193211
dc.description.abstractRecord linkage, often called entity resolution or de-duplication, refers to identifying the same entities across one or more databases. As the amount of data that is generated grows at an exponential rate, it becomes increasingly important to be able to integrate data from several sources to perform richer analysis. In this paper, we present an open source comprehensive end to end hybrid record linkage framework that combines the automatic and manual review process. Using this framework, we train several models based on different machine learning algorithms such as random forests, linear SVM, Radial SVM, and Dense Neural Networks and compare the effectiveness and efficiency of these models for record linkage in different settings. We evaluate model performance based on Recall, F1-score (quality of linkages) and number of uncertain pairs which is the number of pairs that need manual review. We also test our trained models in a new dataset to test how different trained models transfer to a new setting. The RF, linear SVM and radial SVM models transfer much better compared to the DNN. Finally, we study the effect of name2vec (n2v) feature, a letter embedding in names, on model performance. Using n2v results in a smaller manual review set with slightly less F1-score. Overall the SVM models performed best in all experiments.en
dc.format.mimetypeapplication/pdf
dc.language.isoen
dc.subjectRecord Linkageen
dc.subjectdeduplicationen
dc.subjectentity resolutionen
dc.subjectmachine learningen
dc.subjectBenchmarkingen
dc.subjectpatient matchingen
dc.titleComparison of Machine Learning Algorithms in a Human-Computer Hybrid Record Linkage Systemen
dc.typeThesisen
thesis.degree.departmentComputer Science and Engineeringen
thesis.degree.disciplineComputer Scienceen
thesis.degree.grantorTexas A&M Universityen
thesis.degree.nameMaster of Scienceen
thesis.degree.levelMastersen
dc.contributor.committeeMemberFossett, Mark
dc.type.materialtexten
dc.date.updated2021-05-19T14:06:40Z
local.etdauthor.orcid0000-0002-2659-6151


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record