Comparison of Machine Learning Algorithms in a Human-Computer Hybrid Record Linkage System
dc.contributor.advisor | Da Silva, Dilma | |
dc.contributor.advisor | Kum, Hye Chung | |
dc.creator | Ramezani Foukolayi, Mahin | |
dc.date.accessioned | 2021-05-19T14:06:39Z | |
dc.date.available | 2021-05-19T14:06:39Z | |
dc.date.created | 2021-05 | |
dc.date.issued | 2021-04-28 | |
dc.date.submitted | May 2021 | |
dc.identifier.uri | https://hdl.handle.net/1969.1/193211 | |
dc.description.abstract | Record linkage, often called entity resolution or de-duplication, refers to identifying the same entities across one or more databases. As the amount of data that is generated grows at an exponential rate, it becomes increasingly important to be able to integrate data from several sources to perform richer analysis. In this paper, we present an open source comprehensive end to end hybrid record linkage framework that combines the automatic and manual review process. Using this framework, we train several models based on different machine learning algorithms such as random forests, linear SVM, Radial SVM, and Dense Neural Networks and compare the effectiveness and efficiency of these models for record linkage in different settings. We evaluate model performance based on Recall, F1-score (quality of linkages) and number of uncertain pairs which is the number of pairs that need manual review. We also test our trained models in a new dataset to test how different trained models transfer to a new setting. The RF, linear SVM and radial SVM models transfer much better compared to the DNN. Finally, we study the effect of name2vec (n2v) feature, a letter embedding in names, on model performance. Using n2v results in a smaller manual review set with slightly less F1-score. Overall the SVM models performed best in all experiments. | en |
dc.format.mimetype | application/pdf | |
dc.language.iso | en | |
dc.subject | Record Linkage | en |
dc.subject | deduplication | en |
dc.subject | entity resolution | en |
dc.subject | machine learning | en |
dc.subject | Benchmarking | en |
dc.subject | patient matching | en |
dc.title | Comparison of Machine Learning Algorithms in a Human-Computer Hybrid Record Linkage System | en |
dc.type | Thesis | en |
thesis.degree.department | Computer Science and Engineering | en |
thesis.degree.discipline | Computer Science | en |
thesis.degree.grantor | Texas A&M University | en |
thesis.degree.name | Master of Science | en |
thesis.degree.level | Masters | en |
dc.contributor.committeeMember | Fossett, Mark | |
dc.type.material | text | en |
dc.date.updated | 2021-05-19T14:06:40Z | |
local.etdauthor.orcid | 0000-0002-2659-6151 |
Files in this item
This item appears in the following Collection(s)
-
Electronic Theses, Dissertations, and Records of Study (2002– )
Texas A&M University Theses, Dissertations, and Records of Study (2002– )