Comparison of Machine Learning Algorithms in a Human-Computer Hybrid Record Linkage System
Abstract
Record linkage, often called entity resolution or de-duplication, refers to identifying the same entities across one or more databases. As the amount of data that is generated grows at an exponential rate, it becomes increasingly important to be able to integrate data from several sources to perform richer analysis. In this paper, we present an open source comprehensive end to end hybrid record linkage framework that combines the automatic and manual review process. Using this framework, we train several models based on different machine learning algorithms such as random forests, linear SVM, Radial SVM, and Dense Neural Networks and compare the effectiveness and efficiency of these models for record linkage in different settings. We evaluate model performance based on Recall, F1-score (quality of linkages) and number of uncertain pairs which is the number of pairs that need manual review. We also test our trained models in a new dataset to test how different trained models transfer to a new setting. The RF, linear SVM and radial SVM models transfer much better compared to the DNN. Finally, we study the effect of name2vec (n2v) feature, a letter embedding in names, on model performance. Using n2v results in a smaller manual review set with slightly less F1-score. Overall the SVM models performed best in all experiments.
Subject
Record Linkagededuplication
entity resolution
machine learning
Benchmarking
patient matching
Citation
Ramezani Foukolayi, Mahin (2021). Comparison of Machine Learning Algorithms in a Human-Computer Hybrid Record Linkage System. Master's thesis, Texas A&M University. Available electronically from https : / /hdl .handle .net /1969 .1 /193211.