Show simple item record

dc.creatorSivakumar, Hariharan
dc.date.accessioned2022-08-09T16:32:59Z
dc.date.available2022-08-09T16:32:59Z
dc.date.created2022-05
dc.date.submittedMay 2022
dc.identifier.urihttps://hdl.handle.net/1969.1/196521
dc.description.abstractRecord linkage is used to identify and link the same entity from one or more databases when a unique identifier is absent. As the amount of data increases largely every day, machine learning has become effective in integrating data with heterogeneity from multiple sources to establish more comprehensive datasets. As it is challenging to build a high-quality labeled dataset to train good models, our aim for this research will be to investigate which machine learning models will work best under certain conditions when applying these models trained in one setting to a new setting. In this paper, we compare the performance of three different machine learning models (i.e., random forests, linear SVM, and radial SVM) trained in a different setting from an open-source hybrid record linkage system using different heterogeneity rates (0% - 60%). The RL heterogeneity generator introduces name errors, date errors, missing data errors, and record level heterogeneities in the data. The models were trained on a subset of hospital record data containing nearly 10,000 pairs. We test how robust these models are in a new voter registration dataset. The performance of the models was evaluated based on F1 score, Recall, and the percentage of pairs that needed manual review. The radial and linear SVM models transfer better to a new setting across all heterogeneity rates compared to the random forest model. The linear SVM model outperformed the radial SVM by 4% on average in terms of the percentage of pairs that needed manual review. However, we found that the radial SVM performed significantly better than the linear SVM in terms of recall performance (80% - 48% compared to 59% - 29%) for heterogeneity rates from 0% to 60%. Overall, the radial SVM performed best in our experiments.
dc.format.mimetypeapplication/pdf
dc.subjectRecord Linkage
dc.subjectMachine Learning
dc.subjectBenchmarking
dc.titleBenchmarking the Performance of Machine Learning Algorithms for Record Linkage at Different Heterogeneity Rates in a New Setting
dc.typeThesis
thesis.degree.departmentComputer Science & Engineering
thesis.degree.disciplineComputer Engineering, Computer Science Track
thesis.degree.grantorUndergraduate Research Scholars Program
thesis.degree.nameB.S.
thesis.degree.levelUndergraduate
dc.contributor.committeeMemberKum, Hye-Chung
dc.type.materialtext
dc.date.updated2022-08-09T16:32:59Z


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record