Benchmarking the Effectiveness and Efficiency of Machine Learning Algorithms for Record Linkage
dc.contributor.advisor | Kum, Hye-Chung | |
dc.contributor.advisor | da Silva, Dilma | |
dc.creator | Ilangovan, Gurudev | |
dc.date.accessioned | 2019-11-25T20:47:42Z | |
dc.date.available | 2021-08-01T07:32:19Z | |
dc.date.created | 2019-08 | |
dc.date.issued | 2019-06-19 | |
dc.date.submitted | August 2019 | |
dc.identifier.uri | https://hdl.handle.net/1969.1/186390 | |
dc.description.abstract | Record linkage which refers to the identification of the same entities across several databases in the absence of an unique identifier is a crucial step for data integration. In this research, we study the effectiveness and efficiency of different machine learning algorithms (SVM, Random Forest, and neural networks) to link databases in a controlled experiment. We control for % of heterogeneity in data and size of training dataset. We evaluate the algorithms based on (1) quality of linkages such as F1 score based on a one threshold model and (2) size of uncertain regions that need manual review based on a two threshold model. We find that random forests performed very well both in terms of traditional metrics like F1 score (99.2% - 95.9%) as well as manual review set size (7.1% - 21%) for error rates from 0% to 60%. Though in terms of F1 scores, the algorithms (Random Forests, SVMs and Neural Nets) fared fairly similar, random forests outperformed the next best model by 28% on average in terms of the percentage of pairs that need manual review. | en |
dc.format.mimetype | application/pdf | |
dc.language.iso | en | |
dc.subject | Record Linkage | en |
dc.subject | Machine Learning | en |
dc.title | Benchmarking the Effectiveness and Efficiency of Machine Learning Algorithms for Record Linkage | en |
dc.type | Thesis | en |
thesis.degree.department | Computer Science and Engineering | en |
thesis.degree.discipline | Computer Science | en |
thesis.degree.grantor | Texas A&M University | en |
thesis.degree.name | Master of Science | en |
thesis.degree.level | Masters | en |
dc.contributor.committeeMember | Fossett, Mark | |
dc.type.material | text | en |
dc.date.updated | 2019-11-25T20:47:43Z | |
local.embargo.terms | 2021-08-01 | |
local.etdauthor.orcid | 0000-0003-3973-1620 |
Files in this item
This item appears in the following Collection(s)
-
Electronic Theses, Dissertations, and Records of Study (2002– )
Texas A&M University Theses, Dissertations, and Records of Study (2002– )