Bayesian Logistic Regression with Jaro-Winkler String Comparator Scores Provides Sizable Improvement in Probabilistic Record Matching

Jann, Dominic 1983-

dc.contributor.advisor	Sheather, Simon J
dc.contributor.advisor	Speed, Michael
dc.creator	Jann, Dominic 1983-
dc.date.accessioned	2013-03-14T16:12:10Z
dc.date.available	2014-12-12T07:18:54Z
dc.date.created	2012-12
dc.date.issued	2012-08-17
dc.date.submitted	December 2012
dc.identifier.uri	https://hdl.handle.net/1969.1/148078
dc.description.abstract	Record matching is a fundamental and ubiquitous part of today?s society. Anything from typing in a password in order to access your email to connecting existing health records in California with new health records in New York requires matching records together. In general, there are two types of record matching algorithms: deterministic, a more rules-based approach, and probabilistic, a model-based approach. Both types have their advantages and disadvantages. If the amount of data is relatively small, deterministic algorithms yield very high success rates. However, the number of common mistakes, and subsequent rules, becomes astronomically large as the sizes of the datasets increase. This leads to a highly labor-intensive process updating and maintaining the matching algorithm. On the other hand, probabilistic record matching implements a mathematical model that can take into account keying mistakes, does not require as much maintenance and over- head, and provides a probability that two particular entities should be linked. At the same time, as a model, assumptions need to be met, fitness has to be assessed, and predictions can be incorrect. Regardless of the type of algorithm, nearly all utilize a 0/1 field-matching structure, including the Fellegi-Sunter algorithm from 1969. That is to say that either the fields match entirely, or they do not match at all. As a result, typographical errors can get lost and false negatives can result. My research has yielded that using Jaro-Winkler string comparator scores as predictors to a Bayesian logistic regression model in lieu of a restrictive binary structure yields marginal improvement over current methodologies.	en
dc.format.mimetype	application/pdf
dc.subject	Baysian Methodology	en
dc.subject	Logistic Regresion	en
dc.subject	Record matching	en
dc.title	Bayesian Logistic Regression with Jaro-Winkler String Comparator Scores Provides Sizable Improvement in Probabilistic Record Matching	en
dc.type	Thesis	en
thesis.degree.department	Statistics	en
thesis.degree.discipline	Statistics	en
thesis.degree.grantor	Texas A&M University	en
thesis.degree.name	Doctor of Philosophy	en
thesis.degree.level	Doctoral	en
dc.contributor.committeeMember	Hart, Jeffrey D
dc.contributor.committeeMember	Murguia, Edward
dc.contributor.committeeMember	Wang, Suojin
dc.contributor.committeeMember	Jung, Jin-Whan
dc.type.material	text	en
dc.date.updated	2013-03-14T16:12:11Z
local.embargo.terms	2014-12-01

Files in this item

Name:: Jann, Dominic.pdf
Size:: 1.441Mb
Format:: PDF

View/ Open

This item appears in the following Collection(s)

Electronic Theses, Dissertations, and Records of Study (2002– )
Texas A&M University Theses, Dissertations, and Records of Study (2002– )

Show simple item record