Show simple item record

dc.contributor.advisorBraga-Neto, Ulisses M
dc.creatorXie, Shuilian
dc.date.accessioned2022-02-23T18:09:41Z
dc.date.available2023-05-01T06:36:42Z
dc.date.created2021-05
dc.date.issued2021-05-18
dc.date.submittedMay 2021
dc.identifier.urihttps://hdl.handle.net/1969.1/195737
dc.description.abstractThe standard assumption in classification is that the training data are independent and identically distributed. Indeed, this assumption is so pervasive that it is often applied without mention. In this dissertation, we propose novel methods that address violations of this standard assumption corresponding to 1) restricted sampling and 2) a nonstationary environment. The first part of this dissertation concerns the bias of classification precision estimation under restricted sampling. Precision and recall have become very popular classification accuracy metrics in the statistical learning literature, under the standard i.i.d. sampling assumption. However, in many cases of interest, as in observational case-control studies for biomarker discovery in cancer studies, the training data are sampled separately from the case and control populations, violating the standard sampling assumption, under which the data is sampled randomly from the mixture of the populations. We present an analysis of the bias in the estimation of the precision of classifiers designed on separately sampled data. The analysis consists of both theoretical and numerical results, which show that classifier precision estimates can display strong bias under separating sampling, with the bias magnitude depending on the difference between the true case prevalence in the population and the sample prevalence in the data. We show that this bias is systematic in the sense that it cannot be reduced by increasing sample size. If information about the true case prevalence is available from public health records, then we propose the use of a modified precision estimator based on the known prevalence that displays smaller bias, which can in fact be reduced to zero as sample size increases under regularity conditions on the classification algorithm. The accuracy of the theoretical analysis and the performance of the precision estimators under separate sampling are confirmed by numerical experiments using synthetic and real data from published observational case-control studies. The results with real data confirmed that under separately-sampled data, the usual precision estimator produces larger, i.e. more optimistic, estimates than the estimator using the true prevalence value. The second part of this dissertation proposes a state space model approach to classification of nonstationary data. In many applications, the data are collected at different time points. If the time between consecutive acquisition points is large enough, the distribution of data is likely to shift due to natural physical processes, and the standard i.i.d. sampling assumption is violated. This has been known in the statistical learning literature as “population drift” problem. Most attempts to address nonstationarity are ad-hoc and carry no guarantee of optimality. In this dissertation, we propose a state-space methodology, whereby the data are assumed to evolve linearly or nonlinearly under Gaussian observation noise, and applied adaptive filtering methods to estimate the distributional parameters, leading to nonstationary linear and quadratic discriminant analysis (NSLDA and NSQDA) classification rules. Parameter estimation in the linear state-space model is accomplished by a combination of Kalman smoothing and maximum-likelihood estimation by expectation maximization, while particle filtering methods are proposed for the nonlinear state-space model. We have also addressed the case where the time labels of some data are unknown, a situation that often arises in practice, by proposing a hybrid Gaussian mixture modeling (GMM)-Kalman Smoother approach. The accuracy of the proposed nonstationary discriminant analysis rule, as well as its robustness against noise, missing data, and unbalanced training data are demonstrated in numerical experiments, where we compare it to “naive” LDA, QDA, and nonlinear SVM classification rules.en
dc.format.mimetypeapplication/pdf
dc.language.isoen
dc.subjectpattern recognitionen
dc.subjectseparate samplingen
dc.subjectprecision and recallen
dc.subjectnonstationarityen
dc.subjectkalman filteren
dc.subjectadaptive filteren
dc.subjectlinear state space modelen
dc.subjectparticle filteren
dc.subjecten
dc.titlePATTERN RECOGNITION FOR RESTRICTED AND NONSTATIONARY DATAen
dc.typeThesisen
thesis.degree.departmentElectrical and Computer Engineeringen
thesis.degree.disciplineElectrical Engineeringen
thesis.degree.grantorTexas A&M Universityen
thesis.degree.nameDoctor of Philosophyen
thesis.degree.levelDoctoralen
dc.contributor.committeeMemberNarayanan, Krishna R
dc.contributor.committeeMemberDabney, Alan
dc.contributor.committeeMemberDougherty, Edward R
dc.type.materialtexten
dc.date.updated2022-02-23T18:09:41Z
local.embargo.terms2023-05-01
local.etdauthor.orcid0000-0002-6991-0315


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record