PATTERN RECOGNITION FOR RESTRICTED AND NONSTATIONARY DATA
Abstract
The standard assumption in classification is that the training data are independent and identically distributed. Indeed, this assumption is so pervasive that it is often applied without mention. In this dissertation, we propose novel methods that address violations of this standard assumption corresponding to 1) restricted sampling and 2) a nonstationary environment. The first part of this dissertation concerns the bias of classification precision estimation under restricted sampling. Precision and recall have become very popular classification accuracy metrics in the statistical learning literature, under the standard i.i.d. sampling assumption. However, in many cases of interest, as in observational case-control studies for biomarker discovery in cancer studies, the training data are sampled separately from the case and control populations, violating the standard sampling assumption, under which the data is sampled randomly from the mixture of the populations. We present an analysis of the bias in the estimation of the precision of classifiers designed on separately sampled data. The analysis consists of both theoretical and numerical results, which show that classifier precision estimates can display strong bias under separating sampling, with the bias magnitude depending on the difference between the true case prevalence in the population and the sample prevalence in the data. We show that this bias is systematic in the sense that it cannot be reduced by increasing sample size. If information about the true case prevalence is available from public health records, then we propose the use of a modified precision estimator based on the known prevalence that displays smaller bias, which can in fact be reduced to zero as sample size increases under regularity conditions on the classification algorithm. The accuracy of the theoretical analysis and the performance of the precision estimators under separate sampling are confirmed by numerical experiments using synthetic and real data from published observational case-control studies. The results with real data confirmed that under separately-sampled data, the usual precision estimator produces larger, i.e. more optimistic, estimates than the estimator using the true prevalence value. The second part of this dissertation proposes a state space model approach to classification of nonstationary data. In many applications, the data are collected at different time points. If the time between consecutive acquisition points is large enough, the distribution of data is likely to shift due to natural physical processes, and the standard i.i.d. sampling assumption is violated. This has been known in the statistical learning literature as “population drift” problem. Most attempts to address nonstationarity are ad-hoc and carry no guarantee of optimality. In this dissertation, we propose a state-space methodology, whereby the data are assumed to evolve linearly or nonlinearly under Gaussian observation noise, and applied adaptive filtering methods to estimate the distributional parameters, leading to nonstationary linear and quadratic discriminant analysis (NSLDA and NSQDA) classification rules. Parameter estimation in the linear state-space model is accomplished by a combination of Kalman smoothing and maximum-likelihood estimation by expectation maximization, while particle filtering methods are proposed for the nonlinear state-space model. We have also addressed the case where the time labels of some data are unknown, a situation that often arises in practice, by proposing a hybrid Gaussian mixture modeling (GMM)-Kalman Smoother approach. The accuracy of the proposed nonstationary discriminant analysis rule, as well as its robustness against noise, missing data, and unbalanced training data are demonstrated in numerical experiments, where we compare it to “naive” LDA, QDA, and nonlinear SVM classification rules.
Subject
pattern recognitionseparate sampling
precision and recall
nonstationarity
kalman filter
adaptive filter
linear state space model
particle filter
Citation
Xie, Shuilian (2021). PATTERN RECOGNITION FOR RESTRICTED AND NONSTATIONARY DATA. Doctoral dissertation, Texas A&M University. Available electronically from https : / /hdl .handle .net /1969 .1 /195737.