dc.description.abstract | We propose novel methods to tackle two problems: the misspecified model with measurement
error and high-dimensional binary classification, both have a crucial impact on
applications in public health.
The first problem exists in the epidemiology practice. Epidemiologists often categorize a
continuous risk predictor since categorization is thought to be more robust and interpretable,
even when the true risk model is not a categorical one. Thus, their goal is to fit the categorical
model and interpret the categorical parameters. We address the question: with measurement
error and categorization, how can we do what epidemiologists want, namely to estimate the
parameters of the categorical model that would have been estimated if the true predictor was
observed? We develop a general methodology for such an analysis, and illustrate it in linear
and logistic regression. Simulation studies are presented, and the methodology is applied to
a nutrition data set. Discussion of alternative approaches is also included.
For the second project, we consider the problem of high-dimensional classification between
the two groups with unequal covariance matrices. Rather than estimating the full quadratic
discriminant rule, we propose to perform simultaneous variable selection and linear dimension
reduction on original data, with the subsequent application of quadratic discriminant analysis
on the reduced space. In contrast to quadratic discriminant analysis, the proposed framework
does not require estimation of precision matrices and scales linearly with the number of
measurements, making it especially attractive for the use on high-dimensional datasets. We
support the methodology with theoretical guarantees on variable selection consistency, and
empirical comparison with competing approaches. We apply the method to gene expression
data of breast cancer patients and confirm the crucial importance of the ESR1 gene in
differentiating estrogen receptor status.
Further, we provide software support for the proposed methodology. We develop two
R packages, CCP and DAP, and present two vignettes as long-format illustrations for their
usage. | en |