dc.description.abstract | Small samples are commonplace in genomic/proteomic classification, the result
being inadequate classifier design and poor error estimation. A promising approach
to alleviate the problem is the use of prior knowledge. On the other hand, it is
known that a huge amount of information is encoded and represented by biological
signaling pathways. This dissertation is concerned with the problem of classifier
design by utilizing both the available prior knowledge and training data. Specifically,
this dissertation utilizes the concrete notion of regularization in signal processing
and statistics to combine prior knowledge with different data-based or data-ignorant
criteria.
In the first part, we address optimal discrete classification where prior knowledge
is restricted to an uncertainty class of feature distributions absent a prior distribution
on the uncertainty class, a problem that arises directly for biological classification using
pathway information: labeling future observations obtained in the steady state by
utilizing both the available prior knowledge and the training data. An optimization-based
paradigm for utilizing prior knowledge is proposed to design better performing
classifiers when sample sizes are limited. We derive approximate expressions for the
first and second moments of the true error rate of the proposed classifier under the
assumption of two widely used models for the uncertainty classes: E-contamination
and p-point classes. We examine the proposed paradigm on networks containing
NF-k B pathways, where it shows significant improvement compared to data-driven
methods.
In the second part of this dissertation, we focus on Bayesian classification. Although
the problem of designing the optimal Bayesian classifier , assuming some known prior distributions, has been fully addressed, a critical issue still remains: how to incorporate biological knowledge into the prior distribution. For genomic/proteomic, the most common kind of knowledge is in the form of signaling pathways. Thus, it behooves us to nd methods of transforming pathway knowledge into knowledge of the feature-label distribution governing the classi cation problem. In order to incorporate the available prior knowledge, the interactions in the pathways are first quantifi ed from a Bayesian perspective. Then, we address the problem of prior probability construction by proposing a series of optimization paradigms that utilize the incomplete prior information contained in pathways (both topological and regulatory). The optimization paradigms are derived for both Gaussian case with Normal-inverse-Wishart prior and discrete classi cation with Dirichlet prior.
Simulation results, using both synthetic and real pathways, show that the proposed
paradigms yield improved classi ers that outperform traditional classi ers
which use only training data. | en |