Show simple item record

dc.contributor.advisorBraga-Neto, Ulissesen_US
dc.creatorVu, Thangen_US
dc.date.accessioned2012-07-16T15:56:31Zen_US
dc.date.accessioned2012-07-16T20:22:00Z
dc.date.available2012-07-16T15:56:31Zen_US
dc.date.available2012-07-16T20:22:00Z
dc.date.created2011-05en_US
dc.date.issued2012-07-16en_US
dc.date.submittedMay 2011en_US
dc.identifier.urihttp://hdl.handle.net/1969.1/ETD-TAMU-2011-05-9114en_US
dc.description.abstractThe small-sample size issue is a prevalent problem in Genomics and Proteomics today. Bootstrap, a resampling method which aims at increasing the efficiency of data usage, is considered to be an effort to overcome the problem of limited sample size. This dissertation studies the application of bootstrap to two problems of supervised learning with small sample data: estimation of the misclassification error of Gaussian discriminant analysis, and the bagging ensemble classification method. Estimating the misclassification error of discriminant analysis is a classical problem in pattern recognition and has many important applications in biomedical research. Bootstrap error estimation has been shown empirically to be one of the best estimation methods in terms of root mean squared error. In the first part of this work, we conduct a detailed analytical study of bootstrap error estimation for the Linear Discriminant Analysis (LDA) classification rule under Gaussian populations. We derive the exact formulas of the first and the second moment of the zero bootstrap and the convex bootstrap estimators, as well as their cross moments with the resubstitution estimator and the true error. Based on these results, we obtain the exact formulas of the bias, the variance, and the root mean squared error of the deviation from the true error of these bootstrap estimators. This includes the moments of the popular .632 bootstrap estimator. Moreover, we obtain the optimal weight for unbiased and minimum-RMS convex bootstrap estimators. In the univariate case, all the expressions involve Gaussian distributions, whereas in the multivariate case, the results are written in terms of bivariate doubly non-central F distributions. In the second part of this work, we conduct an extensive empirical investigation of bagging, which is an application of bootstrap to ensemble classification. We investigate the performance of bagging in the classification of small-sample gene-expression data and protein-abundance mass spectrometry data, as well as the accuracy of small-sample error estimation with this ensemble classification rule. We observed that, under t-test and RELIEF filter-based feature selection, bagging generally does a good job of improving the performance of unstable, overtting classifiers, such as CART decision trees and neural networks, but that improvement was not sufficient to beat the performance of single stable, non-overtting classifiers, such as diagonal and plain linear discriminant analysis, or 3-nearest neighbors. Furthermore, the ensemble method did not improve the performance of these stable classifiers significantly. We give an explicit definition of the out-of-bag estimator that is intended to remove estimator bias, by formulating carefully how the error count is normalized, and investigate the performance of error estimation for bagging of common classification rules, including LDA, 3NN, and CART, applied on both synthetic and real patient data, corresponding to the use of common error estimators such as resubstitution, leave-one-out, cross-validation, basic bootstrap, bootstrap 632, bootstrap 632 plus, bolstering, semi-bolstering, in addition to the out-of-bag estimator. The results from the numerical experiments indicated that the performance of the out-of-bag estimator is very similar to that of leave-one-out; in particular, the out-of-bag estimator is slightly pessimistically biased. The performance of the other estimators is consistent with their performance with the corresponding single classifiers, as reported in other studies. The results of this work are expected to provide helpful guidance to practitioners who are interested in applying the bootstrap in supervised learning applications.en_US
dc.format.mimetypeapplication/pdfen_US
dc.language.isoen_USen_US
dc.subjectBootstrapen_US
dc.subjectError Estimationen_US
dc.subjectClassificationen_US
dc.subjectLDAen_US
dc.subjectBaggingen_US
dc.subjectOut-of-Bag Estimationen_US
dc.subjectEnsemble Methodsen_US
dc.subjectGenomicsen_US
dc.subjectProteomicsen_US
dc.titleThe Bootstrap in Supervised Learning and its Applications in Genomics/Proteomicsen_US
dc.typeThesisen
thesis.degree.departmentElectrical and Computer Engineeringen_US
thesis.degree.disciplineElectrical Engineeringen_US
thesis.degree.grantorTexas A&M Universityen_US
thesis.degree.nameDoctor of Philosophyen_US
thesis.degree.levelDoctoralen_US
dc.contributor.committeeMemberDougherty, Edwarden_US
dc.contributor.committeeMemberDatta, Aniruddhaen_US
dc.contributor.committeeMemberDabney, Alanen_US
dc.type.genrethesisen_US
dc.type.materialtexten_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record