Statistical Methods for the Analysis of Mass Spectrometry-based Proteomics Data

Wang, Xuan

dc.contributor.advisor	Dabney, Alan
dc.creator	Wang, Xuan
dc.date.accessioned	2012-07-16T15:57:41Z
dc.date.accessioned	2012-07-16T20:27:38Z
dc.date.available	2014-09-16T07:28:20Z
dc.date.created	2012-05
dc.date.issued	2012-07-16
dc.date.submitted	May 2012
dc.identifier.uri	https://hdl.handle.net/1969.1/ETD-TAMU-2012-05-10777
dc.description.abstract	Proteomics serves an important role at the systems-level in understanding of biological functioning. Mass spectrometry proteomics has become the tool of choice for identifying and quantifying the proteome of an organism. In the most widely used bottom-up approach to MS-based high-throughput quantitative proteomics, complex mixtures of proteins are first subjected to enzymatic cleavage, the resulting peptide products are separated based on chemical or physical properties and then analyzed using a mass spectrometer. The three fundamental challenges in the analysis of bottom-up MS-based proteomics are as follows: (i) Identifying the proteins that are present in a sample, (ii) Aligning different samples on elution (retention) time, mass, peak area (intensity) and etc, (iii) Quantifying the abundance levels of the identified proteins after alignment. Each of these challenges requires knowledge of the biological and technological context that give rise to the observed data, as well as the application of sound statistical principles for estimation and inference. In this dissertation, we present a set of statistical methods in bottom-up proteomics towards protein identification, alignment and quantification. We describe a fully Bayesian hierarchical modeling approach to peptide and protein identification on the basis of MS/MS fragmentation patterns in a unified framework. Our major contribution is to allow for dependence among the list of top candidate PSMs, which we accomplish with a Bayesian multiple component mixture model incorporating decoy search results and joint estimation of the accuracy of a list of peptide identifications for each MS/MS fragmentation spectrum. We also propose an objective criteria for the evaluation of the False Discovery Rate (FDR) associated with a list of identifications at both peptide level, which results in more accurate FDR estimates than existing methods like PeptideProphet. Several alignment algorithms have been developed using different warping functions. However, all the existing alignment approaches suffer from a useful metric for scoring an alignment between two data sets and hence lack a quantitative score for how good an alignment is. Our alignment approach uses "Anchor points" found to align all the individual scan in the target sample and provides a framework to quantify the alignment, that is, assigning a p-value to a set of aligned LC-MS runs to assess the correctness of alignment. After alignment using our algorithm, the p-values from Wilcoxon signed-rank test on elution (retention) time, M/Z, peak area successfully turn into non-significant values. Quantitative mass spectrometry-based proteomics involves statistical inference on protein abundance, based on the intensities of each protein's associated spectral peaks. However, typical mass spectrometry-based proteomics data sets have substantial proportions of missing observations, due at least in part to censoring of low intensities. This complicates intensity-based differential expression analysis. We outline a statistical method for protein differential expression, based on a simple Binomial likelihood. By modeling peak intensities as binary, in terms of "presence / absence", we enable the selection of proteins not typically amendable to quantitative analysis; e.g., "one-state" proteins that are present in one condition but absent in another. In addition, we present an analysis protocol that combines quantitative and presence / absence analysis of a given data set in a principled way, resulting in a single list of selected proteins with a single associated FDR.	en
dc.format.mimetype	application/pdf
dc.language.iso	en_US
dc.subject	Proteomics	en
dc.subject	Mass spectrometry	en
dc.subject	Bottom-up	en
dc.subject	Identification	en
dc.subject	Alignment	en
dc.subject	Quantitation	en
dc.subject	LC-MS	en
dc.subject	PSM	en
dc.subject	Bayesian Hierarchical Model	en
dc.subject	FDR	en
dc.subject	Anchor Points	en
dc.subject	Missingness	en
dc.title	Statistical Methods for the Analysis of Mass Spectrometry-based Proteomics Data	en
dc.type	Thesis	en
thesis.degree.department	Statistics	en
thesis.degree.discipline	Statistics	en
thesis.degree.grantor	Texas A&M University	en
thesis.degree.name	Doctor of Philosophy	en
thesis.degree.level	Doctoral	en
dc.contributor.committeeMember	Longnecker, Michael
dc.contributor.committeeMember	Sang, Huiyan
dc.contributor.committeeMember	Sturino, Joseph
dc.type.genre	thesis	en
dc.type.material	text	en
local.embargo.terms	2014-07-16

Files in this item

Name:: WANG-DISSERTATION.pdf
Size:: 3.956Mb
Format:: PDF

View/ Open

This item appears in the following Collection(s)

Electronic Theses, Dissertations, and Records of Study (2002– )
Texas A&M University Theses, Dissertations, and Records of Study (2002– )

Show simple item record