Show simple item record

dc.contributor.advisorZhang, Xianyang
dc.creatorZhou, Huijuan
dc.date.accessioned2022-02-23T18:11:20Z
dc.date.available2023-05-01T06:36:32Z
dc.date.created2021-05
dc.date.issued2021-04-22
dc.date.submittedMay 2021
dc.identifier.urihttps://hdl.handle.net/1969.1/195760
dc.description.abstractIn many scientific studies and real life scenes, one of the essential questions is “Are there any signals in the datasets?”. For example in genetics, scientists are interested in genes that are differentially expressed with respected to certain diseases. In fund investment, we want to identify funds administered by skilled instead of just “lucky" managers. These and many other applications could be formulated as a multiple hypothesis testing problem, where we consider a set of statistical inferences simultaneously. Motivated by those real world applications, many research problems arise in multiple testing and remain to be addressed. This dissertation contains two projects addressing different challenges arising from bioinformatics. The first project considers a covariate adaptive family-wise error rate (FWER)-controlling procedure for genome-wide association studies. With the increasing availability of functional genomics data, it is possible to increase the detection power by leveraging these genomic functional annotations in genome-wide association studies. Previous efforts to accommodate covariates in multiple testing focus on the false discovery rate control while covariate-adaptive FWER-controlling procedures remain under-developed. Here we propose a novel covariate-adaptive FWER-controlling procedure that incorporates external covariates which are potentially informative of either the statistical power or the prior null probability. An efficient algorithm is developed to implement the proposed method. We prove its asymptotic validity and obtain the rate of convergence through a perturbation-type argument. Our numerical studies show that the new procedure is more powerful than competing methods and maintains robustness across different settings. We apply the proposed approach to the UK Biobank data and analyze 27 traits with 9 million single-nucleotide polymorphisms tested for associations. Seventy-five genomic annotations are used as covariates. Our approach detects more genome-wide significant loci than other methods in 21 out of the 27 traits. One fundamental statistical task in microbiome data analysis is differential abundance analysis, which aims to identify microbial taxa whose abundance covaries with a variable of interest. Although the main interest is on the change in the absolute abundance, i.e., the number of microbial cells per unit area/volume at the ecological site such as the human gut, the data from a sequencing experiment reflects only the taxa relative abundances in a sample. Thus, microbiome data are compositional in nature. Analysis of such compositional data is challenging since the change in the absolute abundance of one taxon will lead to changes in the relative abundances of other taxa, making false positive control difficult. In the second project, we present a simple, yet robust and highly scalable approach to tackle the compositional effects in differential abundance analysis. The method only requires the application of established statistical tools. It fits linear regression models on the centered log-ratio transformed data, identifies a bias term due to the transformation and compositional effect, and corrects the bias using the mode of the regression coefficients. Due to the algorithmic simplicity, our method is 100-1000 times faster than the state-of-the-art method ANCOM-BC. Under mild assumptions, we prove its asymptotic FDR control property, making it the first differential abundance method that enjoys a theoretical FDR control guarantee. The proposed method is very flexible and can be extended to mixed-effect models for the analysis of correlated microbiome data. Using comprehensive simulations and real data applications, we demonstrate that our method has overall the best performance in terms of FDR control and power among the competitors. We implemented the proposed method in the R package LinDA (https://github.com/zhouhj1994/LinDA).en
dc.format.mimetypeapplication/pdf
dc.language.isoen
dc.subjectMultiple testingen
dc.subjectExternal covariatesen
dc.subjectFamily-wise error rateen
dc.subjectDifferential abundance analysisen
dc.subjectFalse discovery rateen
dc.titleStructure Adaptive Multiple Testing with Applications to Bioinformaticsen
dc.typeThesisen
thesis.degree.departmentStatisticsen
thesis.degree.disciplineStatisticsen
thesis.degree.grantorTexas A&M Universityen
thesis.degree.nameDoctor of Philosophyen
thesis.degree.levelDoctoralen
dc.contributor.committeeMemberCarroll, Raymond
dc.contributor.committeeMemberWong, Raymond
dc.contributor.committeeMemberZhang, Ke
dc.type.materialtexten
dc.date.updated2022-02-23T18:11:21Z
local.embargo.terms2023-05-01
local.etdauthor.orcid0000-0002-3696-6232


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record