Data Mining for Identifying Key Genes in Biological Processes Using Gene Expression Data
Abstract
A large volume of gene expression data is being generated for studying mechanisms of
various biological processes. These precious data enabled various computational analyses to speed
up the understanding of biological knowledge. However, it remains a challenge to analyze the data
efficiently for new knowledge mining. These data were generated for different purposes, and their
heterogeneity makes it difficult to consistently integrate the datasets, slowing down the reuse of
these data and the process of biological discovery for new knowledge. To facilitate the reuse of
these precious data, we engaged biology experts to manually collected RNA-Seq gene expression
datasets for perturbed splicing factors and RNA-binding proteins, resulting in two online
databases, SFMetaDB and RBPMetaDB. These two databases hold comprehensive RNA-Seq gene
expression data for mouse splicing factors and RNA-binding proteins, and they can be used for
identify key genes or regulators in biological processes or human diseases. Beside showing an
importance of two databases, these two projects also demonstrated an efficient way to collect data.
In my dissertation, we also engaged biology collaborators to collect comprehensive regulate genes
in cold-induced thermogenesis supported by in vivo experiments with key genes deposited to
CITGeneDB. This database is the first to offer comprehensive list of regulators in cold-induced
thermogenesis in a higher regulatory hierarchy. In addition to build data resources, my dissertation
also worked on analyze RNA-Seq gene expression data to gain biological insights. To study the
mechanism of human skin disease psoriasis, we analyzed mouse and human public psoriasis
datasets, and compared to splicing factor perturbed datasets in SFMetaDB, resulting in candidate
genes for psoriasis. Our computational predictions provide candidate factors to follow to study
fundamental processes underlying psoriasis. In addition, we introduced a data processing paradigm
to identify key genes in biological processes via systematic collection of gene expression datasets,
primary analysis of data, and evaluation of consistent signals. Our paradigm was applied to two
applications of epidermal development and cold-induced thermogenesis, and revealed many key
genes in the two applications. By collaborating with web labs, we experimentally validate a novel
gene suprabasin (SBSN) in epidermal development. These findings enable a better understanding
of the mechanisms underlying epidermal development and cold-induced thermogenesis, and also
demonstrate the effectiveness of our paradigm by combining data collection and integrated
analysis. My dissertation has mainly investigated a biological data process paradigm, consisting
of systematic data collection, data analysis and hypothesis generation. By intensive works, we
demonstrated the effectiveness of this novel biological data process approach, and this approach
can be readily generalized to other biological processes or human diseases.
Citation
Li, Jin (2018). Data Mining for Identifying Key Genes in Biological Processes Using Gene Expression Data. Doctoral dissertation, Texas A & M University. Available electronically from https : / /hdl .handle .net /1969 .1 /174429.