Data Mining for Identifying Key Genes in Biological Processes Using Gene Expression Data

Li, Jin

View/ Open

LI-DISSERTATION-2018.pdf (4.303Mb)

Date

2018-10-03

Author

Li, Jin

Metadata

Show full item record

Abstract

A large volume of gene expression data is being generated for studying mechanisms of various biological processes. These precious data enabled various computational analyses to speed up the understanding of biological knowledge. However, it remains a challenge to analyze the data efficiently for new knowledge mining. These data were generated for different purposes, and their heterogeneity makes it difficult to consistently integrate the datasets, slowing down the reuse of these data and the process of biological discovery for new knowledge. To facilitate the reuse of these precious data, we engaged biology experts to manually collected RNA-Seq gene expression datasets for perturbed splicing factors and RNA-binding proteins, resulting in two online databases, SFMetaDB and RBPMetaDB. These two databases hold comprehensive RNA-Seq gene expression data for mouse splicing factors and RNA-binding proteins, and they can be used for identify key genes or regulators in biological processes or human diseases. Beside showing an importance of two databases, these two projects also demonstrated an efficient way to collect data. In my dissertation, we also engaged biology collaborators to collect comprehensive regulate genes in cold-induced thermogenesis supported by in vivo experiments with key genes deposited to CITGeneDB. This database is the first to offer comprehensive list of regulators in cold-induced thermogenesis in a higher regulatory hierarchy. In addition to build data resources, my dissertation also worked on analyze RNA-Seq gene expression data to gain biological insights. To study the mechanism of human skin disease psoriasis, we analyzed mouse and human public psoriasis datasets, and compared to splicing factor perturbed datasets in SFMetaDB, resulting in candidate genes for psoriasis. Our computational predictions provide candidate factors to follow to study fundamental processes underlying psoriasis. In addition, we introduced a data processing paradigm to identify key genes in biological processes via systematic collection of gene expression datasets, primary analysis of data, and evaluation of consistent signals. Our paradigm was applied to two applications of epidermal development and cold-induced thermogenesis, and revealed many key genes in the two applications. By collaborating with web labs, we experimentally validate a novel gene suprabasin (SBSN) in epidermal development. These findings enable a better understanding of the mechanisms underlying epidermal development and cold-induced thermogenesis, and also demonstrate the effectiveness of our paradigm by combining data collection and integrated analysis. My dissertation has mainly investigated a biological data process paradigm, consisting of systematic data collection, data analysis and hypothesis generation. By intensive works, we demonstrated the effectiveness of this novel biological data process approach, and this approach can be readily generalized to other biological processes or human diseases.

Citation

Li, Jin (2018). Data Mining for Identifying Key Genes in Biological Processes Using Gene Expression Data. Doctoral dissertation, Texas A & M University. Available electronically from https : / /hdl .handle .net /1969 .1 /174429.