Emerging Topics in Genome Sequencing and Analysis
Abstract
This dissertation studies the emerging topics in genome sequencing and analysis with DNA and RNA. The optimal hybrid sequencing and assembly for accurate genome reconstruction and efficient detection approaches for novel ncRNAs in genomes are discussed.
The next-generation sequencing is a significant topic that provides whole genetic information for the further biological research. Recent advances in high-throughput genome sequencing technologies have enabled the systematic study of various genomes by making whole genome sequencing affordable. To date, many hybrid genome assembly algorithms have been developed that can take reads from multiple read sources to reconstruct the original genome. An important aspect of hybrid sequencing and assembly is that the feasibility conditions for genome reconstruction can be satisfied by different combinations of the available read sources, opening up the possibility of optimally combining the sources to minimize the sequencing cost while ensuring accurate genome reconstruction. In this study, we derive the conditions for whole genome reconstruction from multiple read sources at a given confidence level and also introduce the optimal strategy for combining reads from different sources to minimize the overall sequencing cost. We show that the optimal read set, which simultaneously satisfies the feasibility conditions for genome reconstruction and minimizes the sequencing cost, can be effectively predicted through constrained discrete optimization.
The availability of genome-wide sequences for a variety of species provides a large database for the further RNA analysis with computational methods. Recent studies have shown that noncoding RNAs (ncRNAs) are known to play crucial roles in various biological processes, and some ncRNAs are related to the genome stability and a variety of inherited diseases. The discovery of novel ncRNAs is hence an important topic, and there is a pressing need for accurate computational detection approaches that can be used to efficiently detect novel ncRNAs in genomes. One important issue is RNA structure alignment for comparative genome analysis, as RNA secondary structures are better conserved than the RNA sequences. Simultaneous RNA alignment and folding algorithms aim to accurately align RNAs by predicting the consensus structure and alignment at the same time, but the computational complexity of the optimal dynamic programming algorithm for simultaneous alignment and folding is extremely high. In this work, we proposed an innovative method, TOPAS, for RNA structural alignment that can efficiently align RNAs through topological networks. Although many ncRNAs are known to have a well conserved secondary structure, which provides useful clues for computational prediction, the prediction of ncRNAs is still challenging, since it has been shown that a structure-based approach alone may not be sufficient for detecting ncRNAs in a single sequence. In this study, we first develop a new approach by utilizing the n-gram model to classify the sequences and extract effective features to capture sequence homology. Based on this approach, we propose an advanced method, piRNAdetect, for reliable computational prediction of piRNAs in genome sequences. Utilizing the n-gram model can enhance the detection of ncRNAs that have sparse folding structures with many unpaired bases. By incorporating the n-gram model with the generalized ensemble defect, which assesses structure conservation and conformation to the consensus structure, we further propose RNAdetect, a novel computational method for accurate detection of ncRNAs through comparative genome analysis. Extensive performance evaluation based on the Rfam database and bacterial genomes demonstrates that our approaches can accurately and reliably detect novel ncRNAs, outperforming the current advanced methods.
Subject
DNA assemblywhole genome reconstruction
RNA structural alignment
topological network
piRNA prediction
n-gram model
support vector machine
ncRNA prediction
comparative genome analysis
generalized ensemble defect
Citation
Chen, Chun-Chi (2017). Emerging Topics in Genome Sequencing and Analysis. Doctoral dissertation, Texas A & M University. Available electronically from https : / /hdl .handle .net /1969 .1 /173085.