Information Theory, Graph Theory and Bayesian Statistics based improved and robust methods in Genome Assembly
MetadataShow full item record
Bioinformatics skills required for genome sequencing often represent a significant hurdle for many researchers working in computational biology. This dissertation highlights the significance of genome assembly as a research area, focuses on its need to remain accurate, provides details about the characteristics of the raw data, examines some key metrics, emphasizes some tools and outlines the whole pipeline for next-generation sequencing. Currently, a major effort is being put towards the assembly of the genomes of all living organisms. Given the importance of comparative genome assembly, herein dissertation, the principle of Minimum Description Length (MDL) and its two variants, the Two-Part MDL and Sophisticated MDL, are explored in identifying the optimal reference sequence for genome assembly. Thereafter, a Modular Approach to Reference Assisted Genome Assembly Pipeline, referred to as MARAGAP, is developed. MARAGAP uses the principle of Minimum Description Length (MDL) to determine the optimal reference sequence for the assembly. The optimal reference sequence is used as a template to infer inversions, insertions, deletions and Single Nucleotide Polymorphisms (SNPs) in the target genome. MARAGAP uses an algorithmic approach to detect and correct inversions and deletions, a De-Bruijn graph based approach to infer insertions, an affine-match affine-gap local alignment tool to estimate the locations of insertions and a Bayesian estimation framework for detecting SNPs (called BECA). BECA effectively capitalizes on the `alignment-layout-consensus' paradigm and Quality (Q-) values for detecting and correcting SNPs by evaluating a number of probabilistic measures. However, the entire process is conducted once. BECA's framework is further extended by using Gibbs Sampling for further iterations of BECA. After each assembly the reference sequence is updated and the probabilistic score for each base call renewed. The revised reference sequence and probabilities are then further used to identify the alignments and consensus sequence, thereby, yielding an algorithm referred to as Gibbs-BECA. Gibbs-BECA further improves the performance both in terms of rectifying more SNPs and offering a robust performance even in the presence of a poor reference sequence. Lastly, another major effort in this dissertation was the development of two cohesive software platforms that combine many different genome assembly pipelines in two distinct environments, referred to as Baari and Genobuntu, respectively. Baari and Genobuntu support pre-assembly tools, genome assemblers as well as post-assembly tools. Additionally, a library of tools developed by the authors for Next Generation Sequencing (NGS) data and commonly used biological software have also been provided in these software platforms. Baari and Genobuntu are free, easily distributable and facilitate building laboratories and software workstations both for personal use as well as for a college/university laboratory. Baari is a customized Ubuntu OS packed with the tools mentioned beforehand whereas Genobuntu is a software package containing the same tools for users who already have Ubuntu OS pre-installed on their systems.
Minimum Description Length
Reference assisted assembly
Wajid, Bilal (2015). Information Theory, Graph Theory and Bayesian Statistics based improved and robust methods in Genome Assembly. Doctoral dissertation, Texas A & M University. Available electronically from