Signal Processing and Machine Learning Techniques for Analyzing Metagenomic Data

Date

2017-04-21

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Recent advances in high-throughput sequencing technologies open a new era of genomics studies, called metagenomics. Rapidly, metagenomics has presented itself as the standard approach for characterizing the compositional and functional capacity of microbial communities by direct study of the genetic contents recovered from environmental samples without prior culturing. Although these advancements enable researchers to sequence bacterial populations at a reasonable budget, analyzing these massive metagenomic datasets presents significant challenges. This dissertation presents novel computational tools, based on signal processing and machine learning theories, to enable the investigation of biological systems. Two important research problems are addressed in this dissertation. The first problem addressed herein concerns the identification of the potential metagenomic biomarkers, which play a critical role in understanding the biological process under study and developing possible therapies. Due to the lack of knowledge of the true biomarkers and a standard assessment methodology, evaluating the quality of the detected markers is challenging. Therefore, we begin by developing an evaluation protocol that mimics the knowledge of the true markers to provide a common ground to compare competing algorithms. Next, a new framework for the biomarker discovery problem based on a low rank-sparse (LRS) decomposition is proposed. The instability of a biomarker detection algorithm renders the identified markers questionable and hinders the translation of these findings into clinical applications. To mitigate this problem, we propose the Regularized Low Rank-Sparse Decomposition (RegLRSD) algorithm. RegLRSD adapts the LRS model to incorporate the fact that irrelevant features are expected to present abundance profiles that do not exhibit a significant variation between samples belonging to different ii phenotypes. Integrating this prior knowledge helps to guide the recovery process to more accurate and consistent biological results. The second research problem addressed in this dissertation concerns the development of a computational framework to enable the translation of the identified markers into clinical applications. Identifying potential biomarkers is the foremost step in the process of understanding the relation between the microbial composition shift due to a certain disease. However, from a practical perspective, the microbial alteration needs to be quantified in a single numerical value, which helps clinicians to measure the disease activity and its response to therapy.

Description

Keywords

Metagenomic

Citation