Comparative Analysis of Error Correction in High-throughput Sequences for the Human Gut Microbiome

Purwosumarto, Nathan

Abstract

With the development of high-throughput sequencing tools over the last few decades, the sequencing of genomic data at a large scale at a relatively low cost has drastically revolutionized the field of bioinformatics. Next-generation sequencing tools, such as the Illumina suite of bridge amplification sequencing technologies, can generate billions of base pair reads per experiment. However, one drawback of these tools is that they produce a lot more errors than early sequencing methods. While error rates may seem to be quite low on paper, they are compounded by the large number of bases sequenced. Since these errors have the potential to confuse analysis and further results within bioinformatics pipelines, many tools have been developed to mitigate this issue. The traditional method is to use clustering & denoising techniques to mitigate the error, but there have been a variety of software that also look at reducing error through correction using alternative methods, such as k-mer analysis. This project looks at the traditional method of error correction of high throughput sequencing using clustering & denoising and seeing if non-standard error correction models can be included in addition to the traditional pipeline to obtain better results. As the entire field of high-throughput sequencing is very large, a focus will be placed on error correction in bacterial taxonomic classification. For this project, taxonomic classification for the human gut microbiome will be studied, using the 16S rRNA gene as the target sequence due to its ubiquity and importance in bacterial taxonomic classification. This gene is a highly conserved sequence among most prokaryotes, serving a fundamental role in protein synthesis across bacterial species. Differences within this sequence allow for the analysis of taxonomic composition within bacterial communities, which will be analyzed in the context of the species residing within the human gut microbiome. Existing sequences that have known taxonomic composition for the human gut microbiome will be used with different error correction methods as part of an in silico pipeline using the bioinformatics platform QIIME2. This project builds off previous research in the field, studying their methodologies and differences to address the problem of errors arising during sequencing. The human gut microbiome was chosen due to recent studies finding that the diversity of the gut microbiome has been increasingly linked with a variety of overall health conditions. A contrastive approach will be taken to identify the differences between error correction and traditional taxonomic classification methods to determine whether increased taxonomic classification can be obtained with error correction on sequences for the human gut microbiome, focusing on the differences that error correction software can make at the genus and species level.

URI

https://hdl.handle.net/1969.1/199647

Subject

Bioinformatics
Taxonomic Classification
Gut Microbiome
Error Correction

Collections

Undergraduate Research Scholars Capstone (2006–present)

Citation

Purwosumarto, Nathan (2023). Comparative Analysis of Error Correction in High-throughput Sequences for the Human Gut Microbiome. Undergraduate Research Scholars Program. Available electronically from https : / /hdl .handle .net /1969 .1 /199647.