Analyzing TnSeq Data to Predict Insertion Counts in M. tuberculosis
Abstract
TnSeq is a genetic method used to evaluate the essentiality of genes in bacteria, such as Mycobacterium tuberculosis. It uses random insertions by the Himar1 transposon and high throughput sequencing to determine the most essential genes. The Himar1 transposon only inserts at TA dinucleotide sites in the genome, and it was thought that the surrounding sequence did not affect its insertion preferences. However, recent studies have shown that the sequence surrounding the TA site does affect how likely Himar1 is to insert there. Our goal was to determine whether a model that predicts the insertion count of a TA site in the M. tuberculosis given its surrounding nucleotide sequence could be created. To do this machine learning algorithms, including artificial neural networks and naïve bayes classifiers were tuned and tested to make the most accurate predictions. Also, the input and output encodings were adjusted, and supplemental information was added to increase the accuracy of the predictions. In the end, by considering the relative difference between the mean insertion counts of each TA site and the expected counts of surrounding TA sites in addition to the surrounding sequence itself, we were able to use simple linear regression to create a model that has predictive power. We achieved an R^2 value of 0.28, and the scatter plot of the predicted and actual insertion counts showed a linear trend. Our model used the novel approach of considering the context of the surrounding TA sites to generate a more accurate prediction. The model can help scientists better interpret the results of TnSeq experiments. This bioinformatic analysis can help us learn more about bacterial evolution and could help us find essential genes to target when developing drugs to treat tuberculosis.
Subject
TnSeqData Science
Machine Learning
Genetics
Mycobacterium tuberculosis
tuberculosis
Computer Science
Data Analysis
high throughput sequencing
gene analysis
Citation
Brown, Adlie Jacob (2021). Analyzing TnSeq Data to Predict Insertion Counts in M. tuberculosis. Undergraduate Research Scholars Program. Available electronically from https : / /hdl .handle .net /1969 .1 /194380.