Analyzing TnSeq Data to Predict Insertion Counts in M. tuberculosis

Brown, Adlie Jacob

dc.creator	Brown, Adlie Jacob
dc.date.accessioned	2021-07-24T00:29:14Z
dc.date.available	2021-07-24T00:29:14Z
dc.date.created	2021-05
dc.date.submitted	May 2021
dc.identifier.uri	https://hdl.handle.net/1969.1/194380
dc.description.abstract	TnSeq is a genetic method used to evaluate the essentiality of genes in bacteria, such as Mycobacterium tuberculosis. It uses random insertions by the Himar1 transposon and high throughput sequencing to determine the most essential genes. The Himar1 transposon only inserts at TA dinucleotide sites in the genome, and it was thought that the surrounding sequence did not affect its insertion preferences. However, recent studies have shown that the sequence surrounding the TA site does affect how likely Himar1 is to insert there. Our goal was to determine whether a model that predicts the insertion count of a TA site in the M. tuberculosis given its surrounding nucleotide sequence could be created. To do this machine learning algorithms, including artificial neural networks and naïve bayes classifiers were tuned and tested to make the most accurate predictions. Also, the input and output encodings were adjusted, and supplemental information was added to increase the accuracy of the predictions. In the end, by considering the relative difference between the mean insertion counts of each TA site and the expected counts of surrounding TA sites in addition to the surrounding sequence itself, we were able to use simple linear regression to create a model that has predictive power. We achieved an R^2 value of 0.28, and the scatter plot of the predicted and actual insertion counts showed a linear trend. Our model used the novel approach of considering the context of the surrounding TA sites to generate a more accurate prediction. The model can help scientists better interpret the results of TnSeq experiments. This bioinformatic analysis can help us learn more about bacterial evolution and could help us find essential genes to target when developing drugs to treat tuberculosis.	en
dc.format.mimetype	application/pdf
dc.subject	TnSeq	en
dc.subject	Data Science	en
dc.subject	Machine Learning	en
dc.subject	Genetics	en
dc.subject	Mycobacterium tuberculosis	en
dc.subject	tuberculosis	en
dc.subject	Computer Science	en
dc.subject	Data Analysis	en
dc.subject	high throughput sequencing	en
dc.subject	gene analysis	en
dc.title	Analyzing TnSeq Data to Predict Insertion Counts in M. tuberculosis	en
dc.type	Thesis	en
thesis.degree.department	Computer Science and Engineering	en
thesis.degree.discipline	Computer Science	en
thesis.degree.grantor	Undergraduate Research Scholars Program	en
thesis.degree.name	B.S.	en
thesis.degree.level	Undergraduate	en
dc.contributor.committeeMember	Ioerger, Thomas R
dc.type.material	text	en
dc.date.updated	2021-07-24T00:29:14Z

Files in this item

Name:: BROWN-FINALTHESIS-2021.pdf
Size:: 739.4Kb
Format:: PDF

View/ Open

This item appears in the following Collection(s)

Undergraduate Research Scholars Capstone (2006–present)

Show simple item record