dc.creator | Brown, Adlie Jacob | |
dc.date.accessioned | 2021-07-24T00:29:14Z | |
dc.date.available | 2021-07-24T00:29:14Z | |
dc.date.created | 2021-05 | |
dc.date.submitted | May 2021 | |
dc.identifier.uri | https://hdl.handle.net/1969.1/194380 | |
dc.description.abstract | TnSeq is a genetic method used to evaluate the essentiality of genes in bacteria, such as Mycobacterium tuberculosis. It uses random insertions by the Himar1 transposon and high throughput sequencing to determine the most essential genes. The Himar1 transposon only inserts at TA dinucleotide sites in the genome, and it was thought that the surrounding sequence did not affect its insertion preferences. However, recent studies have shown that the sequence surrounding the TA site does affect how likely Himar1 is to insert there. Our goal was to determine whether a model that predicts the insertion count of a TA site in the M. tuberculosis given its surrounding nucleotide sequence could be created. To do this machine learning algorithms, including artificial neural networks and naïve bayes classifiers were tuned and tested to make the most accurate predictions. Also, the input and output encodings were adjusted, and supplemental information was added to increase the accuracy of the predictions. In the end, by considering the relative difference between the mean insertion counts of each TA site and the expected counts of surrounding TA sites in addition to the surrounding sequence itself, we were able to use simple linear regression to create a model that has predictive power. We achieved an R^2 value of 0.28, and the scatter plot of the predicted and actual insertion counts showed a linear trend. Our model used the novel approach of considering the context of the surrounding TA sites to generate a more accurate prediction. The model can help scientists better interpret the results of TnSeq experiments. This bioinformatic analysis can help us learn more about bacterial evolution and could help us find essential genes to target when developing drugs to treat tuberculosis. | en |
dc.format.mimetype | application/pdf | |
dc.subject | TnSeq | en |
dc.subject | Data Science | en |
dc.subject | Machine Learning | en |
dc.subject | Genetics | en |
dc.subject | Mycobacterium tuberculosis | en |
dc.subject | tuberculosis | en |
dc.subject | Computer Science | en |
dc.subject | Data Analysis | en |
dc.subject | high throughput sequencing | en |
dc.subject | gene analysis | en |
dc.title | Analyzing TnSeq Data to Predict Insertion Counts in M. tuberculosis | en |
dc.type | Thesis | en |
thesis.degree.department | Computer Science and Engineering | en |
thesis.degree.discipline | Computer Science | en |
thesis.degree.grantor | Undergraduate Research Scholars Program | en |
thesis.degree.name | B.S. | en |
thesis.degree.level | Undergraduate | en |
dc.contributor.committeeMember | Ioerger, Thomas R | |
dc.type.material | text | en |
dc.date.updated | 2021-07-24T00:29:14Z | |