Show simple item record

dc.creatorBrown, Adlie Jacob
dc.date.accessioned2021-07-24T00:29:14Z
dc.date.available2021-07-24T00:29:14Z
dc.date.created2021-05
dc.date.submittedMay 2021
dc.identifier.urihttps://hdl.handle.net/1969.1/194380
dc.description.abstractTnSeq is a genetic method used to evaluate the essentiality of genes in bacteria, such as Mycobacterium tuberculosis. It uses random insertions by the Himar1 transposon and high throughput sequencing to determine the most essential genes. The Himar1 transposon only inserts at TA dinucleotide sites in the genome, and it was thought that the surrounding sequence did not affect its insertion preferences. However, recent studies have shown that the sequence surrounding the TA site does affect how likely Himar1 is to insert there. Our goal was to determine whether a model that predicts the insertion count of a TA site in the M. tuberculosis given its surrounding nucleotide sequence could be created. To do this machine learning algorithms, including artificial neural networks and naïve bayes classifiers were tuned and tested to make the most accurate predictions. Also, the input and output encodings were adjusted, and supplemental information was added to increase the accuracy of the predictions. In the end, by considering the relative difference between the mean insertion counts of each TA site and the expected counts of surrounding TA sites in addition to the surrounding sequence itself, we were able to use simple linear regression to create a model that has predictive power. We achieved an R^2 value of 0.28, and the scatter plot of the predicted and actual insertion counts showed a linear trend. Our model used the novel approach of considering the context of the surrounding TA sites to generate a more accurate prediction. The model can help scientists better interpret the results of TnSeq experiments. This bioinformatic analysis can help us learn more about bacterial evolution and could help us find essential genes to target when developing drugs to treat tuberculosis.en
dc.format.mimetypeapplication/pdf
dc.subjectTnSeqen
dc.subjectData Scienceen
dc.subjectMachine Learningen
dc.subjectGeneticsen
dc.subjectMycobacterium tuberculosisen
dc.subjecttuberculosisen
dc.subjectComputer Scienceen
dc.subjectData Analysisen
dc.subjecthigh throughput sequencingen
dc.subjectgene analysisen
dc.titleAnalyzing TnSeq Data to Predict Insertion Counts in M. tuberculosisen
dc.typeThesisen
thesis.degree.departmentComputer Science and Engineeringen
thesis.degree.disciplineComputer Scienceen
thesis.degree.grantorUndergraduate Research Scholars Programen
thesis.degree.nameB.S.en
thesis.degree.levelUndergraduateen
dc.contributor.committeeMemberIoerger, Thomas R
dc.type.materialtexten
dc.date.updated2021-07-24T00:29:14Z


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record