Show simple item record

dc.contributor.advisorWilliams, Tiffani L.en_US
dc.creatorSul, Seung Jinen_US
dc.date.accessioned2011-02-22T22:24:15Zen_US
dc.date.accessioned2011-02-22T23:48:12Z
dc.date.available2011-02-22T22:24:15Zen_US
dc.date.available2011-02-22T23:48:12Z
dc.date.created2009-12en_US
dc.date.issued2011-02-22en_US
dc.date.submittedDecember 2009en_US
dc.identifier.urihttp://hdl.handle.net/1969.1/ETD-TAMU-2009-12-7591en_US
dc.description.abstractPhylogenetic analysis can produce easily tens of thousands of equally plausible evolutionary trees. Consensus trees and topological distance matrices are often used to summarize the evolutionary relationships among the trees of interest. However, current approaches are not designed to analyze very large tree collections. In this dissertation, we present two fast algorithms— HashCS and HashRF —for analyzing large collections of evolutionary trees based on a novel hash table data structure, which provides a convenient and fast approach to store and access the bipartition information collected from the tree collections. Our HashCS algorithm is a fast 𝑂(𝑛𝑡) technique for constructing consensus trees, where 𝑛 is the number of taxa and 𝑡 is the number of trees. By reprocessing the bipartition information in our hash table, HashCS constructs strict and majority consensus trees. In addition to a consensus algorithm, we design a fast topological distance algorithm called HashRF to compute the 𝑡×𝑡 Robinson-Foulds distance matrix, which requires 𝑂(𝑛𝑡^ 2) running time. A RF distance matrix provides plenty of data-mining opportunities to help researchers understand the evolutionary relationships contained in their collection of trees. We also introduce a series of extensions based on HashRF to provide researchers with more convenient set of tools for analyzing their trees. We provide extensive experimentation regarding the practical performance of our hash-based algorithms across a diverse collection of biological and artificial trees. Our results show that both algorithms easily outperform existing consensus and RF matrix implementations. For example, on our biological trees, HashCS and HashRF are 1.8 and 100 times faster than PAUP*, respectively. We show two real-world applications of our fast hashing algorithms: (i) comparing phylogenetic heuristic implementations, and (ii) clustering and visualizing trees. In our first application, we design novel methods to compare the PaupRat and Rec-I-DCM3, two popular phylogenetic heuristics that use the Maximum Parsimony criterion, and show that RF distances are more effective than parsimony scores at identifying heterogeneity within a collection of trees. In our second application, we empirically show how to determine the distinct clusters of trees within large tree collections. We use two different techniques to identify distinct tree groups. Both techniques show that partitioning the trees into distinct groups and summarizing each group separately is a better representation of the data. Additional benefits of our approach are better consensus trees as well as insightful information regarding the convergence behavior of phylogenetic heuristics. Our fast hash-based algorithms provide scientists with a very powerful tools for analyzing the relationships within their large phylogenetic tree collections in new and exciting ways. Our work has many opportunities for future work including detecting convergence and designing better heuristics. Furthermore, our hash tables have lots of potential future extensions. For example, we can also use our novel hashing structure to design algorithms for computing other distance metrics such as Nearest Neighbor Interchange (NNI), Subtree Pruning and Regrafting (SPR), and Tree Bisection and Reconnection (TBR) distances.en_US
dc.format.mimetypeapplication/pdfen_US
dc.language.isoen_USen_US
dc.subjectphylogenetic analysisen_US
dc.subjectevolutionary treeen_US
dc.subjecthash, consensus treeen_US
dc.subjectRobinson-Foulds distanceen_US
dc.titleFast Hash-Based Algorithms for Analyzing Large Collections of Evolutionary Treesen_US
dc.typeBooken
dc.typeThesisen
thesis.degree.departmentComputer Science and Engineeringen_US
thesis.degree.disciplineComputer Scienceen_US
thesis.degree.grantorTexas A&M Universityen_US
thesis.degree.nameDoctor of Philosophyen_US
thesis.degree.levelDoctoralen_US
dc.contributor.committeeMemberAmato, Nancy M.en_US
dc.contributor.committeeMemberGutierrez-Osuna, Ricardoen_US
dc.contributor.committeeMemberWoolley, James B.en_US
dc.type.genreElectronic Dissertationen_US
dc.type.materialtexten_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record