Improved Algorithms for Discovery of Transcription Factor Binding Sites in DNA Sequences

Zhao, Xiaoyan

dc.contributor.advisor	Sze, Sing-Hoi
dc.creator	Zhao, Xiaoyan
dc.date.accessioned	2012-02-14T22:18:37Z
dc.date.accessioned	2012-02-16T16:14:40Z
dc.date.available	2012-02-14T22:18:37Z
dc.date.available	2012-02-16T16:14:40Z
dc.date.created	2010-12
dc.date.issued	2012-02-14
dc.date.submitted	December 2010
dc.identifier.uri	https://hdl.handle.net/1969.1/ETD-TAMU-2010-12-8834
dc.description.abstract	Understanding the mechanisms that regulate gene expression is a major challenge in biology. One of the most important tasks in this challenge is to identify the transcription factors binding sites (TFBS) in DNA sequences. The common representation of these binding sites is called “motif” and the discovery of TFBS problem is also referred as motif finding problem in computer science. Despite extensive efforts in the past decade, none of the existing algorithms perform very well. This dissertation focuses on this difficult problem and proposes three new methods (MotifEnumerator, PosMotif, and Enrich) with excellent improvements. An improved pattern-driven algorithm, MotifEnumerator, is first proposed to detect the optimal motif with reduced time complexity compared to the traditional exact pattern-driven approaches. This strategy is further extended to allow arbitrary don’t care positions within a motif without much decrease in solvable values of motif length. The performance of this algorithm is comparable to the best existing motif finding algorithms on a large benchmark set of samples. Another algorithm with further post processing, PosMotif, is proposed to use a string representation that allows arbitrary ignored positions within the non-conserved portion of single motifs, and use Markov chains to model the background distributions of motifs of certain length while skipping these positions within each Markov chain. Two post processing steps considering redundancy information are applied in this algorithm. PosMotif demonstrates an improved performance compared to the best five existing motif finding algorithms on several large benchmark sets of samples. The third method, Enrich, is proposed to improve the performance of general motif finding algorithms by adding more sequences to the samples in the existing benchmark datasets. Five famous motif finding algorithms have been chosen to run on the original datasets and the enriched datasets, and the performance comparisons show a general great improvement on the enriched datasets.	en
dc.format.mimetype	application/pdf
dc.language.iso	en_US
dc.subject	Computational Biology	en
dc.subject	Motif finding	en
dc.subject	Transcription	en
dc.title	Improved Algorithms for Discovery of Transcription Factor Binding Sites in DNA Sequences	en
dc.type	Thesis	en
thesis.degree.department	Computer Science and Engineering	en
thesis.degree.discipline	Computer Science	en
thesis.degree.grantor	Texas A&M University	en
thesis.degree.name	Doctor of Philosophy	en
thesis.degree.level	Doctoral	en
dc.contributor.committeeMember	Chen, Jianer
dc.contributor.committeeMember	Sarin, Vivek
dc.contributor.committeeMember	Fan, Ruzong
dc.type.genre	thesis	en
dc.type.material	text	en

Files in this item

Name:: ZHAO-DISSERTATION.pdf
Size:: 1.050Mb
Format:: PDF

View/ Open

This item appears in the following Collection(s)

Electronic Theses, Dissertations, and Records of Study (2002– )
Texas A&M University Theses, Dissertations, and Records of Study (2002– )

Show simple item record