|dc.description.abstract||The research in this dissertation focuses on developing a novel methodology for ChIPSeq
dataset analysis. Despite its advances, the standard ChIP-Seq data analysis pipeline, i.e.,
read mapping followed by peak calling has the following shortcomings:
1. Majority of the ChIP-Seq dataset consists of background reads, hence unnecessary
computation effort is spent on mapping reads that have no role in forming the true
2. Unnecessary computation effort is spent on aligning control reads which do not map
to ChIP-enriched genomic regions.
3. Multi-mappable reads are often discarded during the read mapping, resulting in the
reduced power to identify peaks in repeat elements of the genome.
We present Map2Peak, a novel tool aimed at mitigating the aforementioned drawbacks.
Map2Peak receives ChIP-Seq and control unmapped reads as the input and presents the
peaks as the output at a speed twice faster than that of standard workflow. Map2Peak intertwines
partial read mappings and peak calling in a five-phase algorithm. It models the fragment
count information obtained during the early stages of ChIP read mapping (Phase 1) as
a 2-component Poisson mixture model, and then implements expectation-maximization algorithm
to identify ChIP enriched regions (Phase 2). The remaining ChIP reads and majority
of control reads are then restricted to map exactly only to the much shorter pseudo-genome
composed of the ChIP enriched regions (Phase 3 & 4). The mapping information is then
used to call peaks on pseudo-genome (Phase 5). Our results show that the peaks called by
Map2Peak encompass most of the peaks called by the standard workflow (88%-96%) and
some novel motif-justifiable peaks which are not detected by the standard workflow, and
majority (90%) of the background reads are discarded. Moreover, Map2Peak implicitly resolves
the alignment location for some of the multi-mappable reads which result in increased
power to call peaks in repeat elements of the genome.
Map2Peak provides researchers with an ultrafast peak caller which utilizes whole
ChIP-Seq dataset without discarding multi-mappable reads to identify peaks, and efficiently
utilize control datasets for the purpose of peak calling. “Map2Peak” is available