Show simple item record

dc.contributor.advisorLoguinov, Dmitri
dc.creatorMathiharan, Siddhartha Sankaran
dc.date.accessioned2012-02-14T22:19:18Z
dc.date.accessioned2012-02-16T16:19:20Z
dc.date.available2014-01-15T07:05:32Z
dc.date.created2011-12
dc.date.issued2012-02-14
dc.date.submittedDecember 2011
dc.identifier.urihttps://hdl.handle.net/1969.1/ETD-TAMU-2011-12-10235
dc.description.abstractWeb crawlers encounter both finite and infinite elements during crawl. Pages and hosts can be infinitely generated using automated scripts and DNS wildcard entries. It is a challenge to rank such resources as an entire web of pages and hosts could be created to manipulate the rank of a target resource. It is crucial to be able to differentiate genuine content from spam in real-time to allocate crawl budgets. In this study, ranking algorithms to rank hosts are designed which use the finite Pay Level Domains(PLD) and IPv4 addresses. Heterogenous graphs derived from the webgraph of IRLbot are used to achieve this. PLD Supporters (PSUPP) which is the number of level-2 PLD supporters for each host on the host-host-PLD graph is the first algorithm that is studied. This is further improved by True PLD Supporters(TSUPP) which uses true egalitarian level-2 PLD supporters on the host-IP-PLD graph and DNS blacklists. It was found that support from content farms and stolen links could be eliminated by finding TSUPP. When TSUPP was applied on the host graph of IRLbot, there was less than 1% spam in the top 100,000 hosts.en
dc.format.mimetypeapplication/pdf
dc.language.isoen_US
dc.subjectsearch enginesen
dc.subjectweb crawlingen
dc.subjectspamen
dc.titleIdentifying Search Engine Spam Using DNSen
dc.typeThesisen
thesis.degree.departmentComputer Science and Engineeringen
thesis.degree.disciplineComputer Scienceen
thesis.degree.grantorTexas A&M Universityen
thesis.degree.nameMaster of Scienceen
thesis.degree.levelMastersen
dc.contributor.committeeMemberCaverlee, James
dc.contributor.committeeMemberReddy, A. L. Narasimha
dc.type.genrethesisen
dc.type.materialtexten
local.embargo.terms2014-01-15


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record