Show simple item record

dc.contributor.advisorLoguinov, Dmitri
dc.creatorLee, Hsin-Tsang
dc.date.accessioned2008-10-10T20:55:45Z
dc.date.available2008-10-10T20:55:45Z
dc.date.created2008-05
dc.date.issued2008-10-10
dc.identifier.urihttps://hdl.handle.net/1969.1/85914
dc.description.abstractThis thesis shares our experience in designing web crawlers that scale to billions of pages and models their performance. We show that with the quadratically increasing complexity of verifying URL uniqueness, breadth-first search (BFS) crawl order, and fixed per-host rate-limiting, current crawling algorithms cannot effectively cope with the sheer volume of URLs generated in large crawls, highly-branching spam, legitimate multi-million-page blog sites, and infinite loops created by server-side scripts. We offer a set of techniques for dealing with these issues and test their performance in an implementation we call IRLbot. In our recent experiment that lasted 41 days, IRLbot running on a single server successfully crawled 6:3 billion valid HTML pages (7:6 billion connection requests) and sustained an average download rate of 319 mb/s (1,789 pages/s). Unlike our prior experiments with algorithms proposed in related work, this version of IRLbot did not experience any bottlenecks and successfully handled content from over 117 million hosts, parsed out 394 billion links, and discovered a subset of the web graph with 41 billion unique nodes.en
dc.format.mediumelectronicen
dc.language.isoen_US
dc.publisherTexas A&M University
dc.subjectMeasurementen
dc.subjectPerformanceen
dc.titleIRLbot: design and performance analysis of a large-scale web crawleren
dc.typeBooken
dc.typeThesisen
thesis.degree.departmentComputer Scienceen
thesis.degree.disciplineComputer Scienceen
thesis.degree.grantorTexas A&M Universityen
thesis.degree.nameMaster of Scienceen
thesis.degree.levelMastersen
dc.contributor.committeeMemberBettati, Riccardo
dc.contributor.committeeMemberA.L. Narasimha, Reddy
dc.type.genreElectronic Thesisen
dc.type.materialtexten
dc.format.digitalOriginborn digitalen


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record