Analysis, Modeling, and Algorithms for Scalable Web Crawling

Ahmed, Sarker Tanzir

dc.contributor.advisor	Loguinov, Dmitri
dc.creator	Ahmed, Sarker Tanzir
dc.date.accessioned	2016-09-16T14:20:09Z
dc.date.available	2018-08-01T05:57:22Z
dc.date.created	2016-08
dc.date.issued	2016-08-05
dc.date.submitted	August 2016
dc.identifier.uri	https://hdl.handle.net/1969.1/157811
dc.description.abstract	This dissertation presents a modeling framework for the intermediate data generated by external-memory sorting algorithms (e.g., merge sort, bucket sort, hash sort, replacement selection) that are well-known, yet without accurate models of produced data volume. The motivation comes from the IRLbot crawl experience in June 2007, where a collection of scalable and high-performance external sorting methods are used to handle such problems as URL uniqueness checking, real-time frontier ranking, budget allocation, spam avoidance, all being monumental tasks, especially when limited to the resources of a single-machine. We discuss this crawl experience in detail, use novel algorithms to collect data from the crawl image, and then advance to a broader problem – sorting arbitrarily large-scale data using limited resources and accurately capturing the required cost (e.g., time and disk usage). To solve these problems, we present an accurate model of uniqueness probability the probability to encounter previous unseen data and use that to analyze the amount of intermediate data generated the above-mentioned sorting methods. We also demonstrate how the intermediate data volume and runtime vary based on the input properties (e.g., frequency distribution), hardware configuration (e.g., main memory size, CPU and disk speed) and the choice of sorting method, and that our proposed models accurately capture such variation. Furthermore, we propose a novel hash-based method for replacement selection sort and its model in case of duplicate data, where existing literature is limited to random or mostly-unique data. Note that the classic replacement selection method has the ability to increase the length of sorted runs and reduce their number, both directly benefiting the merge step of external sorting and . But because of a priority queue-assisted sort operation that is inherently slow, the application of replacement selection was limited. Our hash-based design solves this problem by making the sort phase significantly faster compared to existing methods, making this method a preferred choice. The presented models also enable exact analysis of Least-Recently-Used (LRU) and Random Replacement caches (i.e., their hit rate) that are used as part of the algorithms presented here. These cache models are more accurate than the ones in existing literature, since the existing ones mostly assume infinite stream of data, while our models work accurately on finite streams (e.g., sampled web graphs, click stream) as well. In addition, we present accurate models for various crawl characteristics of random graphs, which can forecast a number of aspects of crawl experience based on the graph properties (e.g., degree distribution). All these models are presented under a unified umbrella to analyze a set of large-scale information processing algorithms that are streamlined for high performance and scalability.	en
dc.format.mimetype	application/pdf
dc.language.iso	en
dc.subject	MapReduce	en
dc.subject	BigData	en
dc.subject	Web crawling	en
dc.subject	External Sorting	en
dc.title	Analysis, Modeling, and Algorithms for Scalable Web Crawling	en
dc.type	Thesis	en
thesis.degree.department	Computer Science and Engineering	en
thesis.degree.discipline	Computer Science	en
thesis.degree.grantor	Texas A & M University	en
thesis.degree.name	Doctor of Philosophy	en
thesis.degree.level	Doctoral	en
dc.contributor.committeeMember	Bettati, Riccardo
dc.contributor.committeeMember	Caverlee, James
dc.contributor.committeeMember	Reddy, A. L. Narasimha
dc.type.material	text	en
dc.date.updated	2016-09-16T14:20:17Z
local.embargo.terms	2018-08-01
local.etdauthor.orcid	0000-0002-8974-8773

Files in this item

Name:: AHMED-DISSERTATION-2016.pdf
Size:: 1.406Mb
Format:: PDF

View/ Open

This item appears in the following Collection(s)

Electronic Theses, Dissertations, and Records of Study (2002– )
Texas A&M University Theses, Dissertations, and Records of Study (2002– )

Show simple item record