Show simple item record

dc.contributor.advisorLoguinov, Dmitri
dc.creatorSood, Sadhan
dc.date.accessioned2012-10-19T15:28:44Z
dc.date.accessioned2012-10-22T17:59:19Z
dc.date.available2012-10-19T15:28:44Z
dc.date.available2012-10-22T17:59:19Z
dc.date.created2011-08
dc.date.issued2012-10-19
dc.date.submittedAugust 2011
dc.identifier.urihttps://hdl.handle.net/1969.1/ETD-TAMU-2011-08-9813
dc.description.abstractFinding near-duplicate documents is an interesting problem but the existing methods are not suitable for large scale datasets and memory constrained systems. In this work, we developed approaches that tackle the problem of finding near-duplicates while improving query performance and using less memory. We then carried out an evaluation of our method on a dataset of 70M web documents, and showed that our method works really well. The results indicated that our method could achieve a reduction in space by a factor of 5 while improving the query time by a factor of 4 with a recall of 0.95 for finding all near-duplicates when the dataset is in memory. With the same recall and same reduction in space, we could achieve an improvement in query-time by a factor of 4.5 while finding first the near-duplicate for an in memory dataset. When the dataset was stored on a disk, we could achieve an improvement in performance by 7 times for finding all near-duplicates and by 14 times when finding the first near-duplicate.en
dc.format.mimetypeapplication/pdf
dc.language.isoen_US
dc.subjectHamming distanceen
dc.subjectnear-duplicateen
dc.subjectsimilarityen
dc.subjectsearchen
dc.subjectfinger- printen
dc.subjectweb crawlen
dc.subjectclusteringen
dc.subjectweb documenten
dc.titleProbabilistic Simhash Matchingen
dc.typeThesisen
thesis.degree.departmentComputer Science and Engineeringen
thesis.degree.disciplineComputer Scienceen
thesis.degree.grantorTexas A&M Universityen
thesis.degree.nameMaster of Scienceen
thesis.degree.levelMastersen
dc.contributor.committeeMemberCaverlee, James
dc.contributor.committeeMemberReddy, A.L. N.
dc.type.genrethesisen
dc.type.materialtexten


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record