Probabilistic Simhash Matching
Abstract
Finding near-duplicate documents is an interesting problem but the existing methods are not suitable for large scale datasets and memory constrained systems. In this work, we developed approaches that tackle the problem of finding near-duplicates while improving query performance and using less memory. We then carried out an evaluation of our method on a dataset of 70M web documents, and showed that our method works really well. The results indicated that our method could achieve a reduction in space by a factor of 5 while improving the query time by a factor of 4 with a recall of 0.95 for finding all near-duplicates when the dataset is in memory. With the same recall and same reduction in space, we could achieve an improvement in query-time by a factor of 4.5 while finding first the near-duplicate for an in memory dataset. When the dataset was stored on a disk, we could achieve an improvement in performance by 7 times for finding all near-duplicates and by 14 times when finding the first near-duplicate.
Subject
Hamming distancenear-duplicate
similarity
search
finger- print
web crawl
clustering
web document
Citation
Sood, Sadhan (2011). Probabilistic Simhash Matching. Master's thesis, Texas A&M University. Available electronically from https : / /hdl .handle .net /1969 .1 /ETD -TAMU -2011 -08 -9813.