Probabilistic Simhash Matching

Sood, Sadhan

View/ Open

SOOD-THESIS.pdf (361.6Kb)

Date

2012-10-19

Author

Sood, Sadhan

Metadata

Show full item record

Abstract

Finding near-duplicate documents is an interesting problem but the existing methods are not suitable for large scale datasets and memory constrained systems. In this work, we developed approaches that tackle the problem of finding near-duplicates while improving query performance and using less memory. We then carried out an evaluation of our method on a dataset of 70M web documents, and showed that our method works really well. The results indicated that our method could achieve a reduction in space by a factor of 5 while improving the query time by a factor of 4 with a recall of 0.95 for finding all near-duplicates when the dataset is in memory. With the same recall and same reduction in space, we could achieve an improvement in query-time by a factor of 4.5 while finding first the near-duplicate for an in memory dataset. When the dataset was stored on a disk, we could achieve an improvement in performance by 7 times for finding all near-duplicates and by 14 times when finding the first near-duplicate.

Citation

Sood, Sadhan (2011). Probabilistic Simhash Matching. Master's thesis, Texas A&M University. Available electronically from https : / /hdl .handle .net /1969 .1 /ETD -TAMU -2011 -08 -9813.