MinHash, Bloom Filters and LSH
Lets talk about some large scale algorithms widely used in the document world. MinHash: Minhash is a really cute algorithm to determine how does a document compare to another. One simple way to compute document similarity is to compute the jaccard similarity between them. The jaccard similarity between the documents is computed as follows: def jaccard_similarity(set1: set, set2: set): if len(set1) == 0 or len(set2) == 0: return 0.0 common = len(set1.intersection(set2)) if common == 0: return 0.0 union = len(set1.union(set2)) return common / union One problem with Jaccard similarity is that it can take a lot of time to compute especially for large scale documents. The idea behind minhashing is to first operate on the space of document shingles. Shingles are basically any combination of k consecutive words of the document. For example, for the sentence "hi i am jaya" the possible 3 shingles are "hi i am" and "i am jaya...