MinHash, Bloom Filters and LSH

Lets talk about some large scale algorithms widely used in the document world.

MinHash: Minhash is a really cute algorithm to determine how does a document compare to another. One simple way to compute document similarity is to compute the jaccard similarity between them. The jaccard similarity between the documents is computed as follows:
def jaccard_similarity(set1: set, set2: set):
    if len(set1) == 0 or len(set2) == 0:
        return 0.0
    common = len(set1.intersection(set2))
    if common == 0:
        return 0.0
    union = len(set1.union(set2))
    return common / union
One problem with Jaccard similarity is that it can take a lot of time to compute especially for large scale documents. The idea behind minhashing is to first operate on the space of document shingles. Shingles are basically any combination of k consecutive words of the document. For example, for the sentence "hi i am jaya" the possible 3 shingles are "hi i am" and "i am jaya". You can generate k-shingles as follows:
def shingles(words: list, k: int):
    ret = set()
    if k <= 0:
        return ret
    for i in range(len(words) - k + 1):
        sh = ' '.join(words[i: i + k])
        crc = shingle_crc(sh)
        ret.add(crc)
    return ret
Typically the shingles are created out of characters and not words.You want to pick k so that it is rare for a shingle to be present in a document. If it is too small, there might be a lot of spurious connections. Next, the goal is to generate hash of the shingles and select the minimum of the hash value. The hash function in the code below is a simple hash function of the form (a*x + b)%c where a and b are random values and c is a prime number.
MAX_ID = 0xffffffff
def shingle_crc(token: str):
    # Hash the shingle to a 32-bit integer.
    b = bytes(token, 'utf-8')
    return binascii.crc32(b) & MAX_ID

def hash_crc(crc: set, a: int, b: int, c: int):
    ret = MAX_ID
    for w in crc:
        h = (a * w + b) % c
        ret = min(ret, h)
    return ret
Instead of computing a single hash function, we can use multiple hash functions. This adds to the robustness of process. Finally, the similarity between two documents can be computed as the hash similarity from the multiple hash functions.
def multi_hash_crc(crc: set, a: list, b: list, c: int):
    ret = []
    for i in range(len(a)):
        ret.append(hash_crc(crc, a[i], b[i], c))
    return ret
  
def hash_sim(h1: list, h2: list):
    n = 0.0
    for i in range(len(h1)):
        n = n + (h1[i] == h2[i])
    return n / len(h1)
Bloom filters: Let's talk about Bloom filters next. Bloom filter is a probabilistic datastructure to determine if a element belongs to the set. It allows false positives to slip through but not false negatives. Bloom filters are very useful when the dataset is very large or while processing streams. The idea behind a bloom filter is to use a bitarray of size m to represent a hashtable. To insert an element, it is hashed to an index in the bitarray and the position is set as 1 if not already set. To check the membership of an element, it is hashed and if the hash position is set as 1, it means a yes. As you can imagine, multiple elements can be hashed to the same position which means that there can be false positives. However, not having a 1 definitely means that the element is not present and hence it is not possible to get false negatives. In order to reduce the number of false positives, one strategy is to use multiple hash functions and set as 1 all the positions for the k hash functions. To determine the membership of an element, hash using the k hash functions and only return a 1 when all the k values are set. Here is some code for the bloom filters:
class BloomFilter(object):
    def __init__(self):
        self.N = 10000
        self.num_hash = 10
        self.max_hashid = 2**32 - 1
        self.hash_seed = np.random.randint(low=0, high=self.max_hashid, size=self.num_hash)
        self.bf = bitarray(self.N)

    def _hash(self, item, seed):
        return mmh3.hash(str(item), seed) % self.N

    def add(self, item):
        for i in range(self.num_hash):
            ind = self._hash(item, self.hash_seed[i])
            self.bf[ind] = 1

    def __contains__(self, item: int):
        for i in range(self.num_hash):
            ind = self._hash(item, self.hash_seed[i])
            if self.bf[ind] == 0:
                return False
        return True
LSH: Finally lets talk about LSH. The idea behind locality sensitive hashing is to hash similar elements to the same hash bucket unlike general hashing techniques where the idea is to minimize the collisions.

References:
  1. Chris McCormick's blog on Minhash
  2. Wikipedia article on MinHash

Comments

Popular posts from this blog

Perceptron Algorithm

Logistic Regression