Word2Vec

Ok, first blog post after a loooooooong time and the first one after baby #2. Have been wanting to catch up on readings for so long and what a better way than to write it and explain it myself. Lets begin with the model which has perhaps been beaten to death in the past few years. Word2Vec is a popular algorithm to generate word embeddings. The original algorithm by Mikolov, et al. has been proposed in the following references (1 and 2).

There are two key pieces of the model:

The Skip-gram model: In this model, we are given a corpus of word and its context (context for example is nearby word within a window size). The goal is to train a NN to predict the probability of every word in the vocabulary given the input word (hopefully the probability of context words would be much higher).
We thus need to find the parameters that maximize the probability:
argmax Π p(c|w;θ)
where c is the set of contexts for the word w in the corpus
The conditional probability is usually parameterized using softmax as predicting the word probability is like multi-class classification. evc*vw Σevc'*vw where vw and vc represent the vector representation for the word and the context respectively and vc' ∈ C the set of all possible contexts .The intuition is to set the parameters such that the dot product is maximized which means that the vector representation for the word and the context should be as close as possible.

Negative Sampling: The next important intuition is that of negative sampling. If we examine the above eqn carefully, the computation of the denominator for softmax is expensive because the set of possible contexts can be very large. The idea of negative sampling is based upon the idea of noise-contrastive estimation. NCE converts the problem of multi class classification into binary classification. The training data comprises of the pairs of word and context as the positive class and random pairs of words as the negative class. Thus p(D=1|w, c; θ) represents the probability that the pair (w,c) came from the training data and p(D=0|w, c; θ) represents the probability that the pair did not belong to the training data. The binary classification objective can thus be written as 1 1 + e-vc'*vw
In word2vec, a small number of negative samples are selected from a unigram distribution such that frequent words are more likely to be selected as negative samples.

Here is the code snippet for word2vec using tensorflow taken from here:
# Construct the variables for the NCE loss
nce_weights = tf.Variable(
        tf.truncated_normal([vocabulary_size, embedding_size],
                            stddev=1.0 / math.sqrt(embedding_size)))
nce_biases = tf.Variable(tf.zeros([vocabulary_size]))

nce_loss = tf.reduce_mean(
        tf.nn.nce_loss(weights=nce_weights,
                       biases=nce_biases,
                       labels=train_context,
                       inputs=embed,
                       num_sampled=num_sampled,
                       num_classes=vocabulary_size))

optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(nce_loss)
Additional Resources:
word2vec Explained by Yoav Goldberg and Omer Levy
Chris McCormick's blog
word2vec code

PS1: #2 is such a charmer, kills me with the smile everyday :)
PS2: I am a guilty mom today as the #1 was not happy with me for rushing her in the morning, just want to hug her now :(

Comments

Popular posts from this blog

MinHash, Bloom Filters and LSH

Perceptron Algorithm

Logistic Regression