Posts

Two Tower

The two tower model is a popularly used model for retrieval mainly because it can be set up efficiently for a large scale. At a very high level, the model consists of two towers - a user tower and a query tower. Here are some practical tips and tricks on training two tower model - Candidate Sampling: To solve a multi-class multi-label problem with a large number of classes, one of the tricks often applied is that of candidate sampling where you do not need to compute F(x,y) for every class y for every training example x. Candidate sampling involves constructing a training task such that you only update a subset of the classes. Some examples of candidate sampling algorithms are Noise Contrastive Estimation (NCE), Negative Sampling, Sampled Logistic, etc. Popularity correction : A typical two tower setup involves References Candidate Sampling Tutorial by TensorFlow

My learnings of Karpathy 1-3

Andrej Karpathy has put together one of the most awesome video blogs on NNs and I finished watching the first three of them. Wanted to put out some of my ramblings on the same. These are my notes so that I internalize them well and also so that I do not forget :) Have become a bit forgetful of late (blame it on the forties! :| ). Lecture 1: Bacpropagation is the core of any modern NN . Derivative of function: If you slightly bump up f(x) by h how does it respond. f(x + h) - f(h) / h. Lecture 2: Bigram Model . Lecture 3: Basic Language model

MinHash, Bloom Filters and LSH

Lets talk about some large scale algorithms widely used in the document world. MinHash: Minhash is a really cute algorithm to determine how does a document compare to another. One simple way to compute document similarity is to compute the jaccard similarity between them. The jaccard similarity between the documents is computed as follows: def jaccard_similarity(set1: set, set2: set): if len(set1) == 0 or len(set2) == 0: return 0.0 common = len(set1.intersection(set2)) if common == 0: return 0.0 union = len(set1.union(set2)) return common / union One problem with Jaccard similarity is that it can take a lot of time to compute especially for large scale documents. The idea behind minhashing is to first operate on the space of document shingles. Shingles are basically any combination of k consecutive words of the document. For example, for the sentence "hi i am jaya" the possible 3 shingles are "hi i am" and "i am jaya&quo

Universal sentence encoder

Ok so I might be a bit late to join the NLP bus but am so glad I boarded! Lets start with the universal sentence encoder. The universal sentence encoder was released a couple of years ago by Google and is widely appreciated by the NLP community as a quick way to generate a sentence embedding before any further processing can be done on it. One reason to not to use a naive encoding scheme based upon term frequency is that it ignores word ordering and can have high similarity even when the meaning of the sentence is not the same. Some examples mentioned in the blog below shows that the sentence it is cool and it and is it cool have a high similarity. The original paper mentions two ways to encode the natural language sentence - a) Transformer encoder - This consists of 6 stacked transformer layers (each has a self-attention module followed by a feed-forward network).The self attention takes care of the nearby context to generate the word embeddings. b) Deep averaging network - be

Tips for working better.

Being assertive is needed the most for the role without coming across as bossy. Every word, every sentence you say counts. If you say nothing, that is counted as a negative. Too much talk, too little talk, too much assertion, too little assertion, all bad. Try to find some allies to work with. Establish with your team that you know your shit technically. Have an opinion on everything that pertains to the team - from stand ups to team meetings to calls to code reviews, etc. Important to not to lose your point in a discussion. Lastly, think about how your boss would have navigated your questions and other questions and problems.

Recommendations at YouTube

Lets take a look at some of the practical papers published for recommendation algorithms at YouTube. Paper 1: Davidson et al., The YouTube Video Recommendation System One of the oldest papers around the topic is The Youtube Recommendation System . The paper mentions that users come to Youtube for either for direct navigation to locate the single video they found elsewhere or for search and goal directed browse to find specific videos around a topic or just to be entertained by the content they find. The recommender is a top-N recommender rather than a predictor. Challenges: poor metadata corpus size very large mostly short form (under 10 min length) user interactions are relatively short and noisy videos have a short life cycle going from upload to viral in the order of days requiring constant freshness The goal is to find recommendations that are reasonably recent and fresh as well as relevant and diverse to the users taste. The input data can be divided into two p

Deep Reinforcement Learning

Lets talk about one of the difficult areas in ML - Deep Reinforcement Learning. Two of the most popularly used approaches in the space policy gradients and deep Q-networks . An agent interacts with the user within an environment and receives rewards . Policy search is finding a good set of parameters in the policy space . One way to explore the policy space is via policy gradient approach which evaluates the gradients of the rewards w.r.t the parameters and then moves in the direction of maximizing reward. The policy themselves can be defined via let's say Neural Networks. In the case of Supervised ML, we already know the best action from the set of actions and the NN could be trained by minimizing the cross-entropy loss between the estimated and target distributions. However, in RL, as we focus on long term reward, the reward itself could be delayed or sparse. This is known as the classic credit assignment problem. This problem is generally solved by summing up all