Recommendations at YouTube

Lets take a look at some of the practical papers published for recommendation algorithms at YouTube.

Paper 1: Davidson et al., The YouTube Video Recommendation System
One of the oldest papers around the topic is The Youtube Recommendation System. The paper mentions that users come to Youtube for either for direct navigation to locate the single video they found elsewhere or for search and goal directed browse to find specific videos around a topic or just to be entertained by the content they find. The recommender is a top-N recommender rather than a predictor. Challenges:
  1. poor metadata
  2. corpus size very large
  3. mostly short form (under 10 min length)
  4. user interactions are relatively short and noisy
  5. videos have a short life cycle going from upload to viral in the order of days requiring constant freshness
The goal is to find recommendations that are reasonably recent and fresh as well as relevant and diverse to the users taste. The input data can be divided into two parts - a) the raw video stream as well as the video metadata such as title, description, etc b) implicit or explicit feedback for the video. The data is noisy as metadata can be incomplete/incorrect or even absent. User data only captures a fraction of user's activity and only indirectly measures user's engagement and happiness.

Candidate Generation: The authors develop a related videos for a seed video by computing a relatedness score as follows:
r(vi, vj) = cij/ f(vi, vj) 
where cij is the co-visitation graph and we divide by some normalization constant e.g. the product of the global popularities of the two videos ci*cj.
After this, the candidates are generated by looking at the the past history of the user on the website as follows:
C1(S) = ∪ Ri  for vi in history
In order to expand the set of candidates a transitive set of the videos reachable from a distance n is considered.

Ranking: The signals are broadly categorized into 3 groups according to the 3 different stages of ranking: 1) video quality 2) user specificity and 3) diversification. Video quality measures the global popularity of the video. User specificity gives specific boost to videos that are specific to a user's taste and preferences. And finally, a linear combination of the three produces the ranking.

Paper 2: Covington et al., Deep Neural Networks for YouTube Recommendations
Another one of the widely cited paper is the paper by YouTube Deep Neural Networks for YouTube Recommendations. It is a very well written paper and I enjoyed reading it. The overall system consists of classical two stage information retrieval comprising of two stages - a) candidate generation and b) ranking model.
Candidate Generation : The goal of candidate generation is to prune the list of candidates to a smaller subset (say hundreds) which could be then used for ranking. One of the candidate generators proposed in the paper is a Deep Neural Network. The recommendation problem is treated as a multi class classification problem as:
p(watch = i | U, C) = e^(vi, u) / Σ e^(vi, u) 
In this paper it is assumed that the vi's which represent the embeddings of the video are learned via a word2vec style model. In order to handle the problem of extreme multi class classification (as there are millions of videos in youtube) the authors use importance sampling trick also used in NLP. The NN is a wide and deep network which learns the user's embedding along the way. Each user history is represented as an average of the video embeddings from the past watches. They also mentioned some feature engineering that they did along the way like scaling the double features in the range [0-1] which represents the cdf for that feature by fitting an interpolation using the quantiles.

Ranking: The ranking uses impression data to finally re-rank the videos obtained from the various candidate generators. The ranking also uses an architecture similar to the candidate generation however the output is a binary classifier or a weighted logistic regression. One thing that I observed was that the impression based positives were duration weighted by the personalized expected duration watch for the title and the negatives (impressions that did not yield to a play) were set as weight 1. The authors mention that the log odds learned as a result is E[T](1 + P) where P is the click probability and E[T] is the expected watch duration.

Paper 3: Chen et al., Top-K Off-Policy Correction for a REINFORCE Recommender System
The paper is about bringing in RL to the problem of recommendations. More specifically, the authors adapt the REINFORCE algorithm to the problem of neural candidate generation. For each user, consider the historical interactions with the system recording the actions taken by the recommender and the user feedback. Given the sequence of past interactions, the goal is to create a MDP with S, A, P, R, rho, gamma. The authors do an off-policy correction because the data is collected from production systems which could have come from disparate sources. Once you have a softmax as the last layer, it becomes a policy (because a policy is a probability over the action space) The authors mention using Boltzman exploration in softmax for making the policy more explore or exploit. Boltzman temperature coefficient is a very neat trick which proceeds as:
e^(w^tx * t) / ∑ e^(w^tx * t)
Here, if t is very large, all the scores become similar and we move towards exploration. On the other hand, if t is small, we move more towards exploitation.

Comments

Popular posts from this blog

MinHash, Bloom Filters and LSH

Perceptron Algorithm

Logistic Regression