Posts

Tips for working better.

Being assertive is needed the most for the role without coming across as bossy. Every word, every sentence you say counts. If you say nothing, that is counted as a negative. Too much talk, too little talk, too much assertion, too little assertion, all bad. Try to find some allies to work with. Establish with your team that you know your shit technically. Have an opinion on everything that pertains to the team - from stand ups to team meetings to calls to code reviews, etc. Important to not to lose your point in a discussion. Lastly, think about how your boss would have navigated your questions and other questions and problems.

Recommendations at YouTube

Lets take a look at some of the practical papers published for recommendation algorithms at YouTube. Paper 1: Davidson et al., The YouTube Video Recommendation System One of the oldest papers around the topic is The Youtube Recommendation System . The paper mentions that users come to Youtube for either for direct navigation to locate the single video they found elsewhere or for search and goal directed browse to find specific videos around a topic or just to be entertained by the content they find. The recommender is a top-N recommender rather than a predictor. Challenges: poor metadata corpus size very large mostly short form (under 10 min length) user interactions are relatively short and noisy videos have a short life cycle going from upload to viral in the order of days requiring constant freshness The goal is to find recommendations that are reasonably recent and fresh as well as relevant and diverse to the users taste. The input data can be divided into two p...

Deep Reinforcement Learning

Lets talk about one of the difficult areas in ML - Deep Reinforcement Learning. Two of the most popularly used approaches in the space policy gradients and deep Q-networks . An agent interacts with the user within an environment and receives rewards . Policy search is finding a good set of parameters in the policy space . One way to explore the policy space is via policy gradient approach which evaluates the gradients of the rewards w.r.t the parameters and then moves in the direction of maximizing reward. The policy themselves can be defined via let's say Neural Networks. In the case of Supervised ML, we already know the best action from the set of actions and the NN could be trained by minimizing the cross-entropy loss between the estimated and target distributions. However, in RL, as we focus on long term reward, the reward itself could be delayed or sparse. This is known as the classic credit assignment problem. This problem is generally solved by summing up all...

Transfer learning & Multi-task learning

Image
In transfer learning, you learn from a sequential process, i.e. learn from task A and transfer it to task B. However, in multi-task learning, you learn from multiple tasks simultaneously. In transfer learning, learn the NN for a big task. Then, for a smaller task just retrain the weights of the last layer only (or last 1-2 layers). You could also retrain all the parameters of the NN and in that case, it is called as pre-training because you are initializing the weights of the NN from a pre-trained model. When you are updating the weights of the model, it is also called as fine-tuning. A couple of ways in which fine tuning works is: Truncate the last layer of the NN and replace it with the new layer to learn the new output. Use a smaller learning rate to train the network. Freeze the weights of the first few layers of the NN. When does transfer learning make sense ? You have a lot of data from the task that you are originally learning from and small amount of data for the ...

Wide and Deep learning for Recommender Systems

Image
Lets discuss this widely cited Google paper on Wide and Deep learning . The paper mentions that an important challenge in recommender systems is to achieve both memorization and generalization . Memorization is learning the frequent co-occurrence of items and features whereas generalization explores new feature combinations that have rarely occurred in the past. LR models have been widely used in Google settings and generally have sparse features with one hot encoding. Memorization and generalization can be added in such models by cross-product transformations in the feature space but require a lot of manual feature engineering. On the other hand are embedding based models that learn a low dimensional embedding for each of the categorical features. One of the problems with embedding based model is that it will lead to non-zero predictions even when the user-item matrix is high rank and consists of niche users. To solve this problem the authors present a very neat idea - use both wide ...

Siamese Network

Lets first talk about the problem of one shot learning . One shot learning is learning from a single training example. This problem occurs for example in an organization where you want to recognize faces and you might have only one face of the employee. Using a convnet to output a multi-class label is not a great idea as a small training set is not enough to train a classifier and it doesn't scale to new employee joining. Instead, one way to handle this problem is to learn a similarity function between two images. One way to train the neural network to learn the similarity function is via a siamese network. The Siamese network consists of two identical neural network with same parameters so that it computes a distance function between the encodings of the two input images [Ref: DeepFace]. To define an objective function, one way is to use a triplet loss. In a triplet loss, there is an anchor image along with a positive example and a negative example (A, P, N). So what is requir...

ResNet

Lets next discuss a very famous NN architecture called the ResNet. One of the key questions asked in deep learning is " Would deeper networks result in higher accuracy ? " Intuitively, this may make sense but practically it is observed that the training accuracy starts to degrade with deeper networks. This is surprising as it is not caused by overfitting because we see the degradation in training error and not test error. Infact, constructing deeper networks from their shallower counterparts by just adding identity mappings also shows a similar degradation in test error. The degradation problem suggests that solvers might have difficulty in approximating identity mappings from multiple nonlinear layers. One possible causes is the problem of vanishing/exploding gradients. Resnet addresses the problem by adding residual connections or shortcut connections. Adding these connections makes it easier to learn identity mapping as: a[l+2] = g(z[l+2] + a[l]), where a[l] is the ski...