Posts

Siamese Network

Lets first talk about the problem of one shot learning . One shot learning is learning from a single training example. This problem occurs for example in an organization where you want to recognize faces and you might have only one face of the employee. Using a convnet to output a multi-class label is not a great idea as a small training set is not enough to train a classifier and it doesn't scale to new employee joining. Instead, one way to handle this problem is to learn a similarity function between two images. One way to train the neural network to learn the similarity function is via a siamese network. The Siamese network consists of two identical neural network with same parameters so that it computes a distance function between the encodings of the two input images [Ref: DeepFace]. To define an objective function, one way is to use a triplet loss. In a triplet loss, there is an anchor image along with a positive example and a negative example (A, P, N). So what is requir...

ResNet

Lets next discuss a very famous NN architecture called the ResNet. One of the key questions asked in deep learning is " Would deeper networks result in higher accuracy ? " Intuitively, this may make sense but practically it is observed that the training accuracy starts to degrade with deeper networks. This is surprising as it is not caused by overfitting because we see the degradation in training error and not test error. Infact, constructing deeper networks from their shallower counterparts by just adding identity mappings also shows a similar degradation in test error. The degradation problem suggests that solvers might have difficulty in approximating identity mappings from multiple nonlinear layers. One possible causes is the problem of vanishing/exploding gradients. Resnet addresses the problem by adding residual connections or shortcut connections. Adding these connections makes it easier to learn identity mapping as: a[l+2] = g(z[l+2] + a[l]), where a[l] is the ski...

Optimization Algorithms

Gradient descent is a way of minimizing an objective function J(θ) by updating the model's parameters in the opposite direction of the gradient of the objective function. Batch Gradient Descent : Batch gradient descent computes the gradient of the cost function for the entire training set in just one update which makes it very slow and intractable for very large datasts. The parameters are updated as follows: θ = θ - η ∇ J(θ) Batch gradient descent is guaranteed to converge to the global minimum for convex problems. Stochastic Gradient Descent : SGD performs parameter update for each training example. It is much faster and can be used to learn online but due to single point updates it can be very noisy and cause the objective function to oscillate. It can continuously keep oscillating and is not guaranteed to converge. However, upon using a decreasing learning rate it is known to converge almost certainly. Mini-batch gradient descent : This is in...

Normalizing Neural Networks

Lets look at various different strategies for normalizing neural networks: Batch Normalization : Batch normalization is a type of layer that can adaptively normalize the data. Before we go deep in that lets first examine the phenomenon of Internal covariate shift . The change in the distribution of the internal nodes of a deep network in the course of training is called as Internal Covariate Shift. This is disadvantageous because the layers need to continuously adapt to the change in the distribution. Batch Normalization transform takes care of the problem as follows: For a layer, normalize each feature dimension as: x'(k) = x(k) - E[x(k)]/√ Var[x(k)] Simply normalizing the input can change what the layer can represent. For example, normalizing the inputs of sigmoid would constrain them to linear regime. To understand why, note that 95% of the values of a Gaussian distribution lie within the range μ ± 2σ. Hence we need to make sure that the transformation inse...

RNNs, LSTMs, GRUs, ConvNets

Alright, next blog post lets jump quickly into some of the widely used DL models. RNNs : RNNs are a class of ANNs that process sequences. RNNs iterate through the input sequence and maintain a state of the sequence so far. Every input sequence given to the RNNs is considered independent and the state of the RNN is reset between the different inputs. From the DL book, this can be explained using the following code: W = np.random.random((output_features, input_features)) U = np.random.random((output_features, output_features)) b = np.random.random((output_features,)) output_t = np.tanh(np.dot(W, input_t) + np.dot(U, state_t) + b) state_t+1 = output_t The tanh ensures the values are between -1 and 1 and kind of acts as a regularization. LSTMs : Long Short Term Memory networks as the name suggests capture long term dependencies that the RNNs are incapable of capturing. RNNs as we saw above have a simple structure of repeating modules where each module is a single tanh layer. One of ...

Word2Vec

Ok, first blog post after a loooooooong time and the first one after baby #2. Have been wanting to catch up on readings for so long and what a better way than to write it and explain it myself. Lets begin with the model which has perhaps been beaten to death in the past few years. Word2Vec is a popular algorithm to generate word embeddings. The original algorithm by Mikolov, et al. has been proposed in the following references ( 1 and 2 ). There are two key pieces of the model: The Skip-gram model: In this model, we are given a corpus of word and its context (context for example is nearby word within a window size). The goal is to train a NN to predict the probability of every word in the vocabulary given the input word (hopefully the probability of context words would be much higher). We thus need to find the parameters that maximize the probability: argmax Π p(c|w;θ) where c is the set of contexts for the word w in the corpus The conditional probability is usually ...

Randomized SVD

Randomized SVD is a really cute technique for computing the SVD of a large scale matrix. It proceeds as follows: Consider A m x n be the large matrix. Let Q m x k k T = I A can thus be rewritten as - A ≈ QQ T A The first step is to compute the SVD of B=Q T A. As B k x n is much smaller, the SVD of B can be computed using standard methods and is B = SΣV T . Now plugging this back we have A = QSΣV T which becomes the SVD of A with U = QS. The challenge here is constructing the orthogonal matrix Q efficiently. This is done as follows: Compute a Gaussian random matrix Ω n x k . Compute Y m x k = AΩ. Compute the QR decomposition of Y to get the orthonormal matrix Q. The matlab code is as follows: O = randn(n, k); Y = A*O; [Q, R] = qr(Y, 0); B = Q'*A; [S sigma V] = svd(B); U = Q*S;