Posts

Showing posts from April, 2019

Optimization Algorithms

Gradient descent is a way of minimizing an objective function J(θ) by updating the model's parameters in the opposite direction of the gradient of the objective function. Batch Gradient Descent : Batch gradient descent computes the gradient of the cost function for the entire training set in just one update which makes it very slow and intractable for very large datasts. The parameters are updated as follows: θ = θ - η ∇ J(θ) Batch gradient descent is guaranteed to converge to the global minimum for convex problems. Stochastic Gradient Descent : SGD performs parameter update for each training example. It is much faster and can be used to learn online but due to single point updates it can be very noisy and cause the objective function to oscillate. It can continuously keep oscillating and is not guaranteed to converge. However, upon using a decreasing learning rate it is known to converge almost certainly. Mini-batch gradient descent : This is in...

Normalizing Neural Networks

Lets look at various different strategies for normalizing neural networks: Batch Normalization : Batch normalization is a type of layer that can adaptively normalize the data. Before we go deep in that lets first examine the phenomenon of Internal covariate shift . The change in the distribution of the internal nodes of a deep network in the course of training is called as Internal Covariate Shift. This is disadvantageous because the layers need to continuously adapt to the change in the distribution. Batch Normalization transform takes care of the problem as follows: For a layer, normalize each feature dimension as: x'(k) = x(k) - E[x(k)]/√ Var[x(k)] Simply normalizing the input can change what the layer can represent. For example, normalizing the inputs of sigmoid would constrain them to linear regime. To understand why, note that 95% of the values of a Gaussian distribution lie within the range μ ± 2σ. Hence we need to make sure that the transformation inse...

RNNs, LSTMs, GRUs, ConvNets

Alright, next blog post lets jump quickly into some of the widely used DL models. RNNs : RNNs are a class of ANNs that process sequences. RNNs iterate through the input sequence and maintain a state of the sequence so far. Every input sequence given to the RNNs is considered independent and the state of the RNN is reset between the different inputs. From the DL book, this can be explained using the following code: W = np.random.random((output_features, input_features)) U = np.random.random((output_features, output_features)) b = np.random.random((output_features,)) output_t = np.tanh(np.dot(W, input_t) + np.dot(U, state_t) + b) state_t+1 = output_t The tanh ensures the values are between -1 and 1 and kind of acts as a regularization. LSTMs : Long Short Term Memory networks as the name suggests capture long term dependencies that the RNNs are incapable of capturing. RNNs as we saw above have a simple structure of repeating modules where each module is a single tanh layer. One of ...