RNNs, LSTMs, GRUs, ConvNets

Alright, next blog post lets jump quickly into some of the widely used DL models.

RNNs: RNNs are a class of ANNs that process sequences. RNNs iterate through the input sequence and maintain a state of the sequence so far. Every input sequence given to the RNNs is considered independent and the state of the RNN is reset between the different inputs. From the DL book, this can be explained using the following code:
W = np.random.random((output_features, input_features))
U = np.random.random((output_features, output_features))
b = np.random.random((output_features,))

output_t = np.tanh(np.dot(W, input_t) + np.dot(U, state_t) + b)
state_t+1 = output_t
The tanh ensures the values are between -1 and 1 and kind of acts as a regularization.

LSTMs: Long Short Term Memory networks as the name suggests capture long term dependencies that the RNNs are incapable of capturing. RNNs as we saw above have a simple structure of repeating modules where each module is a single tanh layer. One of the big problem with RNNs is the problem of vanishing gradient where the gradient shrinks as it backpropagates over time. In RNNs, the earlier layers usually get a smaller gradient update and hence don't learn. As a result of this RNNs forget what it saw earlier in the sequence and does seem to have a short term memory.

LSTMs are very well explained in Colah's blog. Here is a gist of what is mentioned there. The core idea behind LSTMs are its cell state and various gates. The cell state is the conveyor belt that transfers information along the sequence of chains and is the memory of the network as it carries information throughout the sequence processing thus reducing the effects of short term memory. Information gets added or removed to the cell state via gates which contain sigmoid activation. There are three different kinds of gates - the forget gate, the input gate and the output gate. The forget gate decides what information should be forgotten. It takes as an input the information from the previous hidden state and the current input and passes through the sigmoid function. The output is between 0 and 1 and closer to 0 means forget and closer to 1 means remember. The input gate is used to update the cell state. We pass the hidden state and the current input state to the sigmoid function to decide what values would be updated. The hidden state and the current input is also passed to a tanh function and multiply the tan output with the sigmoid output and the sigmoid output helps decide which information needs to be kept from the tanh output. Now to compute the cell state we multiply the cell state by the forget vector which has the possibility of dropping values in the cell state. Then add the output from the input gate to the cell state to get a new cell state. Finally, the output gate decides what the next hidden state should be. The pseudocode is as follows taken from this excellent youtube tutorial:
combine = prev_ht + input
ft = forget_layer(combine)
candidate = candidate_layer(combine)
it = input_layer(combine)
ct = prev+ct * ft + candidate * it
ot = output_layer(combine)
ht = ot * tanh(ct)
GRUs: Gated recurrent unit (GRU) work using the same principle as LSTM except that they are somewhat cheaper to run but may not have the same representational power. GRUs got rid of the cell state and use the hidden state to transfer information. It has two gates a reset gate and an update gate which is the equivalent of the forget gate of lstm.

ConvNets: Convolutional Neural Networks (ConvNets or CNNs) are a class of ANNs that have shown to be very successful for image recognition tasks. The fundamental difference between ConvNets and ANNs is that ANNs learn global patterns whereas ConvNets learn local patterns. Once a ConvNet learns a pattern, it can recognize it anywhere on a new image. There are 2 important operations - convolution and max pooling. The convolution operation begins by extracting small windows from the input feature map and then multiplying it with a learned weight matrix called as the convolution kernel. The maxpooling operation is similar to the convolution layer except that instead of multiplying by a learned weight matrix, it is hard-coded via a max tensor operation.

Additional References:
1. THE book: Deep Learning with Python by Francois Chollet
2. Very nice Youtube video explaining LSTMs.

Comments

Popular posts from this blog

MinHash, Bloom Filters and LSH

Perceptron Algorithm

SVM