Normalizing Neural Networks
Lets look at various different strategies for normalizing neural networks:
Batch Normalization: Batch normalization is a type of layer that can adaptively normalize the data. Before we go deep in that lets first examine the phenomenon of Internal covariate shift. The change in the distribution of the internal nodes of a deep network in the course of training is called as Internal Covariate Shift. This is disadvantageous because the layers need to continuously adapt to the change in the distribution. Batch Normalization transform takes care of the problem as follows:
Layer Normalization: Layer normalization attempts to address some of the short comings of batch normalization. It solves the problem of covariate shift by normalizing using the statistics collected from all units within a layer for an input example instead of computing batch statistics per dimension for a mini batch. It results in more stable hidden to hidden dynamics.
Weight Normalization: Weight normalization normalizes the weights of a layer directly using the L2 norm. As an example consider the input layer to be of dimension 10 and the hidden layer to be of dimension 50. The weight matrix is thus of size 10 x 50. In order to normalize the weight matrix, consider the 10 inputs going to every of the 50 nodes. Take the L2 norm of all the 10 weights and then normalize the weights.
Batch Normalization: Batch normalization is a type of layer that can adaptively normalize the data. Before we go deep in that lets first examine the phenomenon of Internal covariate shift. The change in the distribution of the internal nodes of a deep network in the course of training is called as Internal Covariate Shift. This is disadvantageous because the layers need to continuously adapt to the change in the distribution. Batch Normalization transform takes care of the problem as follows:
- For a layer, normalize each feature dimension as:
x'(k) = x(k) - E[x(k)]/√ Var[x(k)]
- Simply normalizing the input can change what the layer can represent. For example, normalizing the inputs of sigmoid would constrain them to linear regime. To understand why, note that 95% of the values of a Gaussian distribution lie within the range μ ± 2σ. Hence we need to make sure that the transformation inserted in the network can represent the identity transform. To do this 2 parameters are introduced which scale and shift the normalized value as:
y(k) = γ(k)x'(k) + β(k)
Note that these parameters are learned along with the original model parameters. Further, the BN transform is a differentiable transformation which is important as during training we need to backpropagate the gradient of the loss through this transformation as well. - The BN layer is neither desirable nor necessary during test time. At the training time, from each mini batches compute the E[μb] and unbiased sample variance to rescale the values.
Layer Normalization: Layer normalization attempts to address some of the short comings of batch normalization. It solves the problem of covariate shift by normalizing using the statistics collected from all units within a layer for an input example instead of computing batch statistics per dimension for a mini batch. It results in more stable hidden to hidden dynamics.
Weight Normalization: Weight normalization normalizes the weights of a layer directly using the L2 norm. As an example consider the input layer to be of dimension 10 and the hidden layer to be of dimension 50. The weight matrix is thus of size 10 x 50. In order to normalize the weight matrix, consider the 10 inputs going to every of the 50 nodes. Take the L2 norm of all the 10 weights and then normalize the weights.
Comments
Post a Comment