In this section we will discuss a very popular topic in modern deep learning, weight initialization. You may have wondered in the past why we divide the square root of our weight by the dimensionality?
np.random.randn(D) / np.sqrt(D)We will also look at some other topics on optimization, such as vanishing and exploding gradients and local vs global minima.
When is comes to neural networks there is a premise that deeper networks are better. With a one hidden layer neural network, often called a shallow network, you need a lot of hidden units to make it more expressive! Researchers have found that if you just add more hidden layers you can have less hidden units per layer, but achieve better performance.
However, there is a problem with deep networks. Well, for a long time researchers really did believe that the s-shape activation function (sigmoid) really was the best possible activation function. This could be due to the fact that they have some very nice derivatives. For example, the sigmoid derivative is it's output times 1 minus it's output:
$$output*(1-output)$$It is also smooth and monotonically increasing. Smoothness is nice because that means the function is differentiable everywhere, and differentiability is important because the learning method is gradient descent, and we can't do gradient descent if we can take derivatives.
Finally, you may recall the sigmoid is the output of binary logistic regression (modeled after a neuron). So it is nice when you are building neural network to have it actually be made out of neurons. The entire architecture is uniform.
The problem is this: We know that a neural network has the basic form of:
$$y = f(g(h(...(x)...)))$$Where $f$, $g$, and $h$ each represent a separate network layer. In other words, it is a composite function. And we know that due to the chain rule of calculus, the derivative with respect to the weight at the first layer is calculated by multiplying the derivative at each layer that comes after that.
$$\frac{dy}{dw_1} = \frac{df}{dg}*\frac{dg}{dh}*...$$So what is wrong with this? Well think about what happens when you multiply a small number by itself again and again? It will quickly approach 0. For instance, try 0.25:
$$0.25 * 0.25 *0.25 *0.25*0.25*0.25*0.25* 0.25 = 1.52e-05$$The reason that we just looked at the number 0.25 is very specific. If we look at the derivative of the sigmoid we notice 2 things:
This means that very deep neural networks just can't be trained using standard backpropagation.
A key development to fix this was found by Geoff Hinton's "greedy layer-wise unsupervised pre-training". That is slightly advanced so we will be covering it in a future notebook.
Another option is to not use the sigmoid or the tanh and just use the ReLU instead. By using the ReLU we can train a deep network using standard backpropagation without any pretraining. Sometimes people call this end to end training.
Now that we are familiar with the vanishing gradient problem, what about the exploding gradient? What happens if we take a number greater than 1 and multiply it by itself again and again? Well that number is very quickly going to approach infinity. This clearly is also a problem, and one that shows up in recurrent neural nets.
What does this have to do with weight initialization? Well have a situation where we want $w$ to not be too big (they will explode), and not be too small (they will vanish). So, we need the weights to be just right, and for that to happen we will need to initialize them to these just right values.
One common way to initialize linear models (e.g. linear regression) is to initialize the weights to 0. Why won't this work with neural networks?
Consider a 1 hidden layer ANN with either the sigmoid or tanh activation. Well we can quickly see that if we us the tanh then $Z$ is going to be all zeros. This means the derivative with respect to $V$ is also going to be all zeros, so $V$ will never change. The same thing will also happen with $W$.
$$\frac{\partial J}{\partial V} = Z^T (Y - T)$$$$\frac{\partial J}{\partial W} = X^T [(Y-T)V^T * (1-Z)^2]$$In the case of the sigmoid, the weights will all be 0.5, so the weights will change but they are going to change in an undesirable way. In particular, there is going to be symmetry along one axis. $$\frac{\partial J}{\partial W} = X^T {(Y-T)V^T * Z * (1-Z)}$$
So, in other words, we are either going to get all 0s, or more generally we will get symmetry. This is because if all units calculate the same feature, it's like having only 1 unit in that layer. In other words, adding more units to your network won't make it more expressive, since it is as if it only has one hidden unit. So, initializing randomly allows us to break this symmetry and make use of all units in the network.
So, now that we are convinced that we need to initialize the weights randomly, the next question is what distribution should they come from, and what are the parameters of this distribution?
Let's start with linear regression since that is the simplest case. Our model is of the form:
$$y = w_1x_1 + w_2x_2 +...$$We already know that the variance of the all the $x$'s is 1, because have have normalized the training data to have 0 mean and unit variance:
$$var(x_i) = 1$$We would also like the output of this model to have a variance of 1, since in a neural network this is going to feed into the subsequent layer.
$$we \; want: var(y) = 1 $$And so, because each $x$ and each $w$ is independent and identically distributed (the proof that shows this is in the math appendix-the mean is 0 which is why the terms including mean disappear) we say that the variance of $y$ is:
$$var(y) = var(w_1)var(x_1) + var(w_2)var(x_2) + ... $$Now, we know that the variance of $x_i$ is 1, so we can plug that in:
$$var(y) = var(w_1) + var(w_2) + ... $$Because we are intializing the weights of all $w$ the same way as well, we just call these the variance of $w$ without a subscript, and so for each $x$ vector which is of dimensionality $D$, we get the the variance of $y$ is $D$ times the variance of $w$:
$$var(y) = D * var(w)$$So, if we want the variance of $y$ to be 1, then we have to make the variance of $w$ equal to $\frac{1}{D}$:
$$set: var(w) = \frac{1}{D}$$In code this can be achieved by sampling from the standard normal, and dividing by the square root of $D$:
np.random.randn(D) / np.sqrt(D)One other question you may have is "doesn't the nonlinear activation function change the variance?" Yes it does! However, since this is just an approximation, it is fine. Generally speaking, the most important thing is to not initialize your weights to be constant. As long as your weights are random and small enough, you generally won't have a problem. Weights that are initialized too large will be a problem, since they will have a very steep gradient, and lead to NaNs.
Now, in the grand scheme of things weight initialization should be further down on your list of priorities; things like learning rate, training algorithm, and architecture are probably more important. We just want to be sure that weights are random and small.
Let's quickly go over some conventions that we can stick to for this lecture so that nothing is ambiguous.
- For a neural network layer, we will call the input size M1 and the output size M2. Sometimes M1 will be referred to as fan-in, and M2 referred to as fan-out.
The first method you may see doesn't depend on the size of the weights at all, instead we just set the standard deviation to 0.01.
W = np.random.randn(M1, M2) * 0.01Despite there being a ton of literature out there on weight initialization, this method is still pretty common.
The second method you may see is setting the variance to 2 divided (M1 + M2):
var = 2 / M1 + M2
W = np.random.randn(M1, M2) * np.sqrt(var)Note, we can see that 1/var is just the average of fan-in and fan-out. Typically this would be used for the tanh activation function.
Another simpler method we can also use when it comes to tanh, is just setting the variance to 1/M1:
var = 1 / M1
W = np.random.randn(M1, M2) * np.sqrt(var)For the ReLU it is common to use a variance of 2/M1:
var = 2 / M1
W = np.random.randn(M1, M2) * np.sqrt(var)This is known as the He Normal, just named after the author.
So these are the major weight initialization methods:
- For tanh use: 1 / M1 or 2 / (M1 + M2)
- For ReLU use: 2 / M1
Note that the assumption here is that you are drawing from a normal distribution.
One final note about bias terms; you will notice that we didn't mention them. These can either be initialized to 0, or the same as the other weights, it doesn't really matter, things will still work. We mainly care about breaking symmetry, and by initializing the weight matrix randomly, we will still accomplish that.
In this past, people would frequently mention that when it came to neural network training you had to watch out for local minima. However, in modern deep learning researchers have updated their perspective on this topic.
First and foremost we cannpt see what the error surface of a neural network actually looks like. This is because we cannot see a 1 million dimensional function. This can be thought of similarly to the shift from newtonian physics to quantum mechanics. Newtonian physics were nice because they involved macro objects that we could see and perform experiments on. But now we have quantum mechanics, in which case we really can't see anything. We need to come up with clever designs which allow us to calculate things that act as a proxy for the things that we are really trying to understand. So, we have to invent ways to probe around and really understand what is going on.
With this idea in mind, let's go over some ways that we can reason about a 1 million dimensional neural networks properties and do experiments to check its behavior.
Researchers have reasoned that at any point where the gradient is 0, we are much more likely to be at a saddle point. Recall that a saddle point in 2 dimensions has a minimum along one axis, and a maximum along the other axis.
And so in reality, if you are doing gradient descent, you are very unlikely to be going to down in the precise direction to the minimum. More likely, you will be moving along both axis at the same time, and hence you just slide off the saddle. Hence, saddles are not really a problem.
So, when we are in millions of dimensions why is it unlikely that we are at a real minimum? Well this is a probability problem. For each axis, we have two choices given the derivative is 0: we can either be at a min or at a max. So, for 1 million dimensions the probability of being at a minimum for ALL of them is: 0.5^1 million, which is basically 0.