We will now take a look at one of the most effective methods at improving plain gradient descent, called momentum. This can be thought of as the 80% factor to improve your learning procedure.
A way to think of this is as follows: Gradient descent without momentum requires a force or push each time we want to get the weights to move. In other words, each time we want to move, there has to a be a gradient so that we can move in the direction of the gradient. If we had momentum, we can imagine that our update could keep moving, even without the gradient being present.
This can be thought of as pushing a box on ice vs. pushing a box on gravel. If we are pushing the box on gravel, the minute we stop applying force, the box will also stop moving. This is analogous to gradient descent without momentum. However, if we were pushing the box on ice we could and then let go and it would continue moving for a period of time, before stopping. This is analogous to gradient descent with momentum. Another way to phrase this is as follows:
With momentum included in our update, our weight vector will build up a velocity in any direction that has a consistent gradient.
Let's put this into math.
Our update for $\theta$ can be described as: $$\theta_t \leftarrow \theta_{t-1} - \eta g_{t-1}$$ This says that $\theta_t$ is equal to the previous value of $theta$, minus the learning rate, times the gradient $g_t$. From this we can see that if the gradient is 0, nothing will happen to $\theta_t$. It just gets updated to it's old value and doesn't change.
Now let's say that we add in momentum. Note that the term momentum is used very loosely here, since it has nothing to do with actual physical momentum. What we do is create a new variable, $v$, which stands for the velocity. It is equal to $\mu$ (the momentum term) times its old velocity, minus the learning rate times the gradient. Notice that now, the gradient only directly influences the velocity, which in turn has an effect on the position (our weight vector), $\theta$.
$$v_t \leftarrow \mu v_{t-1} - \eta g_{t-1}$$This new term, $\mu v_{t-1}$, gives us the ability to "slide on ice" if you will. In other words, it allows us to continue to move in the same direction that we were going before. Now, we talked about how if a box is sliding on ice, it will still stop eventually. That means that we are going to want our updated $v$ to be a fraction of the prior $v$, and hence $\mu$ should be a fraction. Typical values of $\mu$ are 0.9, 0.95, 0.99, etc. This means that without any $g$, the equation will still eventually "slow down". Our update rule for $\theta_t$ then becomes:
$$\theta_t \leftarrow \theta_{t-1} + v_t$$Now, if we combine these two equations we can see that our total update rule is:
$$\theta_t \leftarrow \theta_{t-1} + \mu v_{t-1} -\eta g_{t-1} $$And we can see that if we set the momentum term, $\mu$, equal to zero, we end up with the same update rule we originally had for gradient descent.
You may be wondering, what is the effect of using momentum? Well we can see below that by using momentum, the cost converges to its minimum value much faster. This significantly speeds up training!
From another perspective, we can think of a situation where we have unequal gradients in different directions. In the image below, we have a very large gradient that creates the valley (each side is very steep), and then in the other direction (the stream flowing down), it is a very small gradient.
For visualization purposes, lets assume we have 2 parameters to optimize: the vertical and horizontal parameter. The gradient in one direction is very steep, and the gradient in the other direction is very shallow. The idea is that if you don't have momentum, then you rely purely on the gradient, which points more in the steep direction than in the shallow direction-this is just a property of the gradient, it is the direction of steepest descent. Since this gradient vector points more in the steep direction, we are going to zigzag back and forth across the valley. That is a very inefficient way of reaching the minimum.
Once we add momentum, however, things change. Because in the shallow direction, we move in the same direction every time, those velocities are going to accumulate, so we will have a portion of our old velocity, added to our new velocity to help us along in that direction. The result is that we get there faster by taking bigger steps in the shallow direction of the gradient.
Nesterov momentum was coined by Y Nesterov in 1983. It is described as:
"A method for unconstrained convex minimization problem with rate of convergence O(1/$k^2$)"
The core idea is that when the current weight vector is at some position, lets say $w$, then looking at original momentum update from earlier, we know that the momentum term alone (ignoring the term with the gradient), is about to nudge the parameter vector by $\mu v_{t-1}$. Therefore, if we are about to compute the gradient, we can treat the future approximate position of $w$, $w + \mu v_{t-1}$, as a "lookahead" - as in this is a point in the vicinity of where we are going to end up. Hence, it makes sense to compute the gradient at the $w + \mu v_{t-1}$, instead of the old/stale position $w$.
The image above makes it clear that instead of evaluating the gradient at the current position of $w$, (red circle), we know that our momentum is about to carry us to the tip of the green arrow. With Nesterov momentum we therefore instead evaluate the gradient at this "looked-ahead" position. Also, keep in mind that in the image above the blue vector is referring to our update to the velocity, $v_t$, and not to the update of $w_t$.
Okay, so we have a basic idea of nesterov momentum now, but let's just try and reiterate from a few different perspectives, to help is sink in. So, instead of just using momentum to blindly keep going in the direction that we were already going, let's instead peak ahead, by taking a big jump in the direction of the previous velocity, and calculate the gradient from there. We can think of it as though you are gambling, and if you are going to gamble it is better to take a big jump and then make a correction, than to make a correction and then gamble.
So first we peak ahead, jumping in the direction of the previous velocity (accumulated gradient):
We then measure the gradient, and go downhill in the direction of the gradient. We use that gradient to update our velocity (accumulated gradient). In other words, we combine the big jump with our gradient to get the accumulated gradient. So in a way, its peaking ahead and then course correcting based on where we would have ended up.
We then take that accumulated gradient (first green vector), multiply by some momentum constant, $\mu$, and then we take the next big jump in the direction of that accumulated gradient. Again, at the place where we end up (head of second brown vector), we measure the gradient, we go downhill (second red vector) to correct any errors we have made, and we get a new accumulated gradient (second green vector)
We can see that the blue vectors represent where we would go if we were using standard momentum, where we first measure the gradient where it currently is (small blue vector), and it adds that to the brown vector, and ends up making a jump by the big blue vector (first brown vector plus small blue vector, i.e. the current gradient). The brown vector represents our peak ahead value. Notice that it is in the same direction as the blue vector. The red vector is the gradient at the peak ahead value. The green vector is just the vector of the brown vector and the red vector.
So, with the visuals discussed, what do the equations look like? First, we are going to use $w$ to represent our weights instead of $\theta$. Also, the majority if this is looking at how we will update $v_t$, and the last step covers $w_t$. Now, lets start with the vector that represents the previous value of our weights, $w_{t-1}$, and the previous velocity, $v_{t-1}$:
Now, we have this jump ahead, which we can call $w'_{t-1}$. We can also think of it as just our previous weight position, plus the momentum step. It is in the same direction of our previous velocity vector (because remember, the first part of updating $v_t$ was the term $\mu v_{t-1}$, but it is slightly smaller since the jump is scaled by $\mu$:
The above equations are equivalent because both $w_{t-1}$ and $v_{t-1}$ both have the same position (head each vector). Also, note that as seen in the image above, the jump ahead is just the previous value of the velocity (or previous weight position, $w_{t-1}$), plus the momentum term multiplied by the previous velocity. Next, we calculate the gradient at this jump ahead point, and then use that to update $v$:
Which is equal to:
$$v_t \leftarrow \mu v_{t-1} - \eta \nabla J(w_{t-1} +\mu v_{t-1})$$And then the last step is to update $w_t$, the accumulated gradient, which is the same as it was for standard momentum:
The main difference to note is that in the standard method we are taking the gradient from the current position of $w$, and also making our momentum step from the current position of $w$, whereas in the Nesterov method we first take our momentum step from the current position of $w$, then correct by taking the gradient from that position. For a great link that goes over this topic in more detail, checkout out the follow: http://cs231n.github.io/neural-networks-3/
However, in practice, this is not how nesterov momentum is usually implemented. Instead, we will reformulate the equations. Let's try and express everything only in terms of $w'$, our lookahead value of $w$, and it is where we want to calculate the gradient from. So, we can define $w'_t$ and $w'_{t-1}$ using the same definition:
In other words, these are the lookahead values of $w$ at two consecutive steps. The second equation is seen in the below:
So, remember, the look ahead value can just be thought of as the step in the direction of the previous velocity, from the current weight vector postion. Next lets recall our updates for $v$ and $w$.
From here we can substitute $w'$ for $w$
We can get ride of the $v_t$ term on the right by replacing it with an expression in terms of $v_{t-1}$. $$w'_t = w'_{t-1} - \mu v_{t-1} + (1+\mu) \Big[\mu v_{t-1} - \eta \nabla J (w'_{t-1})\Big]$$
This equation is the one that you will most often see when people are discussing nesterov momentum.
So, to quickly recap, the equation for regular momentum is:
$$w_t = w_{t-1} + \mu v_{t-1} - \eta \nabla J (w_{t-1})$$And Nesterov Momentum is defined as:
$$w'_t = w'_{t-1} + \mu^2 v_{t-1} - (1+ \mu) \eta \nabla (w'_{t-1})$$We can see that they each have a similar form. There is a previous $w$ term, a previous velocity term, and a previous gradient term. And if we take the nesterov equation, and plug phrase our $w'_t$ update in terms of $v_t$, we can see that the equations have an even greater resemblance:
$$v_t = \mu v_{t-1} - \eta \nabla J (w'_{t-1})$$$$w'_t = w'_{t-1} + \mu v_t - \eta \nabla J (w'_{t-1})$$The question may arise, why is this useful? What is the point of all of this algebraic manuipulation? Well, we have expressed the nesterov momentum update entirely in terms of all the look ahead $w'$s, so the actual non look ahead $w$s are never needed. We can just treat $w$, whatever it may be, as if they were the lookahead values. In other words, lets just drop the prime symbols!
$$v_t = \mu v_{t-1} - \eta \nabla J (w_{t-1})$$$$w_t = w_{t-1} + \mu v_t - \eta \nabla J (w_{t-1})$$So we can see that similar to regular momentum, we just proceed in two steps.
The final updates for regular momentum are: $$v_t \leftarrow \mu v_{t-1} - \eta \nabla J(w_{t-1})$$ $$w_t \leftarrow w_{t-1} + v_t$$
And for nesterov momentum: $$v_t = \mu v_{t-1} - \eta \nabla J (w_{t-1})$$ $$w_t = w_{t-1} + \mu v_t - \eta \nabla J (w_{t-1})$$
So the main difference be clearly distinguished: The update rule for $w_t$ is slightly different. However, in practice the lost per iteration tends to be rather similar, so choosing between the two is not a huge deal.
Let's now implement 3 different scenarios:
We can start with our imports:
import numpy as np
from sklearn.utils import shuffle
import matplotlib.pyplot as plt
from util import get_normalized_data, error_rate, cost, y2indicator
And now lets quickly define our forward
and derivative functions. Note that I have commented out the sigmoid code, but they can be performed with the sigmoid instead of the ReLU.
def forward(X, W1, b1, W2, b2):
# sigmoid
# Z = 1 / (1 + np.exp(-( X.dot(W1) + b1 )))
# relu
Z = X.dot(W1) + b1
Z[Z < 0] = 0
A = Z.dot(W2) + b2
expA = np.exp(A)
Y = expA / expA.sum(axis=1, keepdims=True)
return Y, Z
def derivative_w2(Z, T, Y):
return Z.T.dot(Y - T)
def derivative_b2(T, Y):
return (Y - T).sum(axis=0)
def derivative_w1(X, Z, T, Y, W2):
# return X.T.dot( ( ( Y-T ).dot(W2.T) * ( Z*(1 - Z) ) ) ) # for sigmoid
return X.T.dot( ( ( Y-T ).dot(W2.T) * (Z > 0) ) ) # for relu
def derivative_b1(Z, T, Y, W2):
# return (( Y-T ).dot(W2.T) * ( Z*(1 - Z) )).sum(axis=0) # for sigmoid
return (( Y-T ).dot(W2.T) * (Z > 0)).sum(axis=0) # for relu
For this example we have imported get_normalized_data
, which means that we will be using the full 784 dimensionality data set. Now lets define our setup function, where the goal is to prep our data for the 3 scenarios discussed above:
max_iter = 20 # 20 iterations, 1 batch per iteration
print_period = 10
X, Y = get_normalized_data() # grab our normalized X and Y data
lr = 0.00004 # precomputed learning and reg rates
reg = 0.01
# create train and test set (data is already shuffled), as well as one hot matrix
Xtrain = X[:-1000,] # grabbing everything up until the last 100
Ytrain = Y[:-1000] # grabbing last 1000, making it test set
Xtest = X[-1000:,]
Ytest = Y[-1000:]
Ytrain_ind = y2indicator(Ytrain) # turn targets into one hot encoded matrices
Ytest_ind = y2indicator(Ytest)
N, D = Xtrain.shape
batch_sz = 500 # set batch size to 500
n_batches = N // batch_sz # get number of batches
M = 300 # number hidden units, we found this in prev demo
K = Ytrain_ind.shape[1] # number of output classes
# intialize weights to random small values
W1 = np.random.randn(D, M) / np.sqrt(D)
b1 = np.zeros(M)
W2 = np.random.randn(M, K) / np.sqrt(M)
b2 = np.zeros(K)
# and let's save initial weights so we can compare all methods!
W1_0 = W1.copy()
b1_0 = b1.copy()
W2_0 = W2.copy()
b2_0 = b2.copy()
# ------------------ test number 1, batch GD without momentum ------------------
losses_batch = []
errors_batch = []
for i in range(max_iter): # iterate through all batches
for j in range(n_batches): # iterate through each specific batch
# get batch size, make predictions
Xbatch = Xtrain[j*batch_sz:(j*batch_sz + batch_sz),]
Ybatch = Ytrain_ind[j*batch_sz:(j*batch_sz + batch_sz),]
pYbatch, Z = forward(Xbatch, W1, b1, W2, b2)
# perform updates
W2 -= lr * (derivative_w2(Z, Ybatch, pYbatch) + reg*W2)
b2 -= lr * (derivative_b2(Ybatch, pYbatch) + reg*b2)
W1 -= lr * (derivative_w1(Xbatch, Z, Ybatch, pYbatch, W2) + reg*W1)
b1 -= lr * (derivative_b1(Z, Ybatch, pYbatch, W2) + reg*b1)
# if print period:
if j % print_period == 0:
pY, _ = forward(Xtest, W1, b1, W2, b2)
l = cost(pY, Ytest_ind)
losses_batch.append(l)
print("Cost at iteration i=%d, j=%d: %.6f" % (i, j, l))
e = error_rate(pY, Ytest)
errors_batch.append(e)
print("Error rate:", e)
# print final error rate
pY, _ = forward(Xtest, W1, b1, W2, b2)
print("Final error rate:", error_rate(pY, Ytest))
print("--------------------------------------------------")
# --------------- test number 2, batch GD with regular momentum ------------------
W1 = W1_0.copy()
b1 = b1_0.copy()
W2 = W2_0.copy()
b2 = b2_0.copy()
losses_momentum = []
errors_momentum = []
mu = 0.9 # momentum parameter, think viscosity
dW2 = 0 # need to keep track of the previous weight changes (gradients)
db2 = 0 # think of these as velocity
dW1 = 0
db1 = 0
for i in range(max_iter): # iterate through all batches
for j in range(n_batches): # iterate through each specific batch
# get batch size, make predictions
Xbatch = Xtrain[j*batch_sz:(j*batch_sz + batch_sz),]
Ybatch = Ytrain_ind[j*batch_sz:(j*batch_sz + batch_sz),]
pYbatch, Z = forward(Xbatch, W1, b1, W2, b2)
# calculate the gradients
gW2 = derivative_w2(Z, Ybatch, pYbatch) + reg*W2
gb2 = derivative_b2(Ybatch, pYbatch) + reg*b2
gW1 = derivative_w1(Xbatch, Z, Ybatch, pYbatch, W2) + reg*W1
gb1 = derivative_b1(Z, Ybatch, pYbatch, W2) + reg*b1
# update our velocities - based on regular momentum equation
dW2 = mu*dW2 - lr*gW2
db2 = mu*db2 - lr*gb2
dW1 = mu*dW1 - lr*gW1
db1 = mu*db1 - lr*gb1
# update weights
W2 += dW2
b2 += db2
W1 += dW1
b1 += db1
# if print period:
if j % print_period == 0:
pY, _ = forward(Xtest, W1, b1, W2, b2)
l = cost(pY, Ytest_ind)
losses_momentum.append(l)
print("Cost at iteration i=%d, j=%d: %.6f" % (i, j, l))
e = error_rate(pY, Ytest)
errors_momentum.append(e)
print("Error rate:", e)
# print final error rate
pY, _ = forward(Xtest, W1, b1, W2, b2)
print("Final error rate:", error_rate(pY, Ytest))
print("--------------------------------------------------")
# --------------- test number 3, batch GD with regular momentum ------------------
W1 = W1_0.copy()
b1 = b1_0.copy()
W2 = W2_0.copy()
b2 = b2_0.copy()
losses_nesterov = []
errors_nesterov = []
mu = 0.9 # momentum parameter, think viscosity
vW2 = 0 # need to keep track of the previous weight changes (gradients)
vb2 = 0 # think of these as velocity
vW1 = 0
vb1 = 0
for i in range(max_iter): # iterate through all batches
for j in range(n_batches): # iterate through each specific batch
# get batch size, make predictions
Xbatch = Xtrain[j*batch_sz:(j*batch_sz + batch_sz),]
Ybatch = Ytrain_ind[j*batch_sz:(j*batch_sz + batch_sz),]
pYbatch, Z = forward(Xbatch, W1, b1, W2, b2)
# calculate the gradients
gW2 = derivative_w2(Z, Ybatch, pYbatch) + reg*W2
gb2 = derivative_b2(Ybatch, pYbatch) + reg*b2
gW1 = derivative_w1(Xbatch, Z, Ybatch, pYbatch, W2) + reg*W1
gb1 = derivative_b1(Z, Ybatch, pYbatch, W2) + reg*b1
# update our velocites - based on nesterov momentum equations
vW2 = mu*vW2 - lr*gW2
vb2 = mu*vb2 - lr*gb2
vW1 = mu*vW1 - lr*gW1
vb1 = mu*vb1 - lr*gb1
# update our weights - based on nesterov momentum equations
W2 += mu*vW2 - lr*gW2
b2 += mu*vb2 - lr*gb2
W1 += mu*vW1 - lr*gW1
b1 += mu*vb1 - lr*gb1
# if print period:
if j % print_period == 0:
pY, _ = forward(Xtest, W1, b1, W2, b2)
l = cost(pY, Ytest_ind)
losses_nesterov.append(l)
print("Cost at iteration i=%d, j=%d: %.6f" % (i, j, l))
e = error_rate(pY, Ytest)
errors_nesterov.append(e)
print("Error rate:", e)
# print final error rate
pY, _ = forward(Xtest, W1, b1, W2, b2)
print("Final error rate:", error_rate(pY, Ytest))
print("--------------------------------------------------")
%matplotlib notebook
plt.plot(losses_batch, label="batch")
plt.plot(losses_momentum, label="momentum")
plt.plot(losses_nesterov, label="nesterov")
plt.legend()
plt.show()
Momentum is generally looked at as the most impactful method when it comes to speeding up training. If you compare the loss per iteration between standard gradient descent, and gradient descent with momentum, the difference is huge!
Additionally, momentum is nice because you don't have to play with the hyper parameters very much. You can just set it to 0.9 and it is usually fine, resulting in huge performance gains. However, some adaptive learning techniques are very powerful, so let's take a look!
The first new learning rate we will look at is a variable learning rate, e.g. a learning rate that is a function of time, $\eta(t)$. This is sometimes referred to as "Learning rate scheduling". The first one worth mentioning is:
Periodically, say ever 100 iterations, we will reduce the learning rate by a constant factor. For example, if we constantly divide by 2, we will half it each time.
In this method the learning rate follows and exponential curve:
$$\eta(t) = A * e^{-kt}$$Here the learning rate will decay proportionally to $\frac{1}{t}$: $$\eta(t) = \frac{A}{kt + 1 }$$ In this case the dropoff is slower than exponential decay.
Well, each of these methods will decrease the learning rate as time goes on (i.e. the number of iterations increases). The question comes up, why would be want to do that? Well, generally speaking, when we initialize the weights of a neural network, they are going to be very far from the optimal weights, so it is good to start with a large learning rate so that we can take bigger steps towards the goal. This is the motivation behind momentum as well: We want to pick up speed by accumulating past gradients, because we know that if we are very far away from our goal, then those gradients should be large. However, as we get close to the goal, the gradient is going to shrink. In fact, by definition, the minimum of a function means the gradient must be 0-that is how we solve for the minimum of a function in calculus.
But why may we want to slow down as we get close to the minimum? Well, when you are too close to the minimum and you take too big of a step, you are going to overshoot the minimum. So what ends up happening is that you just bounce back and forth. In fact, if you learning rate is too large then you will just bounce right of the valley and your loss may increase! So, in order to reduce all of this bouncing around, we would like to take smaller steps.
One other things to think about is the following caveat: all of these methods mean that there are more hyperparameters to optimize. This adds more work to your plate as a data scientist.
The first adaptive learning technique we will go over is AdaGrad. The main idea behind it is this: we cannot expect the dependence of the cost on each of the parameters to be the same. In other words, in one direction the gradient may be very steep, but in another direction the gradient may be very flat. So, perhaps it may be beneficial to adapt the learning rate for each parameter individually, based on how much it has changed in the past.
So, in AdaGrad what we do is introduce a variable called the cache.
$$cache = cache + gradient^2$$$$w \leftarrow w - \eta \frac{\nabla J}{\sqrt{cache + \epsilon }}$$Each parameter of the neural network has its own cache, so for example if you have one weight matrix of size (3 x 4), you will also have a cache matrix of size (3 x 4). The same thing applies to the bias vectors.
The idea behind the cache is that it is going to accumulate the squared gradients. Because we are squaring the gradients, the cache is always going to be positive. And because each parameter has its own cache, then if one parameter has had a lot of large gradients in the past, then its cache will be very large, and its effective learning rate will be very small, and it will change more slowly in the future. On the other hand, if a parameter has had a lot of small gradients in the past then it's cache will be small, so its respective learning rate will remain large, and it will have more opportunity to change in the future.
One small thing to keep in mind is that we usually add a small number $\epsilon$ to the denominator in order to prevent dividing by 0. This is generally set to $10^{-8}$ or $10^{-10}$.
One important thing to stress about AdaGrad is that everything we are doing is an element-wise operation. In other words, each scalar parameter and its learning rate is effectively updated independently of the others. So you can look at the rules that we presented as scalar updates that apply to all of the parameters, or you can think of one huge parameter vector that contains all of the neural network parameters, and that each of the operations is an element wise operation.
This next technique builds on the fact that researchers have found that AdaGrad decreases the learning rate too aggresively. In this case, the learning rate would approach 0 too quickly, when in fact there was still learning to be done. The method introduced to fix this is called RMSProp, and it was introduced by Geoff Hinton and team. It works as follows:
The reason that AdaGrad decrease the learning rate too quickly is because the cache is growing too fast. So in order to make the cache grow more slowly, we actually decrease it each time we update it. This is done by taking a weighted average of the old cache and the new squared gradient. We call this weight the decay weight, and we can see that the two weights add up to 1.
$$cache = decay*cache +(1 - decay)*gradient^2$$Typical values for the decay are 0.99, 0.999, etc. The intuition for these choices will be discussed in later lectures. By doing this, we say that we are making the cache "leaky". The reasoning it is leaking is because we can imagine that if we had 0 gradient for a long time, eventually the cache would shrink back down to 0, because it would be decreased by the decay rate on each round
Note that there is some ambiguity in the RMSProp and AdaGrad algorithms. Specifically, this can be seen in the context of RMSProp. The ambiguity arises from the initial value of the cache. You may automatically assume that 0 is a good value, but this actually has a very strange effect. We can let: $$decay = 0.99$$ And that means our initial update for the cache is:
$$0.001g^2$$And hence our initial update is (ignoring epsilon): $$\frac{\eta g}{\sqrt{0.001g^2}}$$ Which ends up being very large, because the denominator is very small. What you would have to do is compensate by making you initial learning rate smaller than usual. One solution to this however, is just to set the cache to 1 instead. By manipulating the RMSProp update, we can show that this is what we would get if we were not doing RMSProp at all:
$$\nabla w = \eta \frac{g}{\sqrt{1 + 0.001 g^2}} \approx \eta g$$You may still be wondering which way is the correct way to go about things? Well, in TensorFlow the cache is initialized to 1. In Keras the cache is initialized to 0.
AdaGrad
# at every batch
cache += gradient ** 2
param = param - learning_rate * gradient / sqrt(cache + epsilon)
RMSProp
# at every batch
cache = decay * cache + (1 - decay) * gradient **2
param = param - learning_rate * gradient / sqrt(cache + epsilon)
Note: Epsilon is generally 10e-8, 10e-9, 10e-10, and decay is generally 0.9, 0.99, 0.999
Here we are going to be comparing RMSProp against a constant learning rate. We can start by defining our standard multi-layer perceptron functions:
def forward(X, W1, b1, W2, b2):
Z = X.dot(W1) + b1
Z[Z < 0] = 0 # using ReLU function
A = Z.dot(W2) + b2
expA = np.exp(A)
Y = expA / expA.sum(axis=1, keepdims=True)
return Y, Z
def derivative_w2(Z, T, Y):
return Z.T.dot(Y - T)
def derivative_b2(T, Y):
return (Y - T).sum(axis=0)
def derivative_w1(X, Z, T, Y, W2):
return X.T.dot( ( (Y - T).dot(W2.T) * (Z > 0) ) ) # for relu
def derivative_b1(Z, T, Y, W2):
return ((Y - T).dot(W2.T) * (Z > 0)).sum(axis=0) # for relu
And now for our imports.
import numpy as np
from sklearn.utils import shuffle
import matplotlib.pyplot as plt
%matplotlib inline
from util import get_normalized_data, error_rate, cost, y2indicator
Now we can start our main function:
def main():
max_iter = 20
print_period = 10
X, Y = get_normalized_data()
lr = 0.00004
reg = 0.01
# get train/test data
Xtrain = X[:-1000,]
Ytrain = Y[:-1000]
Xtest = X[-1000:,]
Ytest = Y[-1000:]
Ytrain_ind = y2indicator(Ytrain)
Ytest_ind = y2indicator(Ytest)
N, D = Xtrain.shape
batch_sz = 500
n_batches = N // batch_sz
# 300 hidden units, 10 output classes, then initialize our weights
M = 300
K = 10
W1 = np.random.randn(D, M) / 28
b1 = np.zeros(M)
W2 = np.random.randn(M, K) / np.sqrt(M)
b2 = np.zeros(K)
# --------------------- 1. Constant Learning Rate -----------------------------
# cost = -16
LL_batch = []
CR_batch = []
for i in range(max_iter): # looping through total iterations
for j in range(n_batches): # looping through number of batches
Xbatch = Xtrain[j*batch_sz:(j*batch_sz + batch_sz),]
Ybatch = Ytrain_ind[j*batch_sz:(j*batch_sz + batch_sz),]
pYbatch, Z = forward(Xbatch, W1, b1, W2, b2)
# weight and bias updates
W2 -= lr*(derivative_w2(Z, Ybatch, pYbatch) + reg*W2)
b2 -= lr*(derivative_b2(Ybatch, pYbatch) + reg*b2)
W1 -= lr*(derivative_w1(Xbatch, Z, Ybatch, pYbatch, W2) + reg*W1)
b1 -= lr*(derivative_b1(Z, Ybatch, pYbatch, W2) + reg*b1)
# print/debugging step
if j % print_period == 0:
# calculate just for LL
pY, _ = forward(Xtest, W1, b1, W2, b2)
# print "pY:", pY
ll = cost(pY, Ytest_ind)
LL_batch.append(ll)
print("Cost at iteration i=%d, j=%d: %.6f" % (i, j, ll))
err = error_rate(pY, Ytest)
CR_batch.append(err)
print("Error rate:", err)
pY, _ = forward(Xtest, W1, b1, W2, b2)
print("Final error rate:", error_rate(pY, Ytest))
print('-------------------- END --------------------------')
# --------------------- 2. RMSProp -----------------------------
W1 = np.random.randn(D, M) / 28
b1 = np.zeros(M)
W2 = np.random.randn(M, K) / np.sqrt(M)
b2 = np.zeros(K)
LL_rms = []
CR_rms = []
lr0 = 0.001 # Initial Learning Rate. If you set this too high you'll get NaN!
cache_W2 = 1 # storing cache for each of our weights
cache_b2 = 1
cache_W1 = 1
cache_b1 = 1
decay_rate = 0.999 # our decay rate parameter and epsilon
eps = 1e-10
# calculate batches in same way
for i in range(max_iter):
for j in range(n_batches):
Xbatch = Xtrain[j*batch_sz:(j*batch_sz + batch_sz),]
Ybatch = Ytrain_ind[j*batch_sz:(j*batch_sz + batch_sz),]
pYbatch, Z = forward(Xbatch, W1, b1, W2, b2) # forward pass
# weight and bias updates, WITH RMSProp
gW2 = derivative_w2(Z, Ybatch, pYbatch) + reg*W2
cache_W2 = decay_rate*cache_W2 + (1 - decay_rate)*gW2*gW2 # elem by elem mult
W2 -= lr0 * gW2 / (np.sqrt(cache_W2) + eps)
gb2 = derivative_b2(Ybatch, pYbatch) + reg*b2
cache_b2 = decay_rate*cache_b2 + (1 - decay_rate)*gb2*gb2
b2 -= lr0 * gb2 / (np.sqrt(cache_b2) + eps)
gW1 = derivative_w1(Xbatch, Z, Ybatch, pYbatch, W2) + reg*W1
cache_W1 = decay_rate*cache_W1 + (1 - decay_rate)*gW1*gW1
W1 -= lr0 * gW1 / (np.sqrt(cache_W1) + eps)
gb1 = derivative_b1(Z, Ybatch, pYbatch, W2) + reg*b1
cache_b1 = decay_rate*cache_b1 + (1 - decay_rate)*gb1*gb1
b1 -= lr0 * gb1 / (np.sqrt(cache_b1) + eps)
if j % print_period == 0:
# calculate just for LL
pY, _ = forward(Xtest, W1, b1, W2, b2)
# print "pY:", pY
ll = cost(pY, Ytest_ind)
LL_rms.append(ll)
print("Cost at iteration i=%d, j=%d: %.6f" % (i, j, ll))
err = error_rate(pY, Ytest)
CR_rms.append(err)
print("Error rate:", err)
pY, _ = forward(Xtest, W1, b1, W2, b2)
print("Final error rate:", error_rate(pY, Ytest))
print('-------------------- END --------------------------')
plt.plot(LL_batch, label='const')
plt.plot(LL_rms, label='rms')
plt.legend()
plt.show()
main()
One of the newest and most common optimization techniques these days is known as the Adam Optimizer (link to original paper here: https://arxiv.org/pdf/1412.6980.pdf). It is going to be given it's own section all together, since as of 2017 it is often considered the go to optimizer for those doing deep learning. Sometimes, people will talk about Adam by saying it is like RMSprop, but with momentum. This can be slightly confusing though, considering that TensorFlow has RMSprop, and allows you to add momentum to it, which is not Adam. In other words, you can have RMSprop with momentum, but that doesn't give you Adam-it just gives you RMSprop with momentum. Adam does, in a sense, add something like momentum to RMSprop, but in a very specific way. In order to fully understand Adam, we first need to look at part of the theory that it is dependent on: Exponentially-Smoothed Averages.
We are going to take a minute to dig into something that may seem straight forward: How to calculate an average. The first thought we all have when being asked to do this is: Why not just add all of the sample data points, and then divide by the number of data points, resulting in the sample mean:
$$\bar{X}_N = \frac{1}{N}\sum_{n=1}^NX_n$$But now let's suppose that you have a large amount of data-so much so that all of your X's cannot fit into memory at the same time. Is it still possible to calculate the sample mean? Yes it is! We can read in one data point at a time, and then delete each data point after we've looked at it. It is shown below that the current sample mean can actually be expressed in terms of the previous sample mean and the current data point.
$$\bar{X}_N = \frac{1}{N}\sum_{n=1}^NX_n = \frac{1}{N}\Big((N-1)\bar{X}_{N-1} + X_N \Big) = (1 - \frac{1}{N})\bar{X}_{N-1}+\frac{1}{N}X_N$$We can then express this using simpler symbols. We can call $Y$ our output, and we can use $t$ to represent the current time step:
$$Y_t = (1 - \frac{1}{t})Y_{t-1} + \frac{1}{t}X_t$$Great, so we have solved our problem of how to calculate the sample mean when we can't fit all of the data into memory, but we can see that there is this $\frac{1}{t}$ term. This says that as $t$ grows larger, the current sample has less and less of an effect on the total mean. Clearly this makes sense, because as $t$ grows that means the total number of $X$'s we've seen has grown. We also decrease the influence of the previous $Y$ by $1 - \frac{1}{t}$. This means that each new $Y$ is just part of the old $Y$, plus part of the newest $X$. But in the end, it balances out to give us exactly the sample mean of $X$.
For convenience we can call this $\frac{1}{t}$ term $\alpha_t$. What if we were to say that we did not want $\alpha$ to be $\frac{1}{t}$? What if we said that we wanted each data point to matter equally at the time that we see it, so that we can set alpha to be a constant? Of course, $\alpha$ needs to be less than 1 so that we don't end up negating the previous mean.
$$0 < \alpha_t = constant < 1 $$$$Y_t = (1 - \alpha)Y_{t-1} + \alpha X_t$$So what does this give us?
This gives us what is called the exponentially smoothed average! We can see why it is called exponential when we express it in terms of only $X$'s.
$$Y_t = (1 - \alpha)^tY_0 + \alpha \sum_{\tau = 0}^{t - 1}(1- \alpha)^\tau X(t- \tau)$$If the equation above is not clear, the expansion below should clear up where everything is coming from and why this is called exponential. Let's say we are looking at $Y_{100}$:
$$Y_{100} = (1-\alpha)^{100}Y_0 + \alpha * X_{100} + \alpha * (1 - \alpha)^1*X_{99} + \alpha * (1 - \alpha)^2 * X_{98}+ ...$$We can see the exponential term start to accumulate along the $(1 - alpha)$! Now, does this still give us the mean, aka the expected value of $X$? Well, if you take the expected value of everything, we can see that we arrive at the expected value of $X$:
$$(1 - \alpha)E[y(t-1)] + \alpha E[X(t)] = (1-\alpha)E(X) + \alpha E(X) = E(X)$$We do arrive at the expected value of $X$, so we can see that the math does checkout! Of course, this is assuming that the distribution of $X$ does not change over time. Note that if you have come from a signal processing background, you may recognize this as a low-pass filter. Another way to think about this is that you are saying current values matter more, and past values matter less in an exponentially decreasing way. So, if $X$ is not stationary (meaning it's distribution changes over time), then this is actually a better way to estimate the mean (average) then weighting all data points equally over all time.
Now, how does this apply to the Adam Optimizer? Well, if we call the sample mean of $T$ samples, $M_T$, we write $M_T$ as: $$M_T = \frac{1}{T} \sum_{t=1}^TX_t = \frac{1}{T}\sum_{t=1}^{T-1}X_t+\frac{1}{T}X_T = (1 - \frac{1}{T})M_{T-1}+\frac{1}{T}X_T$$
Where $X_T$ is the new sample, and $M_{T-1}$ is the previous mean. This means that instead of needing to store all of the $X$'s, all we need to store is the current $X$ and the previous $M$, and then we calculate the current $M$.
Now, you may notice that there is this $\frac{1}{T}$ term that occurs in the equation. It is reasonable to ask: "What is we set this term to be a constant?" Then, we get back precisely what looks like the cache update from RMSProp!
$$M_T = decay*M_{T-1} + (1 - decay)X_T$$At this point, it is a very reasonable thing to ask: "What is the cache of RMSProp really estimating?" Well, it is really estimating the mean of the square of the gradient!
And because we eventually take the square root of the cache, now you can see where RMSProp gets its name (RMS= "root mean square").
Also, it is worth remembering that the mean of a random variable can also be phrased as an expected value, aka: $mean(X) = E(X)$. In particular, when we take the expected value of the square of a random variable, we get the second moment: $$E(X^2) = 2nd \;moment\;of\;X$$
So, $E(g^2)$ is the second moment of the gradient, and we calculate it using this exponentially smoothed average. We are going to refer to it as $v$ from now on, since the 2nd central moment of a random variable is the definition of variance.
$$v \approx E(g^2)$$$$M_T = (1 - \frac{1}{T})M_{T-1}+\frac{1}{T}X_T$$$$v_t = decay * v_{t-1} + (1 - decay)*g^2$$Now, since there is a second moment, it is reasonable to ask: is there a first moment? Yes there is! It is just the expected value of $g$.
$$m_t = \mu m_{t-1} + (1 - \mu)g_t$$$$m \approx E(g)$$We will call this $m$ because the first moment of a random variable is the definition of the mean. Because the Adam Update makes use of the 1st and 2nd moments of $g$ using exponentially smoothed estimates, this also explains the name of the algorithm: Adaptive Moment Estimation.
Also, note that the update for $m$ looks a lot like momentum! Hence, why some people refer to this as RMSProp with momentum. Usually momentum is just implemented by adding a momentum parameter to the velocity rather than doing this weighted average, as a reminder it usually looks like:
$$m_t = \mu m_{t-1} + g_t$$So to get to our final destination, which is the Adam Update, we are going to combine these two pieces: the first and second moments of $g$, or as we have been calling them, $m$ and $v$ (mean and variance). However, before we do that, there is one last important concept to discuss. The exponentially smoothed average acts very much like a smoothing function, ignoring much of the variation. It will ignore the rapid very random spikes, but it will follow the main trend of the signal, the original sine wave. In the field of signal processing this is known as a low pass filter, because the high frequency changes are being filtered out.
The problem with exponential smoothing is just like RMSProp, what is the initial value? The output is supposed to depend on the current value of the input plus the last value of the output. But since there is no last value of the output for $t = 0 $, we typically just set that to 0. But in doing so, we make the first value of the output small: $$Y(0)= 0$$ $$Y(1) = 0.99 * Y(0) + 0.01*X(1) = 0.01*X(1)$$
This means that initially your output will be biased towards 0!
Is there anything that we can do about the issue presented above? Well, there is a technique, bias correction, that makes up for this.
$$\hat{Y}(t) = \frac{Y(t)}{1 - decay^t}$$We divide by $(1 - decay)^t$. As you can see, when $t$ is small we are dividing by a very small number, which makes the initial value bigger. This is good because the initial value was too small.
Let's go through an example to make things very clear. Given a decay of 0.99, our original output would be:
$$Y(1) = 0.01*X(1)$$And $Y(2)$ would be:
$$Y(2) = 0.0099*X(1) + 0.01*X(2)$$But, if we correct it, then we have:
$$\hat{Y}(1) = \frac{0.01}{1 - 0.99^1}*X(1) = X(1)$$And our corrected $\hat{Y}(2)$:
$$\hat{Y}(2) = \frac{Y(2)}{0.0199} = 0.497*X(1) + 0.503*X(2)$$This also makes sense because it is about half of $X(1)$ plus about half of $X(2)$, with a bit more weight on the current value. We can also see that as $t$ approaches infinity, the bias correction term approaches 1, meaning that we quickly converge to the standard exponentially smooth average. The important part is that our output now has the correct range of values when compared with the input, for all values of the input, including the initial ones.
Now that we have discussed bias correction for exponentially smoothed averages, you might guess that we are going to apply it to the exponentially smoothed averages that we have been talking about, in particular $m$ and $v$. So, if we call $\hat{m}$ the bias corrected version of $m$, and $\hat{v}$ the bias corrected version of $v$, then we can formulate our Adam Update.
$$\hat{m}_t = \frac{m_t}{1 - \beta_1^t}$$$$\hat{v}_t = \frac{v_t}{1 - \beta_2^t}$$Note, there are two different decay rates for $m$ and $v$, which we call $\beta_1$ and $\beta_2$. Typical values are in the range of 0.9 to 0.99. $\epsilon$ again has values in the range of $10^{-8}$ to $10^{-10}$. Our final update for $w$ is:
$$w_{t+1} \leftarrow w_t - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t+\epsilon}}$$Now that was a ton of stuff, so lets recap! Adam is a new, modern adaptive learning rate technique that is the default go to for many deep learning practioners. It combines 2 proven techniques:
- RMSProp which is a cache mechanism (leaky cache)
- Momentum which speeds up training just by continuing to go in the same direction it was going before (keeping track of old gradients)
We saw that these two methods can be generalized, by observing that they estimate the first and second moments of the gradient. Adam also adds bias correction, which makes up for the fact that the exponentially smoothed average starts at zero, and hence the initial estimates are biased towards 0.
$$\hat{m}_t = \frac{m_t}{1 - \beta_1^t}$$$$\hat{v}_t = \frac{v_t}{1 - \beta_2^t}$$$$w_{t+1} \leftarrow w_t - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t+\epsilon}}$$At every batch (typical $\beta_1 = 0.9$, $\beta_2 = 0.999$, $\epsilon = 10^{-8}$):
$$m_t = \beta_1 m_{t-1} + (1 - \beta_1)g_t$$$$v_t = \beta_2 v_{t-1} + (1 - \beta_2)g_t^2$$$$\hat{m}_t = \frac{m_t}{1 - \beta_1^t}$$$$\hat{v}_t = \frac{v_t}{1 - \beta_2^t}$$$$w_{t+1} \leftarrow w_t - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t+\epsilon}}$$We are now going to take the time to compare the Adam Optimizer in code, vs RMSProp with Momentum.
Recall that the algorithm for RMSProp was:
RMSProp
# at every batch
cache = decay * cache + (1 - decay) * gradient **2
param = param - learning_rate * gradient / sqrt(cache + epsilon)
We can start with our imports.
import numpy as np
from sklearn.utils import shuffle
import matplotlib.pyplot as plt
from util import get_normalized_data, error_rate, cost, y2indicator
And now let's start our main method:
%matplotlib notebook
def main():
max_iter = 10
print_period = 10
X, Y = get_normalized_data()
reg = 0.01
Xtrain = X[:-1000,]
Ytrain = Y[:-1000]
Xtest = X[-1000:,]
Ytest = Y[-1000:]
Ytrain_ind = y2indicator(Ytrain)
Ytest_ind = y2indicator(Ytest)
N, D = Xtrain.shape
batch_sz = 500
n_batches = N // batch_sz
M = 300
K = 10
W1_0 = np.random.randn(D, M) / np.sqrt(D)
b1_0 = np.zeros(M)
W2_0 = np.random.randn(M, K) / np.sqrt(M)
b2_0 = np.zeros(K)
W1 = W1_0.copy()
b1 = b1_0.copy()
W2 = W2_0.copy()
b2 = b2_0.copy()
""" ----------------------------------- ADAM -----------------------------------"""
# 1st moment initialization (aka m_t at t = 0, = 0). Note, we could initialize these to
# an array of zeros, however it is not necessary considering numpy knows how to
# automatically broadcast a scalar when you add it to an array
mW1 = 0
mb1 = 0
mW2 = 0
mb2 = 0
# 2nd moment initialization
vW1 = 0
vb1 = 0
vW2 = 0
vb2 = 0
# define hyperparameters - will use the same parameters for ADAM and RMSProp
lr0 = 0.001
beta1 = 0.9 # we can think of this as momentum
beta2 = 0.999 # can think of this like RMSProp cache decay rate
eps = 1e-8
loss_adam = []
err_adam = []
t = 1
for i in range (max_iter):
for j in range(n_batches):
Xbatch = Xtrain[j*batch_sz:(j*batch_sz + batch_sz), ]
Ybatch = Ytrain_ind[j*batch_sz:(j*batch_sz + batch_sz), ]
pYbatch, Z = forward(Xbatch, W1, b1, W2, b2)
# ---- Updates ----
# 1st - calculate updated gradients for each array
gW2 = derivative_w2(Z, Ybatch, pYbatch) + reg*W2
gb2 = derivative_b2(Ybatch, pYbatch) + reg*b2
gW1 = derivative_w1(Xbatch, Z, Ybatch, pYbatch, W2) + reg*W1
gb1 = derivative_b1(Z, Ybatch, pYbatch, W2) + reg*b1
# Calculate updates for m, exponentially smoothed average of gradients, 1st moment
mW1 = beta1 * mW1 + (1 - beta1) * gW1
mb1 = beta1 * mb1 + (1 - beta1) * gb1
mW2 = beta1 * mW2 + (1 - beta1) * gW2
mb2 = beta1 * mb2 + (1 - beta1) * gb2
# Calculate updates for v, exponentially smoothed average of the square of the
# gradients, 2nd moment
vW1 = beta2 * vW1 + (1 - beta2) * gW1* gW1
vb1 = beta2 * vb1 + (1 - beta2) * gb1* gb1
vW2 = beta2 * vW2 + (1 - beta2) * gW2* gW2
vb2 = beta2 * vb2 + (1 - beta2) * gb2* gb2
# Bias correction step - by convention, first time index is 1
correction1 = 1 - beta1 ** t
hat_mW1 = mW1 / correction1
hat_mb1 = mb1 / correction1
hat_mW2 = mW2 / correction1
hat_mb2 = mb2 / correction1
correction2 = 1 - beta2 ** t
hat_vW1 = vW1 / correction2
hat_vb1 = vb1 / correction2
hat_vW2 = vW2 / correction2
hat_vb2 = vb2 / correction2
# Update t - why?????
t += 1
# Finally, apply updates to the actual weights/params
W1 = W1 - lr0 * hat_mW1 / np.sqrt(hat_vW1 + eps)
b1 = b1 - lr0 * hat_mb1 / np.sqrt(hat_vb1 + eps)
W2 = W2 - lr0 * hat_mW2 / np.sqrt(hat_vW2 + eps)
b2 = b2 - lr0 * hat_mb2 / np.sqrt(hat_vb2 + eps)
# Keep track of costs and errors so we can plot later
if j % print_period == 0:
pY, _ = forward(Xtest, W1, b1, W2, b2)
l = cost(pY, Ytest_ind)
loss_adam.append(l)
print("Cost at iteration i=%d, j=%d: %.6f" % (i, j, l))
err = error_rate(pY, Ytest)
err_adam.append(err)
print("Error rate:", err)
pY, _ = forward(Xtest, W1, b1, W2, b2)
print("Final error rate:", error_rate(pY, Ytest))
print('-------------------------------- End of Adam---------------------------------')
""" ---------------------------- RMSProp with Momentum ---------------------------"""
W1 = W1_0.copy() # Same initial weights that were used in Adam
b1 = b1_0.copy()
W2 = W2_0.copy()
b2 = b2_0.copy()
loss_rms = []
err_rms = []
# Comparable hyperparameters so we can have a fair comparison
lr0 = 0.001 # Learning rate
mu = 0.9 # Momentum term, corresponds to beta1
decay_rate = 0.999 # Cache decay rate, corresponds to beta2
eps = 1e-8
# RMSProp Cache, initialized to 1 (same implementation as in TensorFlow)
# This needs to be done because RMSProp does not have bias correction
cache_W2 = 1
cache_b2 = 1
cache_W1 = 1
cache_b1 = 1
# Initialize Momentum/Velocity terms to 0, note no bias correction
dW1 = 0
db1 = 0
dW2 = 0
db2 = 0
for i in range(max_iter):
for j in range(n_batches):
Xbatch = Xtrain[j*batch_sz:(j*batch_sz + batch_sz),]
Ybatch = Ytrain_ind[j*batch_sz:(j*batch_sz + batch_sz),]
pYbatch, Z = forward(Xbatch, W1, b1, W2, b2)
# ---- Updates ----
# General Form: Do everything we did for RMSProp, an add momentum to it at the
# end.
# 1) Calculate the Gradient
# 2) Calculate the Cache Update
# 3) Use variation on momentum, so instead of weighting just the previous velocity
# we weight the change that we would have made by 1 - mu. So, instead of just
# having mu * dW2, we have mu * dW2 and then we add another constant (1 - mu*dW2)
# to the normal learning rate * gradient update that we normally would have done
gW2 = derivative_w2(Z, Ybatch, pYbatch) + reg*W2
cache_W2 = decay_rate*cache_W2 + (1 - decay_rate)*gW2*gW2
dW2 = mu * dW2 + (1 - mu) * lr0 * gW2 / (np.sqrt(cache_W2) + eps)
W2 -= dW2
gb2 = derivative_b2(Ybatch, pYbatch) + reg*b2
cache_b2 = decay_rate*cache_b2 + (1 - decay_rate)*gb2*gb2
db2 = mu * db2 + (1 - mu) * lr0 * gb2 / (np.sqrt(cache_b2) + eps)
b2 -= db2
gW1 = derivative_w1(Xbatch, Z, Ybatch, pYbatch, W2) + reg*W1
cache_W1 = decay_rate*cache_W1 + (1 - decay_rate)*gW1*gW1
dW1 = mu * dW1 + (1 - mu) * lr0 * gW1 / (np.sqrt(cache_W1) + eps)
W1 -= dW1
gb1 = derivative_b1(Z, Ybatch, pYbatch, W2) + reg*b1
cache_b1 = decay_rate*cache_b1 + (1 - decay_rate)*gb1*gb1
db1 = mu * db1 + (1 - mu) * lr0 * gb1 / (np.sqrt(cache_b1) + eps)
b1 -= db1
if j % print_period == 0:
pY, _ = forward(Xtest, W1, b1, W2, b2)
l = cost(pY, Ytest_ind)
loss_rms.append(l)
print("Cost at iteration i=%d, j=%d: %.6f" % (i, j, l))
err = error_rate(pY, Ytest)
err_rms.append(err)
print("Error rate:", err)
pY, _ = forward(Xtest, W1, b1, W2, b2)
print("Final error rate:", error_rate(pY, Ytest))
plt.plot(loss_adam, label='adam')
plt.plot(loss_rms, label='rmsprop')
plt.legend()
plt.show()
main()
So we can see that RMSProp with momentum and Adam actually end up very close to one another. This makes sense because Adam is basically RMSProp with momentum added to it. The big difference is that Adam adds the bias correction step. We can see that this makes a big difference at the very beginning, but that this bias quickly goes away. You observe the same effect if you run just a low pass filter on a random signal. You can see that the output is initially biased towards zero, but that the bias goes away pretty quickly.