10. Momentum and Adaptive Learning Rates¶

We will now take a look at one of the most effective methods at improving plain gradient descent, called momentum. This can be thought of as the 80% factor to improve your learning procedure.

A way to think of this is as follows: Gradient descent without momentum requires a force or push each time we want to get the weights to move. In other words, each time we want to move, there has to a be a gradient so that we can move in the direction of the gradient. If we had momentum, we can imagine that our update could keep moving, even without the gradient being present.

This can be thought of as pushing a box on ice vs. pushing a box on gravel. If we are pushing the box on gravel, the minute we stop applying force, the box will also stop moving. This is analogous to gradient descent without momentum. However, if we were pushing the box on ice we could and then let go and it would continue moving for a period of time, before stopping. This is analogous to gradient descent with momentum. Another way to phrase this is as follows:

With momentum included in our update, our weight vector will build up a velocity in any direction that has a consistent gradient.

Let's put this into math.

1.1 Gradient Descent, without Momentum¶

Our update for $\theta$ can be described as: $$\theta_t \leftarrow \theta_{t-1} - \eta g_{t-1}$$ This says that $\theta_t$ is equal to the previous value of $theta$, minus the learning rate, times the gradient $g_t$. From this we can see that if the gradient is 0, nothing will happen to $\theta_t$. It just gets updated to it's old value and doesn't change.

1.2 Gradient Descent, with Momentum¶

Now let's say that we add in momentum. Note that the term momentum is used very loosely here, since it has nothing to do with actual physical momentum. What we do is create a new variable, $v$, which stands for the velocity. It is equal to $\mu$ (the momentum term) times its old velocity, minus the learning rate times the gradient. Notice that now, the gradient only directly influences the velocity, which in turn has an effect on the position (our weight vector), $\theta$.

$$v_t \leftarrow \mu v_{t-1} - \eta g_{t-1}$$

This new term, $\mu v_{t-1}$, gives us the ability to "slide on ice" if you will. In other words, it allows us to continue to move in the same direction that we were going before. Now, we talked about how if a box is sliding on ice, it will still stop eventually. That means that we are going to want our updated $v$ to be a fraction of the prior $v$, and hence $\mu$ should be a fraction. Typical values of $\mu$ are 0.9, 0.95, 0.99, etc. This means that without any $g$, the equation will still eventually "slow down". Our update rule for $\theta_t$ then becomes:

$$\theta_t \leftarrow \theta_{t-1} + v_t$$

Now, if we combine these two equations we can see that our total update rule is:

$$\theta_t \leftarrow \theta_{t-1} + \mu v_{t-1} -\eta g_{t-1} $$

And we can see that if we set the momentum term, $\mu$, equal to zero, we end up with the same update rule we originally had for gradient descent.

1.3 The Effect of Momentum¶

You may be wondering, what is the effect of using momentum? Well we can see below that by using momentum, the cost converges to its minimum value much faster. This significantly speeds up training!

From another perspective, we can think of a situation where we have unequal gradients in different directions. In the image below, we have a very large gradient that creates the valley (each side is very steep), and then in the other direction (the stream flowing down), it is a very small gradient.

For visualization purposes, lets assume we have 2 parameters to optimize: the vertical and horizontal parameter. The gradient in one direction is very steep, and the gradient in the other direction is very shallow. The idea is that if you don't have momentum, then you rely purely on the gradient, which points more in the steep direction than in the shallow direction-this is just a property of the gradient, it is the direction of steepest descent. Since this gradient vector points more in the steep direction, we are going to zigzag back and forth across the valley. That is a very inefficient way of reaching the minimum.

Once we add momentum, however, things change. Because in the shallow direction, we move in the same direction every time, those velocities are going to accumulate, so we will have a portion of our old velocity, added to our new velocity to help us along in that direction. The result is that we get there faster by taking bigger steps in the shallow direction of the gradient.

2. Nesterov Momentum¶

Nesterov momentum was coined by Y Nesterov in 1983. It is described as:

"A method for unconstrained convex minimization problem with rate of convergence O(1/$k^2$)"

The core idea is that when the current weight vector is at some position, lets say $w$, then looking at original momentum update from earlier, we know that the momentum term alone (ignoring the term with the gradient), is about to nudge the parameter vector by $\mu v_{t-1}$. Therefore, if we are about to compute the gradient, we can treat the future approximate position of $w$, $w + \mu v_{t-1}$, as a "lookahead" - as in this is a point in the vicinity of where we are going to end up. Hence, it makes sense to compute the gradient at the $w + \mu v_{t-1}$, instead of the old/stale position $w$.

The image above makes it clear that instead of evaluating the gradient at the current position of $w$, (red circle), we know that our momentum is about to carry us to the tip of the green arrow. With Nesterov momentum we therefore instead evaluate the gradient at this "looked-ahead" position. Also, keep in mind that in the image above the blue vector is referring to our update to the velocity, $v_t$, and not to the update of $w_t$.

Okay, so we have a basic idea of nesterov momentum now, but let's just try and reiterate from a few different perspectives, to help is sink in. So, instead of just using momentum to blindly keep going in the direction that we were already going, let's instead peak ahead, by taking a big jump in the direction of the previous velocity, and calculate the gradient from there. We can think of it as though you are gambling, and if you are going to gamble it is better to take a big jump and then make a correction, than to make a correction and then gamble.

So first we peak ahead, jumping in the direction of the previous velocity (accumulated gradient):

We then measure the gradient, and go downhill in the direction of the gradient. We use that gradient to update our velocity (accumulated gradient). In other words, we combine the big jump with our gradient to get the accumulated gradient. So in a way, its peaking ahead and then course correcting based on where we would have ended up.

We then take that accumulated gradient (first green vector), multiply by some momentum constant, $\mu$, and then we take the next big jump in the direction of that accumulated gradient. Again, at the place where we end up (head of second brown vector), we measure the gradient, we go downhill (second red vector) to correct any errors we have made, and we get a new accumulated gradient (second green vector)

We can see that the blue vectors represent where we would go if we were using standard momentum, where we first measure the gradient where it currently is (small blue vector), and it adds that to the brown vector, and ends up making a jump by the big blue vector (first brown vector plus small blue vector, i.e. the current gradient). The brown vector represents our peak ahead value. Notice that it is in the same direction as the blue vector. The red vector is the gradient at the peak ahead value. The green vector is just the vector of the brown vector and the red vector.

2.1 Nesterov Equations¶

So, with the visuals discussed, what do the equations look like? First, we are going to use $w$ to represent our weights instead of $\theta$. Also, the majority if this is looking at how we will update $v_t$, and the last step covers $w_t$. Now, lets start with the vector that represents the previous value of our weights, $w_{t-1}$, and the previous velocity, $v_{t-1}$:

Now, we have this jump ahead, which we can call $w'_{t-1}$. We can also think of it as just our previous weight position, plus the momentum step. It is in the same direction of our previous velocity vector (because remember, the first part of updating $v_t$ was the term $\mu v_{t-1}$, but it is slightly smaller since the jump is scaled by $\mu$:

$$look \; ahead\; value: \; w'_{t-1} = w_{t-1} +\mu v_{t-1}$$$$look \; ahead\; value: \; w'_{t-1} = v_{t-1} +\mu v_{t-1}$$

The above equations are equivalent because both $w_{t-1}$ and $v_{t-1}$ both have the same position (head each vector). Also, note that as seen in the image above, the jump ahead is just the previous value of the velocity (or previous weight position, $w_{t-1}$), plus the momentum term multiplied by the previous velocity. Next, we calculate the gradient at this jump ahead point, and then use that to update $v$:

$$v_t \leftarrow \mu v_{t-1} - \eta \nabla J(w'_{t-1})$$

Which is equal to:

$$v_t \leftarrow \mu v_{t-1} - \eta \nabla J(w_{t-1} +\mu v_{t-1})$$

And then the last step is to update $w_t$, the accumulated gradient, which is the same as it was for standard momentum:

$$w_t \leftarrow w_{t-1} + v_t$$

The main difference to note is that in the standard method we are taking the gradient from the current position of $w$, and also making our momentum step from the current position of $w$, whereas in the Nesterov method we first take our momentum step from the current position of $w$, then correct by taking the gradient from that position. For a great link that goes over this topic in more detail, checkout out the follow: http://cs231n.github.io/neural-networks-3/

2.3 Reformulation¶

However, in practice, this is not how nesterov momentum is usually implemented. Instead, we will reformulate the equations. Let's try and express everything only in terms of $w'$, our lookahead value of $w$, and it is where we want to calculate the gradient from. So, we can define $w'_t$ and $w'_{t-1}$ using the same definition:

2.3.1 Redefine in terms of $w'$¶

$$w'_t = w_t + \mu v_t$$$$w'_{t-1} = w_{t-1} + \mu v_{t-1}$$

In other words, these are the lookahead values of $w$ at two consecutive steps. The second equation is seen in the below:

So, remember, the look ahead value can just be thought of as the step in the direction of the previous velocity, from the current weight vector postion. Next lets recall our updates for $v$ and $w$.

2.3.2 Recall updates for $v$ and $w$¶

$$v_t = \mu v_{t-1} - \eta \nabla J(w'_{t-1})$$$$w_t = w_{t-1} + v_t$$

From here we can substitute $w'$ for $w$

2.3.3 Substitute $w'$ for $w$¶

$$w'_t - \mu v_t = w'_{t-1} - \mu v_{t-1} + v_t$$

2.3.4 Combine like terms on the right¶

$$w'_t = w'_{t-1} -\mu v_{t-1} + (1 + \mu)v_t$$

2.3.5 Get rid of $v_t$ term¶

We can get ride of the $v_t$ term on the right by replacing it with an expression in terms of $v_{t-1}$. $$w'_t = w'_{t-1} - \mu v_{t-1} + (1+\mu) \Big[\mu v_{t-1} - \eta \nabla J (w'_{t-1})\Big]$$

2.3.6 Combine like terms, arrive at Nesterov momentum update¶

$$w'_t = w'_{t-1} + \mu^2 v_{t-1} - (1+\mu)\eta \nabla J (w'_{t-1})$$

This equation is the one that you will most often see when people are discussing nesterov momentum.

2.4 Regular vs. Nesterov Momentum¶

So, to quickly recap, the equation for regular momentum is:

$$w_t = w_{t-1} + \mu v_{t-1} - \eta \nabla J (w_{t-1})$$

And Nesterov Momentum is defined as:

$$w'_t = w'_{t-1} + \mu^2 v_{t-1} - (1+ \mu) \eta \nabla (w'_{t-1})$$

We can see that they each have a similar form. There is a previous $w$ term, a previous velocity term, and a previous gradient term. And if we take the nesterov equation, and plug phrase our $w'_t$ update in terms of $v_t$, we can see that the equations have an even greater resemblance:

$$v_t = \mu v_{t-1} - \eta \nabla J (w'_{t-1})$$$$w'_t = w'_{t-1} + \mu v_t - \eta \nabla J (w'_{t-1})$$

The question may arise, why is this useful? What is the point of all of this algebraic manuipulation? Well, we have expressed the nesterov momentum update entirely in terms of all the look ahead $w'$s, so the actual non look ahead $w$s are never needed. We can just treat $w$, whatever it may be, as if they were the lookahead values. In other words, lets just drop the prime symbols!

$$v_t = \mu v_{t-1} - \eta \nabla J (w_{t-1})$$$$w_t = w_{t-1} + \mu v_t - \eta \nabla J (w_{t-1})$$

So we can see that similar to regular momentum, we just proceed in two steps.

First we calculate the new $v$ from the old $v$ and the gradient
Then we update $w$ using the new $v$

2.5 Conclusion¶

The final updates for regular momentum are: $$v_t \leftarrow \mu v_{t-1} - \eta \nabla J(w_{t-1})$$ $$w_t \leftarrow w_{t-1} + v_t$$

And for nesterov momentum: $$v_t = \mu v_{t-1} - \eta \nabla J (w_{t-1})$$ $$w_t = w_{t-1} + \mu v_t - \eta \nabla J (w_{t-1})$$

So the main difference be clearly distinguished: The update rule for $w_t$ is slightly different. However, in practice the lost per iteration tends to be rather similar, so choosing between the two is not a huge deal.

3. Momentum in Code¶

Let's now implement 3 different scenarios:

Batch Gradient descent, no momentum
Batch Gradient descent, with momentum
Batch Gradient descent, with Nesterov Momentum

We can start with our imports:

import numpy as np
from sklearn.utils import shuffle
import matplotlib.pyplot as plt

from util import get_normalized_data, error_rate, cost, y2indicator

And now lets quickly define our forward and derivative functions. Note that I have commented out the sigmoid code, but they can be performed with the sigmoid instead of the ReLU.

def forward(X, W1, b1, W2, b2):
    # sigmoid
    # Z = 1 / (1 + np.exp(-( X.dot(W1) + b1 )))

    # relu
    Z = X.dot(W1) + b1
    Z[Z < 0] = 0

    A = Z.dot(W2) + b2
    expA = np.exp(A)
    Y = expA / expA.sum(axis=1, keepdims=True)
    return Y, Z

def derivative_w2(Z, T, Y):
    return Z.T.dot(Y - T)

def derivative_b2(T, Y):
    return (Y - T).sum(axis=0)

def derivative_w1(X, Z, T, Y, W2):
    # return X.T.dot( ( ( Y-T ).dot(W2.T) * ( Z*(1 - Z) ) ) ) # for sigmoid
    return X.T.dot( ( ( Y-T ).dot(W2.T) * (Z > 0) ) ) # for relu

def derivative_b1(Z, T, Y, W2):
    # return (( Y-T ).dot(W2.T) * ( Z*(1 - Z) )).sum(axis=0) # for sigmoid
    return (( Y-T ).dot(W2.T) * (Z > 0)).sum(axis=0) # for relu

For this example we have imported get_normalized_data, which means that we will be using the full 784 dimensionality data set. Now lets define our setup function, where the goal is to prep our data for the 3 scenarios discussed above:

max_iter = 20         # 20 iterations, 1 batch per iteration 
print_period = 10 

X, Y = get_normalized_data()      # grab our normalized X and Y data
lr = 0.00004                      # precomputed learning and reg rates
reg = 0.01            

# create train and test set (data is already shuffled), as well as one hot matrix
Xtrain = X[:-1000,]          # grabbing everything up until the last 100
Ytrain = Y[:-1000]           # grabbing last 1000, making it test set
Xtest = X[-1000:,]
Ytest = Y[-1000:]
Ytrain_ind = y2indicator(Ytrain)      # turn targets into one hot encoded matrices
Ytest_ind = y2indicator(Ytest)

N, D = Xtrain.shape 
batch_sz = 500                        # set batch size to 500
n_batches = N // batch_sz              # get number of batches

M = 300                               # number hidden units, we found this in prev demo
K = Ytrain_ind.shape[1]              # number of output classes 

# intialize weights to random small values 
W1 = np.random.randn(D, M) / np.sqrt(D)
b1 = np.zeros(M)
W2 = np.random.randn(M, K) / np.sqrt(M)
b2 = np.zeros(K)

# and let's save initial weights so we can compare all methods!
W1_0 = W1.copy()
b1_0 = b1.copy()
W2_0 = W2.copy()
b2_0 = b2.copy()

Reading in and transforming data...

1. Batch Gradient Descent, no Momentum¶

# ------------------ test number 1, batch GD without momentum ------------------
losses_batch = []
errors_batch = []
for i in range(max_iter):                  # iterate through all batches
    for j in range(n_batches):             # iterate through each specific batch

        # get batch size, make predictions 
        Xbatch = Xtrain[j*batch_sz:(j*batch_sz + batch_sz),]
        Ybatch = Ytrain_ind[j*batch_sz:(j*batch_sz + batch_sz),]    
        pYbatch, Z = forward(Xbatch, W1, b1, W2, b2)

        # perform updates
        W2 -= lr * (derivative_w2(Z, Ybatch, pYbatch) + reg*W2)
        b2 -= lr * (derivative_b2(Ybatch, pYbatch) + reg*b2)
        W1 -= lr * (derivative_w1(Xbatch, Z, Ybatch, pYbatch, W2) + reg*W1)
        b1 -= lr * (derivative_b1(Z, Ybatch, pYbatch, W2) + reg*b1)

        # if print period:
        if j % print_period == 0:
            pY, _ = forward(Xtest, W1, b1, W2, b2)
            l = cost(pY, Ytest_ind)
            losses_batch.append(l)
            print("Cost at iteration i=%d, j=%d: %.6f" % (i, j, l))

            e = error_rate(pY, Ytest)
            errors_batch.append(e)
            print("Error rate:", e)

# print final error rate 
pY, _ = forward(Xtest, W1, b1, W2, b2)
print("Final error rate:", error_rate(pY, Ytest))
print("--------------------------------------------------")

Cost at iteration i=0, j=0: 2358.115395
Error rate: 0.849
Cost at iteration i=0, j=10: 1820.148615
Error rate: 0.504
Cost at iteration i=0, j=20: 1474.576347
Error rate: 0.362
Cost at iteration i=0, j=30: 1243.957286
Error rate: 0.275
Cost at iteration i=0, j=40: 1081.230910
Error rate: 0.225
Cost at iteration i=0, j=50: 961.015850
Error rate: 0.205
Cost at iteration i=0, j=60: 870.533621
Error rate: 0.189
Cost at iteration i=0, j=70: 802.442843
Error rate: 0.184
Cost at iteration i=0, j=80: 747.330643
Error rate: 0.173
Cost at iteration i=1, j=0: 737.331648
Error rate: 0.171
Cost at iteration i=1, j=10: 693.893523
Error rate: 0.158
Cost at iteration i=1, j=20: 659.032960
Error rate: 0.151
Cost at iteration i=1, j=30: 630.053802
Error rate: 0.149
Cost at iteration i=1, j=40: 603.948116
Error rate: 0.146
Cost at iteration i=1, j=50: 580.252973
Error rate: 0.141
Cost at iteration i=1, j=60: 560.473756
Error rate: 0.135
Cost at iteration i=1, j=70: 544.425582
Error rate: 0.134
Cost at iteration i=1, j=80: 529.500168
Error rate: 0.133
Cost at iteration i=2, j=0: 526.346451
Error rate: 0.133
Cost at iteration i=2, j=10: 513.001293
Error rate: 0.131
Cost at iteration i=2, j=20: 501.751851
Error rate: 0.127
Cost at iteration i=2, j=30: 491.714107
Error rate: 0.126
Cost at iteration i=2, j=40: 481.438257
Error rate: 0.124
Cost at iteration i=2, j=50: 471.809626
Error rate: 0.123
Cost at iteration i=2, j=60: 463.143653
Error rate: 0.122
Cost at iteration i=2, j=70: 456.677179
Error rate: 0.122
Cost at iteration i=2, j=80: 449.773327
Error rate: 0.12
Cost at iteration i=3, j=0: 448.071399
Error rate: 0.12
Cost at iteration i=3, j=10: 441.632975
Error rate: 0.117
Cost at iteration i=3, j=20: 436.031093
Error rate: 0.117
Cost at iteration i=3, j=30: 430.798084
Error rate: 0.116
Cost at iteration i=3, j=40: 425.237641
Error rate: 0.113
Cost at iteration i=3, j=50: 419.982296
Error rate: 0.113
Cost at iteration i=3, j=60: 414.592662
Error rate: 0.11
Cost at iteration i=3, j=70: 411.291087
Error rate: 0.112
Cost at iteration i=3, j=80: 407.147075
Error rate: 0.111
Cost at iteration i=4, j=0: 405.911467
Error rate: 0.109
Cost at iteration i=4, j=10: 402.114912
Error rate: 0.109
Cost at iteration i=4, j=20: 398.514426
Error rate: 0.109
Cost at iteration i=4, j=30: 395.225017
Error rate: 0.109
Cost at iteration i=4, j=40: 391.661767
Error rate: 0.108
Cost at iteration i=4, j=50: 388.305647
Error rate: 0.107
Cost at iteration i=4, j=60: 384.321366
Error rate: 0.105
Cost at iteration i=4, j=70: 382.328567
Error rate: 0.108
Cost at iteration i=4, j=80: 379.459660
Error rate: 0.107
Cost at iteration i=5, j=0: 378.436303
Error rate: 0.108
Cost at iteration i=5, j=10: 375.832235
Error rate: 0.106
Cost at iteration i=5, j=20: 373.341741
Error rate: 0.104
Cost at iteration i=5, j=30: 371.051750
Error rate: 0.104
Cost at iteration i=5, j=40: 368.486367
Error rate: 0.104
Cost at iteration i=5, j=50: 366.149952
Error rate: 0.104
Cost at iteration i=5, j=60: 362.952477
Error rate: 0.105
Cost at iteration i=5, j=70: 361.652147
Error rate: 0.104
Cost at iteration i=5, j=80: 359.483816
Error rate: 0.103
Cost at iteration i=6, j=0: 358.553982
Error rate: 0.102
Cost at iteration i=6, j=10: 356.734489
Error rate: 0.103
Cost at iteration i=6, j=20: 354.861656
Error rate: 0.101
Cost at iteration i=6, j=30: 353.144264
Error rate: 0.105
Cost at iteration i=6, j=40: 351.187604
Error rate: 0.101
Cost at iteration i=6, j=50: 349.470387
Error rate: 0.101
Cost at iteration i=6, j=60: 346.752728
Error rate: 0.099
Cost at iteration i=6, j=70: 345.843766
Error rate: 0.098
Cost at iteration i=6, j=80: 344.088416
Error rate: 0.098
Cost at iteration i=7, j=0: 343.211922
Error rate: 0.098
Cost at iteration i=7, j=10: 341.899679
Error rate: 0.098
Cost at iteration i=7, j=20: 340.409559
Error rate: 0.099
Cost at iteration i=7, j=30: 339.055293
Error rate: 0.099
Cost at iteration i=7, j=40: 337.489669
Error rate: 0.097
Cost at iteration i=7, j=50: 336.181431
Error rate: 0.094
Cost at iteration i=7, j=60: 333.816026
Error rate: 0.095
Cost at iteration i=7, j=70: 333.157868
Error rate: 0.093
Cost at iteration i=7, j=80: 331.675235
Error rate: 0.095
Cost at iteration i=8, j=0: 330.813709
Error rate: 0.094
Cost at iteration i=8, j=10: 329.853963
Error rate: 0.094
Cost at iteration i=8, j=20: 328.658382
Error rate: 0.094
Cost at iteration i=8, j=30: 327.575400
Error rate: 0.094
Cost at iteration i=8, j=40: 326.307378
Error rate: 0.095
Cost at iteration i=8, j=50: 325.288462
Error rate: 0.092
Cost at iteration i=8, j=60: 323.184212
Error rate: 0.09
Cost at iteration i=8, j=70: 322.722467
Error rate: 0.09
Cost at iteration i=8, j=80: 321.426969
Error rate: 0.09
Cost at iteration i=9, j=0: 320.582716
Error rate: 0.09
Cost at iteration i=9, j=10: 319.862921
Error rate: 0.09
Cost at iteration i=9, j=20: 318.876274
Error rate: 0.089
Cost at iteration i=9, j=30: 318.002236
Error rate: 0.09
Cost at iteration i=9, j=40: 316.963850
Error rate: 0.089
Cost at iteration i=9, j=50: 316.149146
Error rate: 0.089
Cost at iteration i=9, j=60: 314.241733
Error rate: 0.085
Cost at iteration i=9, j=70: 313.923676
Error rate: 0.089
Cost at iteration i=9, j=80: 312.771857
Error rate: 0.088
Cost at iteration i=10, j=0: 311.935290
Error rate: 0.087
Cost at iteration i=10, j=10: 311.366564
Error rate: 0.087
Cost at iteration i=10, j=20: 310.536952
Error rate: 0.088
Cost at iteration i=10, j=30: 309.825779
Error rate: 0.087
Cost at iteration i=10, j=40: 308.977331
Error rate: 0.087
Cost at iteration i=10, j=50: 308.315288
Error rate: 0.085
Cost at iteration i=10, j=60: 306.557321
Error rate: 0.084
Cost at iteration i=10, j=70: 306.331580
Error rate: 0.083
Cost at iteration i=10, j=80: 305.289440
Error rate: 0.084
Cost at iteration i=11, j=0: 304.460711
Error rate: 0.083
Cost at iteration i=11, j=10: 303.997669
Error rate: 0.085
Cost at iteration i=11, j=20: 303.282862
Error rate: 0.084
Cost at iteration i=11, j=30: 302.702055
Error rate: 0.083
Cost at iteration i=11, j=40: 301.999597
Error rate: 0.082
Cost at iteration i=11, j=50: 301.450670
Error rate: 0.082
Cost at iteration i=11, j=60: 299.821759
Error rate: 0.082
Cost at iteration i=11, j=70: 299.676282
Error rate: 0.082
Cost at iteration i=11, j=80: 298.718653
Error rate: 0.083
Cost at iteration i=12, j=0: 297.904338
Error rate: 0.082
Cost at iteration i=12, j=10: 297.503951
Error rate: 0.083
Cost at iteration i=12, j=20: 296.877420
Error rate: 0.082
Cost at iteration i=12, j=30: 296.411239
Error rate: 0.081
Cost at iteration i=12, j=40: 295.845029
Error rate: 0.08
Cost at iteration i=12, j=50: 295.401997
Error rate: 0.081
Cost at iteration i=12, j=60: 293.871915
Error rate: 0.081
Cost at iteration i=12, j=70: 293.801903
Error rate: 0.082
Cost at iteration i=12, j=80: 292.894167
Error rate: 0.082
Cost at iteration i=13, j=0: 292.097619
Error rate: 0.082
Cost at iteration i=13, j=10: 291.749244
Error rate: 0.082
Cost at iteration i=13, j=20: 291.193732
Error rate: 0.081
Cost at iteration i=13, j=30: 290.829059
Error rate: 0.079
Cost at iteration i=13, j=40: 290.378943
Error rate: 0.079
Cost at iteration i=13, j=50: 290.004520
Error rate: 0.08
Cost at iteration i=13, j=60: 288.530719
Error rate: 0.08
Cost at iteration i=13, j=70: 288.511178
Error rate: 0.08
Cost at iteration i=13, j=80: 287.653073
Error rate: 0.078
Cost at iteration i=14, j=0: 286.878971
Error rate: 0.077
Cost at iteration i=14, j=10: 286.571015
Error rate: 0.079
Cost at iteration i=14, j=20: 286.070437
Error rate: 0.077
Cost at iteration i=14, j=30: 285.795053
Error rate: 0.077
Cost at iteration i=14, j=40: 285.442202
Error rate: 0.078
Cost at iteration i=14, j=50: 285.126412
Error rate: 0.077
Cost at iteration i=14, j=60: 283.725768
Error rate: 0.078
Cost at iteration i=14, j=70: 283.755896
Error rate: 0.078
Cost at iteration i=14, j=80: 282.932285
Error rate: 0.077
Cost at iteration i=15, j=0: 282.164853
Error rate: 0.078
Cost at iteration i=15, j=10: 281.890993
Error rate: 0.079
Cost at iteration i=15, j=20: 281.445383
Error rate: 0.078
Cost at iteration i=15, j=30: 281.241730
Error rate: 0.077
Cost at iteration i=15, j=40: 280.976362
Error rate: 0.077
Cost at iteration i=15, j=50: 280.715675
Error rate: 0.077
Cost at iteration i=15, j=60: 279.368006
Error rate: 0.078
Cost at iteration i=15, j=70: 279.428484
Error rate: 0.078
Cost at iteration i=15, j=80: 278.627562
Error rate: 0.078
Cost at iteration i=16, j=0: 277.871408
Error rate: 0.078
Cost at iteration i=16, j=10: 277.604198
Error rate: 0.077
Cost at iteration i=16, j=20: 277.194839
Error rate: 0.077
Cost at iteration i=16, j=30: 277.059178
Error rate: 0.077
Cost at iteration i=16, j=40: 276.874558
Error rate: 0.077
Cost at iteration i=16, j=50: 276.650836
Error rate: 0.077
Cost at iteration i=16, j=60: 275.354274
Error rate: 0.078
Cost at iteration i=16, j=70: 275.433760
Error rate: 0.076
Cost at iteration i=16, j=80: 274.668592
Error rate: 0.077
Cost at iteration i=17, j=0: 273.921165
Error rate: 0.077
Cost at iteration i=17, j=10: 273.668718
Error rate: 0.076
Cost at iteration i=17, j=20: 273.292740
Error rate: 0.077
Cost at iteration i=17, j=30: 273.215803
Error rate: 0.077
Cost at iteration i=17, j=40: 273.088433
Error rate: 0.077
Cost at iteration i=17, j=50: 272.902852
Error rate: 0.076
Cost at iteration i=17, j=60: 271.644681
Error rate: 0.077
Cost at iteration i=17, j=70: 271.740917
Error rate: 0.076
Cost at iteration i=17, j=80: 270.990657
Error rate: 0.076
Cost at iteration i=18, j=0: 270.252998
Error rate: 0.076
Cost at iteration i=18, j=10: 270.003885
Error rate: 0.076
Cost at iteration i=18, j=20: 269.662307
Error rate: 0.076
Cost at iteration i=18, j=30: 269.643838
Error rate: 0.074
Cost at iteration i=18, j=40: 269.572626
Error rate: 0.075
Cost at iteration i=18, j=50: 269.423572
Error rate: 0.076
Cost at iteration i=18, j=60: 268.203335
Error rate: 0.077
Cost at iteration i=18, j=70: 268.323119
Error rate: 0.076
Cost at iteration i=18, j=80: 267.589593
Error rate: 0.076
Cost at iteration i=19, j=0: 266.855792
Error rate: 0.076
Cost at iteration i=19, j=10: 266.602952
Error rate: 0.077
Cost at iteration i=19, j=20: 266.295556
Error rate: 0.075
Cost at iteration i=19, j=30: 266.329657
Error rate: 0.075
Cost at iteration i=19, j=40: 266.308509
Error rate: 0.075
Cost at iteration i=19, j=50: 266.190909
Error rate: 0.075
Cost at iteration i=19, j=60: 265.004141
Error rate: 0.077
Cost at iteration i=19, j=70: 265.148844
Error rate: 0.076
Cost at iteration i=19, j=80: 264.441284
Error rate: 0.076
Final error rate: 0.076
--------------------------------------------------

2. Batch Gradient Descent, Regular Momentum¶

# --------------- test number 2, batch GD with regular momentum ------------------
W1 = W1_0.copy()
b1 = b1_0.copy()
W2 = W2_0.copy()
b2 = b2_0.copy()
losses_momentum = []
errors_momentum = []

mu = 0.9                    # momentum parameter, think viscosity 
dW2 = 0                     # need to keep track of the previous weight changes (gradients)
db2 = 0                     # think of these as velocity
dW1 = 0
db1 = 0

for i in range(max_iter):                  # iterate through all batches
    for j in range(n_batches):             # iterate through each specific batch

        # get batch size, make predictions 
        Xbatch = Xtrain[j*batch_sz:(j*batch_sz + batch_sz),]
        Ybatch = Ytrain_ind[j*batch_sz:(j*batch_sz + batch_sz),]    
        pYbatch, Z = forward(Xbatch, W1, b1, W2, b2)
        
        # calculate the gradients
        gW2 = derivative_w2(Z, Ybatch, pYbatch) + reg*W2
        gb2 = derivative_b2(Ybatch, pYbatch) + reg*b2
        gW1 = derivative_w1(Xbatch, Z, Ybatch, pYbatch, W2) + reg*W1
        gb1 = derivative_b1(Z, Ybatch, pYbatch, W2) + reg*b1
        
        # update our velocities - based on regular momentum equation
        dW2 = mu*dW2 - lr*gW2
        db2 = mu*db2 - lr*gb2
        dW1 = mu*dW1 - lr*gW1
        db1 = mu*db1 - lr*gb1
        
        # update weights
        W2 += dW2
        b2 += db2
        W1 += dW1 
        b1 += db1 
        
        # if print period:
        if j % print_period == 0:
            pY, _ = forward(Xtest, W1, b1, W2, b2)
            l = cost(pY, Ytest_ind)
            losses_momentum.append(l)
            print("Cost at iteration i=%d, j=%d: %.6f" % (i, j, l))

            e = error_rate(pY, Ytest)
            errors_momentum.append(e)
            print("Error rate:", e)

# print final error rate 
pY, _ = forward(Xtest, W1, b1, W2, b2)
print("Final error rate:", error_rate(pY, Ytest))
print("--------------------------------------------------")

Cost at iteration i=0, j=0: 2358.081866
Error rate: 0.848
Cost at iteration i=0, j=10: 903.875582
Error rate: 0.23
Cost at iteration i=0, j=20: 533.583958
Error rate: 0.15
Cost at iteration i=0, j=30: 449.737845
Error rate: 0.126
Cost at iteration i=0, j=40: 404.674774
Error rate: 0.117
Cost at iteration i=0, j=50: 383.271078
Error rate: 0.108
Cost at iteration i=0, j=60: 360.054545
Error rate: 0.102
Cost at iteration i=0, j=70: 348.575284
Error rate: 0.095
Cost at iteration i=0, j=80: 329.370020
Error rate: 0.094
Cost at iteration i=1, j=0: 324.092778
Error rate: 0.093
Cost at iteration i=1, j=10: 320.172692
Error rate: 0.089
Cost at iteration i=1, j=20: 319.276610
Error rate: 0.087
Cost at iteration i=1, j=30: 308.224811
Error rate: 0.086
Cost at iteration i=1, j=40: 299.440787
Error rate: 0.08
Cost at iteration i=1, j=50: 293.822271
Error rate: 0.079
Cost at iteration i=1, j=60: 287.202975
Error rate: 0.081
Cost at iteration i=1, j=70: 281.112211
Error rate: 0.08
Cost at iteration i=1, j=80: 272.114777
Error rate: 0.079
Cost at iteration i=2, j=0: 269.308716
Error rate: 0.078
Cost at iteration i=2, j=10: 267.260909
Error rate: 0.075
Cost at iteration i=2, j=20: 270.740562
Error rate: 0.074
Cost at iteration i=2, j=30: 266.754194
Error rate: 0.073
Cost at iteration i=2, j=40: 262.147681
Error rate: 0.071
Cost at iteration i=2, j=50: 260.063565
Error rate: 0.073
Cost at iteration i=2, j=60: 255.839661
Error rate: 0.072
Cost at iteration i=2, j=70: 254.427726
Error rate: 0.072
Cost at iteration i=2, j=80: 246.804034
Error rate: 0.074
Cost at iteration i=3, j=0: 244.592102
Error rate: 0.073
Cost at iteration i=3, j=10: 243.219317
Error rate: 0.069
Cost at iteration i=3, j=20: 247.622247
Error rate: 0.07
Cost at iteration i=3, j=30: 245.988386
Error rate: 0.069
Cost at iteration i=3, j=40: 242.024506
Error rate: 0.067
Cost at iteration i=3, j=50: 241.747465
Error rate: 0.071
Cost at iteration i=3, j=60: 237.834349
Error rate: 0.064
Cost at iteration i=3, j=70: 237.441477
Error rate: 0.066
Cost at iteration i=3, j=80: 230.809184
Error rate: 0.065
Cost at iteration i=4, j=0: 228.750599
Error rate: 0.066
Cost at iteration i=4, j=10: 227.067083
Error rate: 0.067
Cost at iteration i=4, j=20: 232.059501
Error rate: 0.066
Cost at iteration i=4, j=30: 230.322145
Error rate: 0.065
Cost at iteration i=4, j=40: 227.859838
Error rate: 0.063
Cost at iteration i=4, j=50: 228.423433
Error rate: 0.063
Cost at iteration i=4, j=60: 225.180980
Error rate: 0.06
Cost at iteration i=4, j=70: 224.949132
Error rate: 0.061
Cost at iteration i=4, j=80: 218.676082
Error rate: 0.064
Cost at iteration i=5, j=0: 216.793937
Error rate: 0.063
Cost at iteration i=5, j=10: 214.973800
Error rate: 0.063
Cost at iteration i=5, j=20: 220.093287
Error rate: 0.061
Cost at iteration i=5, j=30: 219.585164
Error rate: 0.063
Cost at iteration i=5, j=40: 216.716283
Error rate: 0.061
Cost at iteration i=5, j=50: 217.849188
Error rate: 0.061
Cost at iteration i=5, j=60: 214.877194
Error rate: 0.058
Cost at iteration i=5, j=70: 214.549906
Error rate: 0.059
Cost at iteration i=5, j=80: 208.903129
Error rate: 0.061
Cost at iteration i=6, j=0: 207.173266
Error rate: 0.06
Cost at iteration i=6, j=10: 205.074829
Error rate: 0.061
Cost at iteration i=6, j=20: 210.253192
Error rate: 0.057
Cost at iteration i=6, j=30: 210.986922
Error rate: 0.061
Cost at iteration i=6, j=40: 208.452352
Error rate: 0.058
Cost at iteration i=6, j=50: 209.211260
Error rate: 0.059
Cost at iteration i=6, j=60: 206.833627
Error rate: 0.057
Cost at iteration i=6, j=70: 206.423508
Error rate: 0.057
Cost at iteration i=6, j=80: 201.117613
Error rate: 0.06
Cost at iteration i=7, j=0: 199.523282
Error rate: 0.06
Cost at iteration i=7, j=10: 197.419311
Error rate: 0.061
Cost at iteration i=7, j=20: 202.305156
Error rate: 0.056
Cost at iteration i=7, j=30: 203.929185
Error rate: 0.061
Cost at iteration i=7, j=40: 201.238200
Error rate: 0.056
Cost at iteration i=7, j=50: 201.826729
Error rate: 0.057
Cost at iteration i=7, j=60: 200.092663
Error rate: 0.054
Cost at iteration i=7, j=70: 199.631335
Error rate: 0.056
Cost at iteration i=7, j=80: 194.832428
Error rate: 0.059
Cost at iteration i=8, j=0: 193.412496
Error rate: 0.058
Cost at iteration i=8, j=10: 191.336341
Error rate: 0.058
Cost at iteration i=8, j=20: 195.948932
Error rate: 0.056
Cost at iteration i=8, j=30: 198.361817
Error rate: 0.056
Cost at iteration i=8, j=40: 194.730227
Error rate: 0.055
Cost at iteration i=8, j=50: 195.450168
Error rate: 0.056
Cost at iteration i=8, j=60: 194.353066
Error rate: 0.055
Cost at iteration i=8, j=70: 193.825581
Error rate: 0.055
Cost at iteration i=8, j=80: 189.605602
Error rate: 0.057
Cost at iteration i=9, j=0: 188.387385
Error rate: 0.056
Cost at iteration i=9, j=10: 186.465580
Error rate: 0.056
Cost at iteration i=9, j=20: 190.792631
Error rate: 0.05
Cost at iteration i=9, j=30: 193.855244
Error rate: 0.051
Cost at iteration i=9, j=40: 190.962731
Error rate: 0.053
Cost at iteration i=9, j=50: 190.567898
Error rate: 0.053
Cost at iteration i=9, j=60: 189.819025
Error rate: 0.053
Cost at iteration i=9, j=70: 189.823642
Error rate: 0.052
Cost at iteration i=9, j=80: 185.617225
Error rate: 0.052
Cost at iteration i=10, j=0: 184.506506
Error rate: 0.051
Cost at iteration i=10, j=10: 182.677449
Error rate: 0.052
Cost at iteration i=10, j=20: 186.595550
Error rate: 0.047
Cost at iteration i=10, j=30: 190.286247
Error rate: 0.048
Cost at iteration i=10, j=40: 187.418156
Error rate: 0.051
Cost at iteration i=10, j=50: 186.542071
Error rate: 0.052
Cost at iteration i=10, j=60: 185.994211
Error rate: 0.052
Cost at iteration i=10, j=70: 186.306546
Error rate: 0.048
Cost at iteration i=10, j=80: 182.475927
Error rate: 0.049
Cost at iteration i=11, j=0: 181.469354
Error rate: 0.049
Cost at iteration i=11, j=10: 179.665505
Error rate: 0.047
Cost at iteration i=11, j=20: 183.334934
Error rate: 0.046
Cost at iteration i=11, j=30: 187.546132
Error rate: 0.047
Cost at iteration i=11, j=40: 184.803848
Error rate: 0.049
Cost at iteration i=11, j=50: 183.479621
Error rate: 0.051
Cost at iteration i=11, j=60: 183.028955
Error rate: 0.05
Cost at iteration i=11, j=70: 183.584345
Error rate: 0.047
Cost at iteration i=11, j=80: 180.053438
Error rate: 0.045
Cost at iteration i=12, j=0: 179.119469
Error rate: 0.045
Cost at iteration i=12, j=10: 177.340556
Error rate: 0.046
Cost at iteration i=12, j=20: 180.886763
Error rate: 0.046
Cost at iteration i=12, j=30: 185.546039
Error rate: 0.047
Cost at iteration i=12, j=40: 181.808010
Error rate: 0.047
Cost at iteration i=12, j=50: 180.755480
Error rate: 0.049
Cost at iteration i=12, j=60: 180.713466
Error rate: 0.048
Cost at iteration i=12, j=70: 181.079420
Error rate: 0.048
Cost at iteration i=12, j=80: 178.036656
Error rate: 0.045
Cost at iteration i=13, j=0: 177.217364
Error rate: 0.045
Cost at iteration i=13, j=10: 175.765931
Error rate: 0.046
Cost at iteration i=13, j=20: 179.823356
Error rate: 0.047
Cost at iteration i=13, j=30: 183.923750
Error rate: 0.048
Cost at iteration i=13, j=40: 180.525981
Error rate: 0.046
Cost at iteration i=13, j=50: 179.286364
Error rate: 0.046
Cost at iteration i=13, j=60: 179.134737
Error rate: 0.047
Cost at iteration i=13, j=70: 179.542708
Error rate: 0.047
Cost at iteration i=13, j=80: 176.758084
Error rate: 0.045
Cost at iteration i=14, j=0: 175.991566
Error rate: 0.045
Cost at iteration i=14, j=10: 174.501739
Error rate: 0.047
Cost at iteration i=14, j=20: 178.372794
Error rate: 0.048
Cost at iteration i=14, j=30: 182.798071
Error rate: 0.047
Cost at iteration i=14, j=40: 179.478458
Error rate: 0.046
Cost at iteration i=14, j=50: 177.823751
Error rate: 0.046
Cost at iteration i=14, j=60: 177.758454
Error rate: 0.047
Cost at iteration i=14, j=70: 178.342559
Error rate: 0.047
Cost at iteration i=14, j=80: 175.758415
Error rate: 0.045
Cost at iteration i=15, j=0: 175.008925
Error rate: 0.045
Cost at iteration i=15, j=10: 173.518510
Error rate: 0.048
Cost at iteration i=15, j=20: 177.316940
Error rate: 0.048
Cost at iteration i=15, j=30: 181.987632
Error rate: 0.048
Cost at iteration i=15, j=40: 178.878044
Error rate: 0.047
Cost at iteration i=15, j=50: 176.899754
Error rate: 0.045
Cost at iteration i=15, j=60: 176.807546
Error rate: 0.046
Cost at iteration i=15, j=70: 177.449498
Error rate: 0.044
Cost at iteration i=15, j=80: 175.137672
Error rate: 0.045
Cost at iteration i=16, j=0: 174.425890
Error rate: 0.044
Cost at iteration i=16, j=10: 172.905482
Error rate: 0.046
Cost at iteration i=16, j=20: 176.532915
Error rate: 0.048
Cost at iteration i=16, j=30: 181.387923
Error rate: 0.048
Cost at iteration i=16, j=40: 178.581385
Error rate: 0.047
Cost at iteration i=16, j=50: 176.169284
Error rate: 0.046
Cost at iteration i=16, j=60: 175.956559
Error rate: 0.046
Cost at iteration i=16, j=70: 176.660711
Error rate: 0.044
Cost at iteration i=16, j=80: 174.549157
Error rate: 0.043
Cost at iteration i=17, j=0: 173.866521
Error rate: 0.043
Cost at iteration i=17, j=10: 172.374758
Error rate: 0.043
Cost at iteration i=17, j=20: 175.953872
Error rate: 0.047
Cost at iteration i=17, j=30: 180.975641
Error rate: 0.048
Cost at iteration i=17, j=40: 178.501825
Error rate: 0.047
Cost at iteration i=17, j=50: 175.687885
Error rate: 0.046
Cost at iteration i=17, j=60: 175.404130
Error rate: 0.046
Cost at iteration i=17, j=70: 176.167960
Error rate: 0.043
Cost at iteration i=17, j=80: 174.273254
Error rate: 0.044
Cost at iteration i=18, j=0: 173.625275
Error rate: 0.044
Cost at iteration i=18, j=10: 172.099503
Error rate: 0.042
Cost at iteration i=18, j=20: 175.551501
Error rate: 0.046
Cost at iteration i=18, j=30: 180.736241
Error rate: 0.048
Cost at iteration i=18, j=40: 178.559111
Error rate: 0.046
Cost at iteration i=18, j=50: 175.452848
Error rate: 0.046
Cost at iteration i=18, j=60: 175.042356
Error rate: 0.047
Cost at iteration i=18, j=70: 175.819518
Error rate: 0.043
Cost at iteration i=18, j=80: 174.185438
Error rate: 0.043
Cost at iteration i=19, j=0: 173.582440
Error rate: 0.044
Cost at iteration i=19, j=10: 172.058631
Error rate: 0.04
Cost at iteration i=19, j=20: 175.290283
Error rate: 0.046
Cost at iteration i=19, j=30: 180.516692
Error rate: 0.048
Cost at iteration i=19, j=40: 178.767806
Error rate: 0.045
Cost at iteration i=19, j=50: 175.409518
Error rate: 0.045
Cost at iteration i=19, j=60: 174.853181
Error rate: 0.047
Cost at iteration i=19, j=70: 175.597245
Error rate: 0.043
Cost at iteration i=19, j=80: 174.221705
Error rate: 0.043
Final error rate: 0.043
--------------------------------------------------

3. Batch Gradient Descent, Nesterov Momentum¶

# --------------- test number 3, batch GD with regular momentum ------------------
W1 = W1_0.copy()
b1 = b1_0.copy()
W2 = W2_0.copy()
b2 = b2_0.copy()

losses_nesterov = []
errors_nesterov = []

mu = 0.9                    # momentum parameter, think viscosity 
vW2 = 0                     # need to keep track of the previous weight changes (gradients)
vb2 = 0                     # think of these as velocity
vW1 = 0
vb1 = 0

for i in range(max_iter):                  # iterate through all batches
    for j in range(n_batches):             # iterate through each specific batch

        # get batch size, make predictions 
        Xbatch = Xtrain[j*batch_sz:(j*batch_sz + batch_sz),]
        Ybatch = Ytrain_ind[j*batch_sz:(j*batch_sz + batch_sz),]    
        pYbatch, Z = forward(Xbatch, W1, b1, W2, b2)
        
        # calculate the gradients
        gW2 = derivative_w2(Z, Ybatch, pYbatch) + reg*W2
        gb2 = derivative_b2(Ybatch, pYbatch) + reg*b2
        gW1 = derivative_w1(Xbatch, Z, Ybatch, pYbatch, W2) + reg*W1
        gb1 = derivative_b1(Z, Ybatch, pYbatch, W2) + reg*b1
        
        # update our velocites - based on nesterov momentum equations 
        vW2 = mu*vW2 - lr*gW2
        vb2 = mu*vb2 - lr*gb2
        vW1 = mu*vW1 - lr*gW1
        vb1 = mu*vb1 - lr*gb1
        
        # update our weights - based on nesterov momentum equations 
        W2 += mu*vW2 - lr*gW2
        b2 += mu*vb2 - lr*gb2
        W1 += mu*vW1 - lr*gW1
        b1 += mu*vb1 - lr*gb1
        
        # if print period:
        if j % print_period == 0:
            pY, _ = forward(Xtest, W1, b1, W2, b2)
            l = cost(pY, Ytest_ind)
            losses_nesterov.append(l)
            print("Cost at iteration i=%d, j=%d: %.6f" % (i, j, l))

            e = error_rate(pY, Ytest)
            errors_nesterov.append(e)
            print("Error rate:", e)

# print final error rate 
pY, _ = forward(Xtest, W1, b1, W2, b2)
print("Final error rate:", error_rate(pY, Ytest))
print("--------------------------------------------------")

Cost at iteration i=0, j=0: 2295.050709
Error rate: 0.82
Cost at iteration i=0, j=10: 850.938970
Error rate: 0.216
Cost at iteration i=0, j=20: 521.677466
Error rate: 0.147
Cost at iteration i=0, j=30: 441.274946
Error rate: 0.125
Cost at iteration i=0, j=40: 398.910493
Error rate: 0.112
Cost at iteration i=0, j=50: 377.277626
Error rate: 0.102
Cost at iteration i=0, j=60: 355.943904
Error rate: 0.1
Cost at iteration i=0, j=70: 345.044062
Error rate: 0.093
Cost at iteration i=0, j=80: 327.048570
Error rate: 0.093
Cost at iteration i=1, j=0: 321.954840
Error rate: 0.091
Cost at iteration i=1, j=10: 318.705830
Error rate: 0.088
Cost at iteration i=1, j=20: 315.078551
Error rate: 0.087
Cost at iteration i=1, j=30: 306.442573
Error rate: 0.086
Cost at iteration i=1, j=40: 297.799286
Error rate: 0.081
Cost at iteration i=1, j=50: 292.476657
Error rate: 0.079
Cost at iteration i=1, j=60: 285.157606
Error rate: 0.079
Cost at iteration i=1, j=70: 279.357931
Error rate: 0.081
Cost at iteration i=1, j=80: 270.542753
Error rate: 0.078
Cost at iteration i=2, j=0: 267.401154
Error rate: 0.077
Cost at iteration i=2, j=10: 266.824718
Error rate: 0.077
Cost at iteration i=2, j=20: 268.832865
Error rate: 0.075
Cost at iteration i=2, j=30: 265.667322
Error rate: 0.073
Cost at iteration i=2, j=40: 261.275809
Error rate: 0.07
Cost at iteration i=2, j=50: 258.182152
Error rate: 0.072
Cost at iteration i=2, j=60: 253.872444
Error rate: 0.071
Cost at iteration i=2, j=70: 252.502799
Error rate: 0.071
Cost at iteration i=2, j=80: 245.298157
Error rate: 0.075
Cost at iteration i=3, j=0: 242.977783
Error rate: 0.073
Cost at iteration i=3, j=10: 242.831265
Error rate: 0.069
Cost at iteration i=3, j=20: 245.276315
Error rate: 0.068
Cost at iteration i=3, j=30: 245.015818
Error rate: 0.068
Cost at iteration i=3, j=40: 242.165216
Error rate: 0.068
Cost at iteration i=3, j=50: 240.806187
Error rate: 0.07
Cost at iteration i=3, j=60: 236.540121
Error rate: 0.065
Cost at iteration i=3, j=70: 236.232807
Error rate: 0.066
Cost at iteration i=3, j=80: 229.879066
Error rate: 0.065
Cost at iteration i=4, j=0: 227.644612
Error rate: 0.066
Cost at iteration i=4, j=10: 227.120486
Error rate: 0.067
Cost at iteration i=4, j=20: 230.416289
Error rate: 0.063
Cost at iteration i=4, j=30: 229.679541
Error rate: 0.065
Cost at iteration i=4, j=40: 228.148811
Error rate: 0.064
Cost at iteration i=4, j=50: 228.009191
Error rate: 0.065
Cost at iteration i=4, j=60: 224.281862
Error rate: 0.06
Cost at iteration i=4, j=70: 223.998878
Error rate: 0.061
Cost at iteration i=4, j=80: 218.047437
Error rate: 0.064
Cost at iteration i=5, j=0: 215.910698
Error rate: 0.063
Cost at iteration i=5, j=10: 215.135899
Error rate: 0.062
Cost at iteration i=5, j=20: 218.868496
Error rate: 0.06
Cost at iteration i=5, j=30: 218.185185
Error rate: 0.064
Cost at iteration i=5, j=40: 216.606524
Error rate: 0.061
Cost at iteration i=5, j=50: 217.060520
Error rate: 0.063
Cost at iteration i=5, j=60: 213.911347
Error rate: 0.058
Cost at iteration i=5, j=70: 213.723448
Error rate: 0.06
Cost at iteration i=5, j=80: 208.203357
Error rate: 0.061
Cost at iteration i=6, j=0: 206.246946
Error rate: 0.061
Cost at iteration i=6, j=10: 205.240186
Error rate: 0.061
Cost at iteration i=6, j=20: 209.263942
Error rate: 0.057
Cost at iteration i=6, j=30: 209.620457
Error rate: 0.062
Cost at iteration i=6, j=40: 208.531215
Error rate: 0.059
Cost at iteration i=6, j=50: 208.471934
Error rate: 0.059
Cost at iteration i=6, j=60: 205.859014
Error rate: 0.058
Cost at iteration i=6, j=70: 205.776474
Error rate: 0.058
Cost at iteration i=6, j=80: 200.599328
Error rate: 0.06
Cost at iteration i=7, j=0: 198.802431
Error rate: 0.059
Cost at iteration i=7, j=10: 197.715409
Error rate: 0.06
Cost at iteration i=7, j=20: 201.805965
Error rate: 0.057
Cost at iteration i=7, j=30: 202.854579
Error rate: 0.06
Cost at iteration i=7, j=40: 201.678974
Error rate: 0.056
Cost at iteration i=7, j=50: 201.514702
Error rate: 0.058
Cost at iteration i=7, j=60: 199.348444
Error rate: 0.056
Cost at iteration i=7, j=70: 199.242580
Error rate: 0.056
Cost at iteration i=7, j=80: 194.363938
Error rate: 0.058
Cost at iteration i=8, j=0: 192.708563
Error rate: 0.058
Cost at iteration i=8, j=10: 191.454832
Error rate: 0.059
Cost at iteration i=8, j=20: 195.565930
Error rate: 0.056
Cost at iteration i=8, j=30: 197.212288
Error rate: 0.056
Cost at iteration i=8, j=40: 195.688913
Error rate: 0.055
Cost at iteration i=8, j=50: 195.410401
Error rate: 0.056
Cost at iteration i=8, j=60: 193.706306
Error rate: 0.055
Cost at iteration i=8, j=70: 193.721313
Error rate: 0.056
Cost at iteration i=8, j=80: 189.410913
Error rate: 0.056
Cost at iteration i=9, j=0: 187.922696
Error rate: 0.053
Cost at iteration i=9, j=10: 186.569590
Error rate: 0.054
Cost at iteration i=9, j=20: 190.574046
Error rate: 0.051
Cost at iteration i=9, j=30: 192.846757
Error rate: 0.051
Cost at iteration i=9, j=40: 191.018419
Error rate: 0.051
Cost at iteration i=9, j=50: 190.655309
Error rate: 0.053
Cost at iteration i=9, j=60: 189.372065
Error rate: 0.053
Cost at iteration i=9, j=70: 189.415843
Error rate: 0.052
Cost at iteration i=9, j=80: 185.527676
Error rate: 0.052
Cost at iteration i=10, j=0: 184.209907
Error rate: 0.052
Cost at iteration i=10, j=10: 182.847606
Error rate: 0.05
Cost at iteration i=10, j=20: 186.713161
Error rate: 0.048
Cost at iteration i=10, j=30: 189.537910
Error rate: 0.048
Cost at iteration i=10, j=40: 187.874033
Error rate: 0.05
Cost at iteration i=10, j=50: 186.888459
Error rate: 0.051
Cost at iteration i=10, j=60: 185.813637
Error rate: 0.051
Cost at iteration i=10, j=70: 186.124446
Error rate: 0.051
Cost at iteration i=10, j=80: 182.598171
Error rate: 0.048
Cost at iteration i=11, j=0: 181.390407
Error rate: 0.049
Cost at iteration i=11, j=10: 179.902682
Error rate: 0.048
Cost at iteration i=11, j=20: 183.551531
Error rate: 0.046
Cost at iteration i=11, j=30: 186.840411
Error rate: 0.047
Cost at iteration i=11, j=40: 184.704607
Error rate: 0.05
Cost at iteration i=11, j=50: 183.819678
Error rate: 0.05
Cost at iteration i=11, j=60: 182.895176
Error rate: 0.05
Cost at iteration i=11, j=70: 183.191257
Error rate: 0.049
Cost at iteration i=11, j=80: 180.057407
Error rate: 0.046
Cost at iteration i=12, j=0: 178.969525
Error rate: 0.046
Cost at iteration i=12, j=10: 177.529986
Error rate: 0.046
Cost at iteration i=12, j=20: 181.080413
Error rate: 0.045
Cost at iteration i=12, j=30: 184.827745
Error rate: 0.048
Cost at iteration i=12, j=40: 183.223829
Error rate: 0.047
Cost at iteration i=12, j=50: 181.749129
Error rate: 0.05
Cost at iteration i=12, j=60: 180.861449
Error rate: 0.048
Cost at iteration i=12, j=70: 181.409880
Error rate: 0.049
Cost at iteration i=12, j=80: 178.560023
Error rate: 0.045
Cost at iteration i=13, j=0: 177.536875
Error rate: 0.045
Cost at iteration i=13, j=10: 175.983894
Error rate: 0.046
Cost at iteration i=13, j=20: 179.278433
Error rate: 0.047
Cost at iteration i=13, j=30: 183.350944
Error rate: 0.048
Cost at iteration i=13, j=40: 181.790290
Error rate: 0.046
Cost at iteration i=13, j=50: 180.135155
Error rate: 0.046
Cost at iteration i=13, j=60: 179.231137
Error rate: 0.047
Cost at iteration i=13, j=70: 179.819030
Error rate: 0.048
Cost at iteration i=13, j=80: 177.258460
Error rate: 0.045
Cost at iteration i=14, j=0: 176.300448
Error rate: 0.045
Cost at iteration i=14, j=10: 174.655081
Error rate: 0.046
Cost at iteration i=14, j=20: 177.764011
Error rate: 0.047
Cost at iteration i=14, j=30: 182.164916
Error rate: 0.047
Cost at iteration i=14, j=40: 179.467837
Error rate: 0.046
Cost at iteration i=14, j=50: 178.434296
Error rate: 0.046
Cost at iteration i=14, j=60: 177.749812
Error rate: 0.047
Cost at iteration i=14, j=70: 178.101383
Error rate: 0.048
Cost at iteration i=14, j=80: 175.964112
Error rate: 0.045
Cost at iteration i=15, j=0: 175.137622
Error rate: 0.045
Cost at iteration i=15, j=10: 174.055675
Error rate: 0.046
Cost at iteration i=15, j=20: 177.491874
Error rate: 0.048
Cost at iteration i=15, j=30: 181.462279
Error rate: 0.048
Cost at iteration i=15, j=40: 179.109621
Error rate: 0.048
Cost at iteration i=15, j=50: 177.926512
Error rate: 0.045
Cost at iteration i=15, j=60: 177.062015
Error rate: 0.046
Cost at iteration i=15, j=70: 177.377292
Error rate: 0.046
Cost at iteration i=15, j=80: 175.442784
Error rate: 0.045
Cost at iteration i=16, j=0: 174.669743
Error rate: 0.043
Cost at iteration i=16, j=10: 173.582715
Error rate: 0.045
Cost at iteration i=16, j=20: 176.815602
Error rate: 0.047
Cost at iteration i=16, j=30: 180.917445
Error rate: 0.048
Cost at iteration i=16, j=40: 178.591729
Error rate: 0.047
Cost at iteration i=16, j=50: 177.244643
Error rate: 0.044
Cost at iteration i=16, j=60: 176.396782
Error rate: 0.046
Cost at iteration i=16, j=70: 176.655206
Error rate: 0.044
Cost at iteration i=16, j=80: 174.868368
Error rate: 0.042
Cost at iteration i=17, j=0: 174.145824
Error rate: 0.042
Cost at iteration i=17, j=10: 173.084567
Error rate: 0.044
Cost at iteration i=17, j=20: 176.226205
Error rate: 0.047
Cost at iteration i=17, j=30: 180.484942
Error rate: 0.047
Cost at iteration i=17, j=40: 178.334320
Error rate: 0.046
Cost at iteration i=17, j=50: 176.768539
Error rate: 0.043
Cost at iteration i=17, j=60: 175.926128
Error rate: 0.045
Cost at iteration i=17, j=70: 176.240793
Error rate: 0.043
Cost at iteration i=17, j=80: 174.724951
Error rate: 0.042
Cost at iteration i=18, j=0: 174.057592
Error rate: 0.042
Cost at iteration i=18, j=10: 173.017742
Error rate: 0.043
Cost at iteration i=18, j=20: 176.044278
Error rate: 0.047
Cost at iteration i=18, j=30: 180.377109
Error rate: 0.048
Cost at iteration i=18, j=40: 178.412676
Error rate: 0.046
Cost at iteration i=18, j=50: 176.615129
Error rate: 0.043
Cost at iteration i=18, j=60: 175.731898
Error rate: 0.044
Cost at iteration i=18, j=70: 175.990691
Error rate: 0.042
Cost at iteration i=18, j=80: 174.594331
Error rate: 0.042
Cost at iteration i=19, j=0: 173.973139
Error rate: 0.042
Cost at iteration i=19, j=10: 172.899862
Error rate: 0.041
Cost at iteration i=19, j=20: 175.744304
Error rate: 0.045
Cost at iteration i=19, j=30: 180.149565
Error rate: 0.047
Cost at iteration i=19, j=40: 178.430161
Error rate: 0.047
Cost at iteration i=19, j=50: 176.515080
Error rate: 0.045
Cost at iteration i=19, j=60: 175.560622
Error rate: 0.043
Cost at iteration i=19, j=70: 175.770776
Error rate: 0.042
Cost at iteration i=19, j=80: 174.593999
Error rate: 0.042
Final error rate: 0.042
--------------------------------------------------

4. Plot all likelihoods¶

%matplotlib notebook
plt.plot(losses_batch, label="batch")
plt.plot(losses_momentum, label="momentum")
plt.plot(losses_nesterov, label="nesterov")
plt.legend()
plt.show()

4. Variable and Adaptive Learning Rates¶

Momentum is generally looked at as the most impactful method when it comes to speeding up training. If you compare the loss per iteration between standard gradient descent, and gradient descent with momentum, the difference is huge!

Additionally, momentum is nice because you don't have to play with the hyper parameters very much. You can just set it to 0.9 and it is usually fine, resulting in huge performance gains. However, some adaptive learning techniques are very powerful, so let's take a look!

4.1 Variable Learning Rates¶

The first new learning rate we will look at is a variable learning rate, e.g. a learning rate that is a function of time, $\eta(t)$. This is sometimes referred to as "Learning rate scheduling". The first one worth mentioning is:

4.1.1 Step Decay¶

Periodically, say ever 100 iterations, we will reduce the learning rate by a constant factor. For example, if we constantly divide by 2, we will half it each time.

4.1.2 Exponential Decay¶

In this method the learning rate follows and exponential curve:

$$\eta(t) = A * e^{-kt}$$

4.1.3 $\frac{1}{t}$ decay¶

Here the learning rate will decay proportionally to $\frac{1}{t}$: $$\eta(t) = \frac{A}{kt + 1 }$$ In this case the dropoff is slower than exponential decay.

4.1.4 What do all of these methods have in common?¶

Well, each of these methods will decrease the learning rate as time goes on (i.e. the number of iterations increases). The question comes up, why would be want to do that? Well, generally speaking, when we initialize the weights of a neural network, they are going to be very far from the optimal weights, so it is good to start with a large learning rate so that we can take bigger steps towards the goal. This is the motivation behind momentum as well: We want to pick up speed by accumulating past gradients, because we know that if we are very far away from our goal, then those gradients should be large. However, as we get close to the goal, the gradient is going to shrink. In fact, by definition, the minimum of a function means the gradient must be 0-that is how we solve for the minimum of a function in calculus.

But why may we want to slow down as we get close to the minimum? Well, when you are too close to the minimum and you take too big of a step, you are going to overshoot the minimum. So what ends up happening is that you just bounce back and forth. In fact, if you learning rate is too large then you will just bounce right of the valley and your loss may increase! So, in order to reduce all of this bouncing around, we would like to take smaller steps.

One other things to think about is the following caveat: all of these methods mean that there are more hyperparameters to optimize. This adds more work to your plate as a data scientist.

4.2 Adaptive Learning Rates¶

4.2.1 AdaGrad¶

The first adaptive learning technique we will go over is AdaGrad. The main idea behind it is this: we cannot expect the dependence of the cost on each of the parameters to be the same. In other words, in one direction the gradient may be very steep, but in another direction the gradient may be very flat. So, perhaps it may be beneficial to adapt the learning rate for each parameter individually, based on how much it has changed in the past.

So, in AdaGrad what we do is introduce a variable called the cache.

$$cache = cache + gradient^2$$$$w \leftarrow w - \eta \frac{\nabla J}{\sqrt{cache + \epsilon }}$$

Each parameter of the neural network has its own cache, so for example if you have one weight matrix of size (3 x 4), you will also have a cache matrix of size (3 x 4). The same thing applies to the bias vectors.

The idea behind the cache is that it is going to accumulate the squared gradients. Because we are squaring the gradients, the cache is always going to be positive. And because each parameter has its own cache, then if one parameter has had a lot of large gradients in the past, then its cache will be very large, and its effective learning rate will be very small, and it will change more slowly in the future. On the other hand, if a parameter has had a lot of small gradients in the past then it's cache will be small, so its respective learning rate will remain large, and it will have more opportunity to change in the future.

One small thing to keep in mind is that we usually add a small number $\epsilon$ to the denominator in order to prevent dividing by 0. This is generally set to $10^{-8}$ or $10^{-10}$.

One important thing to stress about AdaGrad is that everything we are doing is an element-wise operation. In other words, each scalar parameter and its learning rate is effectively updated independently of the others. So you can look at the rules that we presented as scalar updates that apply to all of the parameters, or you can think of one huge parameter vector that contains all of the neural network parameters, and that each of the operations is an element wise operation.

4.2.2 RMSProp¶

This next technique builds on the fact that researchers have found that AdaGrad decreases the learning rate too aggresively. In this case, the learning rate would approach 0 too quickly, when in fact there was still learning to be done. The method introduced to fix this is called RMSProp, and it was introduced by Geoff Hinton and team. It works as follows:

The reason that AdaGrad decrease the learning rate too quickly is because the cache is growing too fast. So in order to make the cache grow more slowly, we actually decrease it each time we update it. This is done by taking a weighted average of the old cache and the new squared gradient. We call this weight the decay weight, and we can see that the two weights add up to 1.

$$cache = decay*cache +(1 - decay)*gradient^2$$

Typical values for the decay are 0.99, 0.999, etc. The intuition for these choices will be discussed in later lectures. By doing this, we say that we are making the cache "leaky". The reasoning it is leaking is because we can imagine that if we had 0 gradient for a long time, eventually the cache would shrink back down to 0, because it would be decreased by the decay rate on each round

Note that there is some ambiguity in the RMSProp and AdaGrad algorithms. Specifically, this can be seen in the context of RMSProp. The ambiguity arises from the initial value of the cache. You may automatically assume that 0 is a good value, but this actually has a very strange effect. We can let: $$decay = 0.99$$ And that means our initial update for the cache is:

$$0.001g^2$$

And hence our initial update is (ignoring epsilon): $$\frac{\eta g}{\sqrt{0.001g^2}}$$ Which ends up being very large, because the denominator is very small. What you would have to do is compensate by making you initial learning rate smaller than usual. One solution to this however, is just to set the cache to 1 instead. By manipulating the RMSProp update, we can show that this is what we would get if we were not doing RMSProp at all:

$$\nabla w = \eta \frac{g}{\sqrt{1 + 0.001 g^2}} \approx \eta g$$

You may still be wondering which way is the correct way to go about things? Well, in TensorFlow the cache is initialized to 1. In Keras the cache is initialized to 0.

4.3 Summary Pseudocode¶

AdaGrad

# at every batch
cache += gradient ** 2
param = param - learning_rate * gradient / sqrt(cache + epsilon)

RMSProp

# at every batch
cache = decay * cache + (1 - decay) * gradient **2
param = param - learning_rate * gradient / sqrt(cache + epsilon)

Note: Epsilon is generally 10e-8, 10e-9, 10e-10, and decay is generally 0.9, 0.99, 0.999

5. Constant Learning Rate vs. RMSProp in Code¶

Here we are going to be comparing RMSProp against a constant learning rate. We can start by defining our standard multi-layer perceptron functions:

def forward(X, W1, b1, W2, b2):
    Z = X.dot(W1) + b1
    Z[Z < 0] = 0               # using ReLU function
    A = Z.dot(W2) + b2
    expA = np.exp(A)
    Y = expA / expA.sum(axis=1, keepdims=True)
    return Y, Z

def derivative_w2(Z, T, Y):
    return Z.T.dot(Y - T)

def derivative_b2(T, Y):
    return (Y - T).sum(axis=0)

def derivative_w1(X, Z, T, Y, W2):
    return X.T.dot( ( (Y - T).dot(W2.T) * (Z > 0) ) )      # for relu

def derivative_b1(Z, T, Y, W2):
    return ((Y - T).dot(W2.T) * (Z > 0)).sum(axis=0)       # for relu

And now for our imports.

import numpy as np
from sklearn.utils import shuffle
import matplotlib.pyplot as plt
%matplotlib inline
from util import get_normalized_data, error_rate, cost, y2indicator

Now we can start our main function:

def main():
    max_iter = 20    
    print_period = 10
    
    X, Y = get_normalized_data()
    lr = 0.00004
    reg = 0.01
    
    # get train/test data
    Xtrain = X[:-1000,]
    Ytrain = Y[:-1000]
    Xtest  = X[-1000:,]
    Ytest  = Y[-1000:]
    Ytrain_ind = y2indicator(Ytrain)
    Ytest_ind = y2indicator(Ytest)
    
    N, D = Xtrain.shape
    batch_sz = 500
    n_batches = N // batch_sz
    
    # 300 hidden units, 10 output classes, then initialize our weights
    M = 300
    K = 10
    W1 = np.random.randn(D, M) / 28
    b1 = np.zeros(M)
    W2 = np.random.randn(M, K) / np.sqrt(M)
    b2 = np.zeros(K)
    
    # --------------------- 1. Constant Learning Rate -----------------------------
    # cost = -16
    LL_batch = []
    CR_batch = []
    for i in range(max_iter):         # looping through total iterations
        for j in range(n_batches):    # looping through number of batches 
            Xbatch = Xtrain[j*batch_sz:(j*batch_sz + batch_sz),]
            Ybatch = Ytrain_ind[j*batch_sz:(j*batch_sz + batch_sz),]
            pYbatch, Z = forward(Xbatch, W1, b1, W2, b2)

            # weight and bias updates
            W2 -= lr*(derivative_w2(Z, Ybatch, pYbatch) + reg*W2)
            b2 -= lr*(derivative_b2(Ybatch, pYbatch) + reg*b2)
            W1 -= lr*(derivative_w1(Xbatch, Z, Ybatch, pYbatch, W2) + reg*W1)
            b1 -= lr*(derivative_b1(Z, Ybatch, pYbatch, W2) + reg*b1)
            
            # print/debugging step
            if j % print_period == 0:
                # calculate just for LL
                pY, _ = forward(Xtest, W1, b1, W2, b2)
                # print "pY:", pY
                ll = cost(pY, Ytest_ind)
                LL_batch.append(ll)
                print("Cost at iteration i=%d, j=%d: %.6f" % (i, j, ll))

                err = error_rate(pY, Ytest)
                CR_batch.append(err)
                print("Error rate:", err)

    pY, _ = forward(Xtest, W1, b1, W2, b2)
    print("Final error rate:", error_rate(pY, Ytest))
    print('-------------------- END --------------------------')


    # --------------------- 2. RMSProp -----------------------------
    W1 = np.random.randn(D, M) / 28
    b1 = np.zeros(M)
    W2 = np.random.randn(M, K) / np.sqrt(M)
    b2 = np.zeros(K)
    LL_rms = []
    CR_rms = []
    lr0 = 0.001         # Initial Learning Rate. If you set this too high you'll get NaN!
    cache_W2 = 1        # storing cache for each of our weights
    cache_b2 = 1
    cache_W1 = 1
    cache_b1 = 1  
    decay_rate = 0.999        # our decay rate parameter and epsilon
    eps = 1e-10
    
    # calculate batches in same way 
    for i in range(max_iter):
        for j in range(n_batches):
            Xbatch = Xtrain[j*batch_sz:(j*batch_sz + batch_sz),]
            Ybatch = Ytrain_ind[j*batch_sz:(j*batch_sz + batch_sz),]
            pYbatch, Z = forward(Xbatch, W1, b1, W2, b2)              # forward pass
            
            # weight and bias updates, WITH RMSProp
            gW2 = derivative_w2(Z, Ybatch, pYbatch) + reg*W2
            cache_W2 = decay_rate*cache_W2 + (1 - decay_rate)*gW2*gW2 # elem by elem mult
            W2 -= lr0 * gW2 / (np.sqrt(cache_W2) + eps)
    
            gb2 = derivative_b2(Ybatch, pYbatch) + reg*b2
            cache_b2 = decay_rate*cache_b2 + (1 - decay_rate)*gb2*gb2
            b2 -= lr0 * gb2 / (np.sqrt(cache_b2) + eps)

            gW1 = derivative_w1(Xbatch, Z, Ybatch, pYbatch, W2) + reg*W1
            cache_W1 = decay_rate*cache_W1 + (1 - decay_rate)*gW1*gW1
            W1 -= lr0 * gW1 / (np.sqrt(cache_W1) + eps)

            gb1 = derivative_b1(Z, Ybatch, pYbatch, W2) + reg*b1
            cache_b1 = decay_rate*cache_b1 + (1 - decay_rate)*gb1*gb1
            b1 -= lr0 * gb1 / (np.sqrt(cache_b1) + eps)
            
            if j % print_period == 0:
                # calculate just for LL
                pY, _ = forward(Xtest, W1, b1, W2, b2)
                # print "pY:", pY
                ll = cost(pY, Ytest_ind)
                LL_rms.append(ll)
                print("Cost at iteration i=%d, j=%d: %.6f" % (i, j, ll))

                err = error_rate(pY, Ytest)
                CR_rms.append(err)
                print("Error rate:", err)

    pY, _ = forward(Xtest, W1, b1, W2, b2)
    print("Final error rate:", error_rate(pY, Ytest))
    print('-------------------- END --------------------------')

    plt.plot(LL_batch, label='const')
    plt.plot(LL_rms, label='rms')
    plt.legend()
    plt.show()

main()

Reading in and transforming data...
Cost at iteration i=0, j=0: 2349.120009
Error rate: 0.845
Cost at iteration i=0, j=10: 1754.519175
Error rate: 0.471
Cost at iteration i=0, j=20: 1385.450313
Error rate: 0.309
Cost at iteration i=0, j=30: 1148.961005
Error rate: 0.237
Cost at iteration i=0, j=40: 992.222950
Error rate: 0.197
Cost at iteration i=0, j=50: 879.998396
Error rate: 0.172
Cost at iteration i=0, j=60: 798.245900
Error rate: 0.16
Cost at iteration i=0, j=70: 731.509100
Error rate: 0.147
Cost at iteration i=0, j=80: 678.656807
Error rate: 0.136
Cost at iteration i=1, j=0: 669.401911
Error rate: 0.136
Cost at iteration i=1, j=10: 628.291571
Error rate: 0.129
Cost at iteration i=1, j=20: 594.981241
Error rate: 0.126
Cost at iteration i=1, j=30: 566.042122
Error rate: 0.121
Cost at iteration i=1, j=40: 541.672213
Error rate: 0.116
Cost at iteration i=1, j=50: 521.289063
Error rate: 0.113
Cost at iteration i=1, j=60: 505.045671
Error rate: 0.113
Cost at iteration i=1, j=70: 487.926051
Error rate: 0.107
Cost at iteration i=1, j=80: 473.050332
Error rate: 0.105
Cost at iteration i=2, j=0: 470.455385
Error rate: 0.104
Cost at iteration i=2, j=10: 458.069777
Error rate: 0.1
Cost at iteration i=2, j=20: 447.364195
Error rate: 0.098
Cost at iteration i=2, j=30: 437.065791
Error rate: 0.097
Cost at iteration i=2, j=40: 427.512263
Error rate: 0.095
Cost at iteration i=2, j=50: 419.510999
Error rate: 0.096
Cost at iteration i=2, j=60: 413.146741
Error rate: 0.094
Cost at iteration i=2, j=70: 405.276953
Error rate: 0.093
Cost at iteration i=2, j=80: 397.614660
Error rate: 0.092
Cost at iteration i=3, j=0: 396.398931
Error rate: 0.091
Cost at iteration i=3, j=10: 390.339230
Error rate: 0.09
Cost at iteration i=3, j=20: 385.187041
Error rate: 0.091
Cost at iteration i=3, j=30: 379.791654
Error rate: 0.092
Cost at iteration i=3, j=40: 374.399247
Error rate: 0.088
Cost at iteration i=3, j=50: 369.950091
Error rate: 0.089
Cost at iteration i=3, j=60: 366.506478
Error rate: 0.092
Cost at iteration i=3, j=70: 361.983055
Error rate: 0.088
Cost at iteration i=3, j=80: 356.767986
Error rate: 0.087
Cost at iteration i=4, j=0: 356.031076
Error rate: 0.087
Cost at iteration i=4, j=10: 352.154249
Error rate: 0.087
Cost at iteration i=4, j=20: 349.076297
Error rate: 0.082
Cost at iteration i=4, j=30: 345.712119
Error rate: 0.084
Cost at iteration i=4, j=40: 342.068517
Error rate: 0.081
Cost at iteration i=4, j=50: 339.084530
Error rate: 0.084
Cost at iteration i=4, j=60: 336.841254
Error rate: 0.083
Cost at iteration i=4, j=70: 333.645503
Error rate: 0.081
Cost at iteration i=4, j=80: 329.657069
Error rate: 0.08
Cost at iteration i=5, j=0: 329.163244
Error rate: 0.081
Cost at iteration i=5, j=10: 326.244010
Error rate: 0.079
Cost at iteration i=5, j=20: 324.280580
Error rate: 0.077
Cost at iteration i=5, j=30: 321.963438
Error rate: 0.076
Cost at iteration i=5, j=40: 319.203727
Error rate: 0.076
Cost at iteration i=5, j=50: 317.086903
Error rate: 0.077
Cost at iteration i=5, j=60: 315.568431
Error rate: 0.075
Cost at iteration i=5, j=70: 313.234177
Error rate: 0.077
Cost at iteration i=5, j=80: 309.942642
Error rate: 0.077
Cost at iteration i=6, j=0: 309.573070
Error rate: 0.076
Cost at iteration i=6, j=10: 307.227866
Error rate: 0.074
Cost at iteration i=6, j=20: 305.849738
Error rate: 0.072
Cost at iteration i=6, j=30: 304.106693
Error rate: 0.072
Cost at iteration i=6, j=40: 301.783900
Error rate: 0.071
Cost at iteration i=6, j=50: 300.217864
Error rate: 0.07
Cost at iteration i=6, j=60: 299.164516
Error rate: 0.07
Cost at iteration i=6, j=70: 297.368127
Error rate: 0.07
Cost at iteration i=6, j=80: 294.527322
Error rate: 0.069
Cost at iteration i=7, j=0: 294.229376
Error rate: 0.069
Cost at iteration i=7, j=10: 292.225742
Error rate: 0.069
Cost at iteration i=7, j=20: 291.221258
Error rate: 0.069
Cost at iteration i=7, j=30: 289.825023
Error rate: 0.068
Cost at iteration i=7, j=40: 287.797451
Error rate: 0.068
Cost at iteration i=7, j=50: 286.544210
Error rate: 0.067
Cost at iteration i=7, j=60: 285.812886
Error rate: 0.066
Cost at iteration i=7, j=70: 284.383791
Error rate: 0.067
Cost at iteration i=7, j=80: 281.858567
Error rate: 0.066
Cost at iteration i=8, j=0: 281.620037
Error rate: 0.067
Cost at iteration i=8, j=10: 279.833578
Error rate: 0.066
Cost at iteration i=8, j=20: 279.075078
Error rate: 0.066
Cost at iteration i=8, j=30: 277.902222
Error rate: 0.066
Cost at iteration i=8, j=40: 276.081607
Error rate: 0.066
Cost at iteration i=8, j=50: 275.027323
Error rate: 0.065
Cost at iteration i=8, j=60: 274.535374
Error rate: 0.064
Cost at iteration i=8, j=70: 273.375743
Error rate: 0.065
Cost at iteration i=8, j=80: 271.077846
Error rate: 0.065
Cost at iteration i=9, j=0: 270.874604
Error rate: 0.065
Cost at iteration i=9, j=10: 269.256345
Error rate: 0.064
Cost at iteration i=9, j=20: 268.659399
Error rate: 0.065
Cost at iteration i=9, j=30: 267.641776
Error rate: 0.065
Cost at iteration i=9, j=40: 265.974517
Error rate: 0.065
Cost at iteration i=9, j=50: 265.114531
Error rate: 0.064
Cost at iteration i=9, j=60: 264.781330
Error rate: 0.064
Cost at iteration i=9, j=70: 263.823098
Error rate: 0.065
Cost at iteration i=9, j=80: 261.703984
Error rate: 0.064
Cost at iteration i=10, j=0: 261.532235
Error rate: 0.064
Cost at iteration i=10, j=10: 260.047354
Error rate: 0.064
Cost at iteration i=10, j=20: 259.581783
Error rate: 0.064
Cost at iteration i=10, j=30: 258.692815
Error rate: 0.063
Cost at iteration i=10, j=40: 257.132174
Error rate: 0.063
Cost at iteration i=10, j=50: 256.431424
Error rate: 0.063
Cost at iteration i=10, j=60: 256.227024
Error rate: 0.062
Cost at iteration i=10, j=70: 255.414449
Error rate: 0.063
Cost at iteration i=10, j=80: 253.445438
Error rate: 0.063
Cost at iteration i=11, j=0: 253.309405
Error rate: 0.063
Cost at iteration i=11, j=10: 251.914861
Error rate: 0.063
Cost at iteration i=11, j=20: 251.538425
Error rate: 0.062
Cost at iteration i=11, j=30: 250.758485
Error rate: 0.062
Cost at iteration i=11, j=40: 249.292771
Error rate: 0.062
Cost at iteration i=11, j=50: 248.730466
Error rate: 0.062
Cost at iteration i=11, j=60: 248.621070
Error rate: 0.062
Cost at iteration i=11, j=70: 247.909256
Error rate: 0.063
Cost at iteration i=11, j=80: 246.072094
Error rate: 0.063
Cost at iteration i=12, j=0: 245.959406
Error rate: 0.062
Cost at iteration i=12, j=10: 244.651365
Error rate: 0.062
Cost at iteration i=12, j=20: 244.369165
Error rate: 0.062
Cost at iteration i=12, j=30: 243.672257
Error rate: 0.063
Cost at iteration i=12, j=40: 242.287605
Error rate: 0.063
Cost at iteration i=12, j=50: 241.832745
Error rate: 0.062
Cost at iteration i=12, j=60: 241.807813
Error rate: 0.062
Cost at iteration i=12, j=70: 241.186960
Error rate: 0.062
Cost at iteration i=12, j=80: 239.443943
Error rate: 0.062
Cost at iteration i=13, j=0: 239.350245
Error rate: 0.062
Cost at iteration i=13, j=10: 238.121796
Error rate: 0.062
Cost at iteration i=13, j=20: 237.907680
Error rate: 0.063
Cost at iteration i=13, j=30: 237.280906
Error rate: 0.063
Cost at iteration i=13, j=40: 235.971844
Error rate: 0.063
Cost at iteration i=13, j=50: 235.608783
Error rate: 0.063
Cost at iteration i=13, j=60: 235.655442
Error rate: 0.064
Cost at iteration i=13, j=70: 235.113416
Error rate: 0.062
Cost at iteration i=13, j=80: 233.443128
Error rate: 0.063
Cost at iteration i=14, j=0: 233.362237
Error rate: 0.063
Cost at iteration i=14, j=10: 232.192020
Error rate: 0.063
Cost at iteration i=14, j=20: 232.028394
Error rate: 0.061
Cost at iteration i=14, j=30: 231.462431
Error rate: 0.063
Cost at iteration i=14, j=40: 230.213531
Error rate: 0.062
Cost at iteration i=14, j=50: 229.922751
Error rate: 0.062
Cost at iteration i=14, j=60: 230.033767
Error rate: 0.063
Cost at iteration i=14, j=70: 229.556634
Error rate: 0.062
Cost at iteration i=14, j=80: 227.953639
Error rate: 0.062
Cost at iteration i=15, j=0: 227.887345
Error rate: 0.062
Cost at iteration i=15, j=10: 226.764677
Error rate: 0.063
Cost at iteration i=15, j=20: 226.646583
Error rate: 0.062
Cost at iteration i=15, j=30: 226.132476
Error rate: 0.062
Cost at iteration i=15, j=40: 224.935946
Error rate: 0.062
Cost at iteration i=15, j=50: 224.704294
Error rate: 0.061
Cost at iteration i=15, j=60: 224.861129
Error rate: 0.062
Cost at iteration i=15, j=70: 224.449084
Error rate: 0.063
Cost at iteration i=15, j=80: 222.907183
Error rate: 0.062
Cost at iteration i=16, j=0: 222.851999
Error rate: 0.061
Cost at iteration i=16, j=10: 221.777287
Error rate: 0.061
Cost at iteration i=16, j=20: 221.706764
Error rate: 0.059
Cost at iteration i=16, j=30: 221.234214
Error rate: 0.059
Cost at iteration i=16, j=40: 220.073308
Error rate: 0.06
Cost at iteration i=16, j=50: 219.887304
Error rate: 0.06
Cost at iteration i=16, j=60: 220.102443
Error rate: 0.06
Cost at iteration i=16, j=70: 219.746461
Error rate: 0.06
Cost at iteration i=16, j=80: 218.241344
Error rate: 0.059
Cost at iteration i=17, j=0: 218.201380
Error rate: 0.06
Cost at iteration i=17, j=10: 217.168483
Error rate: 0.06
Cost at iteration i=17, j=20: 217.131525
Error rate: 0.058
Cost at iteration i=17, j=30: 216.694513
Error rate: 0.059
Cost at iteration i=17, j=40: 215.575683
Error rate: 0.06
Cost at iteration i=17, j=50: 215.423832
Error rate: 0.059
Cost at iteration i=17, j=60: 215.680775
Error rate: 0.06
Cost at iteration i=17, j=70: 215.369086
Error rate: 0.059
Cost at iteration i=17, j=80: 213.899064
Error rate: 0.058
Cost at iteration i=18, j=0: 213.865263
Error rate: 0.057
Cost at iteration i=18, j=10: 212.873672
Error rate: 0.057
Cost at iteration i=18, j=20: 212.870456
Error rate: 0.055
Cost at iteration i=18, j=30: 212.466931
Error rate: 0.055
Cost at iteration i=18, j=40: 211.387176
Error rate: 0.056
Cost at iteration i=18, j=50: 211.272768
Error rate: 0.056
Cost at iteration i=18, j=60: 211.560197
Error rate: 0.055
Cost at iteration i=18, j=70: 211.281060
Error rate: 0.055
Cost at iteration i=18, j=80: 209.848217
Error rate: 0.056
Cost at iteration i=19, j=0: 209.825195
Error rate: 0.056
Cost at iteration i=19, j=10: 208.869018
Error rate: 0.054
Cost at iteration i=19, j=20: 208.894951
Error rate: 0.054
Cost at iteration i=19, j=30: 208.524304
Error rate: 0.054
Cost at iteration i=19, j=40: 207.467373
Error rate: 0.054
Cost at iteration i=19, j=50: 207.382990
Error rate: 0.054
Cost at iteration i=19, j=60: 207.699016
Error rate: 0.053
Cost at iteration i=19, j=70: 207.456084
Error rate: 0.055
Cost at iteration i=19, j=80: 206.060349
Error rate: 0.055
Final error rate: 0.054
-------------------- END --------------------------
Cost at iteration i=0, j=0: 1302.765016
Error rate: 0.421
Cost at iteration i=0, j=10: 398.127106
Error rate: 0.102
Cost at iteration i=0, j=20: 332.985303
Error rate: 0.077
Cost at iteration i=0, j=30: 291.236274
Error rate: 0.076
Cost at iteration i=0, j=40: 273.482729
Error rate: 0.073
Cost at iteration i=0, j=50: 254.258390
Error rate: 0.066
Cost at iteration i=0, j=60: 246.139238
Error rate: 0.063
Cost at iteration i=0, j=70: 228.230492
Error rate: 0.061
Cost at iteration i=0, j=80: 200.113072
Error rate: 0.057
Cost at iteration i=1, j=0: 203.806167
Error rate: 0.057
Cost at iteration i=1, j=10: 192.321276
Error rate: 0.051
Cost at iteration i=1, j=20: 199.572118
Error rate: 0.053
Cost at iteration i=1, j=30: 192.806075
Error rate: 0.056
Cost at iteration i=1, j=40: 189.398028
Error rate: 0.049
Cost at iteration i=1, j=50: 183.727744
Error rate: 0.045
Cost at iteration i=1, j=60: 180.697464
Error rate: 0.05
Cost at iteration i=1, j=70: 176.467620
Error rate: 0.047
Cost at iteration i=1, j=80: 158.250659
Error rate: 0.043
Cost at iteration i=2, j=0: 162.233567
Error rate: 0.042
Cost at iteration i=2, j=10: 156.353189
Error rate: 0.042
Cost at iteration i=2, j=20: 163.194722
Error rate: 0.047
Cost at iteration i=2, j=30: 168.083572
Error rate: 0.044
Cost at iteration i=2, j=40: 165.132412
Error rate: 0.041
Cost at iteration i=2, j=50: 162.424448
Error rate: 0.041
Cost at iteration i=2, j=60: 165.214867
Error rate: 0.039
Cost at iteration i=2, j=70: 160.444840
Error rate: 0.039
Cost at iteration i=2, j=80: 143.728678
Error rate: 0.037
Cost at iteration i=3, j=0: 147.690818
Error rate: 0.035
Cost at iteration i=3, j=10: 145.251354
Error rate: 0.035
Cost at iteration i=3, j=20: 151.492707
Error rate: 0.038
Cost at iteration i=3, j=30: 155.619508
Error rate: 0.039
Cost at iteration i=3, j=40: 153.681399
Error rate: 0.037
Cost at iteration i=3, j=50: 152.523027
Error rate: 0.036
Cost at iteration i=3, j=60: 155.854670
Error rate: 0.036
Cost at iteration i=3, j=70: 152.538494
Error rate: 0.035
Cost at iteration i=3, j=80: 137.580983
Error rate: 0.032
Cost at iteration i=4, j=0: 140.964466
Error rate: 0.029
Cost at iteration i=4, j=10: 140.112148
Error rate: 0.033
Cost at iteration i=4, j=20: 145.345904
Error rate: 0.033
Cost at iteration i=4, j=30: 148.307599
Error rate: 0.036
Cost at iteration i=4, j=40: 145.573409
Error rate: 0.036
Cost at iteration i=4, j=50: 147.326050
Error rate: 0.033
Cost at iteration i=4, j=60: 151.733384
Error rate: 0.036
Cost at iteration i=4, j=70: 146.105656
Error rate: 0.032
Cost at iteration i=4, j=80: 132.565915
Error rate: 0.031
Cost at iteration i=5, j=0: 135.485671
Error rate: 0.027
Cost at iteration i=5, j=10: 136.681437
Error rate: 0.031
Cost at iteration i=5, j=20: 141.686916
Error rate: 0.031
Cost at iteration i=5, j=30: 145.246070
Error rate: 0.034
Cost at iteration i=5, j=40: 142.282458
Error rate: 0.033
Cost at iteration i=5, j=50: 145.199844
Error rate: 0.032
Cost at iteration i=5, j=60: 149.838376
Error rate: 0.034
Cost at iteration i=5, j=70: 143.731109
Error rate: 0.033
Cost at iteration i=5, j=80: 131.304493
Error rate: 0.031
Cost at iteration i=6, j=0: 133.990625
Error rate: 0.027
Cost at iteration i=6, j=10: 136.101741
Error rate: 0.031
Cost at iteration i=6, j=20: 139.453595
Error rate: 0.032
Cost at iteration i=6, j=30: 142.583843
Error rate: 0.034
Cost at iteration i=6, j=40: 139.901206
Error rate: 0.033
Cost at iteration i=6, j=50: 142.986914
Error rate: 0.031
Cost at iteration i=6, j=60: 148.503604
Error rate: 0.034
Cost at iteration i=6, j=70: 141.828301
Error rate: 0.03
Cost at iteration i=6, j=80: 131.230851
Error rate: 0.033
Cost at iteration i=7, j=0: 133.468160
Error rate: 0.028
Cost at iteration i=7, j=10: 136.035457
Error rate: 0.03
Cost at iteration i=7, j=20: 138.249200
Error rate: 0.03
Cost at iteration i=7, j=30: 141.600186
Error rate: 0.035
Cost at iteration i=7, j=40: 139.708721
Error rate: 0.032
Cost at iteration i=7, j=50: 142.390000
Error rate: 0.03
Cost at iteration i=7, j=60: 147.842725
Error rate: 0.033
Cost at iteration i=7, j=70: 140.495995
Error rate: 0.029
Cost at iteration i=7, j=80: 132.204980
Error rate: 0.033
Cost at iteration i=8, j=0: 134.597511
Error rate: 0.026
Cost at iteration i=8, j=10: 136.793355
Error rate: 0.03
Cost at iteration i=8, j=20: 138.872168
Error rate: 0.03
Cost at iteration i=8, j=30: 141.508976
Error rate: 0.033
Cost at iteration i=8, j=40: 140.661221
Error rate: 0.032
Cost at iteration i=8, j=50: 142.625252
Error rate: 0.031
Cost at iteration i=8, j=60: 148.025390
Error rate: 0.032
Cost at iteration i=8, j=70: 140.605185
Error rate: 0.03
Cost at iteration i=8, j=80: 133.831230
Error rate: 0.034
Cost at iteration i=9, j=0: 136.054808
Error rate: 0.028
Cost at iteration i=9, j=10: 136.983022
Error rate: 0.03
Cost at iteration i=9, j=20: 139.948096
Error rate: 0.031
Cost at iteration i=9, j=30: 142.363093
Error rate: 0.033
Cost at iteration i=9, j=40: 140.889875
Error rate: 0.031
Cost at iteration i=9, j=50: 142.732904
Error rate: 0.03
Cost at iteration i=9, j=60: 148.271596
Error rate: 0.032
Cost at iteration i=9, j=70: 142.106092
Error rate: 0.03
Cost at iteration i=9, j=80: 135.531783
Error rate: 0.033
Cost at iteration i=10, j=0: 137.577871
Error rate: 0.028
Cost at iteration i=10, j=10: 138.409637
Error rate: 0.029
Cost at iteration i=10, j=20: 141.259732
Error rate: 0.03
Cost at iteration i=10, j=30: 143.665281
Error rate: 0.033
Cost at iteration i=10, j=40: 142.344639
Error rate: 0.031
Cost at iteration i=10, j=50: 144.153445
Error rate: 0.03
Cost at iteration i=10, j=60: 149.317771
Error rate: 0.032
Cost at iteration i=10, j=70: 143.069698
Error rate: 0.031
Cost at iteration i=10, j=80: 137.945913
Error rate: 0.033
Cost at iteration i=11, j=0: 139.848358
Error rate: 0.028
Cost at iteration i=11, j=10: 140.281712
Error rate: 0.029
Cost at iteration i=11, j=20: 142.715404
Error rate: 0.03
Cost at iteration i=11, j=30: 144.672019
Error rate: 0.032
Cost at iteration i=11, j=40: 143.112871
Error rate: 0.031
Cost at iteration i=11, j=50: 144.926515
Error rate: 0.03
Cost at iteration i=11, j=60: 149.914620
Error rate: 0.031
Cost at iteration i=11, j=70: 144.072747
Error rate: 0.03
Cost at iteration i=11, j=80: 139.421155
Error rate: 0.033
Cost at iteration i=12, j=0: 141.475706
Error rate: 0.029
Cost at iteration i=12, j=10: 141.570552
Error rate: 0.029
Cost at iteration i=12, j=20: 143.844635
Error rate: 0.029
Cost at iteration i=12, j=30: 145.925681
Error rate: 0.032
Cost at iteration i=12, j=40: 143.641218
Error rate: 0.031
Cost at iteration i=12, j=50: 145.667904
Error rate: 0.03
Cost at iteration i=12, j=60: 150.673502
Error rate: 0.031
Cost at iteration i=12, j=70: 145.335190
Error rate: 0.029
Cost at iteration i=12, j=80: 141.139299
Error rate: 0.032
Cost at iteration i=13, j=0: 143.365897
Error rate: 0.029
Cost at iteration i=13, j=10: 143.294662
Error rate: 0.028
Cost at iteration i=13, j=20: 145.528835
Error rate: 0.03
Cost at iteration i=13, j=30: 147.546935
Error rate: 0.032
Cost at iteration i=13, j=40: 145.516632
Error rate: 0.031
Cost at iteration i=13, j=50: 147.181662
Error rate: 0.029
Cost at iteration i=13, j=60: 151.938722
Error rate: 0.031
Cost at iteration i=13, j=70: 146.897990
Error rate: 0.029
Cost at iteration i=13, j=80: 143.175131
Error rate: 0.03
Cost at iteration i=14, j=0: 145.320085
Error rate: 0.03
Cost at iteration i=14, j=10: 145.079627
Error rate: 0.028
Cost at iteration i=14, j=20: 146.821527
Error rate: 0.03
Cost at iteration i=14, j=30: 148.798177
Error rate: 0.031
Cost at iteration i=14, j=40: 146.672768
Error rate: 0.031
Cost at iteration i=14, j=50: 148.356526
Error rate: 0.029
Cost at iteration i=14, j=60: 152.971972
Error rate: 0.029
Cost at iteration i=14, j=70: 148.448248
Error rate: 0.03
Cost at iteration i=14, j=80: 144.895337
Error rate: 0.031
Cost at iteration i=15, j=0: 147.180013
Error rate: 0.03
Cost at iteration i=15, j=10: 146.698825
Error rate: 0.028
Cost at iteration i=15, j=20: 147.911150
Error rate: 0.03
Cost at iteration i=15, j=30: 149.885095
Error rate: 0.031
Cost at iteration i=15, j=40: 147.723225
Error rate: 0.031
Cost at iteration i=15, j=50: 149.402505
Error rate: 0.029
Cost at iteration i=15, j=60: 154.061489
Error rate: 0.029
Cost at iteration i=15, j=70: 149.905740
Error rate: 0.03
Cost at iteration i=15, j=80: 146.730451
Error rate: 0.03
Cost at iteration i=16, j=0: 149.111490
Error rate: 0.03
Cost at iteration i=16, j=10: 148.618158
Error rate: 0.028
Cost at iteration i=16, j=20: 149.539371
Error rate: 0.029
Cost at iteration i=16, j=30: 151.431054
Error rate: 0.031
Cost at iteration i=16, j=40: 148.276068
Error rate: 0.028
Cost at iteration i=16, j=50: 150.046638
Error rate: 0.029
Cost at iteration i=16, j=60: 154.935099
Error rate: 0.029
Cost at iteration i=16, j=70: 151.222785
Error rate: 0.032
Cost at iteration i=16, j=80: 148.388571
Error rate: 0.03
Cost at iteration i=17, j=0: 150.779628
Error rate: 0.03
Cost at iteration i=17, j=10: 150.106789
Error rate: 0.028
Cost at iteration i=17, j=20: 152.481972
Error rate: 0.03
Cost at iteration i=17, j=30: 154.359348
Error rate: 0.031
Cost at iteration i=17, j=40: 150.287320
Error rate: 0.029
Cost at iteration i=17, j=50: 151.925365
Error rate: 0.029
Cost at iteration i=17, j=60: 156.586598
Error rate: 0.029
Cost at iteration i=17, j=70: 153.044622
Error rate: 0.032
Cost at iteration i=17, j=80: 150.295396
Error rate: 0.031
Cost at iteration i=18, j=0: 152.682940
Error rate: 0.032
Cost at iteration i=18, j=10: 151.895674
Error rate: 0.029
Cost at iteration i=18, j=20: 154.006675
Error rate: 0.028
Cost at iteration i=18, j=30: 155.919530
Error rate: 0.031
Cost at iteration i=18, j=40: 151.636828
Error rate: 0.029
Cost at iteration i=18, j=50: 153.179341
Error rate: 0.029
Cost at iteration i=18, j=60: 157.916758
Error rate: 0.03
Cost at iteration i=18, j=70: 154.658322
Error rate: 0.032
Cost at iteration i=18, j=80: 152.053474
Error rate: 0.031
Cost at iteration i=19, j=0: 154.517029
Error rate: 0.032
Cost at iteration i=19, j=10: 153.755373
Error rate: 0.029
Cost at iteration i=19, j=20: 155.777852
Error rate: 0.028
Cost at iteration i=19, j=30: 157.701633
Error rate: 0.031
Cost at iteration i=19, j=40: 153.190401
Error rate: 0.029
Cost at iteration i=19, j=50: 154.598358
Error rate: 0.029
Cost at iteration i=19, j=60: 159.039975
Error rate: 0.03
Cost at iteration i=19, j=70: 155.970673
Error rate: 0.032
Cost at iteration i=19, j=80: 153.676532
Error rate: 0.031
Final error rate: 0.03
-------------------- END --------------------------

6. Adam Optimizer¶

One of the newest and most common optimization techniques these days is known as the Adam Optimizer (link to original paper here: https://arxiv.org/pdf/1412.6980.pdf). It is going to be given it's own section all together, since as of 2017 it is often considered the go to optimizer for those doing deep learning. Sometimes, people will talk about Adam by saying it is like RMSprop, but with momentum. This can be slightly confusing though, considering that TensorFlow has RMSprop, and allows you to add momentum to it, which is not Adam. In other words, you can have RMSprop with momentum, but that doesn't give you Adam-it just gives you RMSprop with momentum. Adam does, in a sense, add something like momentum to RMSprop, but in a very specific way. In order to fully understand Adam, we first need to look at part of the theory that it is dependent on: Exponentially-Smoothed Averages.

6.1 Exponentially Smoothed Averages¶

We are going to take a minute to dig into something that may seem straight forward: How to calculate an average. The first thought we all have when being asked to do this is: Why not just add all of the sample data points, and then divide by the number of data points, resulting in the sample mean:

$$\bar{X}_N = \frac{1}{N}\sum_{n=1}^NX_n$$

But now let's suppose that you have a large amount of data-so much so that all of your X's cannot fit into memory at the same time. Is it still possible to calculate the sample mean? Yes it is! We can read in one data point at a time, and then delete each data point after we've looked at it. It is shown below that the current sample mean can actually be expressed in terms of the previous sample mean and the current data point.

$$\bar{X}_N = \frac{1}{N}\sum_{n=1}^NX_n = \frac{1}{N}\Big((N-1)\bar{X}_{N-1} + X_N \Big) = (1 - \frac{1}{N})\bar{X}_{N-1}+\frac{1}{N}X_N$$

We can then express this using simpler symbols. We can call $Y$ our output, and we can use $t$ to represent the current time step:

$$Y_t = (1 - \frac{1}{t})Y_{t-1} + \frac{1}{t}X_t$$

Great, so we have solved our problem of how to calculate the sample mean when we can't fit all of the data into memory, but we can see that there is this $\frac{1}{t}$ term. This says that as $t$ grows larger, the current sample has less and less of an effect on the total mean. Clearly this makes sense, because as $t$ grows that means the total number of $X$'s we've seen has grown. We also decrease the influence of the previous $Y$ by $1 - \frac{1}{t}$. This means that each new $Y$ is just part of the old $Y$, plus part of the newest $X$. But in the end, it balances out to give us exactly the sample mean of $X$.

For convenience we can call this $\frac{1}{t}$ term $\alpha_t$. What if we were to say that we did not want $\alpha$ to be $\frac{1}{t}$? What if we said that we wanted each data point to matter equally at the time that we see it, so that we can set alpha to be a constant? Of course, $\alpha$ needs to be less than 1 so that we don't end up negating the previous mean.

$$0 < \alpha_t = constant < 1 $$$$Y_t = (1 - \alpha)Y_{t-1} + \alpha X_t$$

So what does this give us?

6.1.1 The Exponentially-smoothed average¶

This gives us what is called the exponentially smoothed average! We can see why it is called exponential when we express it in terms of only $X$'s.

$$Y_t = (1 - \alpha)^tY_0 + \alpha \sum_{\tau = 0}^{t - 1}(1- \alpha)^\tau X(t- \tau)$$

If the equation above is not clear, the expansion below should clear up where everything is coming from and why this is called exponential. Let's say we are looking at $Y_{100}$:

$$Y_{100} = (1-\alpha)^{100}Y_0 + \alpha * X_{100} + \alpha * (1 - \alpha)^1*X_{99} + \alpha * (1 - \alpha)^2 * X_{98}+ ...$$

We can see the exponential term start to accumulate along the $(1 - alpha)$! Now, does this still give us the mean, aka the expected value of $X$? Well, if you take the expected value of everything, we can see that we arrive at the expected value of $X$:

$$(1 - \alpha)E[y(t-1)] + \alpha E[X(t)] = (1-\alpha)E(X) + \alpha E(X) = E(X)$$

We do arrive at the expected value of $X$, so we can see that the math does checkout! Of course, this is assuming that the distribution of $X$ does not change over time. Note that if you have come from a signal processing background, you may recognize this as a low-pass filter. Another way to think about this is that you are saying current values matter more, and past values matter less in an exponentially decreasing way. So, if $X$ is not stationary (meaning it's distribution changes over time), then this is actually a better way to estimate the mean (average) then weighting all data points equally over all time.

6.2 Exponentially Smoothed Average's and Adam¶

Now, how does this apply to the Adam Optimizer? Well, if we call the sample mean of $T$ samples, $M_T$, we write $M_T$ as: $$M_T = \frac{1}{T} \sum_{t=1}^TX_t = \frac{1}{T}\sum_{t=1}^{T-1}X_t+\frac{1}{T}X_T = (1 - \frac{1}{T})M_{T-1}+\frac{1}{T}X_T$$

Where $X_T$ is the new sample, and $M_{T-1}$ is the previous mean. This means that instead of needing to store all of the $X$'s, all we need to store is the current $X$ and the previous $M$, and then we calculate the current $M$.

Now, you may notice that there is this $\frac{1}{T}$ term that occurs in the equation. It is reasonable to ask: "What is we set this term to be a constant?" Then, we get back precisely what looks like the cache update from RMSProp!

$$M_T = decay*M_{T-1} + (1 - decay)X_T$$

6.3 What is the cache of RMSProp estimating?¶

At this point, it is a very reasonable thing to ask: "What is the cache of RMSProp really estimating?" Well, it is really estimating the mean of the square of the gradient!

$$v_t = decay * v_{t-1} + (1 - decay)*g^2 \approx mean(g^2)$$

And because we eventually take the square root of the cache, now you can see where RMSProp gets its name (RMS= "root mean square").

6.4 Second Moment¶

Also, it is worth remembering that the mean of a random variable can also be phrased as an expected value, aka: $mean(X) = E(X)$. In particular, when we take the expected value of the square of a random variable, we get the second moment: $$E(X^2) = 2nd \;moment\;of\;X$$

So, $E(g^2)$ is the second moment of the gradient, and we calculate it using this exponentially smoothed average. We are going to refer to it as $v$ from now on, since the 2nd central moment of a random variable is the definition of variance.

$$v \approx E(g^2)$$$$M_T = (1 - \frac{1}{T})M_{T-1}+\frac{1}{T}X_T$$$$v_t = decay * v_{t-1} + (1 - decay)*g^2$$

6.5 First Moment¶

Now, since there is a second moment, it is reasonable to ask: is there a first moment? Yes there is! It is just the expected value of $g$.

$$m_t = \mu m_{t-1} + (1 - \mu)g_t$$$$m \approx E(g)$$

We will call this $m$ because the first moment of a random variable is the definition of the mean. Because the Adam Update makes use of the 1st and 2nd moments of $g$ using exponentially smoothed estimates, this also explains the name of the algorithm: Adaptive Moment Estimation.

Also, note that the update for $m$ looks a lot like momentum! Hence, why some people refer to this as RMSProp with momentum. Usually momentum is just implemented by adding a momentum parameter to the velocity rather than doing this weighted average, as a reminder it usually looks like:

$$m_t = \mu m_{t-1} + g_t$$

6.6 One final problem¶

So to get to our final destination, which is the Adam Update, we are going to combine these two pieces: the first and second moments of $g$, or as we have been calling them, $m$ and $v$ (mean and variance). However, before we do that, there is one last important concept to discuss. The exponentially smoothed average acts very much like a smoothing function, ignoring much of the variation. It will ignore the rapid very random spikes, but it will follow the main trend of the signal, the original sine wave. In the field of signal processing this is known as a low pass filter, because the high frequency changes are being filtered out.

The problem with exponential smoothing is just like RMSProp, what is the initial value? The output is supposed to depend on the current value of the input plus the last value of the output. But since there is no last value of the output for $t = 0 $, we typically just set that to 0. But in doing so, we make the first value of the output small: $$Y(0)= 0$$ $$Y(1) = 0.99 * Y(0) + 0.01*X(1) = 0.01*X(1)$$

This means that initially your output will be biased towards 0!

6.7 Bias Correction¶

Is there anything that we can do about the issue presented above? Well, there is a technique, bias correction, that makes up for this.

$$\hat{Y}(t) = \frac{Y(t)}{1 - decay^t}$$

We divide by $(1 - decay)^t$. As you can see, when $t$ is small we are dividing by a very small number, which makes the initial value bigger. This is good because the initial value was too small.

Let's go through an example to make things very clear. Given a decay of 0.99, our original output would be:

$$Y(1) = 0.01*X(1)$$

And $Y(2)$ would be:

$$Y(2) = 0.0099*X(1) + 0.01*X(2)$$

But, if we correct it, then we have:

$$\hat{Y}(1) = \frac{0.01}{1 - 0.99^1}*X(1) = X(1)$$

And our corrected $\hat{Y}(2)$:

$$\hat{Y}(2) = \frac{Y(2)}{0.0199} = 0.497*X(1) + 0.503*X(2)$$

This also makes sense because it is about half of $X(1)$ plus about half of $X(2)$, with a bit more weight on the current value. We can also see that as $t$ approaches infinity, the bias correction term approaches 1, meaning that we quickly converge to the standard exponentially smooth average. The important part is that our output now has the correct range of values when compared with the input, for all values of the input, including the initial ones.

6.8 Back to Adam¶

Now that we have discussed bias correction for exponentially smoothed averages, you might guess that we are going to apply it to the exponentially smoothed averages that we have been talking about, in particular $m$ and $v$. So, if we call $\hat{m}$ the bias corrected version of $m$, and $\hat{v}$ the bias corrected version of $v$, then we can formulate our Adam Update.

$$\hat{m}_t = \frac{m_t}{1 - \beta_1^t}$$$$\hat{v}_t = \frac{v_t}{1 - \beta_2^t}$$

Note, there are two different decay rates for $m$ and $v$, which we call $\beta_1$ and $\beta_2$. Typical values are in the range of 0.9 to 0.99. $\epsilon$ again has values in the range of $10^{-8}$ to $10^{-10}$. Our final update for $w$ is:

$$w_{t+1} \leftarrow w_t - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t+\epsilon}}$$

6.9 Recap¶

Now that was a ton of stuff, so lets recap! Adam is a new, modern adaptive learning rate technique that is the default go to for many deep learning practioners. It combines 2 proven techniques:

RMSProp which is a cache mechanism (leaky cache)

Momentum which speeds up training just by continuing to go in the same direction it was going before (keeping track of old gradients)

We saw that these two methods can be generalized, by observing that they estimate the first and second moments of the gradient. Adam also adds bias correction, which makes up for the fact that the exponentially smoothed average starts at zero, and hence the initial estimates are biased towards 0.

$$\hat{m}_t = \frac{m_t}{1 - \beta_1^t}$$$$\hat{v}_t = \frac{v_t}{1 - \beta_2^t}$$$$w_{t+1} \leftarrow w_t - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t+\epsilon}}$$

6.10 Summary Pseudocode¶

At every batch (typical $\beta_1 = 0.9$, $\beta_2 = 0.999$, $\epsilon = 10^{-8}$):

$$m_t = \beta_1 m_{t-1} + (1 - \beta_1)g_t$$$$v_t = \beta_2 v_{t-1} + (1 - \beta_2)g_t^2$$$$\hat{m}_t = \frac{m_t}{1 - \beta_1^t}$$$$\hat{v}_t = \frac{v_t}{1 - \beta_2^t}$$$$w_{t+1} \leftarrow w_t - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t+\epsilon}}$$

7. Adam vs. RMSProp with Momentum, in Code¶

We are now going to take the time to compare the Adam Optimizer in code, vs RMSProp with Momentum.

Recall that the algorithm for RMSProp was:

RMSProp

# at every batch
cache = decay * cache + (1 - decay) * gradient **2
param = param - learning_rate * gradient / sqrt(cache + epsilon)

We can start with our imports.

import numpy as np
from sklearn.utils import shuffle
import matplotlib.pyplot as plt
from util import get_normalized_data, error_rate, cost, y2indicator

And now let's start our main method:

%matplotlib notebook
def main():
    max_iter = 10
    print_period = 10

    X, Y = get_normalized_data()
    reg = 0.01

    Xtrain = X[:-1000,]
    Ytrain = Y[:-1000]
    Xtest  = X[-1000:,]
    Ytest  = Y[-1000:]
    Ytrain_ind = y2indicator(Ytrain)
    Ytest_ind = y2indicator(Ytest)

    N, D = Xtrain.shape
    batch_sz = 500
    n_batches = N // batch_sz

    M = 300
    K = 10
    W1_0 = np.random.randn(D, M) / np.sqrt(D)
    b1_0 = np.zeros(M)
    W2_0 = np.random.randn(M, K) / np.sqrt(M)
    b2_0 = np.zeros(K)

    W1 = W1_0.copy()
    b1 = b1_0.copy()
    W2 = W2_0.copy()
    b2 = b2_0.copy()
    
    """ ----------------------------------- ADAM -----------------------------------"""
    # 1st moment initialization (aka m_t at t = 0, = 0). Note, we could initialize these to 
    # an array of zeros, however it is not necessary considering numpy knows how to 
    # automatically broadcast a scalar when you add it to an array 
    mW1 = 0
    mb1 = 0
    mW2 = 0
    mb2 = 0
    
    # 2nd moment initialization
    vW1 = 0
    vb1 = 0
    vW2 = 0
    vb2 = 0
    
    # define hyperparameters - will use the same parameters for ADAM and RMSProp
    lr0 = 0.001 
    beta1 = 0.9        # we can think of this as momentum 
    beta2 = 0.999      # can think of this like RMSProp cache decay rate 
    eps = 1e-8
    
    loss_adam = []
    err_adam = []
    t = 1
    for i in range (max_iter):
        for j in range(n_batches):
            Xbatch = Xtrain[j*batch_sz:(j*batch_sz + batch_sz), ]
            Ybatch = Ytrain_ind[j*batch_sz:(j*batch_sz + batch_sz), ]
            pYbatch, Z = forward(Xbatch, W1, b1, W2, b2)
            
            # ---- Updates ----
            # 1st - calculate updated gradients for each array
            gW2 = derivative_w2(Z, Ybatch, pYbatch) + reg*W2
            gb2 = derivative_b2(Ybatch, pYbatch) + reg*b2
            gW1 = derivative_w1(Xbatch, Z, Ybatch, pYbatch, W2) + reg*W1
            gb1 = derivative_b1(Z, Ybatch, pYbatch, W2) + reg*b1
            
            # Calculate updates for m, exponentially smoothed average of gradients, 1st moment
            mW1 = beta1 * mW1 + (1 - beta1) * gW1
            mb1 = beta1 * mb1 + (1 - beta1) * gb1
            mW2 = beta1 * mW2 + (1 - beta1) * gW2
            mb2 = beta1 * mb2 + (1 - beta1) * gb2
            
            # Calculate updates for v, exponentially smoothed average of the square of the 
            # gradients, 2nd moment
            vW1 = beta2 * vW1 + (1 - beta2) * gW1* gW1
            vb1 = beta2 * vb1 + (1 - beta2) * gb1* gb1
            vW2 = beta2 * vW2 + (1 - beta2) * gW2* gW2
            vb2 = beta2 * vb2 + (1 - beta2) * gb2* gb2
            
            # Bias correction step - by convention, first time index is 1 
            correction1 = 1 - beta1 ** t
            hat_mW1 = mW1 / correction1
            hat_mb1 = mb1 / correction1
            hat_mW2 = mW2 / correction1
            hat_mb2 = mb2 / correction1
            
            correction2 = 1 - beta2 ** t
            hat_vW1 = vW1 / correction2
            hat_vb1 = vb1 / correction2
            hat_vW2 = vW2 / correction2
            hat_vb2 = vb2 / correction2
            
            # Update t - why?????
            t += 1
            
            # Finally, apply updates to the actual weights/params
            W1 = W1 - lr0 * hat_mW1 / np.sqrt(hat_vW1 + eps)
            b1 = b1 - lr0 * hat_mb1 / np.sqrt(hat_vb1 + eps)
            W2 = W2 - lr0 * hat_mW2 / np.sqrt(hat_vW2 + eps)
            b2 = b2 - lr0 * hat_mb2 / np.sqrt(hat_vb2 + eps)
            
            # Keep track of costs and errors so we can plot later
            if j % print_period == 0:
                pY, _ = forward(Xtest, W1, b1, W2, b2)
                l = cost(pY, Ytest_ind)
                loss_adam.append(l)
                print("Cost at iteration i=%d, j=%d: %.6f" % (i, j, l))

                err = error_rate(pY, Ytest)
                err_adam.append(err)
                print("Error rate:", err)

    pY, _ = forward(Xtest, W1, b1, W2, b2)
    print("Final error rate:", error_rate(pY, Ytest))
    print('-------------------------------- End of Adam---------------------------------')
    
    """ ---------------------------- RMSProp with Momentum ---------------------------"""
    W1 = W1_0.copy()    # Same initial weights that were used in Adam
    b1 = b1_0.copy()
    W2 = W2_0.copy()
    b2 = b2_0.copy()
    loss_rms = []
    err_rms = []
    
    # Comparable hyperparameters so we can have a fair comparison
    lr0 = 0.001                # Learning rate
    mu = 0.9                   # Momentum term, corresponds to beta1
    decay_rate = 0.999         # Cache decay rate, corresponds to beta2
    eps = 1e-8
    
    # RMSProp Cache, initialized to 1 (same implementation as in TensorFlow)
    # This needs to be done because RMSProp does not have bias correction
    cache_W2 = 1
    cache_b2 = 1
    cache_W1 = 1
    cache_b1 = 1

    # Initialize Momentum/Velocity terms to 0, note no bias correction
    dW1 = 0
    db1 = 0
    dW2 = 0
    db2 = 0
    
    for i in range(max_iter):
        for j in range(n_batches):
            Xbatch = Xtrain[j*batch_sz:(j*batch_sz + batch_sz),]
            Ybatch = Ytrain_ind[j*batch_sz:(j*batch_sz + batch_sz),]
            pYbatch, Z = forward(Xbatch, W1, b1, W2, b2)
    
            # ---- Updates ----
            # General Form: Do everything we did for RMSProp, an add momentum to it at the 
            # end. 
            # 1) Calculate the Gradient
            # 2) Calculate the Cache Update
            # 3) Use variation on momentum, so instead of weighting just the previous velocity
            # we weight the change that we would have made by 1 - mu. So, instead of just 
            # having mu * dW2, we have mu * dW2 and then we add another constant (1 - mu*dW2)
            # to the normal learning rate * gradient update that we normally would have done 
            
            gW2 = derivative_w2(Z, Ybatch, pYbatch) + reg*W2
            cache_W2 = decay_rate*cache_W2 + (1 - decay_rate)*gW2*gW2
            dW2 = mu * dW2 + (1 - mu) * lr0 * gW2 / (np.sqrt(cache_W2) + eps)
            W2 -= dW2

            gb2 = derivative_b2(Ybatch, pYbatch) + reg*b2
            cache_b2 = decay_rate*cache_b2 + (1 - decay_rate)*gb2*gb2
            db2 = mu * db2 + (1 - mu) * lr0 * gb2 / (np.sqrt(cache_b2) + eps)
            b2 -= db2

            gW1 = derivative_w1(Xbatch, Z, Ybatch, pYbatch, W2) + reg*W1
            cache_W1 = decay_rate*cache_W1 + (1 - decay_rate)*gW1*gW1
            dW1 = mu * dW1 + (1 - mu) * lr0 * gW1 / (np.sqrt(cache_W1) + eps)
            W1 -= dW1

            gb1 = derivative_b1(Z, Ybatch, pYbatch, W2) + reg*b1
            cache_b1 = decay_rate*cache_b1 + (1 - decay_rate)*gb1*gb1
            db1 = mu * db1 + (1 - mu) * lr0 * gb1 / (np.sqrt(cache_b1) + eps)
            b1 -= db1
            
            if j % print_period == 0:
                pY, _ = forward(Xtest, W1, b1, W2, b2)
                l = cost(pY, Ytest_ind)
                loss_rms.append(l)
                print("Cost at iteration i=%d, j=%d: %.6f" % (i, j, l))

                err = error_rate(pY, Ytest)
                err_rms.append(err)
                print("Error rate:", err)

    pY, _ = forward(Xtest, W1, b1, W2, b2)
    print("Final error rate:", error_rate(pY, Ytest))
    
    plt.plot(loss_adam, label='adam')
    plt.plot(loss_rms, label='rmsprop')
    plt.legend()
    plt.show()
    
main()

Reading in and transforming data...
Cost at iteration i=0, j=0: 2147.070827
Error rate: 0.697
Cost at iteration i=0, j=10: 588.999795
Error rate: 0.187
Cost at iteration i=0, j=20: 399.359862
Error rate: 0.108
Cost at iteration i=0, j=30: 333.021415
Error rate: 0.092
Cost at iteration i=0, j=40: 292.684684
Error rate: 0.073
Cost at iteration i=0, j=50: 262.380853
Error rate: 0.065
Cost at iteration i=0, j=60: 230.907929
Error rate: 0.059
Cost at iteration i=0, j=70: 222.231061
Error rate: 0.052
Cost at iteration i=0, j=80: 209.672028
Error rate: 0.052
Cost at iteration i=1, j=0: 207.840002
Error rate: 0.052
Cost at iteration i=1, j=10: 194.216520
Error rate: 0.049
Cost at iteration i=1, j=20: 183.842938
Error rate: 0.044
Cost at iteration i=1, j=30: 176.089661
Error rate: 0.045
Cost at iteration i=1, j=40: 174.666713
Error rate: 0.047
Cost at iteration i=1, j=50: 171.138116
Error rate: 0.049
Cost at iteration i=1, j=60: 158.012662
Error rate: 0.048
Cost at iteration i=1, j=70: 155.995580
Error rate: 0.048
Cost at iteration i=1, j=80: 154.524563
Error rate: 0.046
Cost at iteration i=2, j=0: 154.578702
Error rate: 0.046
Cost at iteration i=2, j=10: 145.908771
Error rate: 0.043
Cost at iteration i=2, j=20: 141.455232
Error rate: 0.04
Cost at iteration i=2, j=30: 138.438420
Error rate: 0.042
Cost at iteration i=2, j=40: 140.651550
Error rate: 0.045
Cost at iteration i=2, j=50: 137.599638
Error rate: 0.043
Cost at iteration i=2, j=60: 127.967954
Error rate: 0.04
Cost at iteration i=2, j=70: 126.836930
Error rate: 0.039
Cost at iteration i=2, j=80: 129.237236
Error rate: 0.037
Cost at iteration i=3, j=0: 130.238918
Error rate: 0.041
Cost at iteration i=3, j=10: 125.752678
Error rate: 0.04
Cost at iteration i=3, j=20: 120.710198
Error rate: 0.036
Cost at iteration i=3, j=30: 119.677325
Error rate: 0.037
Cost at iteration i=3, j=40: 122.522703
Error rate: 0.041
Cost at iteration i=3, j=50: 122.184732
Error rate: 0.04
Cost at iteration i=3, j=60: 114.077212
Error rate: 0.038
Cost at iteration i=3, j=70: 112.966817
Error rate: 0.039
Cost at iteration i=3, j=80: 116.378527
Error rate: 0.036
Cost at iteration i=4, j=0: 117.617392
Error rate: 0.039
Cost at iteration i=4, j=10: 114.786611
Error rate: 0.037
Cost at iteration i=4, j=20: 109.170873
Error rate: 0.036
Cost at iteration i=4, j=30: 108.549830
Error rate: 0.033
Cost at iteration i=4, j=40: 110.954193
Error rate: 0.037
Cost at iteration i=4, j=50: 113.216539
Error rate: 0.035
Cost at iteration i=4, j=60: 106.765885
Error rate: 0.036
Cost at iteration i=4, j=70: 105.917620
Error rate: 0.033
Cost at iteration i=4, j=80: 109.337854
Error rate: 0.034
Cost at iteration i=5, j=0: 110.880603
Error rate: 0.036
Cost at iteration i=5, j=10: 111.500581
Error rate: 0.033
Cost at iteration i=5, j=20: 105.457665
Error rate: 0.035
Cost at iteration i=5, j=30: 102.853409
Error rate: 0.029
Cost at iteration i=5, j=40: 104.389982
Error rate: 0.032
Cost at iteration i=5, j=50: 108.948924
Error rate: 0.033
Cost at iteration i=5, j=60: 103.144376
Error rate: 0.035
Cost at iteration i=5, j=70: 102.601201
Error rate: 0.034
Cost at iteration i=5, j=80: 104.844103
Error rate: 0.032
Cost at iteration i=6, j=0: 106.223056
Error rate: 0.034
Cost at iteration i=6, j=10: 108.632585
Error rate: 0.034
Cost at iteration i=6, j=20: 103.732068
Error rate: 0.033
Cost at iteration i=6, j=30: 100.002771
Error rate: 0.029
Cost at iteration i=6, j=40: 100.388251
Error rate: 0.029
Cost at iteration i=6, j=50: 105.673950
Error rate: 0.031
Cost at iteration i=6, j=60: 101.209869
Error rate: 0.031
Cost at iteration i=6, j=70: 100.679370
Error rate: 0.034
Cost at iteration i=6, j=80: 102.738195
Error rate: 0.033
Cost at iteration i=7, j=0: 104.518793
Error rate: 0.035
Cost at iteration i=7, j=10: 109.478965
Error rate: 0.034
Cost at iteration i=7, j=20: 105.599002
Error rate: 0.033
Cost at iteration i=7, j=30: 100.982793
Error rate: 0.03
Cost at iteration i=7, j=40: 99.988630
Error rate: 0.03
Cost at iteration i=7, j=50: 104.526002
Error rate: 0.031
Cost at iteration i=7, j=60: 101.492697
Error rate: 0.029
Cost at iteration i=7, j=70: 100.245501
Error rate: 0.031
Cost at iteration i=7, j=80: 100.410748
Error rate: 0.03
Cost at iteration i=8, j=0: 101.721682
Error rate: 0.031
Cost at iteration i=8, j=10: 105.966997
Error rate: 0.032
Cost at iteration i=8, j=20: 104.209617
Error rate: 0.03
Cost at iteration i=8, j=30: 99.145083
Error rate: 0.029
Cost at iteration i=8, j=40: 97.678474
Error rate: 0.029
Cost at iteration i=8, j=50: 102.242486
Error rate: 0.028
Cost at iteration i=8, j=60: 100.668352
Error rate: 0.027
Cost at iteration i=8, j=70: 99.750481
Error rate: 0.029
Cost at iteration i=8, j=80: 98.138161
Error rate: 0.031
Cost at iteration i=9, j=0: 99.180736
Error rate: 0.029
Cost at iteration i=9, j=10: 105.264247
Error rate: 0.031
Cost at iteration i=9, j=20: 103.872365
Error rate: 0.028
Cost at iteration i=9, j=30: 98.075436
Error rate: 0.027
Cost at iteration i=9, j=40: 97.040009
Error rate: 0.027
Cost at iteration i=9, j=50: 102.067759
Error rate: 0.028
Cost at iteration i=9, j=60: 101.050897
Error rate: 0.027
Cost at iteration i=9, j=70: 100.037340
Error rate: 0.029
Cost at iteration i=9, j=80: 98.005822
Error rate: 0.03
Final error rate: 0.03
-------------------------------- End of Adam---------------------------------
Cost at iteration i=0, j=0: 2426.209838
Error rate: 0.892
Cost at iteration i=0, j=10: 532.227089
Error rate: 0.167
Cost at iteration i=0, j=20: 388.179538
Error rate: 0.115
Cost at iteration i=0, j=30: 344.841406
Error rate: 0.091
Cost at iteration i=0, j=40: 297.439443
Error rate: 0.079
Cost at iteration i=0, j=50: 253.608403
Error rate: 0.068
Cost at iteration i=0, j=60: 224.495871
Error rate: 0.063
Cost at iteration i=0, j=70: 214.427131
Error rate: 0.053
Cost at iteration i=0, j=80: 202.850310
Error rate: 0.055
Cost at iteration i=1, j=0: 201.529388
Error rate: 0.054
Cost at iteration i=1, j=10: 192.833639
Error rate: 0.051
Cost at iteration i=1, j=20: 183.596219
Error rate: 0.046
Cost at iteration i=1, j=30: 176.900687
Error rate: 0.047
Cost at iteration i=1, j=40: 175.641684
Error rate: 0.047
Cost at iteration i=1, j=50: 171.449806
Error rate: 0.049
Cost at iteration i=1, j=60: 159.085378
Error rate: 0.047
Cost at iteration i=1, j=70: 156.672770
Error rate: 0.045
Cost at iteration i=1, j=80: 156.383787
Error rate: 0.045
Cost at iteration i=2, j=0: 156.591487
Error rate: 0.046
Cost at iteration i=2, j=10: 152.651940
Error rate: 0.049
Cost at iteration i=2, j=20: 148.079450
Error rate: 0.046
Cost at iteration i=2, j=30: 143.289665
Error rate: 0.038
Cost at iteration i=2, j=40: 145.090865
Error rate: 0.044
Cost at iteration i=2, j=50: 143.937198
Error rate: 0.046
Cost at iteration i=2, j=60: 134.529339
Error rate: 0.042
Cost at iteration i=2, j=70: 133.222575
Error rate: 0.041
Cost at iteration i=2, j=80: 133.352681
Error rate: 0.04
Cost at iteration i=3, j=0: 134.286797
Error rate: 0.04
Cost at iteration i=3, j=10: 130.182083
Error rate: 0.041
Cost at iteration i=3, j=20: 127.691585
Error rate: 0.037
Cost at iteration i=3, j=30: 125.986581
Error rate: 0.038
Cost at iteration i=3, j=40: 127.866447
Error rate: 0.039
Cost at iteration i=3, j=50: 128.631623
Error rate: 0.039
Cost at iteration i=3, j=60: 121.548151
Error rate: 0.036
Cost at iteration i=3, j=70: 120.843228
Error rate: 0.038
Cost at iteration i=3, j=80: 122.094705
Error rate: 0.039
Cost at iteration i=4, j=0: 123.441084
Error rate: 0.039
Cost at iteration i=4, j=10: 121.557238
Error rate: 0.04
Cost at iteration i=4, j=20: 117.822181
Error rate: 0.039
Cost at iteration i=4, j=30: 115.251156
Error rate: 0.036
Cost at iteration i=4, j=40: 116.719986
Error rate: 0.04
Cost at iteration i=4, j=50: 119.368787
Error rate: 0.037
Cost at iteration i=4, j=60: 113.909990
Error rate: 0.037
Cost at iteration i=4, j=70: 113.216924
Error rate: 0.038
Cost at iteration i=4, j=80: 113.822508
Error rate: 0.037
Cost at iteration i=5, j=0: 114.959768
Error rate: 0.038
Cost at iteration i=5, j=10: 114.323753
Error rate: 0.038
Cost at iteration i=5, j=20: 111.504951
Error rate: 0.035
Cost at iteration i=5, j=30: 109.653831
Error rate: 0.038
Cost at iteration i=5, j=40: 110.738046
Error rate: 0.037
Cost at iteration i=5, j=50: 113.849866
Error rate: 0.037
Cost at iteration i=5, j=60: 109.768198
Error rate: 0.037
Cost at iteration i=5, j=70: 109.341777
Error rate: 0.037
Cost at iteration i=5, j=80: 109.913025
Error rate: 0.037
Cost at iteration i=6, j=0: 110.890013
Error rate: 0.037
Cost at iteration i=6, j=10: 110.705113
Error rate: 0.036
Cost at iteration i=6, j=20: 108.209940
Error rate: 0.036
Cost at iteration i=6, j=30: 106.165601
Error rate: 0.035
Cost at iteration i=6, j=40: 107.337522
Error rate: 0.035
Cost at iteration i=6, j=50: 110.463314
Error rate: 0.037
Cost at iteration i=6, j=60: 107.028071
Error rate: 0.036
Cost at iteration i=6, j=70: 106.543203
Error rate: 0.036
Cost at iteration i=6, j=80: 106.880316
Error rate: 0.037
Cost at iteration i=7, j=0: 107.673417
Error rate: 0.038
Cost at iteration i=7, j=10: 108.722810
Error rate: 0.036
Cost at iteration i=7, j=20: 106.857095
Error rate: 0.036
Cost at iteration i=7, j=30: 104.079833
Error rate: 0.034
Cost at iteration i=7, j=40: 105.069187
Error rate: 0.032
Cost at iteration i=7, j=50: 108.024067
Error rate: 0.034
Cost at iteration i=7, j=60: 105.826066
Error rate: 0.036
Cost at iteration i=7, j=70: 105.738961
Error rate: 0.034
Cost at iteration i=7, j=80: 105.782017
Error rate: 0.038
Cost at iteration i=8, j=0: 106.290092
Error rate: 0.037
Cost at iteration i=8, j=10: 107.120440
Error rate: 0.035
Cost at iteration i=8, j=20: 105.989624
Error rate: 0.035
Cost at iteration i=8, j=30: 103.321011
Error rate: 0.032
Cost at iteration i=8, j=40: 104.805380
Error rate: 0.034
Cost at iteration i=8, j=50: 106.961423
Error rate: 0.032
Cost at iteration i=8, j=60: 105.263030
Error rate: 0.032
Cost at iteration i=8, j=70: 104.940144
Error rate: 0.033
Cost at iteration i=8, j=80: 104.110568
Error rate: 0.035
Cost at iteration i=9, j=0: 104.356185
Error rate: 0.037
Cost at iteration i=9, j=10: 105.902251
Error rate: 0.035
Cost at iteration i=9, j=20: 105.567783
Error rate: 0.036
Cost at iteration i=9, j=30: 102.535401
Error rate: 0.031
Cost at iteration i=9, j=40: 103.786290
Error rate: 0.032
Cost at iteration i=9, j=50: 105.608610
Error rate: 0.031
Cost at iteration i=9, j=60: 104.998883
Error rate: 0.031
Cost at iteration i=9, j=70: 105.110111
Error rate: 0.031
Cost at iteration i=9, j=80: 103.699204
Error rate: 0.033
Final error rate: 0.033

So we can see that RMSProp with momentum and Adam actually end up very close to one another. This makes sense because Adam is basically RMSProp with momentum added to it. The big difference is that Adam adds the bias correction step. We can see that this makes a big difference at the very beginning, but that this bias quickly goes away. You observe the same effect if you run just a low pass filter on a random signal. You can see that the output is initially biased towards zero, but that the bias goes away pretty quickly.