9. Gradient Descent: Full vs. Batch vs. Stochastic¶

Let's take a moment to talk about the different ways that we can perform gradient descent, and their particular advantages and disadvantages. The three basic types we will go over are:

Full
Batch
Stochastic

1.1 Full Gradient Descent¶

Up until now, most of my notebooks have covered full gradient descent:

$$J = \sum_{n=1}^N t(n)logy(n)$$

Why is that? Well it makes sense that we would want to maximize the likelihood over our entire training set! The main disadvantage of this however is that the calculation of the gradient is now O(N), since it depends on each sample. This means that it will struggle to be effective when working with big data.

1.2 Stochastic Gradient Descent¶

On the other end of the spectrum we have stochastic gradient descent. This looks at one particular sample at a time, and the associated error. We are depending on the fact that all of the samples are IID (independent and identically distributed). This means that in the long run your error will improve because all of your samples are coming from the same distribution. Now that we have reduced our calculation from $N$ operations to 1, that is a nice improvement, however, there is a disadvantage. Whereas the log likelihood always improves on every iteration of full gradient descent, sometimes the log likelihood can get worse with stochastic gradient descent. In fact, the cost function will behave pretty erratically over each iteration, but it will improve in the long run.

1.3 Batch Gradient Descent¶

Batch gradient descent can be thought of as the happy medium between these two. In this method we split our data up into batches. For instance, say we have 10,000 examples, we could split it up into 100 batches of size 100. Then we could compute our cost function based on each batch for each iteration. In this case the cost function can still get worse, but it will be much less erratic than if we had done pure stochastic gradient descent.

2. Full vs. Batch vs. Stochastic in Code¶

We are now going to compare each form of gradient descent in code, particularly how they progress. We start with the standard imports:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.utils import shuffle
from datetime import datetime

from modern_dl_util import get_transformed_data, forward, error_rate, cost, gradW, gradb, y2indicator

We will be using the transformed data (after PCA has been applied). We will also be using logistic regression instead of a full neural network, just to speed things up. So we will start by getting the transformed data, and only take the first 300 columns. We will also normalize X, get our training and test set, and create our indicator matrices.

X, Y, _, _ = get_transformed_data()
X = X[:, :300]

# normalize X first
mu = X.mean(axis=0)
std = X.std(axis=0)
X = (X - mu) / std

Xtrain = X[:-1000,]
Ytrain = Y[:-1000]
Xtest  = X[-1000:,]
Ytest  = Y[-1000:]

N, D = Xtrain.shape
Ytrain_ind = y2indicator(Ytrain)
Ytest_ind = y2indicator(Ytest)
print('reached')

Reading in and transforming data...

reached

2.2 Full Gradient Descent¶

Now we will perform full gradient descent.

# 1. full
W = np.random.randn(D, 10) / 28      # initialize weights and bias
b = np.zeros(10)
LL = []
lr = 0.0001                          # set learning rate and regularization
reg = 0.01
t0 = datetime.now()
for i in range(200):      
    p_y = forward(Xtrain, W, b)    # forward pass          
    
    # weight and bias update using gradient
    W += lr*(gradW(Ytrain_ind, p_y, Xtrain) - reg*W)
    b += lr*(gradb(Ytrain_ind, p_y) - reg*b)


    p_y_test = forward(Xtest, W, b)    # forward pass on test set
    ll = cost(p_y_test, Ytest_ind)
    LL.append(ll)
    if i % 20 == 0:
        err = error_rate(p_y_test, Ytest)
        print("Cost at iteration %d: %.6f" % (i, ll))
        print("Error rate:", err)
p_y = forward(Xtest, W, b)                                    # final prediction
print("Final error rate:", error_rate(p_y, Ytest))            # print accuracy
print("Elapsted time for full GD:", datetime.now() - t0)      # print current time

Cost at iteration 0: 660.291362
Error rate: 0.15
Cost at iteration 20: 348.686029
Error rate: 0.098
Cost at iteration 40: 331.380188
Error rate: 0.095
Cost at iteration 60: 325.034368
Error rate: 0.096
Cost at iteration 80: 321.800822
Error rate: 0.095
Cost at iteration 100: 319.894373
Error rate: 0.094
Cost at iteration 120: 318.674439
Error rate: 0.094
Cost at iteration 140: 317.850281
Error rate: 0.095
Cost at iteration 160: 317.270303
Error rate: 0.095
Cost at iteration 180: 316.848130
Error rate: 0.094
Final error rate: 0.094
Elapsted time for full GD: 0:00:43.549665

2.3 Stochastic Gradient Descent¶

Now let's go over batch gradient descent. Note that we are only going to make 1 pass through the data since we are going to need to look at every sample and make an update based on every sample individually. We also are only going to go through 500 samples, since it will be very slow to go through all of them.

W = np.random.randn(D, 10) / 28
b = np.zeros(10)
LL_stochastic = []
lr = 0.0001
reg = 0.01

t0 = datetime.now()
for i in range(1): # takes very long since we're computing cost for 41k samples
    tmpX, tmpY = shuffle(Xtrain, Ytrain_ind)     # want to shuffle through the labels
    for n in range(min(N, 500)): # shortcut so it won't take so long...
        x = tmpX[n,:].reshape(1,D)   # reshape x in a 2d matrix
        y = tmpY[n,:].reshape(1,10)  # reshape y
        p_y = forward(x, W, b)       # forward pass

        # gradient weight updates, with regularization as usual 
        W += lr*(gradW(y, p_y, x) - reg*W)
        b += lr*(gradb(y, p_y) - reg*b)

        p_y_test = forward(Xtest, W, b)
        ll = cost(p_y_test, Ytest_ind)
        LL_stochastic.append(ll)
        
        # only calculate error rate once every N divided by 2 samples, will go very slow
        if n % (N//2) == 0:
            err = error_rate(p_y_test, Ytest)
            print("Cost at iteration %d: %.6f" % (i, ll))
            print("Error rate:", err)
p_y = forward(Xtest, W, b)
print("Final error rate:", error_rate(p_y, Ytest))
print("Elapsted time for SGD:", datetime.now() - t0)

Cost at iteration 0: 2469.713765
Error rate: 0.896
Final error rate: 0.892
Elapsted time for SGD: 0:00:00.492293

2.4 Batch Gradient Descent¶

The main question that usually comes up here is: "how do you select the batch?". That is done by selecting the rows from the iteration number, times the batch size, all the way to the iteration number times the batch size, plus the batch size

# 3. batch
W = np.random.randn(D, 10) / 28
b = np.zeros(10)
LL_batch = []
lr = 0.0001
reg = 0.01
batch_sz = 500                  # set batch size to 500
n_batches = N // batch_sz       # get number of batches

t0 = datetime.now()
for i in range(50):
    tmpX, tmpY = shuffle(Xtrain, Ytrain_ind)
    for j in range(n_batches):
        
        # get the current batch
        x = tmpX[j*batch_sz:(j*batch_sz + batch_sz),:]
        y = tmpY[j*batch_sz:(j*batch_sz + batch_sz),:]
        p_y = forward(x, W, b)

        # gradient descent with regularization 
        W += lr*(gradW(y, p_y, x) - reg*W)
        b += lr*(gradb(y, p_y) - reg*b)

        p_y_test = forward(Xtest, W, b)
        ll = cost(p_y_test, Ytest_ind)
        LL_batch.append(ll)
        if j % (n_batches) == 0:
            err = error_rate(p_y_test, Ytest)
            print("Cost at iteration %d: %.6f" % (i, ll))
            print("Error rate:", err)
p_y = forward(Xtest, W, b)
print("Final error rate:", error_rate(p_y, Ytest))
print("Elapsted time for batch GD:", datetime.now() - t0)

Cost at iteration 0: 2478.670826
Error rate: 0.903
Cost at iteration 1: 931.761701
Error rate: 0.185
Cost at iteration 2: 651.563247
Error rate: 0.139
Cost at iteration 3: 550.479113
Error rate: 0.125
Cost at iteration 4: 497.291386
Error rate: 0.122
Cost at iteration 5: 464.399765
Error rate: 0.118
Cost at iteration 6: 441.475866
Error rate: 0.111
Cost at iteration 7: 424.573953
Error rate: 0.107
Cost at iteration 8: 411.676856
Error rate: 0.108
Cost at iteration 9: 401.463488
Error rate: 0.105
Cost at iteration 10: 393.640566
Error rate: 0.102
Cost at iteration 11: 386.687257
Error rate: 0.101
Cost at iteration 12: 381.008868
Error rate: 0.101
Cost at iteration 13: 375.996665
Error rate: 0.1
Cost at iteration 14: 371.588356
Error rate: 0.1
Cost at iteration 15: 367.853288
Error rate: 0.097
Cost at iteration 16: 364.342495
Error rate: 0.098
Cost at iteration 17: 361.470327
Error rate: 0.098
Cost at iteration 18: 358.944314
Error rate: 0.098
Cost at iteration 19: 356.286892
Error rate: 0.098
Cost at iteration 20: 354.112926
Error rate: 0.098
Cost at iteration 21: 352.162839
Error rate: 0.097
Cost at iteration 22: 350.375008
Error rate: 0.098
Cost at iteration 23: 348.609629
Error rate: 0.097
Cost at iteration 24: 347.225688
Error rate: 0.097
Cost at iteration 25: 345.977963
Error rate: 0.097
Cost at iteration 26: 344.548905
Error rate: 0.096
Cost at iteration 27: 343.470726
Error rate: 0.095
Cost at iteration 28: 342.132727
Error rate: 0.095
Cost at iteration 29: 341.295308
Error rate: 0.095
Cost at iteration 30: 340.128598
Error rate: 0.096
Cost at iteration 31: 339.295293
Error rate: 0.096
Cost at iteration 32: 338.430341
Error rate: 0.096
Cost at iteration 33: 337.569819
Error rate: 0.094
Cost at iteration 34: 336.849058
Error rate: 0.095
Cost at iteration 35: 336.431914
Error rate: 0.095
Cost at iteration 36: 335.619563
Error rate: 0.096
Cost at iteration 37: 334.980167
Error rate: 0.096
Cost at iteration 38: 334.431145
Error rate: 0.096
Cost at iteration 39: 333.850203
Error rate: 0.095
Cost at iteration 40: 333.245296
Error rate: 0.097
Cost at iteration 41: 332.633111
Error rate: 0.096
Cost at iteration 42: 332.147808
Error rate: 0.095
Cost at iteration 43: 331.885392
Error rate: 0.095
Cost at iteration 44: 331.226036
Error rate: 0.094
Cost at iteration 45: 331.029855
Error rate: 0.095
Cost at iteration 46: 330.838744
Error rate: 0.095
Cost at iteration 47: 330.546760
Error rate: 0.095
Cost at iteration 48: 329.899811
Error rate: 0.095
Cost at iteration 49: 329.422481
Error rate: 0.095
Final error rate: 0.095
Elapsted time for batch GD: 0:00:12.933731

Compare¶

Let's take a second to compare the plots of each gradient descent variation. We can see that stochastic gradient descent has hardly improved at all, so we would need many more iterations. If we then look at full gradient descent, it appears to have converged faster than batch gradient descent, but it all ran slower.

Now, if you zoom into the batch and stochastic curves you will see that they are not always smooth, whereas full gradient descent is.

fig, ax = plt.subplots(figsize=(8,6))
x1 = np.linspace(0, 1, len(LL))
plt.plot(x1, LL, label="full")
x2 = np.linspace(0, 1, len(LL_stochastic))
plt.plot(x2, LL_stochastic, label="stochastic")
x3 = np.linspace(0, 1, len(LL_batch))
plt.plot(x3, LL_batch, label="batch")
plt.legend()
plt.show()