4. E-commerce Project¶

We are now going to look at training logistic regression with softmax. In my other walk throughs on logistic regression, we weren't looking at multiclass classification (just binary) so we had just been using the sigmoid function, and not softmax. This will give us the chance to see how logistic regression performs compared to a neural network. Remember, for logistic regression our architecture looks like:

The only difference now is that instead of using the sigmoid at at the logistic neuron, we are going to use the softmax as we are performing multiclass classification.

We can start with our imports.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import os

from sklearn.utils import shuffle

And lets define our get_data function:

def get_data():
    df = pd.read_csv('data/ecommerce_data.csv')
    data = df.as_matrix()
    np.random.shuffle(data)
    X = data[:,:-1]
    Y = data[:,-1].astype(np.int32)

    # one-hot encode the categorical data
    N, D = X.shape
    X2 = np.zeros((N, D+3))
    X2[:,0:(D-1)] = X[:,0:(D-1)] # non-categorical

    # one-hot
    for n in range(N):
      t = int(X[n,D-1])
      X2[n,t+D-1] = 1
    X = X2

    # split train and test
    Xtrain = X[:-100]
    Ytrain = Y[:-100]
    Xtest = X[-100:]
    Ytest = Y[-100:]

    # normalize columns 1 and 2
    for i in (1, 2):
        m = Xtrain[:,i].mean()
        s = Xtrain[:,i].std()
        Xtrain[:,i] = (Xtrain[:,i] - m) / s
        Xtest[:,i] = (Xtest[:,i] - m) / s

    return Xtrain, Ytrain, Xtest, Ytest

We are going to need a function to get the indicator matrix from the targets.

def y2indicator(y, K):
    N = len(y)
    ind = np.zeros((N,K))
    for i in range(N):
        ind[i, y[i]] = 1 
    return ind

Now we can can get our data.

Xtrain, Ytrain, Xtest, Ytest = get_data()
D = Xtrain.shape[1]
K = len(set(Ytrain) | set(Ytest))

And convert our Y data into an indicator matrix.

Ytrain_ind = y2indicator(Ytrain, K)
Ytest_ind = y2indicator(Ytest, K)

It is shape (400 x 4) because we have have four classes that make up Y.

Ytrain_ind.shape

(400, 4)

Now randomly initialize our weights.

W = np.random.randn(D, K)
b = np.zeros(K)

And we can define our softmax function:

def softmax(a):
    expA = np.exp(a)
    return expA / expA.sum(axis=1, keepdims=True)

And now lets define our forward function:

def forward(X, W, b):
    return softmax(X.dot(W) + b)

Next, our predict function:

def predict(P_Y_given_X):
    return np.argmax(P_Y_given_X, axis=1)

Our classification rate:

def classification_rate(Y, P):
    return np.mean(Y == P)

And the Cross Entropy:

def cross_entropy(T, pY):
    return -np.mean(T * np.log(pY))

Now that we have done all of that, we can enter our train loop. Note: our weight update rule in gradient descent is based on the rule we derived last lecture: Z.T.dot(T - Y). Only in the case of logistic regression, we only have and input and output layer, no hidden layer, so in this case Z is going to be X. We are also now going to be performing gradient descent, not ascent, so we have a minus in our update, and we are subtracting the targets from the predictions.

train_costs = []
test_costs = []
learning_rate = 0.001
for i in range(10000):
    pYtrain = forward(Xtrain, W, b)        # find the predicitons for y train
    pYtest = forward(Xtest, W, b)        # find the predicitons for y test
    
    ctrain = cross_entropy(Ytrain_ind, pYtrain)     # Ytrain_ind === targets in this case
    ctest = cross_entropy(Ytest_ind, pYtest)
    train_costs.append(ctrain)
    test_costs.append(ctest)
    
    # now we can perform gradient descent
    W -= learning_rate * Xtrain.T.dot(pYtrain - Ytrain_ind)  
    b -= learning_rate * (pYtrain - Ytrain_ind).sum(axis=0)
    if i % 1000 == 0:
        print i, ctrain, ctest
print("Final training classification rate: ", classification_rate(Ytrain, predict(pYtrain)))
print("Final testing classification rate: ", classification_rate(Ytest, predict(pYtest)))

legend1, = plt.plot(train_costs, label='train cost')
legend2, = plt.plot(test_costs, label='test cost')
plt.legend([legend1, legend2])
plt.show()

0 0.5487036337934679 0.587425339056817
1000 0.08847309034724614 0.09187567492094431
2000 0.08377182711917026 0.09052982548097642
3000 0.08196144166596504 0.09077510228825146
4000 0.08102329620858013 0.09127851663313276
5000 0.0804589669016488 0.0918014318036505
6000 0.08008728169999321 0.09228846524706961
7000 0.07982711471431775 0.09272871265757494
8000 0.0796368998911137 0.09312353966499383
9000 0.0794932153858824 0.09347752945439439
('Final training classification rate: ', 0.93)
('Final testing classification rate: ', 0.89)

Ecommerce Project - Training a Neural Network¶

Let's now look at how a neural network performs compared to logistic regression in this case. Again, lets start with our imports:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import os

from sklearn.utils import shuffle

And again, we will want to define our y indicator function:

def y2indicator(y, K):
    N = len(y)
    ind = np.zeros((N,K))
    for i in range(N):
        ind[i, y[i]] = 1 
    return ind

And get our data:

Xtrain, Ytrain, Xtest, Ytest = get_data()

Now lets set the size of our hidden layer, and the size of our input layer:

M = 5
Xtrain.shape[1]
K = len(set(Ytrain) | set(Ytest))

And lets convert our Target labels into an indicator matrix:

Ytrain_ind = y2indicator(Ytrain, K)
Ytest_ind = y2indicator(Ytest, K)

Now randomly initialize our weights:

W1 = np.random.randn(D, M)
b1 = np.zeros(M)
W2 = np.random.randn(M, K)
b2 = np.zeros(K)

Next, define our softmax:

def softmax(a):
    expA = np.exp(a)
    return expA / expA.sum(axis=1, keepdims=True)

And now for the forward function. It is similar to the forward function from Logistic Regression, but notice that now have an additional layer to consider. Remember, Z is the output of hidden layer nodes, and we need to return Z because it is used in the derivative.

Keep in mind, as we run through this prediction, when we first perform matrix multiplication between our input matrix $X$ and our first weight matrix $W_1$, each weight matrix is having the dot product performed against it. As we know, the more similar two vectors are, the greater their dot product.

def forward(X, W1, b1, W2, b2):
    Z = np.tanh(X.dot(W1) + b1)        
    return softmax(Z.dot(W2) + b2), Z

Now lets redefine predict and classification_rate:

def predict(P_Y_given_X):
    return np.argmax(P_Y_given_X, axis=1)

def classification_rate(Y, P):
    return np.mean(Y == P)

And lets redefine cross entropy as well:

def cross_entropy(T, pY):
    return -np.mean(T * np.log(pY))

We can now enter our main training loop:

train_costs = []
test_costs = []
learning_rate = 0.001

for i in range(10000):
    pYtrain, Ztrain = forward(Xtrain, W1, b1, W2, b2)
    pYtest, Ztest = forward(Xtest, W1, b1, W2, b2)
    
    ctrain = cross_entropy(Ytrain_ind, pYtrain)
    ctest = cross_entropy(Ytest_ind, pYtest)
    train_costs.append(ctrain)
    test_costs.append(ctest)
    
    # we can now start gradient descent
    W2 -= learning_rate * Ztrain.T.dot(pYtrain - Ytrain_ind)
    b2 -= learning_rate * (pYtrain - Ytrain_ind).sum(axis=0)
    # need to get error at the hidden node
    # note, (1 - Ztrain * Ztrain) is the derivative of tanh, if we had been using sigmoid
    # we would have it be (1 - Ztrain) * Ztrain
    dZ = (pYtrain - Ytrain_ind).dot(W2.T) * (1 - Ztrain * Ztrain) 
    # now we can update W1 and W2
    W1 -= learning_rate * Xtrain.T.dot(dZ)
    b1 -= learning_rate * dZ.sum(axis=0)
    
    if i % 1000 == 0:
        print i, ctrain, ctest
        
print("Final training classification rate: ", classification_rate(Ytrain, predict(pYtrain)))
print("Final testing classification rate: ", classification_rate(Ytest, predict(pYtest)))

legend1, = plt.plot(train_costs, label='train cost')
legend2, = plt.plot(test_costs, label='test cost')
plt.legend([legend1, legend2])
plt.show()

0 0.5112511245754163 0.5034374398825299
1000 0.040578783494983936 0.045707901918409934
2000 0.029555888166024804 0.04633345269429838
3000 0.0235877160252051 0.04479735810092709
4000 0.020515948962842093 0.04573993366731693
5000 0.018515357771577968 0.047367539403102205
6000 0.017084671294244937 0.04888953258949811
7000 0.016012764493820292 0.05013368928403129
8000 0.015181263275568237 0.05110831091060879
9000 0.014516974759324945 0.05185692286936616
('Final training classification rate: ', 0.9775)
('Final testing classification rate: ', 0.93)