5. Practical Issues¶

Up until this point we have focused mainly on theory and the two main functions of a neural networks:

How to Predict
How to Train The question may arise, in practice, what else do we need to focus on? Well, we have mentioned that neural networks are nonlinear classifiers, which means that they can classify things when you can't draw a line between the two classes. Two famous examples of these are the XOR problem, and the Donut Problem.

We have shown that logistic regression cannot solve those two problems unless you manually create new features. This can be time consuming and rather tedious, so ideally we would have an algorithm learn those features for us. However, the ability to learn these features automatically does not come for free. What you may have started to notice is that neural networks come with a variety of options: how many hidden units to use, how many hidden layers to use, the type of activation function, etc. These are hyperparameters and choosing them is not trivial, so we will discuss the process for this as well.

This section is all about applying what we have already learned about neural networks in order to gain deeper insight into how they work, and what kind of baggage they come with.

1. XOR Problem in Code¶

Let's now go through the XOR problem in code, using a Neural Network to solve. This should be slightly simpler than the Neural Networks that we have been working with so far since this is a binary classification problem. We can start with our imports as always.

import numpy as np
import matplotlib.pyplot as plt

And now our forward function. Note, since this is only binary classification we are basically just doing two sigmoids in a row.

def forward(X, W1, b1, W2, b2):
    Z = 1 / (1 + np.exp( -(X.dot(W1) + b1) ))   # output of hidden layer
    activation = Z.dot(W2) + b2                 # activation at output layer
    Y = 1 / (1 + np.exp(-activation))           # output at output layer
    return Y, Z

Our predict function will no longer be using argmax, since we are not using the softmax, instead it will just use round (anything less than 0.5 is classified as 0, anything great than 0.5 is classified as 1).

def predict(X, W1, b1, W2, b2): 
    Y, _ = forward(X, W1, b1, W2, b2)
    return np.round(Y)

The derivatives are mostly the same. Note that Z is an (N x M) matrix.

def derivative_w2(Z, T, Y):
    return (T - Y).dot(Z)

def derivative_b2(T, Y):
    return (T - Y).sum()

def derivative_w1(X, Z, T, Y, W2):
    dZ = np.outer(T-Y, W2) * (1 - Z * Z)
    return X.T.dot(dZ)

def derivative_b1(Z, T, Y, W2):
    dZ = np.outer(T - Y, W2) * (1 - Z * Z)
    return dZ.sum(axis=0)

Also note, regarding the derivative for $W_1$ and $b_1$, we utilize the np.outer function instead of the np.dot that we had been using in section 3. This may at first seem unexpected, but remember that in reality with our given configuration we are really just doing two sigmoids in a row. Let's look at our data for the XOR problem and this will make more sense. In the process we will make it a little more clear the architecture that we are working with.

We can see that our X has dimensions (4 x 2), meaning that we have 4 training examples, and each example has 2 dimenions. So our neural network is going to have an input layer containing 2 nodes.

X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])    # we can see that X is (4 x 2)
X

array([[0, 0],
       [0, 1],
       [1, 0],
       [1, 1]])

Our Y output has dimensions (4 x 1), meaning our output layer has 1 node, containing our prediction value.

Y = np.array([0, 1, 1, 0])                        # Y is (4 x 1)
Y

array([0, 1, 1, 0])

Our W1 matrix has the shape (2 x 5), where each row holds the weights that map from a specific node in the input layer, to all of the nodes in the output layer. The appendix of section 3 training a neural network has a good visualization of this. Keep in mind, the shape of the W1 tells us that we have 5 nodes in our hidden layer, and hence the shape of Z should be (4 x 5).

W1 = np.random.randn(2, 5)                        # W1 is (2 x 5)
W1

array([[-0.86534048,  0.54128214, -0.13892193,  0.84884489,  0.6157514 ],
       [ 0.31201695, -0.59609343, -1.10944835, -1.08568159,  0.05351868]])

Our W2 matrix has shape (5 x 1), which is a column vector. This is because we need W2 to be a size that brings us back to 1 node for the output layer. It should be kept in mind that our final network architecture is 2 input nodes, 5 hidden layer nodes, and 1 output node.

W2 = np.random.randn(5)                           # W2 is (5 x 1)
W2.shape

(5,)

Let's calculate Z via the dot product to ensure that it is in fact (4 x 5).

X

array([[0, 0],
       [0, 1],
       [1, 0],
       [1, 1]])

W1

array([[-0.86534048,  0.54128214, -0.13892193,  0.84884489,  0.6157514 ],
       [ 0.31201695, -0.59609343, -1.10944835, -1.08568159,  0.05351868]])

Z = X.dot(W1)                                     # Z is (4 x 5) 
Z

array([[ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.31201695, -0.59609343, -1.10944835, -1.08568159,  0.05351868],
       [-0.86534048,  0.54128214, -0.13892193,  0.84884489,  0.6157514 ],
       [-0.55332353, -0.0548113 , -1.24837028, -0.2368367 ,  0.66927009]])

Great, Z is (4 x 5) and each row represents the values of the nodes (5 nodes total) at Z for a given training example (4 training examples total). Remember: the first column of X holds the value for input node 1 at a specific training example, and the second column holds the value for input node 2. When we perform the dot product, we start by taking the 1st column of W2, which holds the weight mapping from input node 1 to hidden node 1, and the weight mapping from input node 2 to hidden node 1, and applying it to X. The result is the dot product, and the first column of our Z matrix is the result.

Now, with that said lets move on to the calculation of our prediction pY. We should be expecting a shape of (4 x 1), since we want it to match our targets, Y, since we want to have 1 prediction for each of the 4 examples. We will apply the column vector W2 to each row of Z, performing the dot product and ending up with a (4 x 1) matrix.

W2              # quickly visualizing W2

array([-0.24102211,  1.14667605,  0.23715334,  2.35934555,  0.27652735])

pY = Z.dot(W2)                                    # pY is (4 x 1)
pY

array([ 0.        , -3.56853707,  2.96928616, -0.5992509 ])

Now that we have gone through the forward prediction step (keep in mind that no activation was applied since this was just to demonstrate the shapes of the matrices and element wise operations would not affect that, so we didn't worry about them for now), the question still remains: why was the np.outer function used instead of np.dot in our derivative for W2? Well, lets look at the shape of (Y - pY) and W2 again:

Y - pY

array([ 0.        ,  4.56853707, -1.96928616,  0.5992509 ])

W2

array([-0.24102211,  1.14667605,  0.23715334,  2.35934555,  0.27652735])

So (Y - pY) has a shape (4 x 1), and W2 has a shape (5 x 1). Now, we need to end up with a shape that matches Z, (4 x 5). If you look at the numpy tutorial, you can see that if you have two vectors, the first a column vector and the second a row vector, if they have the same size you can perform the dot product. However, because these dimensions do not match up, the outer product allows us to apply the term representing our error (Y - pY) to each weight value in W2.

dZ = np.outer((Y - pY), W2)
dZ

array([[-0.        ,  0.        ,  0.        ,  0.        ,  0.        ],
       [-1.10111845,  5.23863204,  1.08344382, 10.77875761,  1.26332547],
       [ 0.47464151, -2.25813328, -0.46702279, -4.64622655, -0.54456149],
       [-0.14443272,  0.68714666,  0.14211435,  1.41383996,  0.16570927]])

dZ.shape

(4, 5)

Excellent, our shape is (4 x 5) which matches Z. Now lets get back to the remaining functions that we needed to define. The cost function is next on our list. In this case we are using the cross entropy, which is equivalent to the negative log likelihood.

def get_log_likelihood(T, Y):
    return np.sum(T * np.log(Y) + (1 - T)*np.log(1 - Y))

Now we can write a function to actually test the XOR problem.

def test_xor():
    # create XOR data
    X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
    Y = np.array([0, 1, 1, 0])
    W1 = np.random.randn(2, 5)
    b1 = np.zeros(5)
    W2 = np.random.randn(5)
    b2 = 0
    
    LL = []                       # keep track of log-likelihoods
    
    # set our hyper parameters
    learning_rate = 1e-2
    regularization = 0.
    last_error_rate = None
    
    for i in range(30000):
        pY, Z = forward(X, W1, b1, W2, b2)
        ll = get_log_likelihood(Y, pY)
        prediction = predict(X, W1, b1, W2, b2)
        er = np.mean(prediction != Y)

        LL.append(ll)
        W2 += learning_rate * (derivative_w2(Z, Y, pY) - regularization * W2)
        b2 += learning_rate * (derivative_b2(Y, pY) - regularization * b2)
        W1 += learning_rate * (derivative_w1(X, Z, Y, pY, W2) - regularization * W1)
        b1 += learning_rate * (derivative_b1(Z, Y, pY, W2) - regularization * b1)
        if i % 1000 == 0:
            print(ll)

    print("final classification rate:", np.mean(prediction == Y))
    plt.plot(LL)
    plt.show()

test_xor()

-3.0396449680446813
-2.3593317577596715
-1.4546891188117146
-1.1519849070388395
-0.9775242358298886
-0.8487809122050669
-0.7519870059780317
-0.6765650035199965
-0.6156396729478375
-0.5645945050785721
-0.520003881104273
-0.48000148712750634
-0.44364483927800213
-0.4098047288077369
-0.377633243304657
-0.3474198932797925
-0.3202045566689863
-0.29662565505996386
-0.27651631173955044
-0.259287524003444
-0.24433595374202233
-0.2312070630792137
-0.21959866381967957
-0.20930764298854757
-0.20017232257603287
-0.19203467131352922
-0.18473035543703942
-0.17809935640759117
-0.1720017598158809
-0.1663272002375039
('final classification rate:', 1.0)

Now let's write out the code that will run through the donut problem. Some things to note are that we explicitly define 8 hidden units and use that to define the size of our weight matrices.

def forward(X, W1, b1, W2, b2):
    # sigmoid
    # Z = 1 / (1 + np.exp( -(X.dot(W1) + b1) ))

    # tanh
    # Z = np.tanh(X.dot(W1) + b1)

    # relu
    Z = X.dot(W1) + b1
    Z = Z * (Z > 0)

    activation = Z.dot(W2) + b2
    Y = 1 / (1 + np.exp(-activation))
    return Y, Z


def predict(X, W1, b1, W2, b2):
    Y, _ = forward(X, W1, b1, W2, b2)
    return np.round(Y)


def derivative_w2(Z, T, Y):
    # Z is (N, M)
    return (T - Y).dot(Z)

def derivative_b2(T, Y):
    return (T - Y).sum()


def derivative_w1(X, Z, T, Y, W2):
    # dZ = np.outer(T-Y, W2) * Z * (1 - Z) # this is for sigmoid activation
    # dZ = np.outer(T-Y, W2) * (1 - Z * Z) # this is for tanh activation
    dZ = np.outer(T-Y, W2) * (Z > 0) # this is for relu activation
    return X.T.dot(dZ)


def derivative_b1(Z, T, Y, W2):
    # dZ = np.outer(T-Y, W2) * Z * (1 - Z) # this is for sigmoid activation
    # dZ = np.outer(T-Y, W2) * (1 - Z * Z) # this is for tanh activation
    dZ = np.outer(T-Y, W2) * (Z > 0) # this is for relu activation
    return dZ.sum(axis=0)


def get_log_likelihood(T, Y):
    return np.sum(T*np.log(Y) + (1-T)*np.log(1-Y))



def test_donut():
    # donut example
    N = 1000
    R_inner = 5
    R_outer = 10

    # distance from origin is radius + random normal
    # angle theta is uniformly distributed between (0, 2pi)
    R1 = np.random.randn(N//2) + R_inner
    theta = 2*np.pi*np.random.random(N//2)
    X_inner = np.concatenate([[R1 * np.cos(theta)], [R1 * np.sin(theta)]]).T

    R2 = np.random.randn(N//2) + R_outer
    theta = 2*np.pi*np.random.random(N//2)
    X_outer = np.concatenate([[R2 * np.cos(theta)], [R2 * np.sin(theta)]]).T

    X = np.concatenate([ X_inner, X_outer ])
    Y = np.array([0]*(N//2) + [1]*(N//2))

    n_hidden = 8
    W1 = np.random.randn(2, n_hidden)
    b1 = np.random.randn(n_hidden)
    W2 = np.random.randn(n_hidden)
    b2 = np.random.randn(1)
    LL = [] # keep track of log-likelihoods
    learning_rate = 0.00005
    regularization = 0.2
    last_error_rate = None
    for i in range(3000):
        pY, Z = forward(X, W1, b1, W2, b2)
        ll = get_log_likelihood(Y, pY)
        prediction = predict(X, W1, b1, W2, b2)
        er = np.abs(prediction - Y).mean()
        LL.append(ll)
        W2 += learning_rate * (derivative_w2(Z, Y, pY) - regularization * W2)
        b2 += learning_rate * (derivative_b2(Y, pY) - regularization * b2)
        W1 += learning_rate * (derivative_w1(X, Z, Y, pY, W2) - regularization * W1)
        b1 += learning_rate * (derivative_b1(Z, Y, pY, W2) - regularization * b1)
        if i % 300 == 0:
            print("i:", i, "ll:", ll, "classification rate:", 1 - er)
    plt.plot(LL)
    plt.show()

test_donut()

('i:', 0, 'll:', -5766.440074093014, 'classification rate:', 0.491)
('i:', 300, 'll:', -184.19026878248826, 'classification rate:', 0.973)
('i:', 600, 'll:', -98.05124129973063, 'classification rate:', 0.98)
('i:', 900, 'll:', -70.19030055807445, 'classification rate:', 0.984)
('i:', 1200, 'll:', -57.07901411817688, 'classification rate:', 0.986)
('i:', 1500, 'll:', -49.554112402267435, 'classification rate:', 0.986)
('i:', 1800, 'll:', -44.667018955278905, 'classification rate:', 0.986)
('i:', 2100, 'll:', -41.041551652757235, 'classification rate:', 0.988)
('i:', 2400, 'll:', -37.93972500829635, 'classification rate:', 0.989)
('i:', 2700, 'll:', -35.62152425839586, 'classification rate:', 0.99)

2. Neural Networks for Regression¶

We are now going to take a brief look at how to utilize Neural Networks in a Regression problem. In general, they are used for classification, for a variety of reasons (most of the famous problems they have solved have been classification, they often perform better in that setting, etc). However, they can sometimes be very effective in a regression scenario, and are even utilized in certain state of the art applications such as deep reinforcement learning.

2.1 Neural Network Binary Classification Structure¶

For Binary classification we will have a stack of nonlinear hidden layers, followed by a logistic regression at the end. We can call this logistic regression, because this takes on the exact form of logistic regression; aka a linear transformation, $Wx+b$, and a sigmoid (shown in the circle on the right). For the hidden layers, we are not going to call them logistic regression since they still take in a linear transformation, but the activation function is not necessarily a sigmoid, it could be tanh, relu, etc. Still, the concept of stacking layers of neurons is important to consider since that is what lead us to these new structures in the first place.

2.2 Neural Network Multi-class Classification Structure¶

For multi-class classification we have nearly the same thing, only now the output layer has more units, one corresponding to each class. However, notice that everything up until the last layer is the same as before. We have stacks on linear transformations, along with nonlinear activation functions, and at the end we have a multi-class logistic regression, which is a linear transformation and a softmax.

2.3 Neural Network Regression Structure¶

Finally we have the regression scenario. Again, notice how everything up until the last layer is the same as before. We have stacks on linear transformations, along with nonlinear activation functions, and at the end we have only a linear transformation, and this is of course exactly linear regression.

2.4 The Lesson¶

We know that from prior studies that if we just consider just linear models, we have logistic regression for classification, and linear regression for regression. In a sense we can consider everything up until the last layer just the feature extraction/transformation, where we are transforming the input into something different. After that, we are just doing what we already know: logistic regression for classification, and linear regression for regression.

2.5 The Problem¶

So because we have already derived the equations for our neural net before, we won't be repeating them here. However, lets get a look at the data set that we are going to use, and then we will go through the code to fit to this data set. Note that this data set has 3 dimensions, because we want to be able to visualize our data. Working with 100, or 1000 dimensions is great, but since you can't see them that makes it feel more abstract. We are going to use our neural network to perform regression on a curve that you can see. Our curve is just going to be a saddle, based off of the equation:

$$y = x_1x_2$$

Note that in this problem, $Y$ is going to be the target, and we will be trying to make our predictions that minimize the squared error between the prediction and $Y$. In the process we want to find a curve that closely follows the saddle shape of $Y$, without overfitting. Keep in mind, the standard linear regression in this case would only be able to generate a linear plane, which clearly would not be a good fit at all. Our goal is to use a neural network to do better. Let's get a visualization of $Y$ now:

# imports 
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
%matplotlib inline

# create data
N = 500                                 # going to create 500 uniformly spaced points
X = np.random.random((N, 2)) * 4 - 2    # tuple (N,2) gives matrix shape, in between (-2, +2)
Y = X[:,0] * X[:,1]                     # X1 * X2, gives saddle shape

# visualize
fig = plt.figure(figsize=(14,10))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X[:,0], X[:,1], Y)
plt.show()

Great, now that we know the shape of the curve that we are trying to predict we can get started in building our neural network. We are going to have 2 input dimensions, $x_1, x_2$, 100 hidden units, and 1 output unit that yields a continuous number.

D = 2           # 2 input dimensions
M = 100         # 100 hidden units (this is a hyperparameter and can be changed as needed)

# ----- randomly initialize weights for hidden layer, and then output layer ------
# layer 1
W = np.random.randn(D, M) / np.sqrt(D)              # shape (2 x 100)
b = np.zeros(M)

# layer 2
V = np.random.randn(M) / np.sqrt(M)                 # shape (1 x 100), a 1-d vector
c = 0

Now we can write out function that will be used in calculating the output. Note that we don't pass in our weights and biases, because they are going to be considered global variables. Also, we utilize the relu activation function in place of the sigmoid or tanh here. The relu causes any input to a node that is less than 0, to be set to 0. And input to a node that is greater than 0 keeps its value.

def forward(X):
    Z = X.dot(W) + b             # shapes -> (500 x 2) (2 x 100) + (1 x 100) = (500 x 100)
    Z = Z * (Z > 0)              # relu
    Yhat = Z.dot(V) + c          # shapes -> (500 x 100) (100 x 1) + scalar = (500 x 1)
    return Z, Yhat

At this point we can write our functions to calculate our derivatives and perform the gradient descent update. We can start with the derivative for $V$. Remember, $V$ has a shape of (1 x 100), so we want to make sure that is the shape that our derivative returns, since we need an update for each value of the V weight matrix. The difference between our target and prediction, (Y - Yhat), has a shape of (1 x 500). The shape of Z is (500 x 100). So the dot product between those two should give us the expected shape of (1 x 100).

def derivative_V(Z, Y, Yhat):
    return (Y - Yhat).dot(Z)         # (target value - prediction).dot(Z)

def derivative_c(Y, Yhat):
    return (Y - Yhat).sum()          # will return a scalar value

As for the derivative of $W$, keep in mind that it has a shape of (2 x 100), so that is the shape that this function should return. We also know that we have a relu activation function at our hidden layer, so we need to utilize its derivative instead of the derivative of the sigmoid or tanh. If we quickly check our shapes, we know that (Y - Yhat), has a shape of (1 x 500), and that V has a shape of (1 x 100). If we take the outer product, apply each row of (Y - Yhat) to each column of V (think column x row dot product), we will get (500 x 100) matrix. We then want to multiply that by the derivative of the relu, and then we need to take X, shape (500 x 2), tranpose it, and perform the dot product with the matrix above. Our result is an update matrix of shape (2 x 100).

def derivative_W(X, Z, Y, Yhat, V): 
    # dZ = np.outer(Y - Yhat, V) * (1 - Z * Z) -> this is for tanh activation
    dz = np.outer(Y - Yhat, V) * (Z > 0)
    return X.T.dot(dz)

def derivative_b(Y, Yhat, Z, V):
    # dZ = np.outer(Y - Yhat, V) * (1 - Z * Z) -> this is for tanh activation
    dz = np.outer(Y - Yhat, V) * (Z > 0)
    return dz.sum(axis=0)

With the derivatives out of the way, we can write an update function. This utilizes our derivative functions to calculate the gradients for $V, c, W, b$, and then updates each respective matrix.

def update(X, Z, Y, Yhat, W, b, V, c, learning_rate=1e-4):
    gV = derivative_V(Z, Y, Yhat)
    gc = derivative_c(Y, Yhat)
    gW = derivative_W(X, Z, Y, Yhat, V)
    gb = derivative_b(Y, Yhat, Z, V)
    
    V += learning_rate * gV
    c += learning_rate * gc
    W += learning_rate * gW
    b += learning_rate * gb
    
    return W, b, V, c

And lets write a function to get the cost. We will be using the mean squared error.

def calculate_cost(Y, Yhat):
    return ((Y - Yhat)**2).mean()

Now lets write our loop to implement the training of our network. This will perform the updates and keep track of our costs.

costs = []
for i in range(200):
    Z, Yhat = forward(X)
    W, b, V, c = update(X, Z, Y, Yhat, W, b, V, c)
    cost = calculate_cost(Y, Yhat)
    costs.append(cost)
    if i % 25 == 0:
        print(cost)
    
plt.plot(costs)
plt.show()

2.8755170042768743
0.07925520519395517
0.07610909178726558
0.07502038263597341
0.07411447870947158
0.07332344553406595
0.07264275305933296
0.07205616283669782

Great! We see that our cost does in fact converge. Now, we want to plot the function that our neural network has learned, and compare it against our data. To start, we will repeat the code we saw originally that created the 3d scatter plot of our data. Then we will create 20 evenly spaced points on the $x_1, x_2$ axis. We can do this using meshgrid, which should return every coordinate in that box. We then can flatten the two returned arrays, and then stack them together, which should give an (N x D) matrix, which can be passed into the neural network. We really are just trying to capture every point in the $x_1, x_2$ plane, and then make a prediction for its $y$ value. We can use plot_trisurf in order to plot all of the new predictions from the neural net.

# original plot of our data
fig = plt.figure(figsize=(14,10))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X[:,0], X[:,1], Y)

line = np.linspace(-2, 2, 20)    # 20 evenly spaced points from -2 to 2, in form of array
xx, yy = np.meshgrid(line, line)  # returns 20 evenly spaced points in both directions
Xgrid = np.vstack((xx.flatten(), yy.flatten())).T
_, Yhat = forward(Xgrid)
ax.plot_trisurf(Xgrid[:,0], Xgrid[:,1], Yhat, linewidth=0.2, antialiased=True, cmap='Wistia')
plt.show()

Awesome, we can see that our neural network mapped the saddle function pretty well. However, we can see that it is slightly off in the corners, and the closer we get to the corners, the worse the neural nets prediction is.

One idea is that we can just plot the magnitude of the residuals at each point on the $x_1, x_2$ grid.

# plot magnitude of residuals
Ygrid = Xgrid[:,0]*Xgrid[:,1]
R = np.abs(Ygrid - Yhat)
fig = plt.figure(figsize=(14,10))
plt.scatter(Xgrid[:,0], Xgrid[:,1], c=R)
plt.show()

In this case the lighter color yellow means that our prediction was not very good (large error) and the darker color means that our prediction was good (small error).

We can also try and do a 3-d plot of the residuals, to see where we have the greatest error. (the vertical axis in this case represents the amount of error, hence, where the error is blue is where it is greatest).

fig = plt.figure(figsize=(14,10))
ax = fig.add_subplot(111, projection='3d')
ax.plot_trisurf(Xgrid[:,0], Xgrid[:,1], R, linewidth=0.2, antialiased=True, cmap="RdBu")
plt.show()

Common nonlinearities and their derivatives¶

Let's now take a moment to look at some common nonlinearities and their derivatives. First, we have the sigmoid.

Sigmoid¶

The sigmoid is defined as: $$\sigma(x) = \frac{1}{1+e^{-x}}$$ It's derivative is: $$\sigma'(x) = \sigma(x)(1-\sigma(x))$$

Hyperbolic Tangent¶

The hyperbolic tangent is defined as: $$y = tanh(x) = \frac{(e^x - e^{-x})}{(e^x+e^{-x}}$$ And it's derivative is: $$\frac{dy}{dx} = 1 - y^2$$ tanh can be seen to be a sigmoid that is just scaled horizontally and vertically.

Rectifier Linear Unit (relu)¶

We then have the relu. It is defined as: $$y = relu(x) = max(0,x)$$ And its derivative is: $$\frac{dy}{dx} = 1$$ If x is greater than 0, and: $$\frac{dy}{dx} = 0$$ If x is less than 0

Practical Considerations for choosing Activation Function¶

We are now going to talk about considerations that we should make when choosing the activation functions in the hidden layers of a neural network. So far we have looked at:

Note, we are not refering to the output layer; that is always going to be the same depending on what task you are doing. For binary classification it will be the sigmoid function, for multi-class classification it will be the softmax, and for regression it will be identity (aka just return the input). This section is dealing with hidden layers only.

Sigmoid¶

Lets start by talking about the sigmoid. It is of theoretical interest because it goes between 0 and 1. This is a lot like the biological neuron, because they operate on the all or nothing principal. In reality, neural networks are more about function approximation instead of modeling the brain.

So one problem that the sigmoid encounters is that most of it is flat, and hence the derivative is 0. Why is that a problem? Well, if you think about it, how does a neural network learn? We know that they learn via gradient, and the gradient is the derivative! So if the derivative is 0, then the neural network is not going to learn. So we encounter this problem where unless the data is in a very small region from -4 to +4, our network is going to learn very slowly.

Note that the slope never reaches 0, it just approaches 0, but if you go far enough it is going to round off to 0 due to the limits of numerical precision.

There is another problem we need to worry about when using the sigmoid. Remember that we often like to normalize our data before putting it into a machine learning algorithm, which means we set the data's mean to 0 and the variance to 1. Having the data in a small range helps us with the previous problem where if the input gets to the large the derivative goes to 0. But what happens when the input to the sigmoid is 0? Well, $\sigma(0) = 0.5$, meaning that the sigmoid keeps offsetting the data! So even if your input is perfectly standardized to have 0 mean, after it goes through the sigmoid, it is going to be at 0.5.

tanh¶

Next we have tanh, which solves the problem we just talked about. When you place 0 into the tanh it returns 0! However, the ends of tanh are still flat, so our training very slow when the data is off to the edges. In general, the tanh is typically better than the sigmoid, so if you have to choose between the two, go with tanh.

Now, researchers were stuck on this s-shaped curve for a very long time. It is clearly nonlinear, and we need nonlinearities in order to create a nonlinear function approximator. And that's where the Relu comes in...

ReLU¶

Nowadays we have the ReLU. The ReLU is a pretty ugly function, which may have something to do with why neural network researchers were stuck on the s-shape for so long. It is defined as:

$$ReLU(x) = max(0, X)$$

For one thing it contains a max which looks weird, and it also looks like a broken stick. Because of that, the derivative is not defined at 0! That may bother something who is too theoretically inclined. However, in practice, the chances of the input being exactly 0 are infintesimally small!

So, how does the ReLU compare to the tanh and sigmoid? Well, it trains rather easily! Soon we will use the tensorflow playground to look at the decision boundary that is found by the neural network as it learns. We will be able to see that the ReLU has these jagged decision boundaries, which can be seen to mean that it allows the neural network to contort itself into more interesting shapes. The ReLU also trains faster.

You may think that because half of the ReLU is flat, there is this whole region of space where the ReLU cannot learn. However, there is still the whole region of space on the right side where the ReLU is steep. In fact, the area where the ReLU is zero is a known problem in deep learning, and it leads to what are called dead neurons. Essentially what happens is that the activations go down to 0, and because the derivative is 0, the weights behind it don't get updated ever again. And so those neurons are always just computing zeros, hence the name dead neurons.

One solution to the dead neuron problem is to just add more hidden units, and hope that not too many of them go down to zero. Generally this is fine and we can still outperform the tanh and sigmoid.

Another option is to use what is called the leaky ReLU.

This is where we give a small positive slope to the left half of the ReLU, so the left side of the ReLU is no longer 0. This solves the dead neuron problem, but interestingly could actually make your neural net perform worse. This does not mean that leaky ReLU's are worse, it just means that hyperparameters are synergistic! In order to get the best result, you have to choose the best learning rate, architecture, activation function, and so on.

Hyperparameters and Cross-Validation¶

At certain points the question may definitely arise:

How do we choose the learning rate?

How do we choose the regularization param?

How do we choose the number of hidden units and hidden layers?

How do we choose between sigmoid, tanh, relu?

These are called hyperparameters, meaning that they are parameters that we have to choose and are not part of the model itself. There is no precise way to choose the correct hyperparameters. One method is cross validation.

K-Fold Cross-Validation¶

In K-fold cross-validation we split our data set into K different parts. Say K = 5 for the sake of this example. In that case we would have 5 iterations; the first iteration we would train on parts 2-5, and test on part 1. The second iteration we would train on parts 1,3,4,5 and test on part 2. We would continue this pattern, and finally take the mean and variance of the classification rate. We could even do a statistical test (i.e. T-test) in order to compare determine if the means and variances are statistically different. In code cross validation may look like:

def crossValidation(model, X, Y, K=5):
    X, Y = shuffle (X, Y)
    sz = len(Y) / K
    scores = []
    for k in range(K):
        xtr = np.concatenate([ X[:k*sz, :], X[(k*sz + sz):, :] ])
        ytr = np.concatenate([ Y[:k*sz], X[(k*sz + sz):] ])
        xte = X[k*sz:(k*sz + sz), :]
        yte = Y[k*sz:(k*sz + sz)]
        
        model.fit(xtr, ytr)
        score = model.score(xte, yte)
        scores.append(score)
    return np.mean(scores), np.std(scores)

Manually choosing the Learning Rate and Regularization¶

As far as learning rate is concerned, just be sure to test different values on the right scale. In other words increase and decrease on a log scale (try 0.1, 0.01, 0.001, 0.0001, 0.00001, and so on).

If you are wondering whether your learning rate is too high or too low, here is how to tell.

If it is too high you will see that your cost goes to infinity/NaN. The strange thing is that your Neural Network will keep multiply NaN's as though they are numbers, so you will need to stop your code manually. On the other hand, if your error/cost is converging to slowly, then your learning rate is probably too low. Try turning it up by a factor of 10 so that it learns faster.

A good technique to try if things are working right is to turn off all modifications except for regular gradient descent (no regularization or anything else you've learned). Then keep lowering the learning rate to see if things improve.

Normalizing Cost and Regularization Penalty¶

It is possible to train a neural network using batches of training data. When the data you are training on varies in size, the learning rate can become sensitive to it. That is because the cost function is the sum of all the individual errors, so if you have a lot of data points then the cost will be larger. To normalize this, we just divide the cost by N, the number of training samples being used at the time:

$$J = \frac{1}{N}\sum_n\sum_kt_{nk}log(y_{nk})$$

We can also make the regularization penalty independent of the number of parameters. Just divide the sum of squares of all parameters, by the number of parameters. These normalizations are not always necessary, but they can be helpful to try if things don't work right away.