13. Theano Basics¶

Going from regular numpy to pandas is a pretty straightforward leap, however, going to theano is not and there are certain things that do not happen as you would expect. So we can now talk about variables.

# We use theano.tensor so much we can call it T
import theano.tensor as T

In order to create a scalar, vector, and matrix we do so as follows:

# initializing various types of variables
c = T.scalar('c')
v = T.vector('v')
A = T.matrix('A')

There are also tensors in theano, which are for matrices of dimensionality 3 and greater. These may be seen if you work with an image that has not been flattened. So we have vector that are 784 in length, but need to be reshaped to 28x28 to view the image. So, if you wanted to store the images as squares and you had N images, then that would be (N x 28 x 28), which is a 3 dimensional Tensor. If you had 3 color channels, then you would have an (N x 3 x 28 x 28) tensor, which is 4 dimensional.

With that said, notice that the 3 variables we created above do not have values, they are just symbols, so we can even do algebra on them!

w = A.dot(v)

However, we still have not done any multiplication, which would be impossible since A and v don't have any values yet. So, how do we set values to A and v and find the result w? This is where theano functions come into play. So, lets import the top level theano module.

import theano

We can use this to create a theano function. Each function creation specifies the inputs and outputs.

matrix_times_vector = theano.function(inputs=[A,v], outputs=w)

Now we can import numpy so that we can create real arrays and call the function.

import numpy as np
A_val = np.array([[1,2], [3,4]])
v_val = np.array([5,6])

w_val = matrix_times_vector(A_val, v_val)
display(w_val)

array([17., 39.])

Now this is nothing too impressive so far. However, one of the biggest advantages of theano is that it links all of these variables up into a graph. And we can use that structure to calculate gradients using the chain rule. In theano, regular variables are not updateable. To make an updateable variable we need to make what is called a shared variable. Let's do that now.

# Creating a shared variable so that we can do gradient descent 
# This will add another layer of complexity to the theano function
# First argument is the initial value, second value is it's name
x = theano.shared(20.0, 'x')

Let's also create a simple cost function that we can solve ourselves, and we know has a global minimum.

# Cost function that has a minimum value
cost = x*x + x + 1

Now, let's tell theano how we want to update x, by giving it an update expression.

# In theano you do not have to compute gradients yourself, it calculates them automatically 
# Grad function takes in two parameters: 
# parameter 1: function you want to take the gradient of
# parameter 2: variable you want the gradient with respect to
# You can pass in multiple variables as a list into the second parameter
x_update = x - 0.3*T.grad(cost, x)

Now we can create the theano train function. This is like the previous function we created, except we are going to add a new argument which is updates. The updates argument takes in a list of tuples, and each tuple has 2 things in it:

The shared variable to update
The update expression

# Note that there are no inputs 
train = theano.function(inputs=[], outputs=cost, updates=[(x, x_update)])

So, we have created a function to train, but we have not actually called it yet. Now, notice that x is not an input, it is the thing that we update. In later examples the inputs will be the data and labels. So, the inputs param takes in data and labels, while the updates param takes in your model parameters with their updates.

Now we have to write our own loop to call the training function.

for i in range(25):
  cost_val = train()
  print(cost_val)
  
# Print the optimal value of x
print(x.get_value())

421.0
67.99000000000001
11.508400000000002
2.4713440000000007
1.0254150400000002
0.7940664064
0.7570506250240001
0.75112810000384
0.7501804960006143
0.7500288793600982
0.7500046206976159
0.7500007393116186
0.750000118289859
0.7500000189263775
0.7500000030282203
0.7500000004845152
0.7500000000775223
0.7500000000124035
0.7500000000019845
0.7500000000003176
0.7500000000000506
0.7500000000000082
0.7500000000000013
0.7500000000000001
0.7500000000000001
-0.4999999976919052

2. Theano: Building a Neural Network¶

We are now going to build a neural network with theano, using the basics that we learned in part 1. We can start with our imports.

import numpy as np
import theano
import theano.tensor as T
import matplotlib.pyplot as plt        # Pulling in so that we can plot the log-likelihood
import seaborn as sns
from util import get_normalized_data, y2indicator # Util to get data and create ind matrix

# Seaborn Plot Styling
sns.set(style="white", palette="husl")
sns.set_context("poster")
sns.set_style("ticks")

We can now define an error rate function. New versions of Theano have the ReLU function, but incase yours does not we can create one right now. Note that both of these functions use true false values and turn them into numbers.

def error_rate(p, t):
  return np.mean(p != t)

def relu(a): 
  return a * (a > 0)

We can now write out main function.

def main():
  """------------- Step 1: Get our data and define the usual variables ------------"""
  X, Y = get_normalized_data()
  
  max_iter = 20
  print_period = 10
  
  lr = 0.00004
  reg = 0.01
  
  Xtrain = X[:-1000,]
  Ytrain = Y[:-1000]
  Xtest  = X[-1000:,]
  Ytest  = Y[-1000:]
  Ytrain_ind = y2indicator(Ytrain)
  Ytest_ind = y2indicator(Ytest)

  N, D = Xtrain.shape
  batch_sz = 500
  n_batches = N // batch_sz

  M = 300   # 300 hidden units
  K = 10    # 10 classes 
  W1_init = np.random.randn(D, M) / 28
  b1_init = np.zeros(M)
  W2_init = np.random.randn(M, K) / np.sqrt(M)
  b2_init = np.zeros(K)
  
  """------------- Step 2: Define theano variables and expressions ---------------"""
  thX = T.matrix('X')                  # Placeholder for X input matrix
  thT = T.matrix('T')                  # Placeholder for the targets
  W1 = theano.shared(W1_init, 'W1')    # All parameters will be shared variables
  b1 = theano.shared(b1_init, 'b1')    # Shared variable: first arg is initial value
  W2 = theano.shared(W2_init, 'W2')    # second arg is name
  b2 = theano.shared(b2_init, 'b2')
  
  thZ = relu( thX.dot(W1) + b1)                # Create function to solve for Z using relu
  thY = T.nnet.softmax( thZ.dot(W2) + b2)      # Create function to solve for Y using softmax
  
  # cost is sum of targets times log of predictions plus regularization
  cost = ( -(thT * T.log(thY)).sum() + reg*((W1*W1).sum() + 
                                            (b1*b1).sum() + (W2*W2).sum() + (b2*b2).sum()))
  prediction = T.argmax(thY, axis=1)    # Need prediction to calculate error rate
  
  """------------- Step 3: Create training/update expressions ---------------"""
  # We can just include regularization as part of the cost because it is also 
  # automatically differentiated!
  # update_W1 = W1 - lr*(T.grad(cost, W1) + reg*W1)
  # update_b1 = b1 - lr*(T.grad(cost, b1) + reg*b1)
  # update_W2 = W2 - lr*(T.grad(cost, W2) + reg*W2)
  # update_b2 = b2 - lr*(T.grad(cost, b2) + reg*b2)
  update_W1 = W1 - lr*T.grad(cost, W1)
  update_b1 = b1 - lr*T.grad(cost, b1)
  update_W2 = W2 - lr*T.grad(cost, W2)
  update_b2 = b2 - lr*T.grad(cost, b2)
  
  # Now we create our train function. Takes in placeholder for X input matrix and 
  # placeholder for targets matrix 
  train = theano.function(
    inputs=[thX, thT],
    updates=[(W1, update_W1), (b1, update_b1), (W2, update_W2), (b2, update_b2)]
  )
  
  # Create a function to get prediction because we want to do it over the whole dataset
  get_prediction = theano.function(
    inputs = [thX, thT],
    outputs = [cost, prediction],
  )
  
  # Training Loop
  costs = []
  for i in range(max_iter):
      for j in range(n_batches):
          Xbatch = Xtrain[j*batch_sz:(j*batch_sz + batch_sz),]
          Ybatch = Ytrain_ind[j*batch_sz:(j*batch_sz + batch_sz),]
  
          train(Xbatch, Ybatch)     # Calling the train function we created
          if j % print_period == 0:
            # calling in prediction function we created to get cost and prediction
            cost_val, prediction_val = get_prediction(Xtest, Ytest_ind)
            err = error_rate(prediction_val, Ytest)
            print("Cost / err at iteration i=%d, j=%d: %.3f / %.3f" % (i, j, cost_val, err))
            costs.append(cost_val)
            
  fig, ax = plt.subplots(figsize=(12,8))
  plt.plot(costs)
  plt.show()
if __name__ == '__main__':
  main()

Reading in and transforming data...
Cost / err at iteration i=0, j=0: 2588.946 / 0.926
Cost / err at iteration i=0, j=10: 1911.592 / 0.557
Cost / err at iteration i=0, j=20: 1512.708 / 0.360
Cost / err at iteration i=0, j=30: 1265.897 / 0.293
Cost / err at iteration i=0, j=40: 1095.560 / 0.248
Cost / err at iteration i=0, j=50: 971.813 / 0.217
Cost / err at iteration i=0, j=60: 875.982 / 0.201
Cost / err at iteration i=0, j=70: 804.008 / 0.190
Cost / err at iteration i=0, j=80: 746.732 / 0.176
Cost / err at iteration i=1, j=0: 735.884 / 0.174
Cost / err at iteration i=1, j=10: 690.698 / 0.163
Cost / err at iteration i=1, j=20: 653.627 / 0.151
Cost / err at iteration i=1, j=30: 622.062 / 0.151
Cost / err at iteration i=1, j=40: 595.092 / 0.151
Cost / err at iteration i=1, j=50: 570.609 / 0.145
Cost / err at iteration i=1, j=60: 549.366 / 0.140
Cost / err at iteration i=1, j=70: 531.280 / 0.134
Cost / err at iteration i=1, j=80: 515.049 / 0.134
Cost / err at iteration i=2, j=0: 511.591 / 0.133
Cost / err at iteration i=2, j=10: 497.448 / 0.132
Cost / err at iteration i=2, j=20: 484.829 / 0.129
Cost / err at iteration i=2, j=30: 473.463 / 0.127
Cost / err at iteration i=2, j=40: 463.852 / 0.127
Cost / err at iteration i=2, j=50: 453.538 / 0.118
Cost / err at iteration i=2, j=60: 444.255 / 0.119
Cost / err at iteration i=2, j=70: 436.443 / 0.117
Cost / err at iteration i=2, j=80: 428.642 / 0.117
Cost / err at iteration i=3, j=0: 426.851 / 0.114
Cost / err at iteration i=3, j=10: 419.732 / 0.115
Cost / err at iteration i=3, j=20: 413.157 / 0.116
Cost / err at iteration i=3, j=30: 407.223 / 0.114
Cost / err at iteration i=3, j=40: 402.436 / 0.115
Cost / err at iteration i=3, j=50: 396.783 / 0.114
Cost / err at iteration i=3, j=60: 391.615 / 0.115
Cost / err at iteration i=3, j=70: 387.294 / 0.115
Cost / err at iteration i=3, j=80: 382.475 / 0.112
Cost / err at iteration i=4, j=0: 381.331 / 0.112
Cost / err at iteration i=4, j=10: 376.878 / 0.109
Cost / err at iteration i=4, j=20: 372.680 / 0.108
Cost / err at iteration i=4, j=30: 368.901 / 0.106
Cost / err at iteration i=4, j=40: 366.152 / 0.107
Cost / err at iteration i=4, j=50: 362.569 / 0.103
Cost / err at iteration i=4, j=60: 359.298 / 0.104
Cost / err at iteration i=4, j=70: 356.609 / 0.103
Cost / err at iteration i=4, j=80: 353.274 / 0.102
Cost / err at iteration i=5, j=0: 352.448 / 0.101
Cost / err at iteration i=5, j=10: 349.250 / 0.100
Cost / err at iteration i=5, j=20: 346.222 / 0.099
Cost / err at iteration i=5, j=30: 343.674 / 0.100
Cost / err at iteration i=5, j=40: 341.886 / 0.101
Cost / err at iteration i=5, j=50: 339.333 / 0.100
Cost / err at iteration i=5, j=60: 337.053 / 0.098
Cost / err at iteration i=5, j=70: 335.219 / 0.099
Cost / err at iteration i=5, j=80: 332.712 / 0.097
Cost / err at iteration i=6, j=0: 332.052 / 0.096
Cost / err at iteration i=6, j=10: 329.537 / 0.096
Cost / err at iteration i=6, j=20: 327.149 / 0.094
Cost / err at iteration i=6, j=30: 325.299 / 0.096
Cost / err at iteration i=6, j=40: 324.058 / 0.093
Cost / err at iteration i=6, j=50: 322.090 / 0.096
Cost / err at iteration i=6, j=60: 320.407 / 0.095
Cost / err at iteration i=6, j=70: 319.103 / 0.093
Cost / err at iteration i=6, j=80: 317.102 / 0.094
Cost / err at iteration i=7, j=0: 316.545 / 0.094
Cost / err at iteration i=7, j=10: 314.519 / 0.093
Cost / err at iteration i=7, j=20: 312.516 / 0.091
Cost / err at iteration i=7, j=30: 311.083 / 0.091
Cost / err at iteration i=7, j=40: 310.204 / 0.090
Cost / err at iteration i=7, j=50: 308.625 / 0.090
Cost / err at iteration i=7, j=60: 307.340 / 0.092
Cost / err at iteration i=7, j=70: 306.404 / 0.091
Cost / err at iteration i=7, j=80: 304.766 / 0.091
Cost / err at iteration i=8, j=0: 304.283 / 0.092
Cost / err at iteration i=8, j=10: 302.569 / 0.090
Cost / err at iteration i=8, j=20: 300.833 / 0.090
Cost / err at iteration i=8, j=30: 299.677 / 0.090
Cost / err at iteration i=8, j=40: 299.028 / 0.089
Cost / err at iteration i=8, j=50: 297.739 / 0.090
Cost / err at iteration i=8, j=60: 296.748 / 0.089
Cost / err at iteration i=8, j=70: 296.084 / 0.088
Cost / err at iteration i=8, j=80: 294.722 / 0.090
Cost / err at iteration i=9, j=0: 294.294 / 0.089
Cost / err at iteration i=9, j=10: 292.807 / 0.088
Cost / err at iteration i=9, j=20: 291.247 / 0.089
Cost / err at iteration i=9, j=30: 290.289 / 0.087
Cost / err at iteration i=9, j=40: 289.803 / 0.089
Cost / err at iteration i=9, j=50: 288.691 / 0.087
Cost / err at iteration i=9, j=60: 287.928 / 0.089
Cost / err at iteration i=9, j=70: 287.475 / 0.089
Cost / err at iteration i=9, j=80: 286.325 / 0.088
Cost / err at iteration i=10, j=0: 285.937 / 0.087
Cost / err at iteration i=10, j=10: 284.629 / 0.085
Cost / err at iteration i=10, j=20: 283.203 / 0.085
Cost / err at iteration i=10, j=30: 282.409 / 0.086
Cost / err at iteration i=10, j=40: 282.032 / 0.086
Cost / err at iteration i=10, j=50: 281.069 / 0.085
Cost / err at iteration i=10, j=60: 280.476 / 0.086
Cost / err at iteration i=10, j=70: 280.183 / 0.086
Cost / err at iteration i=10, j=80: 279.185 / 0.085
Cost / err at iteration i=11, j=0: 278.824 / 0.085
Cost / err at iteration i=11, j=10: 277.656 / 0.085
Cost / err at iteration i=11, j=20: 276.334 / 0.085
Cost / err at iteration i=11, j=30: 275.648 / 0.084
Cost / err at iteration i=11, j=40: 275.355 / 0.085
Cost / err at iteration i=11, j=50: 274.482 / 0.084
Cost / err at iteration i=11, j=60: 274.012 / 0.086
Cost / err at iteration i=11, j=70: 273.832 / 0.086
Cost / err at iteration i=11, j=80: 272.950 / 0.086
Cost / err at iteration i=12, j=0: 272.614 / 0.084
Cost / err at iteration i=12, j=10: 271.559 / 0.084
Cost / err at iteration i=12, j=20: 270.321 / 0.084
Cost / err at iteration i=12, j=30: 269.729 / 0.083
Cost / err at iteration i=12, j=40: 269.517 / 0.083
Cost / err at iteration i=12, j=50: 268.713 / 0.083
Cost / err at iteration i=12, j=60: 268.344 / 0.083
Cost / err at iteration i=12, j=70: 268.269 / 0.084
Cost / err at iteration i=12, j=80: 267.487 / 0.083
Cost / err at iteration i=13, j=0: 267.172 / 0.082
Cost / err at iteration i=13, j=10: 266.216 / 0.082
Cost / err at iteration i=13, j=20: 265.048 / 0.082
Cost / err at iteration i=13, j=30: 264.524 / 0.083
Cost / err at iteration i=13, j=40: 264.367 / 0.082
Cost / err at iteration i=13, j=50: 263.599 / 0.082
Cost / err at iteration i=13, j=60: 263.309 / 0.083
Cost / err at iteration i=13, j=70: 263.322 / 0.085
Cost / err at iteration i=13, j=80: 262.618 / 0.084
Cost / err at iteration i=14, j=0: 262.322 / 0.083
Cost / err at iteration i=14, j=10: 261.439 / 0.081
Cost / err at iteration i=14, j=20: 260.337 / 0.083
Cost / err at iteration i=14, j=30: 259.865 / 0.081
Cost / err at iteration i=14, j=40: 259.756 / 0.081
Cost / err at iteration i=14, j=50: 259.020 / 0.079
Cost / err at iteration i=14, j=60: 258.798 / 0.081
Cost / err at iteration i=14, j=70: 258.876 / 0.082
Cost / err at iteration i=14, j=80: 258.250 / 0.081
Cost / err at iteration i=15, j=0: 257.971 / 0.080
Cost / err at iteration i=15, j=10: 257.157 / 0.077
Cost / err at iteration i=15, j=20: 256.117 / 0.080
Cost / err at iteration i=15, j=30: 255.685 / 0.078
Cost / err at iteration i=15, j=40: 255.609 / 0.081
Cost / err at iteration i=15, j=50: 254.898 / 0.077
Cost / err at iteration i=15, j=60: 254.726 / 0.078
Cost / err at iteration i=15, j=70: 254.861 / 0.081
Cost / err at iteration i=15, j=80: 254.286 / 0.079
Cost / err at iteration i=16, j=0: 254.020 / 0.078
Cost / err at iteration i=16, j=10: 253.256 / 0.077
Cost / err at iteration i=16, j=20: 252.264 / 0.078
Cost / err at iteration i=16, j=30: 251.865 / 0.077
Cost / err at iteration i=16, j=40: 251.836 / 0.078
Cost / err at iteration i=16, j=50: 251.130 / 0.075
Cost / err at iteration i=16, j=60: 250.990 / 0.078
Cost / err at iteration i=16, j=70: 251.173 / 0.078
Cost / err at iteration i=16, j=80: 250.653 / 0.077
Cost / err at iteration i=17, j=0: 250.403 / 0.076
Cost / err at iteration i=17, j=10: 249.691 / 0.075
Cost / err at iteration i=17, j=20: 248.751 / 0.076
Cost / err at iteration i=17, j=30: 248.380 / 0.075
Cost / err at iteration i=17, j=40: 248.377 / 0.075
Cost / err at iteration i=17, j=50: 247.684 / 0.074
Cost / err at iteration i=17, j=60: 247.555 / 0.075
Cost / err at iteration i=17, j=70: 247.782 / 0.075
Cost / err at iteration i=17, j=80: 247.303 / 0.075
Cost / err at iteration i=18, j=0: 247.064 / 0.074
Cost / err at iteration i=18, j=10: 246.395 / 0.074
Cost / err at iteration i=18, j=20: 245.506 / 0.075
Cost / err at iteration i=18, j=30: 245.156 / 0.074
Cost / err at iteration i=18, j=40: 245.172 / 0.074
Cost / err at iteration i=18, j=50: 244.490 / 0.073
Cost / err at iteration i=18, j=60: 244.394 / 0.074
Cost / err at iteration i=18, j=70: 244.664 / 0.073
Cost / err at iteration i=18, j=80: 244.235 / 0.075
Cost / err at iteration i=19, j=0: 244.009 / 0.074
Cost / err at iteration i=19, j=10: 243.380 / 0.073
Cost / err at iteration i=19, j=20: 242.522 / 0.074
Cost / err at iteration i=19, j=30: 242.178 / 0.074
Cost / err at iteration i=19, j=40: 242.226 / 0.073
Cost / err at iteration i=19, j=50: 241.547 / 0.073
Cost / err at iteration i=19, j=60: 241.475 / 0.074
Cost / err at iteration i=19, j=70: 241.772 / 0.073
Cost / err at iteration i=19, j=80: 241.383 / 0.073