Nathaniel Dake Blog

11. Hyperparameter Optimization

We have just introduced a lot of new hyperparameters, and this is only going to get worse as our neural networks get more and more complex. This leads us in to a topic called hyperparameter optimization. So, what are the hyperparameters that we have learned about so far?

  • Learning Rate or if it is adaptive, initial learning rate and Decay rate
  • Momentum
  • Regularization weight
  • Hidden Layer Size
  • Number of hidden layers

So, what are some approaches to choosing these parameters?

1.1 K-Fold Cross-Validation

Let's go over K-Fold Cross Validation to review. The idea is simple, we do the following:

  1. Split data into K parts
  2. Do a loop that iterates through K times
  3. Each time we take out one part, and use that as the validation set, and the rest of the data as the training set

So in the first iteration of the loop, we validate with the first part, and train on the rest.

Here is some pseudocode that can do this:

def crossValidation(model, X, Y, K=5):
    X, Y = shuffle (X, Y)
    sz = len(Y) / K
    scores = []
    for k in range(K):
        xtr = np.concatenate([ X[:k*sz, :], X[(k*sz + sz):, :] ])
        ytr = np.concatenate([ Y[:k*sz], X[(k*sz + sz):] ])
        xte = X[k*sz:(k*sz + sz), :]
        yte = Y[k*sz:(k*sz + sz)]

        model.fit(xtr, ytr)
        score = model.score(xte, yte)
        scores.append(score)
    return np.mean(scores), np.std(scores)

Now, we can see that this algorithm contains K different scores. We can simply use the mean of these of these scores as a measurement for how good this particular hyperparameter setting is. Another thing we could do is a statistical test to determine if the difference between two hyperparameter settings is statistically significantly better than the other.

1.2 Sci-Kit Learn K-Folds

Sci-Kit learn has its own K-folds implementation that is great to use:

from sklearn import cross_validation
scores = cross_validation.cross_val_score(model, X, Y, cv=K)

Note that the SKLearn implementation does require you to conform to certain aspects of the SKLearn API. For example, you must provide a class with at least the 3 methods fit(), predict(), and score().

1.3 Leave-One-Out Cross-Validation

One special variation of K-Folds cross-validation is where we set K = N. We will talk more about this later. But the basic idea is:

  1. We do a loop N times
  2. Every iteration of the loop we train on everything but one point
  3. We test on the one point that was left out
  4. We do this N times for all points

Now with all of that discussed, what are some of the different approaches to hyperparameter optimization?



Grid search is an exhaustive search. This means that you can choose a set of learning rates that you want to try, choose a set of momentums you want to try, and choose a set of regularizations that you want to try, at which point you try every combination of them. In code that may look like:

learning_rates = [0.1, 0.01, 0.001, 0.0001, 0.00001]
momentums = [1, 0.1, 0.01, 0.001]
regularizations = [1, 0.1, 0.01]

for lr in learning_rates: 
    for mu in momentums:
        for reg in regularizations:
            score = cross_validation(lr, mu, reg, data)

As you can imagine, this is very slow! But, since each model is independent of the others, there is a great opportunity here for a parallelization! Frameworks like hadoop and spark are ideal for this type of problem.



On the other hand we have random search, which instead of looking at every possibility, just moves in random directions until the score is improved. A basic algorithm could look like:

theta = random position in hyperparameter space
score1 = cross_validation(theta, data)
for i in range(max_iterations):
    next_theta = sample from hypersphere around theta
    score2 = cross_validation(next_theta, data)
    if score2 is better than score1:
        theta = next_theta


2. Sampling Logarithmically

Let's now talk about how to sample random numbers when performing random search. It is not quite as straight forward as you may assume.

2.1 Main Problem

Suppose we want to randomly sample the learning rate. We know that the difference between 0.001 and 0.0011 is not that significant. In general, we want to try different numbers on a log scale, such as $10^{-2}$,$10^{-3}$,$10^{-4}$, etc. So if you sample between $10^{-7}$ and $10^{-1}$ uniformally, what is going to happen?

Well if we look at the image below, we can see that most of what we would sample is on the same scale as $10^-1$, while everything else is under represented!

So, how can we fix this problem? Well, we will sample on a log scale! That way we will get an even distribution between every 10th power, which is exactly what we want! Algorithmically this may look like:

Sample uniformly from (-7, -1) from a uniform distribution # Or whatever limits you want 
R ~ U(-7, -1)
Set your learning rate to be 10^R

This can also be used for hyper parameters like decay where we want to try numbers like 0.9, 0.99, 0.999, etc!

It may not seem intuitive that these numbers are still on a log scale, but if we rewrite those numbers as: $$0.9 = 1 - 10^{-1}$$ $$0.99 = 1 - 10^{-2}$$ $$0.999 = 1 - 10^{-3}$$

We can see that they are indeed in fact still on a log scale! These will give very different results, so being able to sample them effectively is very important. Algorithmically it may look like:

R ~ U(lower, upper)
Decay Rate = 1 - 10^R


3. Grid Search in Code

The dataset that we are going to look at is the spiral dataset. It is a set of arms extending out from the origin, and each arm is a different class than the one next to it. Why are we using this dataset and not something like MNIST? Well, images take a long time to train on if we wanted to test out 4 different hyperaparameters and 3 different settings of each that is 81 different combinations. If each combination takes 1 hour to train, that is 81 hours! A bit to long for our liking.

So, inside of our get_spiral function, we are going to create the spirals using polar coordinates. We want the radius to stretch outwards from 0 to some maximum, and we want the angle to change with the radius. So, as the radius goes from 0 to max, the angle goes from its start angle to its start angle plus $\frac{\pi}{2}$. Because $\frac{\pi}{2}$ is just 90 degrees, each spiral is going to travel 90 degrees in the angular direction as the radius increases. We will have 6 spiral arms, which is why we split up the start angles by $\frac{2\pi}{6}$ or $\frac{\pi}{3}$, aka 60 degrees.

Next we convert the polar coordinates into cartesian coordinates. The formula is: $$x_1 = r*cos(\theta)$$ $$x_2 = r*sin(\theta)$$ We are referring to the variables as $x_1$ and $x_2$ because we are going to use the letter $y$ for the targets.

The next step is to add $x_1$ and $x_2$ to the $X$ matrix that is going to represent the input data. $x_1$ and $x_2$ begin in arrays that are of size (6 x 100), but we need to flatten them to be of size 600, since there are 600 data points- 100 for each of the 6 spiral arms. Next we add some noise to the data so it isn't a perfect set of spirals.

We then create the targets, which is just 100 zeros followed by 100 ones, and so on. This is because we want each spiral arm to be a different class than the arms beside it.

In [6]:
"""Function to get the spiral dataset"""
def get_spiral():
    radius = np.linspace(1, 10, 100)
    thetas = np.empty((6, 100))
    for i in range(6):
        start_angle = np.pi*i / 3.0
        end_angle = start_angle + np.pi / 2
        points = np.linspace(start_angle, end_angle, 100)
        thetas[i] = points

    # convert into cartesian coordinates
    x1 = np.empty((6, 100))
    x2 = np.empty((6, 100))
    for i in range(6):
        x1[i] = radius * np.cos(thetas[i])
        x2[i] = radius * np.sin(thetas[i])

    # inputs
    X = np.empty((600, 2))
    X[:,0] = x1.flatten()
    X[:,1] = x2.flatten()

    # add noise
    X += np.random.randn(600, 2)*0.5

    # targets
    Y = np.array([0]*100 + [1]*100 + [0]*100 + [1]*100 + [0]*100 + [1]*100)
    return X, Y
In [12]:
import theano.tensor as T
from theano_ann import ANN
from util import get_spiral, get_clouds
from sklearn.utils import shuffle
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

# Seaborn Plot Styling
sns.set(style="white", palette="husl")
sns.set_context("poster")
sns.set_style("ticks")

Now let's just take a quick look at our dataset.

In [11]:
"""Visualize Dataset"""
X, Y = get_spiral()
fig, ax = plt.subplots(figsize=(12,8))
plt.scatter(X[:,0], X[:,1], s=100, c=Y, alpha=0.5, cmap="viridis")
plt.show()

And now we can finally get back to grid search! Note that we have imported theano_ann, which makes use of a custom class that we have not gone over yet. However, it is fully encapsulated, so it is just like using Scikit Learn! You don't need to know how it is written, just how to use it's API.

Something else to note is that while it is standard practice in the field of machine learning to use cross validation and a test set, we will not use it here. It is not very common in the subfield of deep learning, because you generally have so much data that one validation set is fine. Plus, if training takes 1 hour and you want to do 5 fold cross validation, then it is going to take 5 hours in that case.

In [ ]:
"""Grid Search Function"""
def grid_search(): 
    X, Y = get_spiral()                          # Get the data
    X, Y = shuffle(X, Y)
    Ntrain = int(0.7 * len(X))                   # Shuffle and split data into train & test 
    Xtrain, Ytrain = X[:Ntrain], Y[:Ntrain:]
    Xtest, Ytest = X[Ntrain:], Y[Ntrain:]
    
    # Create arrays of hyperparameters to try
    hidden_layer_sizes = [
        [300],
        [100,100],
        [50,50,50],
    ]
    learning_rates = [1e-4, 1e-3, 1e-2]
    l2_penalties = [0., 0.1, 1.0]
    
    # Loop through all possible hyperparameter settings
    best_validation_rate = 0
    best_hls = None
    best_lr = None
    best_l2 = None
    for hls in hidden_layer_sizes:       # Nested for loop through all settings
        for lr in learning_rates:
            for l2 in l2_penalties:
                model = ANN(hls)                 # Create an ANN
                model.fit(Xtrain, Ytrain, learning_rate=lr, reg=l2, 
                          mu=0.99, epochs=3000, show_fig=False) # Fit with chosen hyperparams
                validation_accuracy = model.score(Xtest, Ytest) # Get validation accuracy
                train_accuracy = model.score(Xtrain, Ytrain)    # Get train accuracy
                print(
                  "Validation_accuracy: %.3f, train_accuracy: %.3f, settings: %s, %s, %s" %
                    (validation_accuracy, train_accuracy, hls, lr, l2)
                )
                # If best accuracy, keep the current settings
                if validation_accuracy > best_validation_rate:
                    best_validation_rate = validation_accuracy
                    best_hls = hls
                    best_lr = lr
                    best_l2 = l2
                    
    print("Best validation_accuracy:", best_validation_rate)
    print("Best settings:")
    print("hidden_layer_sizes:", best_hls)
    print("learning_rate:", best_lr)
    print("l2:", best_l2)
In [19]:
grid_search()
validation_accuracy: 0.650, train_accuracy: 0.707, settings: [300], 0.0001, 0.0
validation_accuracy: 0.644, train_accuracy: 0.714, settings: [300], 0.0001, 0.1
validation_accuracy: 0.672, train_accuracy: 0.705, settings: [300], 0.0001, 1.0
validation_accuracy: 0.939, train_accuracy: 0.962, settings: [300], 0.001, 0.0
validation_accuracy: 0.950, train_accuracy: 0.967, settings: [300], 0.001, 0.1
validation_accuracy: 0.928, train_accuracy: 0.955, settings: [300], 0.001, 1.0
validation_accuracy: 0.967, train_accuracy: 0.988, settings: [300], 0.01, 0.0
validation_accuracy: 0.967, train_accuracy: 0.988, settings: [300], 0.01, 0.1
validation_accuracy: 0.967, train_accuracy: 0.990, settings: [300], 0.01, 1.0
validation_accuracy: 0.678, train_accuracy: 0.752, settings: [100, 100], 0.0001, 0.0
validation_accuracy: 0.672, train_accuracy: 0.731, settings: [100, 100], 0.0001, 0.1
validation_accuracy: 0.644, train_accuracy: 0.717, settings: [100, 100], 0.0001, 1.0
validation_accuracy: 0.983, train_accuracy: 0.990, settings: [100, 100], 0.001, 0.0
validation_accuracy: 0.978, train_accuracy: 0.981, settings: [100, 100], 0.001, 0.1
validation_accuracy: 0.983, train_accuracy: 0.986, settings: [100, 100], 0.001, 1.0
validation_accuracy: 0.967, train_accuracy: 0.990, settings: [100, 100], 0.01, 0.0
validation_accuracy: 0.956, train_accuracy: 1.000, settings: [100, 100], 0.01, 0.1
validation_accuracy: 0.972, train_accuracy: 0.995, settings: [100, 100], 0.01, 1.0
validation_accuracy: 0.650, train_accuracy: 0.719, settings: [50, 50, 50], 0.0001, 0.0
validation_accuracy: 0.717, train_accuracy: 0.810, settings: [50, 50, 50], 0.0001, 0.1
validation_accuracy: 0.772, train_accuracy: 0.819, settings: [50, 50, 50], 0.0001, 1.0
validation_accuracy: 0.972, train_accuracy: 0.990, settings: [50, 50, 50], 0.001, 0.0
validation_accuracy: 0.983, train_accuracy: 0.983, settings: [50, 50, 50], 0.001, 0.1
validation_accuracy: 0.983, train_accuracy: 0.986, settings: [50, 50, 50], 0.001, 1.0
validation_accuracy: 0.967, train_accuracy: 1.000, settings: [50, 50, 50], 0.01, 0.0
validation_accuracy: 0.961, train_accuracy: 1.000, settings: [50, 50, 50], 0.01, 0.1
validation_accuracy: 0.967, train_accuracy: 1.000, settings: [50, 50, 50], 0.01, 1.0
Best validation_accuracy: 0.9833333333333333
Best settings:
hidden_layer_sizes: [100, 100]
learning_rate: 0.001
l2: 0.0

One thing to be mindful of with grid search is that people are usually looking for ways to automate choosing hyper parameters, because it is kind of unscientific to chose them yourself. Well, grid search doesn't really solve that problem. You still need to choose what hyperparameter settings to go through. If you chose the wrong ones, you are still going to arrive at a poor result. So, unfortunately there is no way of getting around this trial and error method.



4. Modifying Grid Search

We are now going to talk about the limitations of grid search, and what you might want to do instead. Let's go over the main problem with grid search. Say we have two hyperparameters that we are trying to optimize. The learning rate and the RMSProp epsilon. However, let's say that epsilon doesn't really have any effect on your performance. Since we are still going to be trying several different values for epsilon and it doesn't really have an impact, we are effectively going to be doing 5xs as much work as we have to in order to achieve the same result.

For instance, the results along this first column will all be the same (Since learning rate is held constant and change epsilon has no effect!)


The same thing would happen for the second column, and so on.

What is the solution to this? The solution is that instead of evenly dividing a grid into equally spaced points, just randomly choose points within the grid. You may get strange learning rates like 0.91324, but this is okay, and greatly reduces the chance that you are going to duplicate work unecessarily.



5. Random Search in Code

We will now implement random search in code. It will be very similar to the grid search that we did earlier, but instead of selecting the hyperparameters from a grid, they will be selected randomly.

We can start with our imports.

In [21]:
import theano.tensor as T
from theano_ann import ANN
from util import get_spiral, get_clouds
from sklearn.utils import shuffle
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

And now we can write our random_search function. A few things to note:

  • For the learning rate and L2 penalty we are actually going to search through the log of these hyperparameters. There are two reasons for this. The first is that the learning rate and L2 penalty have to be positive, so if we work in the log space and then later exponentiate these values, we can guarantee that they will always be positive. The second reason is because this is the scale that we usually test hyperparameters such as learning rate and regularization. Typically if you have tried a learning rate of 0.1, you won't then try a learning rate of 0.099 or 0.098. Instead you would try 0.1, 0.01, 0.001 and so on- this is a log scale.
  • When we randomly select new hyperparameters, we can think of each hyperparameters like a dimension, so the complete set is like a vector. So we are essentially searching within a hypersphere around this vector. In simple terms we are going to go plus or minus a certain amount in each direction, aka for each parameter. For example, when we select a new number of hidden layers, the new number will be the current number minus 1, plus 0, or plus 1. Of course, N hidden cannot be less than 1, so we take the max of that or 1. We do the same thing for m, the number of hidden units, but we search in multiples of 10.
In [27]:
"""Function to perform a random search of hyperparameters"""
def random_search(): 
    X, Y = get_spiral()                          # Get the data
    X, Y = shuffle(X, Y)
    Ntrain = int(0.7 * len(X))                   # Shuffle and split data into train & test 
    Xtrain, Ytrain = X[:Ntrain], Y[:Ntrain:]
    Xtest, Ytest = X[Ntrain:], Y[Ntrain:]
    
    # Choose starting set of hyperparameters
    M = 20                                       # Number hidden units
    nHidden = 2                                  # Number hidden layers
    log_lr = -4                                  # Log of learning rate 
    log_l2 = -2                                  # Log of L2 penalty
    max_tries = 30                               # Number of diff hyperparam settings to try
    
    # Initialize hyperparameters to 0 or none
    best_validation_rate = 0
    best_hls = None
    best_lr = None
    best_l2 = None
    
    # Loop through all possible hyperparameter settings
    for _ in range(max_tries):
        model = ANN([M]*nHidden)                  # Create model with current hiddenlayer size
        model.fit(
        Xtrain, Ytrain,
        learning_rate=10**log_lr, reg=10**log_l2,
        mu=0.99, epochs=3000, show_fig=False
        )                                         # Fit model with current l-rate and l2 pen
        validation_accuracy = model.score(Xtest, Ytest)
        train_accuracy = model.score(Xtrain, Ytrain)
        print(
            "validation_accuracy: %.3f, train_accuracy: %.3f, settings: %s, %s, %s" %
            (validation_accuracy, train_accuracy, [M]*nHidden, log_lr, log_l2)
        )
        
        # If validation accuracy is better than the best so far, set these as new hyperparms
        if validation_accuracy > best_validation_rate:
            best_validation_rate = validation_accuracy
            best_M = M
            best_nHidden = nHidden
            best_lr = log_lr
            best_l2 = log_l2
            
        # Randomly select new hyperparameters (this is different from grid search)
        nHidden = best_nHidden + np.random.randint(-1, 2) # -1, 0, or 1
        nHidden = max(1, nHidden)
        M = best_M + np.random.randint(-1, 2)*10
        M = max(10, M)
        log_lr = best_lr + np.random.randint(-1, 2)
        log_l2 = best_l2 + np.random.randint(-1, 2)

    print("Best validation_accuracy:", best_validation_rate)
    print("Best settings:")
    print("best_M:", best_M)
    print("best_nHidden:", best_nHidden)
    print("learning_rate:", best_lr)
    print("l2:", best_l2)
    
if __name__ == '__main__':
    random_search()
validation_accuracy: 0.672, train_accuracy: 0.745, settings: [20, 20], -4, -2
validation_accuracy: 0.528, train_accuracy: 0.619, settings: [10], -4, -1
validation_accuracy: 0.978, train_accuracy: 0.995, settings: [30, 30, 30], -3, -1
validation_accuracy: 0.978, train_accuracy: 1.000, settings: [40, 40], -2, 0
validation_accuracy: 0.978, train_accuracy: 1.000, settings: [20, 20, 20, 20], -3, -2
validation_accuracy: 0.511, train_accuracy: 0.495, settings: [20, 20, 20, 20], -2, 0
validation_accuracy: 0.972, train_accuracy: 0.998, settings: [20, 20, 20, 20], -3, -2
validation_accuracy: 0.967, train_accuracy: 0.995, settings: [30, 30], -3, 0
validation_accuracy: 0.967, train_accuracy: 1.000, settings: [40, 40, 40], -3, -1
validation_accuracy: 0.967, train_accuracy: 1.000, settings: [30, 30], -2, -1
validation_accuracy: 0.978, train_accuracy: 1.000, settings: [40, 40, 40], -2, -2
validation_accuracy: 0.983, train_accuracy: 0.995, settings: [20, 20], -3, -2
validation_accuracy: 0.978, train_accuracy: 0.998, settings: [30, 30], -3, -3
validation_accuracy: 0.983, train_accuracy: 0.998, settings: [20, 20], -2, -1
validation_accuracy: 0.972, train_accuracy: 0.998, settings: [30, 30], -3, -1
validation_accuracy: 0.928, train_accuracy: 0.957, settings: [30], -3, -2
validation_accuracy: 0.561, train_accuracy: 0.650, settings: [10], -4, -3
validation_accuracy: 0.567, train_accuracy: 0.645, settings: [30], -4, -2
validation_accuracy: 0.650, train_accuracy: 0.729, settings: [30], -4, -2
validation_accuracy: 0.944, train_accuracy: 0.948, settings: [20], -3, -1
validation_accuracy: 0.944, train_accuracy: 0.962, settings: [20], -3, -2
validation_accuracy: 0.861, train_accuracy: 0.905, settings: [10], -3, -2
validation_accuracy: 0.972, train_accuracy: 0.993, settings: [30], -2, -1
validation_accuracy: 0.983, train_accuracy: 1.000, settings: [10, 10, 10], -2, -1
validation_accuracy: 0.972, train_accuracy: 0.995, settings: [30, 30], -3, -1
validation_accuracy: 0.578, train_accuracy: 0.629, settings: [30], -4, -1
validation_accuracy: 0.956, train_accuracy: 0.988, settings: [10, 10], -2, -2
validation_accuracy: 0.967, train_accuracy: 0.950, settings: [10], -2, -3
validation_accuracy: 0.972, train_accuracy: 1.000, settings: [20, 20, 20], -2, -3
validation_accuracy: 0.972, train_accuracy: 0.993, settings: [10, 10], -3, -2
Best validation_accuracy: 0.9833333333333333
Best settings:
best_M: 20
best_nHidden: 2
learning_rate: -3
l2: -2

© 2018 Nathaniel Dake