Nathaniel Dake Blog

5. AdaBoost Algorithm

We are now going to talk about boosting, and introduce a realization of the boosting idea, the algorithm AdaBoost. It is currently still one of the most powerful ensemble methods in existence, and like the random forest it is considered a good off the shelf/plug and play model.

The idea behind boosting is very different from bagging and random forest. Recall, in those instances we wanted a low bias, high variance model. In boosting, we actually want high bias models. In boosting nomenclature these are called weak learners. Weak learners generally have classification accuracies not much better than chance-in general about 50-60%. We will show later though, that even after ensembling the variance will remain low, and the test error will remain low, and the ensemble will not overfit-even when we keep adding more trees.


1.1 Boosting

The hypothesis behind boosting is that if we combine many weak learners, as a whole they will be a strong learner. As in our stacking example earlier, what we are going to do is weight each learner so that each base model has a different amount of influence on the output.

$$F(x) = \sum_{m=1}^M \alpha_mf_m(x)$$


1.1.1 Weak Learners

So, how do we create or find weak learners? A typical way to do that with boosting is to use decision trees with a max depth of 1. These are also called decision stumps. A decision stump is just one split, in one dimension. In other words, you are only splitting the space in half for each tree. In other words, it is actually remarkable that a combination of these can actually yield a good classifier.

Another way is just to use a linear classifier like logistic regression. An added bonus of using such simple models is that they train extremely fast. This means that you can quickly train an ensemble of thousands of trees.


1.1.2 Details

In Adaboost, which we will get to shortly, there are a couple of details which we have to make note of. First and foremost, Adaboost used {-1, +1} as targets, rather than {0, 1}. This will become clear why when we see the algorithm. So, for example if we were doing logistic regression we would rescale the output by multiplying by 2 and subtracting by 1.

$$ModifiedLogisticOutput(x) = 2*LogisticOutput(x) - 1$$

The final output of Adaboost is:

$$F_M(x) = sign(\sum_{m=1}^M \alpha_mf_m(x))$$

Where $f_m$ represents the individual base learners, $F_M$ is the ensemble model, with $M$ being the number of base learners. Since the targets are -1 and +1, the decision boundary is 0, so we just take the sign to determine the final prediction. We typicaly use $\alpha$ as the symbol to weight each classifier, since $w$ will be used for something else-in particular, to tell us how important each sample is.


1.2 AdaBoost

Let's now talk about the AdaBoost algorithm itself. The idea is that we are going to add each base model one at a time, which is called additive modeling. We train the base model on all of the training data, which means there is no resampling or bootstrapping here. The difference between this and the other algorithms that we have studied, is that we are going to weight how important each sample is, using $w_i$ for $i = 1...N$. We will modify $w_i$ for each round. So intitially they will all be equal, but if we get the prediction for the pair $x_i, y_i$ wrong, then we will increase $w_i$ for that round. In this way, the base model knows which samples are more important. Else we will decrease $w_i$.

You can imagine that this may require us to modify the cost function. For example, for logistic regression, you would need to multiple each individual cross entropy by the weight for that sample.

$$J_{weighted} = \sum_{i=1}^N w_i\Big[t_ilogy_i +(1-t_i)log(1-y_i) \Big]$$

For the decision tree, luckily the scikit learn API already allows us to pass in sample weights to our fit function!

Once we have trained the base model (on all data $X,Y$ with weights $w_i$, where $i=1..N$), we calculate its error weighted by $w_i$. We then compute $\alpha_m$, which represents how important this base model is to the final model as a function of the error. Note that if we had less error (our model was more accurate) then $\alpha_m$ should be bigger. Then we store $\alpha_m$ and we store the base model $f_m$. Once the loop is done, and we hit the specified number of base models, we exit the loop and we are done training.


1.2.1 AdaBoost Pseudocode


Explained/with Equations

  • We want to start by giving $w_i$ a uniform distribution, so each sample has equal importantance.
  • Then in a loop, we create a base model $f_m$ and train it on all the data with the weights in $w$
  • We then calculate the error for this iteration, $\epsilon_m$, which is also weighted by the sample weights $$\epsilon_m = \frac{\sum_{i=1}^N w_i I(y_i \neq f_m(x_i))}{\sum_{i=1}^N w_i}$$
  • We calculate $\alpha_m$ which is the log ratio of the weighted correct rate to the weighted error rate. $$\alpha_m = \frac{1}{2}log \Big[\frac{1-\epsilon_m}{\epsilon_m} \Big]$$
  • Essentially this means that if the model is more correct, it gets a higher weight $\alpha$
  • Note, the sum over $w$ when we calculate the error rate $\epsilon$ is not necessary if we normalize $w$ like we do in the second last step. It is left there for completion sake, and to remember that the error will be a number between 0 and 1.
  • Next, we update the $w_i$'s. We can see that if we are correct, the $y_i$ and $f_m(x_i)$ are the same sign, hence $w_i$ decreases. If we are wrong, then $y_i$ and $f_m(x_i)$ are of opposite signs, so $w_i$ increases. This is why we require the targets to be either -1 or +1. $$w_i = w_i* exp\Big[-\alpha_my_if_m(x_i)\Big], i =1,...,N$$
  • We then normalize $w$ because we treat it like a probability distribution $$w_i = \frac{w_i}{\sum_{i=1}^N w_i}$$
  • The last step is to save $\alpha_m$ and $f_m(x)$

Pseudocode

Initialize w[i] = 1\N for i=1..N
for m=1..M:
  Fit f_m(x) with sample weights w[i]
  e_m = (w * (y != f_m(x) )) / w
  a_m = 0.5 * log((1 - e_m) / e_m)
  w = exp(-a_m * y * f_m(x))
  save a_m, f_m(x)

We can notice that the Adaboost algorithm is very specific to binary classification, requiring the two classes to be {-1, +1}. There are extensions in literature that discuss adaboost modifications for multiclass classification and regression. Our goal is just to get the main idea down. If you do desire to work with adaboost for multiclass classification or regression, scikit learn comes packaged with those already. As with random forest, the authors recommend using trees as the ideal base model, but linear classifiers are common as well.



2. Additive Modeling

We are now going to look at AdaBoost as an instance of additive modeling. Particularly, we are going to talk about a specific instance of additive modeling, called forward stagewise additive modeling. It is called this because at each stage we add a new base model, without modifying any of the existing base models we have already added.

2.1 Forward Stagewise Additive Modeling

The general algorithm is seen below. Note that as conventions $L(y, f(x))$ is the loss/cost function given target $y$ and model $f(x)$. $F$ is the full model, and $f$ is the base model.

  1. We intialize our full model to 0
  2. We begin by going into a loop from 1..M
  3. We then create a base model, and find the best alpha and best model parameters, such that we minimize the total cost of the full model so far. $$(\alpha_m^*, \theta_m^*) = argmin_{\alpha_m, \theta_m} \sum_{i=1}^N L(y_i, F_{m-1}(x_i) + \alpha_mf_m(x_i;\theta_m))$$
  4. The final step inside of the loop is to add this weighted model to the full model. Notice that this is recursive. $$F_m(x) = F_{m-1}(x) + \alpha_m^* f_m(x;\theta_m^*)$$ Note what this model does not specify. It doesn't tell us that we have to weight the data points in any particular way. It doesn't tell us what cost function to use. Those are specific to any implementation of this, so AdaBoost is one such example. You can imagine that if we chose the wrong cost function, or wrong type of base model, or we did not weight the samples at all then we would get worse results.
Initialize F_0(x) = 0
for m = 1..M:
  Create base model
  Find the best alpha and best model parameters such that we minimize the full model so far
  Add the weighted base model to the full model


3. AdaBoost Loss Function - Exponential Loss

We have looked at many loss functions already:

  • Binary cross-entropy (binary classification)
  • Multiclass cross-entropy (multiclass classification)
  • Squared Error (regression)
  • Absolute Error (regression)
  • Accuracy (perceptron)

Now we are going to look at one more - the loss function that AdaBoost uses - Exponential Loss. Because our outputs are {-1, +1} it does not make sense to use cross entropy.

3.1 Exponential Loss

The exponential loss function is defined as follows:

$$L\Big(y, f(x)\Big) = exp\Big(-yf(x)\Big)$$

We can see that as always it takes in two arguments, the target $y$, and $f(x)$ which is the model. We can see that when $y$ and $f(x)$ are the same sign, the output approaches 0. When $y$ and $f(x)$ are opposite signs, the output approaches infinity. This means we still have the same asymptotic effect of the cross entropy function.

3.2 Cross Entropy Comparison

Let's consider the positive class, $y=1$, for cross entropy. Recall, that with cross entropy and logistic regression, the input into the sigmoid, which we call the logit, has to equal infinity for the sigmoid to equal 1. Otherwise, we can get very close to 1, but not quite equal to one. Mathematically that looks like:

$$Let \; f(x) = \sigma(w^Tx)$$$$CrossEntropy(y, f(x)) = ylog(f(x)) + (1-y)log(1 - f(x))$$

If our class is 1, i.e. y = 1, then if our prediction, $f(x)$ is 1, we have 0 loss. However, since $f(x)$ is the output of the sigmoid function, we would need the input to the sigmoid (the logit), $w^Tx$, to equal infinity. Since $w^Tx$ can never reach infinity, we can keep increasing it forever.

3.3 Relationship to Squared Error

This is why we don't want to use squared error for classification. Recall that squared error is defined as: $$L = \Big[y - f(x)\Big]^2$$

Squared error does not care which side of $y$ $f(x)$ is on-anything that is far away from $y$ will increase the error. If $f(x)$ is on the right side of $y$, it doesn't matter how big it is, as long as it leads to the right prediction.

3.4 Additive Modeling Algorithm - AdaBoost Derivation

No we know that using exponential loss "makes sense" in this situation, but we can take it one step further. In particular, we are going to use the additive modeling definition, and solve for the best model weight, and in the process recover all of the equation that we saw in AdaBoost Algorithm. That means we will recover the equations for $\alpha$, the sample weights $w$, and the weighted errors.


3.4.1 Replace General Cost function with Exponential Loss

So, we are starting at this point:

$$(\alpha_m^*, \theta_m^*) = argmin_{\alpha_m, \theta_m} \sum_{i=1}^N L(y_i, F_{m-1}(x_i) + \alpha_mf_m(x_i;\theta_m))$$

The above equation is just saying that the best value for $\alpha$ and the best value for $\theta$ is that which minimize the total cost over $N$ training examples. So, our first step is to replace the general cost function with the exponential loss that we discussed earlier. $$(\alpha_m^*, f_m^*) = argmin_{\alpha_m, f_m} \sum_{i=1}^N exp\Big[-y_i*\Big(F_{m-1}(x_i) + \alpha_mf_m(x_i)\Big)\Big]$$

Recall that we are using the convention $F_M$ means the full ensemble model, and $f_m$ means the mth base model.


3.4.2 Expand Terms in Exponential

We can now expand the inner terms of the exponential above. Meaning we start with: $$J = \sum_{i=1}^N exp\Big[-y_i*\Big(F_{m-1}(x_i) + \alpha_mf_m(x_i)\Big)\Big]$$ And expand it to be: $$J = \sum_{i=1}^N exp\Big[-y_iF_{m-1}(x_i) \Big] exp \Big[- y_i\alpha_mf_m(x_i)\Big]$$


3.4.3 1st Exponential is Proportional to $w_i$ at iteration $m$

Let's look a little closer at the first exponential for a second: $$exp\Big[-y_iF_{m-1}(x_i) \Big]$$ We are going to prove that this exponential is proportional to $w_i^{(m)}$, meaning we can write it like so: $$J = \sum_{i=1}^N w_i^{(m)}exp \Big[- y_i\alpha_mf_m(x_i)\Big]$$ Let's show that proof now.


3.4.3.1 Proof of $w_i$ proportionality

To show that the above idea is true, we must recall the update rule for $w_i$. It looked like: $$w_i^{(m)} = w_i^{(m-1)} exp\Big(-y_i \alpha_{m-1}f_{m-1}(x_i)\Big)$$ We can keep defining $w_i$ in terms of earlier and earlier $w_i$s! For instance, we can plug in the value for $w_i^{(m-1)}$ and find: $$w_i^{(m)} = w_i^{(m-2)} exp\Big(-y_i \alpha_{m-2}f_{m-2}(x_i)\Big)exp\Big(-y_i \alpha_{m-1}f_{m-1}(x_i)\Big)$$ This means we can define $w_i^{(m)}$ as: $$w_i^{(m)} = w_i^{(0)} exp\Big( - y_i \sum_{m'=1}^{m-1} \alpha_{m'}f_{m'}(x_i) \Big)$$ And we know that $w_i^{(0)}$ is proportional to 1, because we made $w$ take on the uniform distribution: $$w_i^{(0)} = \frac{1}{N}$$ So, we can replace $w_i^{(0)}$ with 1, and see that the sum is actually just equivalent to $F_{m-1}$! Meaning our equation for $w_i^{(m)}$ turns into: $$w_i^{(m)} = 1 * exp \Big( - y_i F_{m-1} (x_i)\Big)$$

Awesome! We have now recovered our first update rule from the adaboost algorithm! We know how to update $w_i$ in terms of it's previous value. To recap, we started with:
$$J = \sum_{i=1}^N exp\Big[-y_iF_{m-1}(x_i) \Big] exp \Big[- y_i\alpha_mf_m(x_i)\Big]$$ And we then proved that the first exponential could we defined as $w_m^{(m)}$. Once we sub that in, we arrive at our cost: $$J = \sum_{i=1}^N w_i^{(m)}exp \Big[- y_i\alpha_mf_m(x_i)\Big]$$

Let's now return to the cost function.


3.4.4 Realize $y*f$ will always be +1 or -1

We can then realize the $y$ times $f$ is always going to be either $+1$ or $-1$ depending on whether the prediction is right or wrong. So, we can split the sum into two terms, the correct terms and incorrect terms:

$$J = e^{-\alpha_m} \sum _{y_i = f_m(x_i)} w_i^{(m)} + e^{\alpha_m} \sum _{y_i \neq f_m(x_i)} w_i^{(m)}$$


3.4.5 Rename and Simplify

We can simplify this a little bit by renaming things. We will call the first sum, the number of weighted corrects $A$, and we will call the second sum the weighted of incorrects $B$. We will also just drop the subscript $m$ since it is just getting in our way.

$$J = e^{-\alpha}A + e^\alpha B$$

This has taken us to our new cost function!


3.4.5 Take derivative, set to 0, solve for alpha

We now want to take the derivative with repsect to $\alpha$, set to 0, and solve for $\alpha$. That will allow us to find the value of $\alpha$ that minimizes our cost, $J$.

$$\frac{\partial J}{\partial \alpha} = -e ^{-\alpha}A + e^\alpha B = 0$$$$e ^{-\alpha}A = e^\alpha B$$$$-\alpha + lnA = \alpha + lnB$$$$2\alpha = lnA - lnB $$$$\alpha_m = \frac{1}{2} ln(\frac{A}{B}) = \frac{1}{2} ln(\frac{number \;weighted \; correct}{number \;weighted\;incorrect})$$

We can see that this is equal to our update equation for $\alpha$ in the adaboost algorithm! Note, the number correct, and the number incorrect, are represented as error rates in the orignal algorithm. But, since we would be dividing by the same number on both top and bottom, they cancel out.



3.5 Summary

Let's try and summarize what we just did, since that was rather long.

  • In the previous section, we looked at a general algorithm for forward stagewise additive modeling, and we claimed that AdaBoost was an instance of that algorithm.
  • The general additive algorithm does not specify the loss function or any other details.
  • We looked at the exponential loss function as a reasonable loss function for adaboost, where the targets are -1 and +1.
  • We then plugged that into the additive modeling equation for finding the next $\alpha$
  • By manipulating the loss function, we recovered our update equation for $w_i$ - the sample weights for telling us how important each sample is for each base model we add
  • We saw that that recursive definition for $w_i$ exactly fits in this framework
  • We then optimized the loss function with respect to $\alpha$, to recover our update equation for $\alpha_m$
  • We finally took the derivative of the cost $J$ with respect to $\alpha$, and set it to 0 in order to solve for the optimal $\alpha$. This is very intellectually satisfying, since it places adaboost in the same framework as many of the other models we have worked with-especially the perceptron, logistic regression, and neural networks.


4. AdaBoost Implementation

We are now going to implement AdaBoost in code. Note that unlike random forest, AdaBoost doesn't require us to reimplement parts of the decision tree. This is due to the fact that the scikit learn api allows us to pass in the sample weights into the fit function.

We can start with our usual imports.

In [4]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.tree import DecisionTreeClassifier
from rf_classification import get_data

# Seaborn Plot Styling
sns.set(style="white", palette="husl")
sns.set_context("poster")
sns.set_style("ticks")

We now want to define our AdaBoost Class.

In [10]:
class AdaBoost: 
  def __init__(self, M):         
    self.M = M                      # Number of Base Learners
    
  """Fit Function"""
  def fit(self, X, Y):
    self.models = []                # Instance variable to store decision stumps
    self.alphas = []                # Instance variable to store the correspond weight alpha
    
    N, _ = X.shape                  # Get number of samples 
    W = np.ones(N) / N              # Initialize W to be uniform distribution, 1/N
    
    """ ----------- Loop through total number of base learners ----------- """
    for m in range(self.M):
      tree = DecisionTreeClassifier(max_depth=1)     # Create new decision stump instance 
      tree.fit(X, Y, sample_weight=W)           # Fit tree to data, pass in sample weights 
      P = tree.predict(X)              # Get Predictions, so we can calculate weighted errors 
      
      # Note that we do not need to divide by w here, because we already normalized it 
      # in the w update, thus w is always a probability distribution
      # Recall that in numpy we like to vectorize as many operations as possible. So, w is an
      # N length vector, and P != Y is also an N length vector. Therefore, the sum over the 
      # element wise multiplication of those two things is also the dot product. 
      err = W.dot(P != Y)
      
      """ ----------- Calculate updates for alpha and w ----------- """
      # Alpha calculation differs from that of pseudocode, but due to rules of logging it is 
      # also correct. If we can turn multiplication and division into addition and subtraction
      # we generally do because it is faster and more numerically stable 
      alpha = 0.5 * (np.log(1 - err) - np.log(err))
      W = W * np.exp(-alpha * Y * P)                  # vectorized update
      W = W / W.sum()                                 # Normalize so it sums to 1
    
      """ ----------- Add decision stump (model) and alpha to list ----------- """
      self.models.append(tree)
      self.alphas.append(alpha)
      
  def predict(self, X):
    # Notice that this does not look like the scikit learn API, since we want to return
    # 2 things. The reason is that later we will want to compare plots of both exponential
    # loss and accuracy simultaneously. To calculate the accuracy, we need the prediction, 
    # which is the sign of F(X). But, to caclulate the loss, we need F(X)  itself before
    # we take the sign, so we just return both. Also, note that there is no way to vectorize
    # the prediction, since we need to call predict for every stump object, meaning they 
    # will be separate function calls.
    N, _ = X.shape
    FX = np.zeros(N)
    for alpha, tree in zip(self.alphas, self.models):
      FX += alpha * tree.predict(X)                # update F(X) with each weighted prediction
    return np.sign(FX), FX                         # Return prediction and actual F(X)
    
  def score(self, X, Y):
    # Similar to Scikit Learn API, but will also be returning loss in addition to accuracy
    # Notice the loss is normalized by the number of samples, in other words we take the mean
    # instead of the sum. This ensures that the loss and the accuracy will be on the same 
    # scale, somewhere between 0 and 1
    P, FX = self.predict(X)              # Get predictions and actual F(X)
    L = np.exp(-Y*FX).mean()             # Calculate final exponential loss (normalized)
    return np.mean(P == Y), L            
    
    
if __name__ == '__main__':
  X, Y = get_data()              
  Y[Y == 0] = -1                         # Make targets -1, +1, since adaboost requires that
  Ntrain = int(0.8*len(X))
  Xtrain, Ytrain = X[:Ntrain], Y[:Ntrain]
  Xtest, Ytest = X[Ntrain:], Y[Ntrain:]
  
  # Main loop, we will test 1 up to 200 trees, which is more than enough to see convergence 
  T = 200
  train_errors = np.empty(T)
  test_losses = np.empty(T)
  test_errors = np.empty(T)
  for num_trees in range(T):
    if num_trees == 0:
      train_errors[num_trees] = None
      test_errors[num_trees] = None
      test_losses[num_trees] = None
      continue
    if num_trees % 20 == 0:
      print(num_trees)
  
    model = AdaBoost(num_trees)           # Create AdaBoost instance with num_trees instances
    model.fit(Xtrain, Ytrain)             # Fit to train data
    acc, loss = model.score(Xtest, Ytest)
    acc_train, _ = model.score(Xtrain, Ytrain)
    train_errors[num_trees] = 1 - acc_train
    test_errors[num_trees] = 1 - acc
    test_losses[num_trees] = loss
    
    if num_trees == T - 1:
      print("final train error:", 1 - acc_train)
      print("final test error:", 1 - acc)

  fig, ax = plt.subplots(figsize=(12,8))
  plt.plot(test_errors, label='test errors')
  plt.plot(test_losses, label='test losses')
  plt.legend()
  plt.show()

  fig, ax = plt.subplots(figsize=(12,8))
  plt.plot(train_errors, label='train errors')
  plt.plot(test_errors, label='test errors')
  plt.legend()
  plt.show()
dimensionality: 139
20
40
60
80
100
120
140
160
180
final train error: 0.0
final test error: 0.0

Awesome! We can see that as we increase the number of models we do not overfit! The test error remains 0!



5. AdaBoost vs. Stacking

We are now going to take a minute to compare to compare AdaBoost vs Stacking. We will see that AdaBoost is a big improvement on the old stacking idea. Remember, both of these methods use a weighted ensemble of base models:

$$F(x) = \sum_{m=1}^M \alpha_m f_m(x)$$

The next step in our stacking setup was we setup a cost function to minimize. We used the squared error in our earlier example, but that need not be the case. We can choose any cost function for this, and the idea remains the same. It was only because we choose the squared error the first time that we could arrive at the quadratic programming solution. If we don't use squared error, we can still attempt to minimize the loss, we just need to use a different method, perhaps gradient descent. Now, we can just call the loss $L$ (being agnostic towards the loss function). So let's compare these two algorithms side by side:

$$\textbf{AdaBoost} \rightarrow J = \frac{1}{N} \sum_{i=1}^N L \Big[y_i, F_M(x_i)\Big] = \frac{1}{N} \sum_{i=1}^N L \Big[y_i, \sum_{m=1}^M \alpha_m f_m(x_i|W^{(m)})\Big]$$$$\textbf{Stacking} \rightarrow J = \frac{1}{N} \sum_{i=1}^N L \Big[y_i, F_M(x_i)\Big] = \frac{1}{N} \sum_{i=1}^N L \Big[y_i, \sum_{m=1}^M \alpha_m f_m^{-i}(x_i)\Big]$$

We can see above that in AdaBoost we just have some base classifier $f_m$ that is trained given some sample weights $W^{(m)}$. In stacking, the base classifier is like a leave one out cross validated classifier. Recall, the $-i$ means that this $f_m$ is trained on every sample except the $ith$ sample.


General Stacking Method
So, the cost functions were not extremely different, but they were a little different. Let's talk about how we find the alphas, which are the model weights next. We can think about a general method to train the stacking ensemble. The main thing is that we need to construct this cost function, and then optimize it with respect to alpha. To do that, we need to find all of the $f_m^{-i}$. In other words, we need to know how to construct the whole cost function before we even begin optimizing the alphas. Aka, we must know all of the $f$ values for i=1..N, and m = 1..M, before we can optimize $L$! Because $f_m^{-i}$ is indexed by $m$ and $i$, that means those values will be stored in a matrix of size (M x N). Therefore, the number of models we need to train is O(NM). That is quadratic complexity! And, since model training isn't fast, it is not a scalable solution.


Complexity
You can see that we are stepping away from machine learning for a minute in order to think in terms of algorithms and complexity. In terms of algorithms, what does AdaBoost do? Well, it is actually a Greedy algorithm! It doesn't try to optimize all of the alphas at the same time. You would think that being a greedy algorithm, the solution would be suboptimal, but we have already seen that AdaBoost performs very well. AdaBoost is greedy because it only uses the current weights $W^{(m)}$ to find the weighted error, $err_m$, and to calculate $\alpha_m$. Then, on every subsequent iteration, $\alpha_m$ never changes. In other words, once we calculate $\alpha_1$ we never return to recalculate it. This is true for all earlier parameters. By building the classifier in an additive way, or a greedy way, we only need to train O(M) models. So, rather than having a problem that has quadratic complexity, we have a problem that has linear complexity!



6. AdaBoost Connection to Deep Learning

We are now going to make a connection between AdaBoost to Deep Learning! Recall that we do not need to use a tree as a base learner, we can also use logisitc regression (scaled to -1 and +1). If we write out the entire equation for the AdaBoost output, in terms of the individual base logistic regressions, we see a form that is very similar to that of a neural network!

$$\textbf{AdaBoost} \rightarrow \hat{y} = sign \Big(\sum_{m=1}^M \alpha_m sign(w_m^Tx)\Big)$$$$\textbf{Neural Net} \rightarrow \hat{y} = sign \Big(\sum_{m=1}^M \alpha_m tanh(w_m^Tx)\Big)$$

The main difference is that adaboost uses base classifiers with hard outputs; aka actual predictions equal to -1 and +1, rather than soft outputs where you get a number between -1 and +1. So the structure of these is the same, even though the training algorithms are different.

So an interesting conclusion that we can draw from this is that adaboost (with a logistic regression base learner) has the same structure as a 1-hidden-layer neural network, that can dynamically expand the size of its hidden layer. This is about as far as the analogy goes, since training AdaBoost is greedy, and you only get to set the current alpha given all the previous alphas and $w$'s. With neural networks, you update all of the weights at the same time, so it is more of a global optimization.

In [ ]:
 

© 2018 Nathaniel Dake