We started off this series talking about the bias-variance trade-off. We showed that the error of any classification or regression model is a combination of bias, variance, and irreducible error. We then demonstrated that irreducible error can't be reduced, by bias and variance can! In the ideal situation both bias and variance are low. The main dilemma we saw was that as we decrease one, the other tends to increase. So, the idea that we found was that we want to find a happy medium where we are optimizing the test error. We learned than ensemble methods are a way to make the tradeoff, less of a tradeoff (i.e. attain low bias and low variance)!
We began our discussion of ensemble methods with the bootstrap technique. We showed that by using the bootstrap we can not only estimate confidence intervals, but we can actually reduce the variance of whatever we are trying to estimate; just by estimating it over and over again with resampled data sets. The key equation below, showed us that the variance of the bootstrap estimate was a function of the original variance, the correlation between each bootstrap sample, and the number of bootstrap samples.
$$var(\bar{\theta}_B) = \frac{1 - \rho}{B}\sigma^2 + \rho \sigma^2$$This equation showed us that when the correlation between each bootstrap sample is 1, we do not get any reduction in variance. But, when the correlation between each bootstrap sample is 0, then we get the maximum reduction in variance, which is a $\frac{1}{B}$ decrease.
The bootstrap technique motivated the idea of bagging. This is where we would create an ensemble of models by training them on bootstrapped samples of the training data. We deduced that by using trees, which have very low bias and very high variance, we could achieve the desired decorrelation effect. We then combined the trees to lower the variance.
Next we looked at the random forest, which further decorrelated the ensemble of trees by sampling the features as well.
Next we looked at AdaBoost, which unlike bagged trees and random forest, did not aim to use low bias, high variance models. Instead, the idea behind boosting was that many individual weak learners could be combined to be a strong learner if weighted properly. We demonstrated that this is true, by showing that AdaBoost achieve the best error rate on our data.
An interesting idea is how you may even go past boosting. First, we know that a bagging classifier just outputs the sum of all of its base learner predictions (assuming we are using -1 and +1 as targets).
$$\textbf{Bagging} \rightarrow F(x) = \sum_{m=1}^Mf_m(x)$$Boosting extends this by adding weights to each base learner. These weights, however, are not dependent on the input. Thus, each weight has the same value no matter what x is.
$$\textbf{Boosting} \rightarrow F(x) = \sum_{m=1}^M \alpha_mf_m(x)$$One idea is that you can extend that by making each base learner an expert at something different. In other words, the model weight will also depend on x, and that will tell us how good the model is at classifying that x. This is known as mixture of experts.
$$\textbf{Mixture of Experts} \rightarrow F(x) = \sum_{m=1}^M \alpha_m(x)f_m(x)$$