3. Bootstrap Estimation¶

We previously looked at the bias-variance tradeoff and if you were thinking critically you may have wondered: "Could it be possible in some way to lower bias and variance simultaneously?"

In this section, we are going to take our first look into model averaging. The key tool that we need to do this is called bootstrapping, aka resampling. The fascinating result of this is that even though we are using the same data, we can get a better result. This should seem odd at first, since if we create a model from a set of samples, how can that be any different than taking the averages of different models trained on different subsets of those same samples again and again-it is the same set of samples after all.

However, model averaging does work, even if it is true that they work on the same data that you would have if you only have 1 model. Before we talk about bootstraping for models, we are going to look at bootstrapping for simple parameter estimates like the mean.

1.1 Bootstrap Estimation - Mean¶

So, how does bootstrap estimation work? We are given a set of data points from $1...N$ $$X = x_1,x_2,...,x_N$$ We then draw a sample, with replacement, from this data set, $B$ times. For each of the $b$ subsample datasets, we calculate the parameter of interest-aka the mean, variance, or any other statistic. Once the loop is done we will have $B$ different estimates of the parameter. We can use this to find the mean of the parameter, and the variance of the parameter. Why do we care about the mean and variance? First, the mean tells us the most likely value of the parameter, in other words the expected value of the parameter. The variance then can tell us how accurate that estimate is! A large variance means not that accurate, and a small variance means more accurate. So, in pseudo code the algorithm could look like this:

X = x1, x2,...xN
for b = 1..B:
    Xb = sample_with_replacement(X)         # size of Xb is N
    sample_mean[b] = sum(Xb)/N
Calculate mean and variance of {sample_mean[1],...,sample_mean[B]}

As an example, lets just say that $X$ has a size N = 5 with the following values: $$X = 1,2,3,4,5$$

Let's say that we decide to make $B = 4$, in order words we are going to have 4 iterations of sampling. The first time we sample (with replacement) we may end up with:

$$X_{b1} = 1,2,5,5,2$$

We have sampled five values from our original $X$. If we perform that 3 more times, we may end up with: $$X_{b2} = 4,3,3,1,5$$ $$X_{b3} = 2,4,1,5,4$$ $$X_{b4} = 3,1,3,2,4$$

Now, let's say that our original goal was to be finding a certain parameter of $X$, in this case the mean. We can then calculate the mean of each of above samples: $$\mu_{B1} = \frac{15}{5} = 3$$ $$\mu_{B2} = \frac{16}{5} = 3.2$$ $$\mu_{B3} = \frac{16}{5} = 3.2$$ $$\mu_{B4} = \frac{13}{5} = 2.6$$

We now have $B$ different estimates of the mean of our samples, which were taken from $X$. What we may want to do is now try and find the mean of these means. In other words, we can try and find the parameters that describe this set of sampled data:

$$mean(\mu_{B1},\mu_{B2},\mu_{B3},\mu_{B4}) = \frac{\mu_{B1} + \mu_{B2}+\mu_{B3}+\mu_{B4}}{B}$$

And the variance which can tell us how accurate our estimate is: $$var(\mu_{B1},\mu_{B2},\mu_{B3},\mu_{B4})$$

1.2 Sampling with Replacement¶

In case you have not come across sampling with replacement, let's quickly touch on it now. Suppose we have a dataset with the points 1,2,3,4,5.

$$X = 1,2,3,4,5$$

Suppose we then draw a sample and get 5. Sampling with replacements means that if we draw another sample, we can get 5 again. In fact, we could draw a sample with all 5s!

$$sample = 5,5,5,5,5$$

This is because we replace the sample after we take it from the dataset. This is the opposite of sampling without replacement. If we were to sample without replacement and we drew a number of samples equal to the dataset size, we would just draw the dataset itself. Hence, sampling with replacement is important to this process.

1.3 Why does bootstrapping work?¶

As you can see, bootstrapping is a very simple algorithm- you are just computing the parameter estimate multiple times from the same dataset. So, why does it work? Lets look at the results first and then we can derive them. Remember, we are interested in the mean and variance.

Mean
The mean of the bootstrap estimate is equal to the parameter itself:

$$E(\bar{\theta_B}) = mean(\bar{\theta_B}) = \theta$$

And, as an example, if our parameter had been the mean, then what we find is that mean of our bootstrap estimate of $\mu$, is equal to the actual value of $\mu$. In other words, the mean of our bootstrap sampled means, is the actual mean of the original data.

$$E(\bar{\mu_B}) = \mu$$

Or in the case of our example earlier: $$E(\frac{\mu_{B1}+\mu_{B2}+\mu_{B3}+\mu_{B4}}{B}) = \mu_X$$

Variance
The variance is a bit more complicated. Let's suppose the correlation coefficient between two different estimates of the parameter, $\hat{\theta}_i, \hat{\theta}_j$ is $\rho$, and the variance of each $\hat{\theta}$ is $\sigma^2$:

$$\rho = corr(\hat{\theta}_i, \hat{\theta}_j), var(\hat{\theta}) = \sigma^2$$

Then, it can be derived that the variance of the bootstrap estimate is:

$$var(\bar{\theta}_B) = \frac{1 - \rho}{B}\sigma^2 + \rho \sigma^2$$

Notice that if each bootstrap estimate is completely uncorrelated from the others, the variance would be the original variance divided by $B$. This means that for every bootstrap sample we take, we reduce the variance of our estimate. That is remarkable! Unfortunately, there will probably be correlation.

1.4 Confidence Interval¶

One application of bootstrap estimation, is that we can also estimate the confidence interval of our estimate. We assume a gaussian approximation, so let's say we want a 95% confidence interval. That means that we want the lower and upper bound of $\theta$ that covers 95% of the area under the probability distribution. This is approximately equal to the sample mean of the bootstrap $\theta$, plus or minus 1.96 times the standard deviation of the bootstrap $\theta$:

$$95\% CI \approx \bar{\theta}_B \;\pm\; 1.96 std(\hat{\theta}_B)$$

1.5 Derivation of Mean and Variance¶

Now that we know the main results of bootstrap estimation, how do we show that they are true?

1.5.1 Mean Derivation¶

Let's start with the mean. We want to be able to show that the mean value of our bootstrap estimated parameter is the value of the parameter itself. In other words, think back to our simple example at the start of lecture. Our data set was $X$, and the parameter we were looking at was $\mu$, the mean of $X$. We want to be able to prove that after performing our bootstrap sampling, and that expected value of the mean of our samples (think $\mu_{B1}$, $\mu_{b2}$, and so on) are equal to actual mean of $X$, since this was the parameter we were originally trying to estimate!

So, we know that expected value (based on its definiton: expected value of a random variable, intuitively, is the long-run average value of repetitions of the experiment it represents), is equivalent to the mean. Let's define the following:

$\bar{\theta}_B$ = sample mean of resampled sample means

$\hat{\theta}_i$ = sample mean of bootstrap sample $i$

$\theta$ = original parameter we're trying to estimate

Okay, now we can start with looking at the expected value of our resampled sample means:

$$E(\bar{\theta}_B)$$

We can expand $\bar{\theta}_B$ based on its definition:

$$E(\bar{\theta}_B) = E \Big[ \frac{1}{B} \sum_{i=1}^B \hat{\theta}_i \Big] = E\Big[\frac{1}{B}(\hat{\theta}_1 + ...+\hat{\theta}_B)\Big]$$

Because $\frac{1}{B}$ is a constant, and the expected value of constant is just itself, we can pull it out:

$$E(\bar{\theta}_B) = \frac{1}{B}*E\Big[(\hat{\theta}_1 + ...+\hat{\theta}_B)\Big]$$

And then we know the expected value of any $\hat{\theta}_i$ is going to be $\theta$, the actual parameter. We also know that there are $B$ total $\hat{\theta}$s, so we can pull that out and end up with the final equation:

$$E(\bar{\theta}_B) = \frac{1}{B}BE(\hat{\theta}) = \theta$$

We can see that the expected value of the bootstrap estimate of the parameter, is equal to the parameter, which is exactly what we were looking for.

1.5.2 Variance Derivation¶

1.5.2.1 Variance Derivation - Definitions¶

Next, let's look at the variance. We can start with some definitions. Let's suppose that the expected value of $\hat{\theta}$ (aka the expected value of the parameter that we calculate after resampling) is equal to $\mu$.

$$E(\hat{\theta}) = \mu$$

This is not necessarily equal to the original mean of data $X$. It is the mean of whatever parameter we are trying to estimate. For instance, say that the parameter we are trying to estimate is the mean (as in our simple example from earlier). We are stating that the expected value of any of the means we have sampled ($\mu_{B1},\mu_{B2}$, etc) is just equal to $\mu$. So in this case $\mu_{B1}$ and so on would be represented as $\hat{\theta}$ (the parameter we are trying to find, and $\mu$ is the mean/expected value of that parameter.

Let's also define the variance of $\hat{\theta}$ to be $\sigma^2$:

$$var(\hat{\theta}) = E \Big[(\hat{\theta} - \mu)^2\Big] = \sigma^2$$

We can next define the correlation between two different $\hat{\theta}$s to be $\rho$: $$\rho = \frac{E \Big[(\hat{\theta}_i - \mu)(\hat{\theta}_j - \mu) \Big]}{\sigma^2}$$

Note that correlation is scaled by standard deviation, so it always $[-1, 1]$. We then define the sum of all $\hat{\theta}$ to be $S_B$: $$S_B = \sum_{i=1}^B \hat{\theta}_i$$

And therefore the sample mean of the bootstrap estimates is:

$$\bar{\theta}_B = \frac{1}{B}S_B$$

1.5.2.2 Variance Derivation - Write out definition¶

Let's start by writing out the definiton for variance of $\bar{\theta}_B$:

$$var(\bar{\theta}_B) = E \Big[(\bar{\theta}_B - \mu)^2\Big]= E \Big[(\frac{1}{B}S_B - \mu)^2\Big]$$

Then we can perform several algebraic operations to the right side of the equation: $$E \Big[(\frac{1}{B}S_B - \frac{1}{B} B\mu)^2\Big]$$ $$E \Big[\Big((\frac{1}{B})(S_B - B\mu)\Big)^2\Big]$$ $$E \Big[\frac{1}{B^2}(S_B - \mu)^2\Big]$$ $$\frac{1}{B^2}E \Big[(S_B - \mu)^2\Big]$$ $$\frac{1}{B^2}E \Big[S_B^2 - 2\mu BS_B + \mu^2 B^2\Big]$$

Now, if we look specifically at the term $- 2\mu BS_B$, we can see that 2, $\mu$, and $B$ are constant, so we can pull them outside of the expected value:

$$E\Big[-2 \mu B S_B\Big] = -2 \mu B *E\Big[S_B\Big] $$

We know that the expected value of $S_B$, based on its definition, can be rewritten as:

$$-2 \mu B *E\Big[S_B\Big] = -2 \mu B *E\Big[B\hat{\theta}\Big] $$

And since $B$ is constant, it can be pulled outside of the expected value, and we have defined the expected value of $\hat{\theta}$ to be $\mu$:

$$-2 \mu B^2 *E\Big[\hat{\theta}\Big] = -2 \mu^2 B^2 $$

Now, back to the equation we branched off from:

$$\frac{1}{B^2}E \Big[S_B^2 - 2\mu BS_B + \mu^2 B^2\Big]$$

We can replace the middle term with that which we just found above, and end up with:

$$var(\bar{\theta}_B) = \frac{1}{B^2}E \Big[S_B^2 - \mu^2 B^2\Big]$$

At this point, we can note that $\mu$ and $B$ are both constant, so they can be pulled out of the expected value:

$$var(\bar{\theta}_B) = \frac{1}{B^2}\Big(E \Big[S_B^2\Big] - \mu^2 B^2\Big)$$

Which if we then multiply the fraction through, we end up with:

$$var(\bar{\theta}_B) = \frac{1}{B^2}E \Big[S_B^2\Big] - \mu^2$$

Our main focus now is to find $E\Big[S_B^2\Big]$.

1.5.2.3 Variance Derivation - Find $E\Big[S_B^2\Big]$¶

We can start by using the definition of $S_B$, which is just the sum of the individual sample $\hat{\theta}$s:

$$E\Big[S_B^2\Big] = E\Big[(\hat{\theta}_1+\hat{\theta}_2+...+\hat{\theta}_B)(\hat{\theta}_1+\hat{\theta}_2+...+\hat{\theta}_B)\Big]$$

The important point to notice here is that we will end up with two types of terms here when we multiply this out. There will be the type where it is $\hat{\theta}_i*\hat{\theta}_i$ (aka the subscript is the same for both $\hat{\theta}$s in the expected value, it will look like $(\hat{\theta}_1*\hat{\theta}_1 +\hat{\theta}_2*\hat{\theta}_2)$ and so on). There will be $B$ of these terms.

The other type of term that we will get is $\hat{\theta}_i*\hat{\theta}_j$, where $i \neq j$. Since there will be $B^2$ terms in total, and there will be $B$ where $i = j$, then there will be $B(B-1)$ terms where $i \neq j$. Both of these terms will be non zero.

$$E\Big[S_B^2\Big] = BE\Big[\hat{\theta}_i^2\Big] + B(B-1)E_{i \neq j}\Big[\hat{\theta}_i\hat{\theta}_j\Big]$$

We can find these two expected values using our previous definitions!

1.5.2.3 Variance Derivation - Rearange equation for Variance and Correlation¶

If we rearange our equation for the variance and correlation of $\hat{\theta}$, we can find these two expected values. We can start with the variance and begin looking for $E\Big[\hat{\theta}_i^2\Big]$:

$$var(\hat{\theta}_i) = \sigma^2 = E\Big[ (\hat{\theta}_i - \mu)^2 \Big]$$

And we can expand out the inside:

$$E\Big[ (\hat{\theta}_i - \mu)^2 \Big] = E\Big[ (\hat{\theta}_i^2 - 2\mu \hat{\theta}_i+ \mu^2) \Big]$$

Take the expected value of each term:

$$E\Big[\hat{\theta}_i^2\Big] - E\Big[ 2\mu \hat{\theta}_i \Big] + \mu^2 $$

Focusing on the middle term (as we did earlier), we can see that the 2 and $\mu$ just have an expected value of themselves, so they can be pulled out:

$$E\Big[\hat{\theta}_i^2\Big] - 2\mu E\Big[\hat{\theta}_i \Big] + \mu^2 $$

And the expected value of $\hat{\theta}_i$ is just $\mu$ (by definition!). So we end up with:

$$\sigma^2 = E\Big[\hat{\theta}_i^2\Big] - \mu^2 $$

$$E\Big[\hat{\theta}_i^2\Big] = \sigma^2 + \mu^2 $$

Great! Now we can look at $\rho$ and begin solving for $E_{i \neq j}\Big[\hat{\theta}_i\hat{\theta}_j\Big]$. Let's start with the definition of $\rho$:

$$\rho = \frac{E \Big[(\hat{\theta}_i - \mu)(\hat{\theta}_j - \mu) \Big]}{\sigma^2}$$

We can expand the top:

$$\rho = \frac{E \Big[(\hat{\theta}_i\hat{\theta}_j - \mu \hat{\theta}_i - \mu \hat{\theta}_j +\mu^2) \Big]}{\sigma^2}$$

And again if we look at the middle terms, we can simplify them by taking out the $\mu$, and knowing the the expected value of $\hat{\theta}$ is just $\mu$. This allows us to simplify our equation to:

$$\rho = \frac{E \Big[\hat{\theta}_i\hat{\theta}_j\Big] - \mu^2}{\sigma^2}$$

And we can rearange that to find:

$$E \Big[\hat{\theta}_i\hat{\theta}_j\Big] = \rho\sigma^2 + \mu^2$$

1.5.2.4 Variance Derivation - Plug in values to find $E\Big[S_B^2\Big]$¶

Now if we go back to the equation we branched off from:

$$E\Big[S_B^2\Big] = BE\Big[\hat{\theta}_i^2\Big] + B(B-1)E_{i \neq j}\Big[\hat{\theta}_i\hat{\theta}_j\Big]$$

We can being solving for $E\Big[S_B^2\Big]$. So let's plug in our recently determined expected values:

$$ E\Big[S_B^2\Big] = \sigma^2 + \mu^2 + B(B-1)(\rho\sigma^2 + \mu^2)$$

And then we can simplify to:

$$ E\Big[S_B^2\Big] = B \sigma^2 + B(B-1)\rho\sigma^2 + \mu^2B^2$$

Great, now we have our value for $E\Big[S_B^2\Big]$! Let's now return to the equation for variance that we had branched off from:

$$var(\bar{\theta}_B) = \frac{1}{B^2}E \Big[S_B^2\Big] - \mu^2$$

And plug in the expected value of $S_B$ we just solved for:

$$var(\bar{\theta}_B) = \frac{1}{B^2} \Big[B \sigma^2 + B(B-1)\rho\sigma^2 + \mu^2B^2 \Big] - \mu^2$$

And finally let's simplify the above equation:

$$ \Big[\frac{1}{B} \sigma^2 + \frac{(B-1)}{B}\rho\sigma^2 + \mu^2\Big] - \mu^2$$$$ \frac{\sigma^2}{B} + \frac{(B-1)}{B}\rho\sigma^2 $$$$ \frac{\sigma^2}{B} + \frac{B\rho\sigma^2}{B} - \frac{\rho\sigma^2}{B} $$$$ \frac{\sigma^2- \rho \sigma^2}{B} + \rho \sigma^2$$$$var(\bar{\theta}_B) = \frac{1- \rho}{B}\sigma^2 + \rho \sigma^2$$

We have arrived at the final solution for variance of $\bar{\theta}_B$!

1.5.3 Variance Analysis¶

One interesting question to ask now that we have arrived at the solution for variance, is what will the variance of the bootstrap estimate be, if the correlation is maximum (recalling that the maximum correlation coefficient, $\rho$, is 1, meaning that the 2 variables are perfectly correlated with eachother)? Well, if $\rho$ is 1 then we see that the $\frac{1}{B}$ term goes away and we are just left with $\sigma^2$. This makes sense, because if each individual estimate is correlated with eachother, then the variance of the bootstrap estimate will not go down at all. However, if $\rho$ is 0, then we can get the best possible decrease in variance, which is $\frac{1}{B}$.

It is import to note that using the bootstrap estimate of the mean, or using bootstrap for any linear model, does not greatly improve the variance of the model. For linear statistics, for which the sample mean is an example:

$$\rho = \frac{N}{2N - 1} \approx 0.5$$

The biggest advantage of bootstrapping occurs when you use highly nonlinear models like decision trees. On different data sets they will produce highly irregular decision boundaries that will not correlate with each other that much, so $\rho$ will be small. In other words, we will reduce the variance more by combing non linear models that are less likely to be correlated!

2. Bootstrap Demo in Code¶

We are now going to demonstrate bootstrapping in order to estimate the confidence interval of the sample mean, in order to show that it is approximately equal to the traditional method of estimating the confidence interval.

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import norm, t

# Seaborn Plot Styling
sns.set(style="white", palette="husl")
sns.set_context("poster")
sns.set_style("ticks")

B = 200                        # B is the number of times we are going to sample 
N = 20                         # N = number of data points
X = np.random.randn(N)         # X is a standard normal distribution with N points

Note that the numpy.random.randn function creates a univariate/normal distribution, hence we are looking for a $\mu$ of 0, and a variance, $\sigma^2$ of 1.

This can be seen here in the docs: https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.randn.html

print("Sample mean of X: ", X.mean())       # This is the regular sample mean 

individual_estimates = np.empty(B)          # Create array to hold estimates

# Sample from X (with replacement) B times
for b in range(B):
    sample = np.random.choice(X, size=N)    # draw a random sample from X of size N = 20
    individual_estimates[b] = sample.mean()   # save the sample mean
    

# Calculate bootstrap estimate of the mean & STD
bmean = individual_estimates.mean()                # Mean of all individual estimates
bstd = individual_estimates.std()                  # Standard Deviation of all individual est

Sample mean of X:  0.1915196036658688

Now we can calculate the confidence interval! Remember, the confidence interval is used in finding the range of values that are most likely to contain the true mean, $\mu$. It can visualized as:

"""Find lower and upper bound of 95% confidence interval for estimate of mean of X"""
lower = bmean + norm.ppf(0.025) * bstd        # norm.ppf(0.025) == -1.96
upper = bmean + norm.ppf(0.975) * bstd        # norm.ppf(0.975) == +1.96

print("Bootstrap mean of X: ", bmean)

Bootstrap mean of X:  0.21382653327368076

And let's find the confidence interval the traditional way:

# Traditional way of calculating CI
lower2 = X.mean() + norm.ppf(0.025) * X.std() / np.sqrt(N)
upper2 = X.mean() + norm.ppf(0.975) * X.std() / np.sqrt(N)

Now we are going to want to plot this.

fig, ax = plt.subplots(figsize=(12,8))
plt.hist(individual_estimates, bins=20, ec="black")
plt.axvline(x=lower, linestyle='--', color='g', label="lower bound for 95% CI (bootstrap)")
plt.axvline(x=upper, linestyle='--', color='g', label="upper bound for 95% CI (bootstrap)")
plt.axvline(x=lower2, linestyle='--', color='b', label="lower bound for 95% CI")
plt.axvline(x=upper2, linestyle='--', color='b', label="upper bound for 95% CI")
plt.legend(fontsize=20, bbox_to_anchor=(1, 1), loc=2)
plt.show()

3. Bagging¶

We are now going to look at the application of bootstrapping to model averaging. This is called bagging, which is short for bootstrap aggregating. In particular, we will use the bootstrap resampling method to create to create multiple different models, and then average them to create a final model, which may or may not have reduced variance depending on which model you use.

The algorithm looks pretty much the same as bootstrapping, except that now instead of calculating the sample mean or some other statistic ($\hat{\theta}$), we are training the model instead. Here is how the algorithm works in pseudocode:

3.1 Training¶

# Training
models = []
for b=1..B:                  # We loop up to B
    model = Model()          # Create a model each time
    Xb, Yb = resample(X)     # Get the bootstrap sample
    model.fit(Xb, Yb)        # Train the model
    models.append(model)     # Append model to list of models

3.2 Prediction¶

Prediction is then done by averaging each model in the case of regression, or voting the case of classification. So for regression the algorithm would look like:

# Regression
def predict(X): 
    return np.mean([model.predict(X) for model in models], axis=1)

Now classification is a bit more complicated because we need to collect the votes. In the case where each model returns a class probability this is not needed since we can just take the average like we did for regression. Here is an example of how to do prediction for one sample at a time. This a pretty naive implementation that can probably be improved. Note that we don't want to both sorting the votes dictionary, since sorting is O(nlogn) and finding the max is just O(n):

# Classification
def predict_one(X):
    votes = {}
    for model in models:      # Loop through all models
        k = model.predict(X)  # Find prediction class for each 
        votes[k]++            # Increment that class value in the votes dict
    argmax = 0, max = -inf    # We don't sort that hash, since that is O(nlogn)
    for k, v in votes.iteritems():    # Iterate through votes dictionary
        if v > max:         # If value is greater than current max, set as max
            argmax = k; max = v             
    return k                  # Return class k, the class with the most votes

Another option is to create an (N x K) matrix and accumulate the predictions. Since numpy allows us to index an array by providing arrays for each index, we can use those to accumulate the votes for each corresponding sample and class pair. Since everything is inside a 2d array by the end, we can just take the argmax over one axis.

# Classification 
def predict(X):
    output = np.zeros((N, K))         # Create N x K matrix to hold predictions
    for model in models:                     # Loop through all models

        # Here we use np.arange(N) as our row index (size N), 
        # and we then use model.predict(X) As our column 
        # index (size N). For each example we are predicting 1..N, 
        # we increment the value of the class that model.predict(X)
        # returned as an output output[np.arange(N), model.predict(X)] += 1 

    return output.argmax(axis=1)              # Return the argmax over 1 axis

We can do a simpler version of that if we are doing binary classification. Suppose we have B models in total- the number of votes will always be between 0 and B, so if we just add up all of the votes and then divide by B, we will get a number between 0 and 1 which we can just round.

# Classification
def predict(X):
    output = np.zeros(N)
    for model in models:
        output += models.predict(X)
    return np.round(output / B)

4. Bagging Regression Trees¶

In this section we are going to implement bagging for regression. We can start with our imports.

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.tree import DecisionTreeRegressor
from sklearn.utils import shuffle

# Seaborn Plot Styling
sns.set(style="white", palette="husl")
sns.set_context("poster")
sns.set_style("ticks")

Our first step is to create the data.

"""Create the data"""
T = 100                                         # 100 points to represent the x axis
x_axis = np.linspace(0, 2*np.pi, T)             # [0, 2pi] with T points
y_axis = np.sin(x_axis)                         # sin(x) is ground truth function

Now we need to actually get the training data.

"""Get the training data"""
# Training data will be 30 points
N = 30                                          

# Selecting T random points from 1..T without replacement
idx = np.random.choice(T, size=N, replace=False) 

# Selecting training data from x_axis, must then reshape to N x D (D is 1 in this example)
Xtrain = x_axis[idx].reshape(N, 1)               
Ytrain = y_axis[idx]

Now what we want to do is try a lone decision tree, so that we can compare the ensemble to the first result.

"""Create a lone decision tree"""
model = DecisionTreeRegressor()
model.fit(Xtrain, Ytrain)
prediction = model.predict(x_axis.reshape(T, 1))         # need to reshape for sklearn api

print("Score for 1 tree:", model.score(x_axis.reshape(T, 1), y_axis))    # Print R^2 score

Score for 1 tree: 0.9913636587270763

And let's plot the predictions our decision tree made vs. the actual curve.

"""Plot the lone decision tree's predictions"""
fig, ax = plt.subplots(figsize=(12,8))
plt.plot(x_axis, prediction)
plt.plot(x_axis, y_axis)
plt.show()

Awesome, now that we have created everything for our lone decision tree, we can move on to our bagged regressor!

class BaggedTreeRegressor: 
    def __init__(self, B):                     # Init function
        self.B = B
        
    """Fit function"""
    def fit(self, X, Y):
        N = len(X)                          # Set N to len(X), number training examples
        self.models = []                    # Initialize models to empty array
        for b in range(self.B):             # Loop through all B models                 
            idx = np.random.choice(N, size=N, replace=True)   # Generate bootstrap sample
            Xb = X[idx]                                       # X bootstrap sample
            Yb = Y[idx]                                       # Y bootstrap sample 
            
            model = DecisionTreeRegressor()                   # Create decision tree
            model.fit(Xb, Yb)                                 # Fit to bootstrap sample
            self.models.append(model)                         # Append to models array 
            
    """Predict function, mean of all predictions when performing regression"""
    def predict(self, X):
        predictions = np.zeros(len(X))
        for model in self.models:              # Loop through all models
            predictions += model.predict(X)    # Accumulate predictions
        return predictions / self.B            # Return mean of predictions 
        
    """Score function, used to calculate R^2, the amount of variance explained"""
    def score(self, X, Y):              # R^2 = 1 - (unexplained variation / total vari
        d1 = Y - self.predict(X)        # Observed values minus predictions
        d2 = Y - Y.mean()               # Observed values minus mean
        unexplained_var = d1.dot(d1)    # Square and sum diff between observed and prediction
        total_var = d2.dot(d2)          # Square and sum diff between observed and mean
        return 1 - (unexplained_var / total_var)

And now we can put our class into use.

model = BaggedTreeRegressor(200)      # Setting B to 200 (200-500 are common B values)
model.fit(Xtrain, Ytrain)
print("Score for bagged tree:", model.score(x_axis.reshape(T, 1), y_axis))
prediction = model.predict(x_axis.reshape(T, 1))

# Plot the bagged regressor's predictions
fig, ax = plt.subplots(figsize=(12,8))
plt.plot(x_axis, prediction)
plt.plot(x_axis, y_axis)
plt.show()

Score for bagged tree: 0.9950689398322677

4. Bagging Classification¶

In this section we are going to implement bagging for classification. We can start with our imports.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.utils import shuffle

"""Function to plot decision boundary"""
def plot_decision_boundary(X, model):
    h = .02  # step size in the mesh
    # create a mesh to plot in
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                       np.arange(y_min, y_max, h))


    # Plot the decision boundary. For that, we will assign a color to each
    # point in the mesh [x_min, m_max]x[y_min, y_max].
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])

    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    plt.contour(xx, yy, Z, cmap=plt.cm.Paired)

And let's set our seed.

np.random.seed(10)

Now we can create our data.

N = 500                                  # 500 points
D = 2                                    # Dimensionality is 2
X = np.random.randn(N, D)              # Generate some Gaussian random numbers

"""Dataset is going to be a noisy XOR"""
sep = 2                                  # Separation parameter is equal to 2
X[:125] += np.array([sep, sep])          # 1st 125 pts centered at [+2, +2]
X[125:250] += np.array([sep, -sep])      # centered at [+2, -2]
X[250:375] += np.array([-sep, -sep])     # centered at [-2, -2]
X[375:] += np.array([-sep, sep])         # centered at [-2, +2]
Y = np.array([0]*125 + [1]*125 + [0]*125 + [1]*125)

And let's plot it to see what it looks like:

# Plot the data
fig, ax = plt.subplots(figsize=(12,8))
plt.scatter(X[:,0], X[:,1], s=100, c=Y, alpha=0.5, cmap="viridis")
plt.show()

Now we can see how a lone decision tree performs.

# Lone decision tree
model = DecisionTreeClassifier()
model.fit(X, Y)
print("Score for 1 tree:", model.score(X, Y))

Score for 1 tree: 1.0

And plot the lone decision tree decision boundary:

# Plot data with lone decision tree boundary
fig, ax = plt.subplots(figsize=(12,8))
plt.scatter(X[:,0], X[:,1], s=100, c=Y, alpha=0.5, cmap="viridis")
plot_decision_boundary(X, model)
plt.show()

We can clearly see that the decision boundary that our lone decision tree came up with is very noisy. At the same time the score is 1 because we are overfitting. Let's continue and create our bagged model.

# Create the bagged model
class BaggedTreeClassifier:
    def __init__(self, B):
        self.B = B
        
    """Create B models, sample with replacement, get bootstrap sample, create decision tree"""
    def fit(self, X, Y):
        N = len(X)
        self.models = []
        for b in range(self.B):
            idx = np.random.choice(N, size=N, replace=True)
            Xb = X[idx]
            Yb = Y[idx]

            model = DecisionTreeClassifier(max_depth=2)  # limit max depth so boundary smooth
            model.fit(Xb, Yb)
            self.models.append(model)
            
    def predict(self, X):
        # No need to keep a dictionary since we are doing binary classification
        predictions = np.zeros(len(X))
        for model in self.models:
            predictions += model.predict(X)      # Sum predictions
        return np.round(predictions / self.B)    # Divide by total num predictions, round

    """Score function - the accuracy of the model"""
    def score(self, X, Y):                       
        P = self.predict(X)
        return np.mean(Y == P)

Now we can test our model out!

model = BaggedTreeClassifier(200)
model.fit(X, Y)

print("Score for bagged model:", model.score(X, Y))

# Plot data with boundary
fig, ax = plt.subplots(figsize=(12,8))
plt.scatter(X[:,0], X[:,1], s=100, c=Y, alpha=0.5, cmap="viridis")
plot_decision_boundary(X, model)
plt.show()

Score for bagged model: 0.968

We can see that the score is lower because we are not overfitting! The boundary for our 200 decision trees is much smoother!

5. Stacking¶

We are now going to talk about a special method of combining models called stacking. So far, we have been assuming that each models influence on the output should be equal. This means that every vote has the same power, and when we take the average each of the models contributes equally. But what if we were to weight each model? Now we can have better models have a greater influence on the final model, and inaccurate models have less influence on the final model. Mathematically that just looks like:

$$f(x) = \sum_{m=1}^M w_mf_m(x)$$

The big question is, how do we find these weights? There are several ways to do this, the first of which is stacking.

5.1 Stacking in detail¶

We start by thinking about what our object function might be. We will go with mean squared error since it makes the example easy to follow. As with linear regression, we again want to minimize the mean squared error. We can generalize that to be not the sample mean, but the expected value of the squared error.

$$\hat{w} = argmin_wE_{POP}\Big[(Y - f(X))^2\Big]$$$$\hat{w} = argmin_wE_{POP}\Big[(Y - \sum_{m=1}^M w_mf_m(X))^2\Big]$$

What is interesting about this is that it looks exactly like linear regression! So maybe we could solve this by using the solution to linear regression? The problem though is that we don't know the actualy population distribution and therefore can't calculate this expected value.

You may be tempted to substitute the training data for the true population, but there is at least one example where that would fail. Recall from linear regression that the more columns you have, the better your $R^2$ value is. So if we said that $f_m(X)$ is a linear model with $m$ inputs, then the last $f(X)$, $f_M(x)$, the one with the most inputs would be the best model in all cases. So in that situation $w_M$ would equal 1, and all other $w$s would be 0, since we would only want to use the linear model with the most inputs.

5.2 Stacking: Correct way to estimate expected value¶

There is a good way to estimate the expected value, and it is as follows: We turn the expected value into a sum over all $N$ data points, indexed by $i$. When we compute $f_m$ for the input $x_i$, we use the model $f_m$ on all points except $x_i$. In the equation below that is what is meant by the superscript $-i$ on the term $f_m^{-i}$.

$$\hat{w}_{stack} = argmin_w\sum_{i=1}^N\Big[(y_i - \sum_{m=1}^M w_mf_m^{-i}(x_i))^2\Big]$$

The next obvious question is "how do you solve this equation?". Well, what you can do is limit all of the $w_m$ to be greater than 0 and set a contstraint that they must sum to one. Mathematically that looks like:

$$minimize: \sum_{i=1}^N\Big[(y_i - \sum_{m=1}^M w_mf_m^{-i}(x_i))^2\Big]$$$$subject \; to: w_m \geq 0 \; \forall m, \sum_{m=1}^M w_m = 1$$

These constraints help in the case where $f(x)$ outputs a class probability. This type of problem, where you want to optimize a quadratic function subject to inequality and equality constraints is known as a quadratic programming problem. This shows up a few times in machine learning, another example being support vector machines. We won't go too much further with quadratic programming since it is very in depth, but a library can take care of this for us.

5.3 Leave-One-Out-Cross-Validation¶

You may recognize that this method looks a lot like leave one out cross validation. Recall that LOO-CV always trains on every point except one, and tests on the last point, and it does that on the entire data set. Usually when we are doing cross validation we are interested in picking the best model with the best hyper parameters, rather than combining models. By choosing the best model, that is equivalent to setting one of the $w_m$ to 1, and the rest to 0. In other words, if the best model according cross validation is $m^*$, then we set $w_{m^*}$ to 1, and the rest to 0.

$$w_{m^*} = 1, w_m = 0 \; \forall m \neq m^*$$