4. Random Forest Algorithm¶

We are now going to touch on the random forest algorithm, which touches on earlier concepts we have gone over.

Recall, if we have $B$ independent and identically distributed variables, each with a variance $\sigma^2$, then the sum (or equivalently the sample mean) of the random variables has a variance:

$$var(\bar{\theta}_B) = \frac{1}{B}\sigma^2$$

However, recall that this is not the case when the $B$ random variables are only identically distributed, and not independent. In this case there may be a correlation between two random variables, which we will call $\rho$; the variance of the sum or sample mean is given by this equation:

$$var(\bar{\theta}_B) = \frac{1- \rho}{B}\sigma^2 + \rho \sigma^2$$

The main goal of random forest is to try and reduce this correlation. In other words, it tries to build a set of trees that are decorrelated from eachother.

Recall that the idea behind bagging was to average the results from high variance/low bias models. Trees are perfect for that because they can go arbitrarily deep, and capture complex interactions. Much of the time, they can achieve 100% accuracy on the training set, and hence have 0 bias. We want this because then $\rho$ will be 0! At the same time, this results in them having a high variance. But due to the previous equation for the variance of an ensemble, we can achieve a much lower combined variance by finding trees that are not correlated with eachother.

A good question to ask at this point is: "is there anything more deliberate we can do to make sure each tree is decorrelated from the others, rather than just assuming that trees grown to maximum depth on different bootstrap samples will be very different?" We will see how to do that soon!

1.1 Random Forest - Bias¶

We know that we can achieve low bias easily with trees because the more nodes we add, the more it will overfit. So, let's suppose that each tree has zero bias. Since each tree has the same expected value, then the expected value of an ensemble of trees is the same, and thus the bias remains the same too. This can be seen in the equation below - all estimates of $f$, $\hat{f}$, are going to have the same expected value:

$$\bar{f}(x) = E\Big[\hat{f}(x)\Big]$$

(This can be seen in the previous section 1.5.1 Mean Derivation).

And we can see that $bias^2$ is simply the ground truth function $f$ (which doesn't change) minus the expected value of the estimate, $\bar{f}$.

$$bias^2 = \Big[f(x) - \bar{f}(x)\Big]^2$$

We will see later with boosting another way of combining trees with high bias.

1.2 Random Forest - Decorrelation¶

So, how does Random Forest try to decorrelate it's trees? In the same way that we can select which samples to train on, we can also randomly select which features to train on too! So, if you think of the data matrix $X$, one way to get different trees is to sample different rows, which we have done already. Another way is to sample different columns, which is equivalent to training a tree on only a subset of features.

We usually choose a dimensionality $d << D$, assuming that $X$ is an (N x D) matrix. The inventors of random forest recommend the following settings for $d$:

$$Classification: d = floor(\sqrt{D})$$$$Regression: d = floor(\frac{D}{3})$$

For classification note that it can be set as low as 1. For regression it can be set as low as 5. As always, by using a method like cross validation you can see what works best for your specific dataset.

1.3 Random Forest - Algorithm¶

The algorithm for random forest training is as follows:

for b = 1..B:                              # We loop B times
    Xb, Yb = sample_with_replacement(X, Y) # Draw sample w/ replacement
    model = DecisionTree()                 # Create a new tree

    # We do not train tree here in usual way, instead we loop until we 
    # hit a terminal node or preset maximum depth. This loop could also
    # be recursion, as we did in the supervised learning walkthroughs

    while not at terminal node and not reached max_depth:

        # Each iteration, select d features randomly from the total set 
        # features, and for these features we choose the best split 
        # using a criterion like maximum information gain. 
        select d features randomly         

        # We then add this split to the current tree, and keep recursing 
        # until the tree is complete (hit leaf node or max depth)
        add split to model

    models.append(model)                     # Add model to list of models

A few things are worth noting here.

First, this is just like bagging, expcept that in addition to getting a bootstrap sample from the training set on each round, we also sample the features too. People will sometimes refer to this as feature bagging.
Second, these are not ensembles of plain vanilla decision trees, as we did with the bagged tree. This is because we are modifying the decision tree algorithm by changing how they split. In particular, we are not allowing them to search the entire feature list. So, we can't build a random forest using the built in decision tree class. For this reason, we will not write the random forest from scratch in this walkthrough.

1.4 Random Forest - Prediction¶

We have already touched on how to do predictions for ensembles. For binary classification you can just round the average, and for regression you can just output the average. Since we have already covered that before we won't do it again here. The main complication with random forests is just the train function.

1.5 Random Forest - Training¶

We will be using the Scikit Learn random forest to code this out going forward. There are a few things to note about this, found in the docs here: http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

First, you can see that the number of trees by default is 10 for both classification and regression.

If you look at the splitting criterion, you will see that the classifier uses gini impurity, while the regressor uses mean squared error

Gini impurity has the same shape as entropy, so it's not a big deal

The max features argument tells the random forest how many features to sample on each split. Notice that the default is the square root of D for classification, but the default for regression is equal to D. This does not follow the recommendation to using $\frac{D}{3}$

As we continue to go down, there are a lot of options related to how to build the tree. For instance there is maximum_depth which is by default none, min_samples_split, min_samples_leaf, and the max_leaf_nodes. Notice how a lot of these inputs pertain to the tree and not the forest.

1.6 Random Forest - Problems¶

A big question that comes up in machine learing is what is a lot of the predictors/input features are irrelevant and just noise? If most of the features are just noise, what happens to random forest if when it randomly chooses input features it chooses only noise. And, what if it does this for a large proportion of trees. For example, if we have 3 relevant variables and 100 irrelevant variables, and we have d equal $floor(\sqrt{103}) = 10$, then what is the probability that we chose at least one relevant variable?

$$1 - \frac{100}{103} * \frac{99}{102} * ...* \frac{91}{94} = 0.266$$

26% actually is not terrible, considering we only had 3 relevant variables. If we had 6 relevant variables, the probability would go up to 46%. It is, however, still a a problem if the number of relevant variables is small. Soon we will talk about an algorithm called boosting which fixes this problem.

1.7 Random Forest - Advantages¶

One big advantage of random forests is that they require very little tuning. Research has show that you can let all of the trees have arbitrary depth without incurring very much penalty. They perform well, and they are fast, making them ideal for many situations. This is why it is often recommended vs. deep learning for more basic and broad use cases. Neural networks can yield very different results based on the selected hyperparameters, and on top of that there are many more options. Hence, random forests are the way to go if you are looking for a plug and play solution.

2. Random Forest Regression - Code¶

We are now going to apply the random forest regressor to real data. It is now that we will stop looking at 2d plots for regression for 2 reasons.

We already know the smoothing effect that ensembling has on the decision tree predictions because of the plots we did earlier.
The main trick of random forest is that is selects a subset of features at each tree split, but a 2d plot only has 1 feature to choose from, making this main feature obselete.

Let's start with our imports.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression       # Will compare LR as baseline
from sklearn.tree import DecisionTreeRegressor          # Will also compare single Dtree
from sklearn.model_selection import cross_val_score

# Seaborn Plot Styling
sns.set(style="white", palette="husl")
sns.set_context("poster")
sns.set_style("ticks")

Next we can define constants for our column names:

NUMERICAL_COLS = [
  'crim', # numerical
  'zn', # numerical
  'nonretail', # numerical
  'nox', # numerical
  'rooms', # numerical
  'age', # numerical
  'dis', # numerical
  'rad', # numerical
  'tax', # numerical
  'ptratio', # numerical
  'b', # numerical
  'lstat', # numerical
]

NO_TRANSFORM = ['river']      # Do not want to transform river, since it is already 0 or 1

Define our DataTransformer class. This works like Scikit learns scaler classes (normalizes by subtracting mean and dividing by std or variance), which have the functions fit, transform, and fit_transform. This transforms data from dataframe to numerical matrix. We want to use the scales found in training when transforming the test set, so we only call fit() once, and call transform() for any subsequent data.

We want to be able to transform without fitting, because when we find the mean and variance of a feature, we only want to use the train set for that. When we transform the test set we want to use only the means and variances that we have already found.

Documentation can be found specifically related to this here: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

class DataTransformer:
    """fit finds the mean and variance"""
    def fit(self, df):
        self.scalers = {}
        for col in NUMERICAL_COLS:
            scaler = StandardScaler()                          # Sklearn standard scaler
            scaler.fit(df[col].as_matrix().reshape(-1, 1))
            self.scalers[col] = scaler

    """transform subtracts by the mean and divides by the variance, converts to np array"""
    def transform(self, df):
        N, _ = df.shape
        D = len(NUMERICAL_COLS) + len(NO_TRANSFORM)
        X = np.zeros((N, D))
        i = 0
        for col, scaler in self.scalers.items():
            X[:,i] = scaler.transform(df[col].as_matrix().reshape(-1, 1)).flatten()
            i += 1
        for col in NO_TRANSFORM:
            X[:,i] = df[col]
            i += 1
        return X

    def fit_transform(self, df):
        self.fit(df)
        return self.transform(df)

Now we can write our function to get the data. The dataset is similar to a CSV, but each value is separated by an arbitrary number of spaces.

def get_data():
  # Regex allows arbitrary number of spaces in separator (sep=r"\s*")
  # To use this regex, must use python engine
  df = pd.read_csv('./data/housing.data', header=None, sep=r"\s*", engine='python')
  
  # Renamining columns manually
  df.columns = [
    'crim', # numerical
    'zn', # numerical
    'nonretail', # numerical
    'river', # binary
    'nox', # numerical
    'rooms', # numerical
    'age', # numerical
    'dis', # numerical
    'rad', # numerical
    'tax', # numerical
    'ptratio', # numerical
    'b', # numerical
    'lstat', # numerical
    'medv', # numerical -- this is the target
  ]

  # Transform data - create an instance of DataTransformer
  transformer = DataTransformer()

  # Shuffle the data and split into 70% train, 30% test
  N = len(df)
  train_idx = np.random.choice(N, size=int(0.7*N), replace=False)
  test_idx = [i for i in range(N) if i not in train_idx]
  df_train = df.loc[train_idx]
  df_test = df.loc[test_idx]

  # Take the log of the targets (median house price). This is a common operation for 
  # scaling numerical columns that have large ranges, but you care about how correct you are
  # relative to the value. What we mean by that is if you have a house that is priced at 
  # $10,000 and your prediction is $5,000 you prediction is very wrong. However, if the house
  # was priced at $1,000,000 and your prediction was $5,000 off you don't care as much
  Xtrain = transformer.fit_transform(df_train)
  Ytrain = np.log(df_train['medv'].as_matrix())
  Xtest = transformer.transform(df_test)
  Ytest = np.log(df_test['medv'].as_matrix())
  return Xtrain, Ytrain, Xtest, Ytest

We can now enter our main loop:

def run_experiment():
  Xtrain, Ytrain, Xtest, Ytest = get_data()      # Get our data
  
  # Create an instance of random forest regressor
  model = RandomForestRegressor(n_estimators=100) # Try 10, 20, 50, 100, 200
  model.fit(Xtrain, Ytrain) 
  predictions = model.predict(Xtest)

  # Plot predictions vs targets
  # First plot - targets along x axis, predictions along y axis, if accurate should be near 
  # line y = x, so we plot that too
  fig, ax = plt.subplots(figsize=(12,8))
  plt.scatter(Ytest, predictions)
  plt.xlabel("target")
  plt.ylabel("prediction")
  ymin = np.round( min( min(Ytest), min(predictions) ) )
  ymax = np.ceil( max( max(Ytest), max(predictions) ) )
  print("ymin:", ymin, "ymax:", ymax)
  r = range(int(ymin), int(ymax) + 1)
  plt.plot(r, r, color='blue')
  plt.show()

  # Create plot of targets and predictions as line charts 
  fig, ax = plt.subplots(figsize=(12,8))
  plt.plot(Ytest, label='targets')
  plt.plot(predictions, label='predictions', color="green")
  plt.legend()
  plt.show()

  # Cross validation on all models
  # Training set baseline test
  baseline = LinearRegression()
  single_tree = DecisionTreeRegressor()
  print("CV single tree:", cross_val_score(single_tree, Xtrain, Ytrain).mean())
  print("CV baseline:", cross_val_score(baseline, Xtrain, Ytrain).mean())
  print("CV forest:", cross_val_score(model, Xtrain, Ytrain).mean())

  # Test score
  single_tree.fit(Xtrain, Ytrain)
  baseline.fit(Xtrain, Ytrain)
  print("test score single tree:", single_tree.score(Xtest, Ytest))
  print("test score baseline:", baseline.score(Xtest, Ytest))
  print("test score forest:", model.score(Xtest, Ytest))


run_experiment()

/usr/local/lib/python3.6/site-packages/pandas/io/parsers.py:2227: FutureWarning: split() requires a non-empty pattern match.
  yield pat.split(line.strip())
/usr/local/lib/python3.6/site-packages/pandas/io/parsers.py:2229: FutureWarning: split() requires a non-empty pattern match.
  yield pat.split(line.strip())
/usr/local/lib/python3.6/site-packages/sklearn/utils/validation.py:475: DataConversionWarning: Data with input dtype int64 was converted to float64 by StandardScaler.
  warnings.warn(msg, DataConversionWarning)

ymin: 2.0 ymax: 4.0

CV single tree: 0.7627342797438036
CV baseline: 0.7696955274061493
CV forest: 0.8370560373592532
test score single tree: 0.5829399527321669
test score baseline: 0.7824358648686986
test score forest: 0.8192645918177106

We can see here that the single tree performs the worst, then the baseline, and the random forest significantly outperforms the others.

3. Random Forest Classification - Code¶

Our data set is now going to be a dataset of poisonous and edible mushrooms, and we have to try to predict which ones are poisonous and which ones are edible. The data set can be found here:

https://archive.ics.uci.edu/ml/datasets/Mushroom

If we look at the data set description we can see that there are 22 columns, all of which are just attributes of the mushroom, such as size/shape/color and other things that describe what it looks like. We also can see that missing attribute values are denoted by ?'s. For categorical variables we can treat that as an additional symbol for that category, so there is nothing more to do for us.

We can start with our imports, only this time importing classifiers instead of regressors. The classification analog for linear regression is logistic regression, since they are both linear models.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

We have no numerical columns, and all categorical columns. Recall that there is no header row in the data file, so pandas will just number them as 1, 2, 3, and so on. Since the targets column name is 0, the first feature will have the name 1 and the last feature will have the column name 22. We can create an array of these numbers quickly by using numpy's arange function and adding 1.

NUMERICAL_COLS = ()
CATEGORICAL_COLS = np.arange(22) + 1 # 1..22 inclusive

We then have our data transformer class, which is very similar to the previous data transformer, except now we have code to transform categorical columns. Ideally we would put all of the code into one data transformer class and put all of the code into that one class.

For our numerical columns we will use SciKit Learn StandardScaler, and for categorical columns we will use SciKit Learns LabelEncoder. LabelEncoder will number the labels from 0 to k - 1, which allows us to use them as index's later on.

Notice, that in order to determine the dimensionality, each of the numerical columns counts as one. Also, in the transform method, notice that for the categorical columns the LabelEncoder gives us index's. Which means we need to use those index's to index X and then set those values to one.

"""
Transforms data from dataframe to numerical matrix. One-hot encodes categories 
and normalizes numerical columns. We want to use the scales found in training
when transforming the test sets, so only call fit() once, call transform() for 
any subsequent data.
"""

class DataTransformer:
  def fit(self, df):
    self.labelEncoders = {}
    self.scalers = {}
    for col in NUMERICAL_COLS:         # Use SKL StandardScaler for numerical columns
      scaler = StandardScaler()
      scaler.fit(df[col].reshape(-1, 1))
      self.scalers[col] = scaler

    for col in CATEGORICAL_COLS:
      encoder = LabelEncoder()         # Use SKL LabelEncoder for categorical columns
      # in case the train set does not have 'missing' value but test set does
      values = df[col].tolist()
      values.append('missing')
      encoder.fit(values)
      self.labelEncoders[col] = encoder

    # Determine dimensionality
    self.D = len(NUMERICAL_COLS)                       # Each numerical column counts as one
    for col, encoder in iteritems(self.labelEncoders):
      self.D += len(encoder.classes_)                  # Categorical column counts for as 
    print("dimensionality:", self.D)                   # many different categories as it has

  def transform(self, df):
    N, _ = df.shape
    X = np.zeros((N, self.D))
    i = 0
    # Transform numerical columns
    for col, scaler in iteritems(self.scalers):
      X[:,i] = scaler.transform(df[col].as_matrix().reshape(-1, 1)).flatten()
      i += 1
    
    # Transform categorical columns 
    for col, encoder in iteritems(self.labelEncoders):
      # print "transforming col:", col
      K = len(encoder.classes_)
      X[np.arange(N), encoder.transform(df[col]) + i] = 1
      i += K
    return X

  def fit_transform(self, df):
    self.fit(df)
    return self.transform(df)

We then have our get_data function. Once we load the file, we convert all of the e's and p's in the first column to 0s and 1s. We then tranform the data using DataTransformer and then return X and Y.

def get_data():
  df = pd.read_csv('../../../data/mushrooms/mushroom.data', header=None)

  # Replace label column: e/p --> 0/1
  # e = edible = 0, p = poisonous = 1
  df[0] = df.apply(lambda row: 0 if row[0] == 'e' else 1, axis=1)
  
  # transform the data
  transformer = DataTransformer()

  X = transformer.fit_transform(df)
  Y = df[0].as_matrix()
  return X, Y

The last step is the main section, where we get the data, and then run cross validation on our three models: The base case is logistic regression, and then we have the single tree and the random forest.

if __name__ == '__main__':
  X, Y = get_data()

  # do a quick baseline test
  baseline = LogisticRegression()
  print("CV baseline:", cross_val_score(baseline, X, Y, cv=8).mean())

  # single tree
  tree = DecisionTreeClassifier()
  print("CV one tree:", cross_val_score(tree, X, Y, cv=8).mean())

  model = RandomForestClassifier(n_estimators=20) # try 10, 20, 50, 100, 200
  print("CV forest:", cross_val_score(model, X, Y, cv=8).mean())

dimensionality: 139
CV baseline: 0.9274806301152012
CV one tree: 0.9455977754935805
CV forest: 0.9511391625615764

We can see again that the random forest performs the best!

4. Random Forest vs. Bagging Trees - Code¶

We are now going to compare the behavior of random forest vs. bagged trees, as we add more trees. In particular we are going to plot the test error vs. the number of trees for each model. Note that these will take a long time to run, so just be aware of that before starting.

We can begin as always with our imports. Note that scikit learn has its own bagging classifier and regressor. However, these do not let you specify the max depth of the trees, so we will import our custom classes from the util file.

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor, BaggingRegressor, RandomForestClassifier, BaggingClassifier
from util import BaggedTreeRegressor, BaggedTreeClassifier

# Seaborn Plot Styling
sns.set(style="white", palette="husl")
sns.set_context("poster")
sns.set_style("ticks")

We now need to get the data. The first data set is a synthetic dataset. It just the sum of all the features squared plus some gaussian noise. You can comment that out and instead use the classification data, or the house price data.

# Make simple regression data
N = 15
D = 100
X = (np.random.random((N, D)) - 0.5)*10
Y = X.sum(axis=1)**2 + 0.5*np.random.randn(N)
Ntrain = N//2
Xtrain = X[:Ntrain]
Ytrain = Y[:Ntrain]
Xtest = X[Ntrain:]
Ytest = Y[Ntrain:]

# from rf_classification import get_data
# X, Y = get_data()
# Ntrain = int(0.8*len(X))
# Xtrain, Ytrain = X[:Ntrain], Y[:Ntrain]
# Xtest, Ytest = X[Ntrain:], Y[Ntrain:]

# from rf_regression import get_data
# Xtrain, Ytrain, Xtest, Ytest = get_data()

We now can enter into our main loop. We set T to 300, so we are going to test ensembles of trees up to 300 trees. We create 2 arrays to store the test errors at each iteration. In the main loop we first check if num_trees is 0, and if so we skip it since there is not such thing as an ensemble with 0 trees. Next, we train our models and test them on the test data.

T = 300
test_error_rf = np.empty(T)
test_error_bag = np.empty(T)
for num_trees in range(T):
  if num_trees == 0:
    test_error_rf[num_trees] = None
    test_error_bag[num_trees] = None
  else:
    rf = RandomForestRegressor(n_estimators=num_trees)
    # rf = RandomForestClassifier(n_estimators=num_trees)
    rf.fit(Xtrain, Ytrain)
    test_error_rf[num_trees] = rf.score(Xtest, Ytest)

    bg = BaggedTreeRegressor(n_estimators=num_trees)
    # bg = BaggedTreeClassifier(n_estimators=num_trees)
    bg.fit(Xtrain, Ytrain)
    test_error_bag[num_trees] = bg.score(Xtest, Ytest)

  if num_trees % 10 == 0:
    print("num_trees:", num_trees)

fig, ax = plt.subplots(figsize=(12,8))
plt.plot(test_error_rf, label='rf')
plt.plot(test_error_bag, label='bag')
plt.legend()
plt.show()

num_trees: 0
num_trees: 10
num_trees: 20
num_trees: 30
num_trees: 40
num_trees: 50
num_trees: 60
num_trees: 70
num_trees: 80
num_trees: 90
num_trees: 100
num_trees: 110
num_trees: 120
num_trees: 130
num_trees: 140
num_trees: 150
num_trees: 160
num_trees: 170
num_trees: 180
num_trees: 190
num_trees: 200
num_trees: 210
num_trees: 220
num_trees: 230
num_trees: 240
num_trees: 250
num_trees: 260
num_trees: 270
num_trees: 280
num_trees: 290

Notice how the random forest eventually does better than the bagged trees! Note the lack of overfitting in all cases. We converge to an asymptotic test error which does not go up as we add more trees. It is for this reason that people say random forest does not overfit. We will observe this behavior with boosting as well. Technically, we do still need to worry about overfitting, but not to the same extent as ensemble models.

5. Not as Random Forest - Code¶

Since we will not be implementing a full random forest at this point, it would still be interesting if we took the principle of random forest and tried to implement something. So we will do just that.

We will again use bagging so that for each base model we create, we are going to train it on a bootstrap sample, but we are also only going to use a subset of features. The difference between this and a real random forest, is that in a real random forest each split on the tree is made with a newly selected random set of features. For our pseudo random forest, we are going to select a subset of features only once, and each tree will be trained using only that subset. This allows us to make use of the decision tree class in scikit learn.

So again, let's just be clear that the main difference between this and a real random forest is that in the real random forest we randomly sample M features at every node of the tree!

We will again start with our imports.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestRegressor, BaggingRegressor, RandomForestClassifier, BaggingClassifier
from util import BaggedTreeRegressor, BaggedTreeClassifier

For this example we are going to be using the classification dataset, so we can import it and get our training and test sets to start.

X, Y = get_data()
Ntrain = int(0.8*len(X))
Xtrain, Ytrain = X[:Ntrain], Y[:Ntrain]
Xtest, Ytest = X[Ntrain:], Y[Ntrain:]

dimensionality: 139

Next we can define our NotAsRandomForest class, which is random but not quite as random as a true random forest. It still only takes in the number of base models and the constructor.

The big difference is in the fit and predict functions. The fit function is a lot like the bagged tree model, expect that it also requires us to save which features we used in the fit function, so that we know which model goes with which features. This will be necessary to know when we do prediction. The fit function takes in an additional parameter M, which tells us how many features to subsample.

In the predict function notice how we use features to index X, because each model has its own set of features that it is looking at.

class NotAsRandomForest:
  def __init__(self, n_estimators):
    self.B = n_estimators

  def fit(self, X, Y, M=None):
    N, D = X.shape
    if M is None:
      M = int(np.sqrt(D))

    self.models = []
    self.features = []
    for b in range(self.B):
      tree = DecisionTreeClassifier()

      # sample features - replace is False since we don't choose same feature more than once
      features = np.random.choice(D, size=M, replace=False)

      # sample training samples - Bootstrap sampling step (with replacement)
      idx = np.random.choice(N, size=N, replace=True)
      Xb = X[idx]
      Yb = Y[idx]
      
      # Train model, save model, save features. We need to save features because when we 
      # sample the features they are going to appear in a random order, so when we later
      # try to do prediction we are going to need to make sure the input is in the same order
      tree.fit(Xb[:, features], Yb)
      self.features.append(features)
      self.models.append(tree)

  def predict(self, X):
    N = len(X)
    P = np.zeros(N)
    for features, tree in zip(self.features, self.models):
      P += tree.predict(X[:, features])       # Sum up all predictions
    return np.round(P / self.B)               # Divide by total and round

  def score(self, X, Y):
    P = self.predict(X)
    return np.mean(P == Y)

The main loop is the same as the random forest vs. bagging example above, except we now have another array of errors to store the scores. The variable is named error, but it is actually accuracy.

T = 500
test_error_prf = np.empty(T)
test_error_rf = np.empty(T)
test_error_bag = np.empty(T)
for num_trees in range(T):
  if num_trees == 0:
    test_error_prf[num_trees] = None
    test_error_rf[num_trees] = None
    test_error_bag[num_trees] = None
  else:
    rf = RandomForestClassifier(n_estimators=num_trees)
    rf.fit(Xtrain, Ytrain)
    test_error_rf[num_trees] = rf.score(Xtest, Ytest)

    bg = BaggedTreeClassifier(n_estimators=num_trees)
    bg.fit(Xtrain, Ytrain)
    test_error_bag[num_trees] = bg.score(Xtest, Ytest)

    prf = NotAsRandomForest(n_estimators=num_trees)
    prf.fit(Xtrain, Ytrain)
    test_error_prf[num_trees] = prf.score(Xtest, Ytest)

  if num_trees % 10 == 0:
    print("num_trees:", num_trees)

fig, ax = plt.subplots(figsize=(12,8))
plt.plot(test_error_rf, label='rf')
plt.plot(test_error_prf, label='pseudo rf')
plt.plot(test_error_bag, label='bag')
plt.legend()
plt.show()

num_trees: 0
num_trees: 10
num_trees: 20
num_trees: 30
num_trees: 40
num_trees: 50
num_trees: 60
num_trees: 70
num_trees: 80
num_trees: 90
num_trees: 100
num_trees: 110
num_trees: 120
num_trees: 130
num_trees: 140
num_trees: 150
num_trees: 160
num_trees: 170
num_trees: 180
num_trees: 190
num_trees: 200
num_trees: 210
num_trees: 220
num_trees: 230
num_trees: 240
num_trees: 250
num_trees: 260
num_trees: 270
num_trees: 280
num_trees: 290
num_trees: 300
num_trees: 310
num_trees: 320
num_trees: 330
num_trees: 340
num_trees: 350
num_trees: 360
num_trees: 370
num_trees: 380
num_trees: 390
num_trees: 400
num_trees: 410
num_trees: 420
num_trees: 430
num_trees: 440
num_trees: 450
num_trees: 460
num_trees: 470
num_trees: 480
num_trees: 490

6. Connection to deep learning: Dropout¶

We are now going to look at a connection between deep learning and random forests, dropout. We know that the main idea behind random forest is that it not only trains its ensemble of decision trees using bootstrap samples, but it also randomly selects a subset of features when it trains its trees.

Dropout is a modern regularization technique for neural networks in the field of deep learning. The way it works is as follows: In a normal neural network, every node in a layer connects to every node in the next layer. The dropout method is that during training you randomly drop nodes so that they don't feed into the next layer. You do this with some small probability, like 20% or even 50%, So at every layer, you throw away half of the nodes during training.

So what does this have to do with ensembling? Consider that each node in a neural network can be either used or not used. In other words, there are two possible states of existence for each node. If the neural network has N nodes in total, then there are $2^N$ different possible neural networks we can get via dropping nodes.

One problem with neural networks is that they take a very long time to train. If you have a neural network that takes 1 hour to train, and you want to train an ensemble of 200 neural networks, that is going to take 200 hours to train, which is over 8 days. But, lets say your neural network has 1000 nodes, which is very reasonable. $2^{1000}$ is approximately $10^{301}$, which is a very large number. You could not train that number of networks in any scenario.

The key point is that dropout "emulates" an ensemble of $2^N$ networks by randomly dropping nodes during training, and then multiplying by 1-p(drop) during prediction. So it allows you to make an ensemble without really making an ensemble. It also works similar to random forest by randomly selecting which features it will use at every layer, similar to how a tree in random forest randomly selects which features to look at at each node.