2. Multi-Armed Bandit¶

Imagine you are at a casino playing slots (hence the term "arm"). The slot machines are bandits because they are taking your money. Not all the slot machines are equal, and the win rate is generally unknown. For example, they may be:

1 pays out 30% of the time
1 pays out 20% of the time
and 1 pays out 10% of the time

Keep in mind that you wouldn't know this, and would only be able to discover it by collecting data. Your goal is to maximize your winnings. If you were psychic and you could determine the machine with the highest win rate, you would play only that machine, and that is how you would maximize your winnings. But, since you don't know the win rates and your need to estimate them by collecting data, that means you need to play each of the machines.

This is where the dilemma comes in: You need to collect enough data for your estimates to be accurate, but if you collect lots of data then you are going to spend lots of time playing suboptimal bandits. So, you need to balance between trying to maximize your earnings, and trying to obtain accurate measurements of each bandits win rate, since in order to maximize your earnings, you need to know that information.

1.0 Traditional A/B Testing¶

In traditional A/B testing, you would predetermine ahead of time the number of times you need to play each bandit in order to obtain statistical significance. The number of times you need to play is dependent upon a number of things, such as the difference in win rates between each bandit. If the difference is bigger you need less samples, but if the difference is smaller you need more samples. This is kind of odd, since you don't know the difference before you start the test. An important rule of traditional A/B testing is that you do not stop your test early. So if you decide before hand that you need to play each bandit 1000 times (N = 1000), and after 500 trials you have calculated a 90% win rate for one bandit, and a 10% win rate for the other, you are not allowed to stop. However, we are not going to care about statistical significance in these notebooks, so we don't need to follow those rules.

1.1 Human Behavior¶

What is very interesting is that humans behave the opposite from this. Suppose you are playing in a real casino, and you get 2/3 on bandit A, and 0/3 on bandit B. You would have a strong urge to play bandit A, even though their true win rates are probably not 67% and 0%. As a data scientist, you know that there is a low probability that the first bandit has a win rate of 67%, and a low probability that the second bandit has a win rate of 0%. You know this because 3 is a very tiny sample size. Yet, if you are playing slot machines, you feel compelled to believe that the first bandit is better.

Next, we will talk about algorithms that make an explicit trade off between exploration and exploitation. This will allow us to improve upon to the 2 suboptimal solutions we have just gone over: human behavior and traditional A/B testing.

2.0 Epsilon-Greedy Solution¶

One method of solving the explore-exploit dilemma is called the epsilon-greedy algorithm (and it is intuitive). This algorithm is used heavily in reinforcement learning. Here is how it works:

Say you have some software that is serving 2 advertisements (A & B), you will not run a full A/B test by blindly serving ad A and ad B an equal # of times. Instead you will adapt to the data you have collected. If ad A is performing better, it will be shown more often.

How does it work?¶

It is simpler than it sounds! We start by choosing a small number epsilon ($\epsilon$) between 0 and 1, which will serve as our probability of exploration. Typical values are 10% or 5%. Some psuedo code may look like:

epsilon = 0.1 
while True:
    r = rand()
    if r < epsilon:
        # explore
        # show random advertisement
    else:
        # exploit
        # show best advertisement (as determined by #clicks/#impressions)

Eventually, we'll discover the which arm is the true best, since this allows us to update every arm's estimate.

Analysis¶

In the long run, this technique allows us to explore each arm an infinite number of times. The problem with this is that eventually you will get to a point when you are still exploring, when in reality you don't need to explore any more. So, if epsilon is 10%, then you will spend 10% of the time chosing suboptimal arms. An A/B test could be useful here, where you perform the test at a predetermined time, and if you find statistical significan then set epsilon=0. However, there are better ways to adapt.

3. Updating Sample Mean¶

Let's assume for a moment that our rewards are not just coin tosses, so we don't just need to store the number of success's and failures. The general method for solving this problem is to use the mean. This works for coin tosses too, since if you add up all of the 0's and 1's that you get, and divide by $N$, you arrive at the maximum likelihood probability.

So, how do we calculate the sample mean of a random variable? We know that it is the total sum, divided by the number of samples:

$$\bar{X} = \frac{1}{N}\sum_{i=1}^N X_i$$

What is the problem with this equation? The problem is that we need to keep track of ALL $N$ elements, in order to calculate the mean. And what if they can't fit into memory? Is there a way to make this calculation more efficient? There most definitely is. The trick is, the mean at the $N$th sample, can be calculated based on the mean at the $(N-1)$th sample.

$$\bar{X}_N = \frac{1}{N}\sum_{n=1}^NX_n = \frac{1}{N}\Big((N-1)\bar{X}_{N-1} + X_N \Big) = (1 - \frac{1}{N})\bar{X}_{N-1}+\frac{1}{N}X_N$$

We can then express this using simpler symbols. We can call $Y$ our output, and we can use $t$ to represent the current time step:

$$\bar{X}_N = (1 - \frac{1}{N})\bar{X}_{N-1} + \frac{1}{N}X_N$$

Note: This is also known as the exponentially smoothed average, and more can be seen in my notebook on Modern Deep Learning, specifically the section concerning momentum and the ADAM optimizer.

4. Epsilon-Greedy in Code¶

# Standard Library Imports
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Seaborn Plot Styling
sns.set(style="white", palette="husl")
sns.set_context("poster")
sns.set_style("ticks")

Create a custom plotting function.

def plot_function(x, y_arr, title, legend, xlabel, ylabel, scale=False):
  """Set fig size, Plot Ground Truth Function, X axis"""
  fig, ax = plt.subplots(figsize=(10,6))
  for y in y_arr:
    plt.plot(x, y)

  """Create legend & labels"""
  if legend != False:
    ax.legend([l for l in legend], fontsize=20)
  plt.title(title)
  ax.set_xlabel(xlabel, fontsize=20)
  ax.set_ylabel(ylabel, fontsize=20)
  
  if scale == True:
    plt.xscale('log')

  return plt.show()

class Bandit: 
  def __init__(self, m):
    self.m = m       # True mean
    self.mean = 0    # Estimated mean that evolves as we explore 
    self.N = 0
    
  def pull(self):
    return np.random.randn() + self.m
  
  def update(self, X):
    self.N += 1
    self.mean = (1 - 1.0 / self.N) * self.mean + (1.0 / self.N) * X
    
def run_experiment_eps(m1, m2, m3, eps, N):
  bandits = [Bandit(m1), Bandit(m2), Bandit(m3)]
  
  data = np.empty(N)
  
  for i in range(N):
    
    # Epsilon Greedy
    p = np.random.random()
    if p < eps:
      j = np.random.choice(3)
    else:
      j = np.argmax([b.mean for b in bandits])
    
    x = bandits[j].pull() # Pull
    bandits[j].update(x) # Update
    
    # Get data for plot
    data[i] = x
  cumulative_average = np.cumsum(data) / (np.arange(N) + 1)
  
  print('Epsilon = ', eps)
  for b in bandits:
    print (b.mean)
  print('-----------------------------')
  return cumulative_average

if __name__ == '__main__':
  N = 100000
  m1 = 1.0
  m2 = 2.0
  m3 = 3.0
  
  c_1 = run_experiment_eps(m1, m2, m3, 0.1, N)
  c_05 = run_experiment_eps(m1, m2, m3, 0.05, 100000)
  c_01 = run_experiment_eps(m1, m2, m3, 0.1, 100000)
  
  # Plot moving average center
  x = np.arange(N)
  c1_y_arr = [c_1, np.ones(N) * m1, np.ones(N) * m2, np.ones(N) * m3]
  plot_function(x, c1_y_arr, 'Cumulative Average with $\epsilon$ = 0.1', False, 'N', 'Means', True)
  
  c05_y_arr = [c_05, np.ones(N) * m1, np.ones(N) * m2, np.ones(N) * m3]
  plot_function(x, c05_y_arr, 'Cumulative Average with $\epsilon$ = 0.5', False, 'N', 'Means', True)

  c01_y_arr = [c_01, np.ones(N) * m1, np.ones(N) * m2, np.ones(N) * m3]
  plot_function(x, c01_y_arr, 'Cumulative Average with $\epsilon$ = 0.01', False, 'N', 'Means', True)

  # Plot cumulative averages for all epsilon
  c_arr = [c_1, c_05, c_01]
  plot_function(x, c_arr, 'Cumulative Averages', ['Eps = 0.1', 'Eps = 0.05', 'Eps = 0.01'], 'N', 'Means', True)

  # Plot cumulative averages for all epsilon (Linear)
  plot_function(x, c_arr, 'Cumulative Averages', ['Eps = 0.1', 'Eps = 0.05', 'Eps = 0.01'], 'N', 'Means')

Epsilon =  0.1
1.0036947461792947
1.9678510765652517
2.9950873985972155
-----------------------------
Epsilon =  0.05
0.9760833089783673
1.9849698971726923
2.9993868196856615
-----------------------------
Epsilon =  0.1
0.9728618868726824
2.0125613754350242
2.9984838933572937
-----------------------------

5. Optimistic Initial Values¶

We will now look at another way of sovling the explore-exploit dilemma. Suppose you know that the true mean of the bandits was much less than 10. The idea is that you want to pick a high ceiling for the initial estimate of the bandit mean, and then do your update based on that. In other words, before we had:

$$\bar{X_0} = 0$$

And now we have:

$$\bar{X_0} = 10$$

We say that this is optimistic because the intial value for the sample mean is too good to be true. Since it is too good to be true, the only thing that can ever happen is that it will go down. So, if the mean is 1, as we collect more samples the mean will still converge to 1. In fact, eventually all of the means will eventually settle into their true values, at which point we will just be exploiting. Clearly, this will help exploration as well! If you haven't explored a bandit much, then its sample mean is going to remain very high, causing you to explore it more. In fact, it will remain higher than everything else, leaving you forced to explore it. So, in the main loop we need to still perform the greedy strategy, but using these optimistic means.

class BanditOiv: 
  def __init__(self, m, upper_limit):
    self.m = m
    self.mean = upper_limit # UPDATE 
    self.N = 1
    
  def pull(self):
    return np.random.randn() + self.m
  
  def update(self, X):
    self.N += 1
    self.mean = (1 - 1.0 / self.N) * self.mean + (1.0 / self.N) * X
    
def run_experiment_oiv(m1, m2, m3, N, upper_limit=10):
  bandits = [BanditOiv(m1, upper_limit), BanditOiv(m2, upper_limit), BanditOiv(m3, upper_limit)]
  
  data = np.empty(N)
  
  for i in range(N):
    
    # Greedy Portion (Epsilon removed) UPDATE
    j = np.argmax([b.mean for b in bandits])
    x = bandits[j].pull() # Pull
    bandits[j].update(x) # Update
                  
    # Get data for plot
    data[i] = x
  cumulative_average = np.cumsum(data) / (np.arange(N) + 1)
    
  # Plot moving average center
  plt.plot(cumulative_average)
  plt.plot(np.ones(N) * m1)
  plt.plot(np.ones(N) * m2)
  plt.plot(np.ones(N) * m3)
  plt.xscale('log')
  plt.show()
  
  for b in bandits:
    print (b.mean)
  return cumulative_average

if __name__ == '__main__':
  oiv = run_experiment_oiv(1.0, 2.0, 3.0, 100000)
  
  # log scale plot
  plt.plot(c_1, label='eps = 0.1')
  plt.plot(oiv, label='optimistic')
  plt.legend()
  plt.xscale('log')
  plt.show()


  # linear plot
  plt.plot(c_1, label='eps = 0.1')
  plt.plot(oiv, label='optimistic')
  plt.legend()
  plt.show()

2.8323587903572154
2.876666178279565
3.0017519445266543

6. UCB1¶

Another method to solve the explore/exploit dilemma is the UCB1 algorithm. Intuitively, we know that if you collect 10 samples and take the mean, this is not as accurate as if you collect 1000 samples and take the mean. If you were to draw a confidence bound around the 10 sample mean, it would be much wider than that which was around the 1000 sample mean.

In statistics, there is a bound called the Chernoff-Hoeffding bound. This states that our confidence bound changes exponentially with the number of samples we collect:

$$P \{|\bar{X} - \mu| \geq \epsilon\} \leq 2e^{\{-2\epsilon^2N\}}$$

It looks complicated, but becomes less complicated when dissected slowly! It says that:

"The probability that the absolute difference between the sample mean and the true mean is greater than or equal to $\epsilon$, is less than 2 times the exponent of -2 times $\epsilon^2$ times N. Here, $N$ is the number of samples collected and $\epsilon$ is just a very small number."

The full background is outside the scope of what we are looking at in terms of RL, however, the main idea is similar to the other algorithms we have looked at so far. The idea is that we take the upper confidence bound (UCB) to be the regular sample mean, plus the square root of $2ln(N)$ divided by $N_j$ (the number of times bandit $j$ has been played:

$$X_{UCB-j} = \bar{X}_j + \sqrt{2\frac{lnN}{N_j}}$$

In other words, we are stating that $\epsilon$ is defined as:

$$\epsilon = \sqrt{2\frac{lnN}{N_j}}$$

This is equivalent to choosing an epsilon equal to the square root and everything inside. So let's talk about what these symbols mean:

$\bar{X}_j$: Regular sample mean of the $j$th bandit

N: Total number of times each bandit has been played

$N_j$: The number of times the jth bandit has been played

Now, how do we use this formula? Well, the idea is similar to the optimistic initial values method. We use the greedy strategy only, but what we are greedy with respect to is not just the sample mean, but the upper confidence bound of the sample mean. Why does this work? Well, our confidence bound is the ratio of two things: $N$, the total number of times you have played, and $N_j$ which is the total number of times your have played bandit $j$. So, if you have played other bandits many times but you haven't played bandit $j$ very frequently ($N_j$ is small), then this ratio is going to be large. This would make the upper confidence bound higher, which would cause you to select this bandit to play. At the same time, if $N_j$ is large, then the ratio is small, at which point the upper confidence bound will shrink.

One key thing about this ratio is that the top portion grows with the logarithm of $N$, hence it will grow more slowly than $N_j$ eventually. That means, as you play all of the bandits more and more, the upper confidence bounds will shrink, and you will only be using the sample means. This is okay, since we have collected lots of data at this point! At this point, we have also converged to a purely greedy strategy, based on only the sample means.

Let's look at some pseudocode so we can understand how this works better. It is really just as simple as the optimistic initial values method, we are just replacing the formula which goes in the argmax, which is now the upper confidence bound.

for n=1..N:
    j = argmax[j']{ bandit[j'].mean + sqrt(2*ln(n) / n_j') }
    pull bandit j
    update bandit j

There is a small problem here, and that is what happens if $N_j = 0$? To fix that, we can just add a very small number to the denominator.

class Bandit:
  def __init__(self, m):
    self.m = m    # True mean
    self.mean = 0 # Sample mean
    self.N = 0    # Number of pulls/trials for bandit instance
    
  def pull(self):
    return self.m + np.random.randn() # True mean + noise form a random normal distribution
  
  def update(self, X):
    self.N += 1 
    self.mean = (1 - 1.0 / self.N) * self.mean + (1.0 / self.N) * X
  
def run_experiment_ucb(m1, m2, m3, N, dbz=0.0001):
  bandits = [Bandit(m1), Bandit(m2), Bandit(m3)]
  data = np.empty(N)
  
  def ucb(mean, n, nj):
    return float('inf') if nj == 0 else mean + np.sqrt(2 * np.log(n) / nj)
  
  for n in range(1, N):
    j = np.argmax([ucb(b.mean, n, b.N) for b in bandits])
    x = bandits[j].pull()
    data[n] = x 
    bandits[j].update(x)

  
  for b in bandits:
    print('Bandit with true mean: ', b.m)
    print('Estimated Mean: ', b.mean)
    print('Final Number of Trials: ', b.N)
    print('------ \n')
    
  cumulative_average = np.cumsum(data) / (np.arange(N) + 1)

  plot_function(np.arange(0,N), [cumulative_average], 'UCB1 Algorithm', 'Legend', 
                'Iterations (N)', 'Cumulative Average', scale=True)
  return cumulative_average

if __name__ == '__main__':
  eps = run_experiment_eps(1.0, 2.0, 3.0, 0.1, 100000)
  ucb = run_experiment_ucb(1.0, 2.0, 3.0, 100000)

  # log scale plot
#   plt.plot(eps, label='eps = 0.1')
#   plt.plot(ucb, label='ucb1')

  plot_function(np.arange(0,100000), [eps, ucb], 'UCB1 vs. Epsilon Greedy', 
                ['Epsilon Greedy', 'UCB1'],'Iterations (N)', 
                'Cumulative Average', scale=True)

  plot_function(np.arange(0,100000), [eps, ucb], 'UCB1 vs. Epsilon Greedy', 
                ['Epsilon Greedy', 'UCB1'],'Iterations (N)', 
                'Cumulative Average', scale=False)

Epsilon =  0.1
0.9859547597835412
1.9909013725057327
3.004168615241259
-----------------------------
Bandit with true mean:  1.0
Estimated Mean:  0.6550407091416911
Final Number of Trials:  5
------ 

Bandit with true mean:  2.0
Estimated Mean:  1.556910840687059
Final Number of Trials:  11
------ 

Bandit with true mean:  3.0
Estimated Mean:  3.000485580121839
Final Number of Trials:  99983
------

7. Bayesian/Thompson Sampling¶

In order to motivate the method of Bayesian (also sometimes refered to as Thompson sampling), we are going to look at confidence bounds again. Let's assume that our random variable, $X$, has a distribution associated with it:

$$X \approx Normal(\mu, \sigma^2)$$

We know intuitively that if we sample from $X$, the sample mean, $\bar{X}$ calculated from 10 samples is going to be less accurate than a sample mean calculated from 1000 samples. We can use the central limit theorem to say that the confidence interval is approximately gaussian, with the true mean ($\mu$) equal to the expected value of the random variable and variance equal to the original variance divided by $N$ number of samples:

$$\bar{X} \approx Normal(\mu, \frac{\sigma^2}{N})$$

This can be seen visually below, where the blue curve is skinnier and closer to the true mean of $\mu = 0.5$, since more samples have been drawn:

The reason that the sample mean has a distribution is because it is essentially the sum of random variables, and any function of random variables is also a random variable. So we can sum up and say that the goal of the frequentist method is:

Goal: Find $\bar{X}$, which is our estimate of the true mean (a specific point) $\mu$.

7.1 Bayesian Method¶

The bayesian method takes this one step further. Instead of $\mu$ being fixed, it will become a random variable as well. The basic idea behind the bayesian paradigm is that:

Data is fixed, while parameters are random.

In the bayesian paradigm, parameters have distributions, and what we would like to know is the distribution of the parameter given the data.

Goal: Using $p(\mu)$ and $p(X|\mu)$, find $p(\mu | X)$, our distribution of $\mu$ based on the data.

This should make a lot of sense; based on the data, we should be able to come up with a distribution that is far more accurate than if we didn't have the data. We refer to this as the posterior distribution:

$$p(\theta \; | \; X)$$

In the bayesian paradigm, the key tool is bayes rule. We are able to flip the posterior, and compute it from the likelihood and the prior:

$$p(\theta \; | \; X) = \frac{p(X|\theta ) P(\theta)}{\int p(X|\theta)p(\theta)d\theta} \propto p(X | \theta) p(\theta) = likelihood * prior$$

A disadvantage of the bayesian method is that we must chose the prior ourselves, and this can sometimes have a non negligible effect on the posterior. There are also a few problems with the above equation. The first is that it is very general, just as a differential equation is very general. The integral in the denominator cannot always be found, and we need to resort to mathematical tricks in order to make it workout. Luckily, there are a special set of pairs of distributions (likelihood and prior), such that the posterior will have the same distribution as the prior. These are called conjugate priors.

7.2 Click-through rates¶

Let's do a quick example to illustrate conjugate priors with click through rates. Now, click-throughs are like coin tosses, so they have a Bernoulli likelihood. If we check here we can see that the conjugate prior for a bernoulli likelihood is a beta distribution. The beta distribution is always between 0 and 1, which makes sense; it also has 2 parameters a and b. Knowing this we can setup our calculation for the posterior.

Likelihood:
The likelihood can be written as a product since each coin toss is independent: $$p(X|\theta) = \prod_{i=1}^N \theta^{1(x_i=1)}(1-\theta)^{1(x_i=0)}$$

Prior:
The prior is simply the PDF of the beta distribution, which can be found at the link above: $$p(\theta) = \frac{1}{B(a,b)}\theta^{a-1}(1-\theta)^{b-1}$$

To find the posterior, we need to combine the likelihood and the prior and solve for the posterior.

$$p(\theta|X) \propto \Big[ \prod_{i=1}^N \theta^{1(x_i=1)} (1-\theta)^{1(x_i=0)} \Big] \frac{1}{B(a,b)}\theta^{a-1}(1-\theta)^{b-1}$$

We can drop the beta function, $B(a,b)$, out of the proportionality, since it gets absorbed into the normalization constant. The key here is to see that the posterior also takes the form of a beta distribution, so we can infer what the new parameters of the Beta are:

$$p(\theta|X) \propto \theta^{a + \sum_{i=1}^N 1(x_i=1) - 1} (1-\theta)^{b + \sum_{i=1}^N 1(x_i=0) - 1 }$$

Where the terms in the exponents can just be written as $a'$ and $b'$: $$p(\theta|X) \propto \theta^{a'} (1-\theta)^{b'}$$

Where again, we can define them now as:

$$a' = a + \sum_{i=1}^N 1(x_i=1) - 1$$$$b' = b + \sum_{i=1}^N 1(x_i=0) - 1$$

Or, simply put:

$$a' = a + \text{#1's}$$$$b' = b + \text{#0's}$$

We can look at a demo of this below:

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import beta
%matplotlib inline

def plot(a, b, trial, ctr):
  x = np.linspace(0, 1, 200)
  y = beta.pdf(x, a, b)
  mean = float(a) / (a + b)
  plt.plot(x, y)
  plt.title("Distributions after %s trials, true rate = %.1f, mean = %.2f" % (trial, ctr, mean))
  plt.show()

true_ctr = 0.3
a, b = 1, 1 # beta parameters
show = [0, 5, 10, 25, 50, 100, 200, 300, 500, 700, 1000, 1500]
for t in range(1501):
  coin_toss_result = (np.random.random() < true_ctr)
  if coin_toss_result:
    a += 1
  else:
    b += 1

  if t in show:
    plot(a, b, t+1, true_ctr)

Above, we plotted the distribution periodically. You should have been able to notice that the distribution started off very fat. As it was updated, it became skinnier and taller. The area under the PDF is always 1, so whenever it gets skinnier it must get taller. The reason it is becoming skinnier, is because we are becoming more confident in our estimate of the parameter. In other words, this is a natural way to represent confidence or confidence intervals. Also, notice that this is an instance of online learning. The distribution becomes more precise after every coin toss-not just after we collect all of the data. We will see that all of the learning methods we will look at are able to update iteratively.

7.3 Gaussian Likelihood/Prior¶

In the notebook for section 4 of Bayesian Machine Learning, we saw that the a Gaussian likelihood is conjugate with a gaussian prior on the mean. If we place a prior only on the mean $\mu$ and assume the variance is fixed, we can start with these forms of the likelihood and prior:

$$X \approx N(\mu, \tau^{-1}), \mu \approx N(m_0, \lambda_0^{-1})$$

We can see above that we are using precision parameters, which is the inverse of variance. This makes the math a little easier to work with. The parameters for the $\mu$ distribution are $m$ and $\lambda$, and so what we are looking for are the update equations for $m$ and $\lambda$. In other words, our prior distribution regarding $\mu$ is $p(\mu)$, and we are going to be trying to find the posterior $p(\mu | X)$.

Likelihood:
So, for clarity, we define our likelihood as:

$$p(X|\mu)=\prod_{n=1}^N \frac{1}{\sqrt{\frac{2 \pi}{\tau}}} exp\Big(\frac{-\tau}{2} (x_n - \mu)^2\Big)$$

Prior:
And our prior is:

$$p(\mu) = \frac{1}{\sqrt{\frac{\pi}{\lambda_0}}} exp\Big(\frac{-\lambda_0}{2} (\mu - m_0)^2\Big)$$

Step 1
So our first step is to multiply the likelihood and prior to get the proportionality for the posterior:

$$p(\mu |X) = \Big[\prod_{n=1}^N \frac{1}{\sqrt{\frac{2 \pi}{\tau}}} exp\Big(\frac{-\tau}{2} (x_n - \mu)^2\Big) \Big] \frac{1}{\sqrt{\frac{2 \pi}{\lambda_0}}} exp\Big(\frac{-\lambda_0}{2} (\mu - m_0)^2\Big) $$

Step 2
We can now drop some constants, and just keep what is the in the exponent, since we don't care about the normalization constant:

$$p(\mu | X) \propto exp\Big(\frac{-\lambda_0}{2} (\mu - m_0)^2 - \frac{\tau}{2} \sum_{n=1}^N (x_n^2 - 2x_n \mu + \mu^2)\Big)$$

Step 3
We now collect terms by order of $\mu$; we do not care about any terms that don't involve $\mu$, since they go away with the normalizing constant.

$$p(\mu|X) \propto exp \Big[ \frac{-1}{2} \Big( (\lambda_0 + \tau N)\mu^2 - 2(m_0\lambda_0 + \tau(\sum_{n=1}^N x_n))\mu + ... )\Big) \Big]$$

Step 4
The next step is to recognize the form of the posterior that we are looking for. From our posterior, we want the form of what is on the left below; this is just what goes inside the exponent of the gaussian:

$$\frac{-1}{2}\lambda (\mu - m)^2 = \frac{-1}{2}(\lambda \mu^2 - 2m\lambda \mu + ...)$$

But this is what we just found!!!

Step 5
This means that our next step is to just equate the two by the term in $\mu$; the squared term goes with the squared term, and the linear term goes with the linear term.

$$\lambda = \lambda_0 + \tau N$$$$m \lambda = m_0 \lambda_0 + \tau \sum_n x_n \longleftrightarrow m = \frac{m_0\lambda_0 + \tau \sum_n x_n}{\lambda_0 + \tau N}$$

Note, the precision term is simpler to solve, but the $m$ term must be solved after because it depends on lambda. We now have our solution for the new $m$, and the new $\lambda$, based on the prior parameters and data we have collected.

7.4 Application to the bandit problem¶

The last step of the above process is to talk about how we apply it to the bandit problem. The answer is sampling. The key part is again the line of code where we chose the bandit. Rather than chose the argmax of the upper confidence bound as we did with UCB1, we are going to chose the argmax of samples from each of the bandits means. Remember, the means are now probability distributions, so we can draw samples from them.

How does this help us explore and exploit? Well, in the demo above we saw that the distribution always starts out as fat, and then gradually becomes skinnier. When the distribution is fat, we still have a high chance of sampling a very high number, and hence we are encouraged to explore more often. When the distribution becomes skinny, and we are more confident in our estimates, we explore less and exploit more, since the sample will always be in a very narrow range. So, sampling from the bayesian posterior automatically controls how much exploration and exploitation we do!

7.5 Bayesian Sampling in Code¶

Below we will implement Bayesian Sampling in code. Before we do, however, there are few things to clarify to ensure everything remains clear.

Our implementation is going to work in the context of online learning. This means that we are able to update our parameters after each sample is collected. In our derivation above, the update of the parameters occured when considering all of the samples simultaneously, meaning we made an update when looking at $X_1...X_N$. This means that our update rules will be slightly modified, since $N$ will only be equal to 1 sample:

Modified Updates:

$$\lambda = \lambda_0 + \tau$$
$$m_0 = \frac{\tau * \sum_n x_n}{\lambda} = \frac{\tau * \sum_n x_n}{\lambda_0 + \tau N}$$
When implementing, it becomes tricky to understand what is doing what and why, so let's walk through a quick potential iteration. Assuming we are dealing with 3 bandits, $bandit_1$, $bandit_2$, and $bandit_3$, we need to keep in mind that each one is essentially a random variable! $bandit_1$, which could be represented as $X_1$ is defined as:

$$X_1 \approx N(\mu_1, \tau_1^{-1})$$
This random variable is parameterized by its true mean, $\mu_1$, and its variance (here we are using precision $\tau_1$). Now, the frequentist method would have us treat $\mu_1$ as fixed. However, the bayesian approach is to state that $\mu_1$ is also a random variable and that is has its own distribution! That is defined as:

$$\mu_1 \approx N(m_0^1, \lambda_0^{-1})$$
Now, this should make sense so far. However, where things start to become confusing is when we begin sampling, pulling, and updating. The exact process is as follows.

1) Sample (from prior)
So at the start, each bandit has a distribution defining it's mean, $\mu$. Again, these are: $$\mu_1 \approx N(m_0^1, \lambda_0^{-1})$$
$$\mu_2 \approx N(m_0^2, \lambda_0^{-1})$$
$$\mu_3 \approx N(m_0^3, \lambda_0^{-1})$$

These represent our prior knowledge of $\mu$, and can be called: $$p(\mu)$$

The first step is to sample from each bandits prior distribution that defines its mean. Because it is a distribution, this process is inherently probabilistic and has exploring and exploiting beautifuly built right in. After we sample from each respective bandits prior $\mu$ distribution, we take bandit who gives us the argmax of the samples. Let's say when sampling, $bandit_1$'s $p(\mu_1)$ when sampled returned 1.4, $bandit_2$ returned 1.6, and $bandit_3$ returned 1.1. We would then be using $bandit_2$ for the next two steps.

2) Pull
So, now that we have selected $bandit_2$, we would go ahead and pull it. Keep in mind that that means we are working with the true distribution again! In other words, we are now pulling the bandit, which samples from the real underlying distribution (not our predicted/estimated one!): $$X_2 \approx N(\mu_2, \tau_2^{-1})$$
Note: it may be confusing, but keep in mind that in the above equation $\mu_2$ is the true mean of the bandits distribution, which is different from the predicted mean that we were using in step one.

Now, after pulling (sampling) from the true distribution of $bandit_2$ we have just generated a new data point, $X_2$! This new data point allows us to work with the likelihood: $$p(X|\mu)$$
This should make sense! We have just gotten more data for $bandit_2$, which we can incorporate into a likelihood, and use in junction with our prior...

3) Update (to get posterior)
Now that we have our new piece of data, X, we can multiply our prior times the likelihood to get an updated posterior for the $bandit_2$, $\mu_2$! $$p(\mu_2|X)$$
The entire reason that we went through the extensive derivation earlier, was so that we did not need to do this entire multiplication by hand, but instead just have the update rules that we could implement via code. Specifically, in code our updates will be defined as: $$\lambda = \lambda_0 + \tau$$

``` self.lambda_ += self.tau ```

$$m_0 = \frac{\tau * \sum_n x_n}{\lambda} = \frac{\tau * \sum_n x_n}{\lambda_0 + \tau N}$$
``` self.predicted_mean = self.tau * self.sum_x / self.lambda_ ```

import numpy as np
import matplotlib.pyplot as plt

class BayesianBandit:
  def __init__(self, true_mean):
    """mu is no longer something we are trying to find a fixed estimate of;
    it is a random variable whose distribution has fixed parameters. Bayesian methods 
    are a little tricky in the sense that your parameters are random variables, but the 
    parameters of the distributions of the original parameters can also be rv's."""
    self.true_mean = true_mean
    # parameters for mu - prior is N(0,1)
    self.predicted_mean = 0 
    self.lambda_ = 1
    self.sum_x = 0 # For convenience 
    self.tau = 1 
    
  def pull(self):
    return np.random.randn() + self.true_mean
  
  def sample(self):
    "New method, generates sample from gaussian with mean m0 and precision lambda0."
    return np.random.randn() / np.sqrt(self.lambda_) + self.predicted_mean
  
  def update(self, x):
    # Assume tau is 1 
    self.lambda_ += self.tau 
    self.sum_x += x 
    self.predicted_mean = self.tau * self.sum_x / self.lambda_
    
def run_experiment_bayes(m1, m2, m3, N):
  bandits = [BayesianBandit(m1), BayesianBandit(m2), BayesianBandit(m3)]
  
  data = np.empty(N)
  
  for i in range(N):
    # Optimistic initial values
    j = np.argmax([b.sample() for b in bandits])
    x = bandits[j].pull()
    bandits[j].update(x)
    
    # For the plot
    data[i] = x
  cumulative_average = np.cumsum(data) / (np.arange(N) + 1)
  
  # plot of moving average
  plt.plot(cumulative_average)
  plt.plot(np.ones(N)*m1)
  plt.plot(np.ones(N)*m2)
  plt.plot(np.ones(N)*m3)
  plt.xscale('log')
  plt.show()

if __name__ == '__main__':
  run_experiment_bayes(1.0, 2.0, 3.0, 100000)

9. Compare Explore-Exploit Solutions¶

In this lecture we are going to compare all of the explore-exploit solutions that we have looked into so far:

Epsilon-greedy
Optimistic Initial Values
UCB1
Thompson Sampling

Because we already know that epsilon greedy performs subpar, we are going to modify it. We will use a decaying epsilon, by setting epsilon to $\epsilon(t) = \frac{1}{t}$, where $t$ is the iteration number. We will see that in the end, all of these methods perform approximately the same, which justifies using epsilon greedy during the course.

"Ensure we have correct bandit class defined."
class Bandit:
  def __init__(self, m):
    self.m = m
    self.mean = 0
    self.N = 0

  def pull(self):
    return np.random.randn() + self.m

  def update(self, x):
    self.N += 1
    self.mean = (1 - 1.0/self.N)*self.mean + 1.0/self.N*x
    
def run_experiment_decaying_epsilon(m1, m2, m3, N):
  bandits = [Bandit(m1), Bandit(m2), Bandit(m3)]

  data = np.empty(N)
  
  for i in range(N):
    # epsilon greedy
    p = np.random.random()
    if p < 1.0/(i+1):
      j = np.random.choice(3)
    else:
      j = np.argmax([b.mean for b in bandits])
    x = bandits[j].pull()
    bandits[j].update(x)

    # for the plot
    data[i] = x
  cumulative_average = np.cumsum(data) / (np.arange(N) + 1)

  # plot moving average ctr
  plt.plot(cumulative_average)
  plt.plot(np.ones(N)*m1)
  plt.plot(np.ones(N)*m2)
  plt.plot(np.ones(N)*m3)
  plt.xscale('log')
  plt.show()

  for b in bandits:
    print(b.mean)

  return cumulative_average

def run_experiment(m1, m2, m3, N):
  bandits = [BayesianBandit(m1), BayesianBandit(m2), BayesianBandit(m3)]

  data = np.empty(N)
  
  for i in range(N):
    # optimistic initial values
    j = np.argmax([b.sample() for b in bandits])
    x = bandits[j].pull()
    bandits[j].update(x)

    # for the plot
    data[i] = x
  cumulative_average = np.cumsum(data) / (np.arange(N) + 1)

  # plot moving average ctr
  plt.plot(cumulative_average)
  plt.plot(np.ones(N)*m1)
  plt.plot(np.ones(N)*m2)
  plt.plot(np.ones(N)*m3)
  plt.xscale('log')
  plt.show()

  return cumulative_average

if __name__ == '__main__':
  m1 = 1.0
  m2 = 2.0
  m3 = 3.0
  eps = run_experiment_decaying_epsilon(m1, m2, m3, 100000)
  oiv = run_experiment_oiv(m1, m2, m3, 100000)
  ucb = run_experiment_ucb(m1, m2, m3, 100000)
  bayes = run_experiment(m1, m2, m3, 100000)

  # log scale plot
  plt.plot(eps, label='decaying-epsilon-greedy')
  plt.plot(oiv, label='optimistic')
  plt.plot(ucb, label='ucb1')
  plt.plot(bayes, label='bayesian')
  plt.legend()
  plt.xscale('log')
  plt.show()


  # linear plot
  plt.plot(eps, label='decaying-epsilon-greedy')
  plt.plot(oiv, label='optimistic')
  plt.plot(ucb, label='ucb1')
  plt.plot(bayes, label='bayesian')
  plt.legend()
  plt.show()

1.0281271119460387
2.040459091302857
2.9973873612854183

2.855670402675746
2.958728468522876
3.004105183100729
Bandit with true mean:  1.0
Estimated Mean:  0.9410709449352603
Final Number of Trials:  6
------ 

Bandit with true mean:  2.0
Estimated Mean:  1.7489698462339933
Final Number of Trials:  16
------ 

Bandit with true mean:  3.0
Estimated Mean:  2.9968687878746367
Final Number of Trials:  99977
------

10. Non-Stationary Bandits¶

What exactly do we mean by nonstationarity?

A stationary process, is a process whose statistics do not change over time.

We are interested in the mean, but strictly speaking, there are different types of stationarity.

Weak-sense stationarity is when the mean (1st order statistic) and autocovariance (2nd order statistic) don't change over time.

Strong-sense stationarity is when the entire PDF must remain the same over time.

Strong-sense stationarity is the type that we are dealing with in the bandit scenario, because each bandit has the same distribution every time we pull its arm. The question is, what if our bandits were not stationary? Would it then make sense to calculate the mean as we have been?

10.1 Mean Update Equation¶

Recall the mean update equation that we went over earlier:

$$\bar{X}_t = (1 - \frac{1}{t})\bar{X}_{t-1} + \frac{1}{t}X_t$$

Let's call $\bar{X}$, $Q$ for convenience: $$Q_t = \bar{X}_t$$

$$Q_t = (1 - \frac{1}{t})Q_{t-1} + \frac{1}{t}X_t$$

We can rearrange the terms, so that we have the $\frac{1}{t}$ term all by itself:

$$Q_t = Q_{t-1} - \frac{1}{t}(Q_{t-1} - X_t)$$

We can generalize this even further, so that the $\frac{1}{t}$ term is represented by a learning rate $\alpha$, so it kind of looks like gradient descent. The key with this equation is that we don't have to use $\alpha = \frac{1}{t}$ as the learning rate. We can rewrite our above equation as:

$$Q_t = (1 - \alpha)Q_{t-1} + \alpha X_t$$

If we get rid of recurrence:

$$Q_t = \alpha \sum_{\tau = 0}^t (1 - \alpha)^\tau X_{t - \tau}$$

We can now see that $Q$ has an exponentially decaying dependence on $X$. This means that there is more emphasis on recent values of $X$, instead of past values of $X$, so it like a moving average or low pass filter.

import numpy as np
import matplotlib.pyplot as plt
import theano
import theano.tensor as T


X = 2*np.random.randn(300) + np.sin(np.linspace(0, 3*np.pi, 300))
plt.plot(X)
plt.title("original")
plt.show()

decay = T.scalar('decay')
sequence = T.vector('sequence')

def recurrence(x, last, decay):
  return (1-decay)*x + decay*last

outputs, _ = theano.scan(
  fn=recurrence,
  sequences=sequence,
  n_steps=sequence.shape[0],
  outputs_info=[np.float64(0)],
  non_sequences=[decay]
)

lpf = theano.function(
  inputs=[sequence, decay],
  outputs=outputs,
)

Y = lpf(X, 0.99)
plt.plot(Y)
plt.title("filtered")
plt.show()

Convergence Criteria for Q¶

There is a very popular result that is often seen in RL literature, which talks about the convergence of $Q$. The result is that $Q$ will converge if the sum of all $\alpha$'s is infinity:

$$\sum_{t=1}^\infty \alpha(t) = \infty$$

And the sum of all of all the squares of the $\alpha$'s is constant:

$$\sum_{t=1}^\infty \alpha (t)^2 < \infty$$

If you think back to calculus, it should be clear that a constant $\alpha$ will not meet these two conditions. However, $\alpha = \frac{1}{t}$ does! The reason that you often see a constant $\alpha$ use is because RL problems are often non stationary, and we do not want convergence! It does not make sense to calculate a sample mean, over every collected sample ever.