Nathaniel Dake Blog

2. Bayesian A/B Testing

In this post we are going to discuss Bayesian Methods and their application to Bayesian A/B testing. In particular we will:

  • Look at the shortcomings of standard frequentist A/B testing
  • Motivate why Bayesian Methods are an improvement
  • Implement Bayesian A/B Testing in code

Note that I am writing with the assumption that you have gone through my post on statistical inference and frequentist A/B testing, or are very comfortable with both of those topics. With that out of the way, let's get started.

1. Drawbacks of Frequentist A/B Testing

To begin, consider the following example:

A drug you are testing is working well-can you stop the test and improve quality of life of all participants?

Standard frequentist statistics would say no. This is because it is bad to stop early when doing frequentist statistics, since it increases the chances that you reach a false positive conclusion. Remember, the $p$-value can pass below/above the threshold over time. Clearly this can be a shortcoming in many scenarios, and in some even go as far as being deadly.

1.1 Multi-Armed Bandit

Now, imagine the following scenario, commonly referred to as the Multi-Armed Bandit problem: You are at the casino playing slots (hence the term arm). The slot machines are bandits because they are taking your money. Not all of the slots machines are equal. Specifically:

  • 1 pays out 30% of the time
  • 1 pays out 20% of the time
  • 1 pays out 10% of the time

This is the same problem as the click-through or conversion rate! Only now we would substitute click for getting a prize. So, if we wanted to find out what arm we should pull the receive the largest payout over time, we would setup an A/B test, run the experiment, and by the end you would know if any one arm was significantly better than the others. In other words, we would run the A/B test for a specific number of $N$ trials, then calculate the $p$-value.

Now, let me ask another question: what would you do in real life? For instance, say that 1 arm gives you a prize 3 out of 3 times, and the other arm gives you a prize 0 out of 3 times. Well, most people (myself included) would intuitively play the arm that payed out 3 times! Why is this? Because we adapt. Even though three plays for each arms is most likely not enough to gain statistical significance, most people still feel compelled to believe that the first arm is better.

1.2 Reinforcement Learning

If you have gone through my posts on AI you will certainly have seen those related to Reinforcement Learning, which faces the same problem that we are here. Specifically, in RL we are trying to teach a machine to play a game. It models rewards it gets based on the actions it takes. The problem is that this process is stochastic and your model of the rewards is only an estimate (it is not feasible to try every possible action, in every possible state). In general, early on there are few actions to take, and the machine is unsure about most of the rewards. It can't "choose the action that leads to the best reward", because the knowledge about rewards is currently minimal. Only after collection a lot of data will the estimate be accurate.

1.3 Explore vs. Exploit

This problem that we are dealing with has a name: The Explore-Exploit Dilemma. We can use this name to frame the following question:

If we have gotten 3/3 on bandit 1, and 0/3 from bandit 2, should we:

  • Exploit bandit 1 more?
  • Explore at random to gather more data?

This is the problem that we are going to be trying to solve in this post. We will look at several solutions, all of which are adaptive, but only some will be Bayesian. We will show the bayesian method last.

2. Adaptive Learning Methods

2.1 Epsilon-Greedy Solution

One method of solving the explore-exploit dilemma is called the epsilon-greedy algorithm (and it is intuitive). It just so happens that this algorithm is also used in reinforcement learning. Here is how it works:

Say you have some software that is serving 2 advertisements ($A$ & $B$), you will not run a full A/B test by blindly serving ad $A$ and ad $B$ an equal number of times. Instead you will adapt to the data you have collected. If ad $A$ is performing better, it will be shown more often.

Note: even with this adaptive system, you can still do traditional A/B test after collecting data. Remember the contingency table for the chi-square test doesn't require both samples to be the same size. The thing changing here is not the test, but the machine that serves the ads. It will adapt based on the performance of each ad.

How does it work?

It is simpler than it sounds! We start by choosing a small number epsilon ($\epsilon$) between 0 and 1 which will serve as our probability of exploration. Typical values are 5% or 10%. In pseudocode it may look like:

epsilon = 0.1 
while True:
    r = rand()
    if r < epsilon:
        # explore
        # show random advertisement
    else:
        # exploit
        # show best advertisement (as determined by #clicks/#impressions)

Analyis of Epsilon-Greedy

We can clearly see that that was pretty simple, so let's analyze further. One problem is that this algorithm does the same thing forever. For instance, even when ad $A$ is statistically significantly better than $B$, it will still sometimes choose $B$. Suppose the test has converged, $A$ is better than $B$, how much reward will we get? If we say that:

$$reward(click) = 1 \;,\; reward(noclick) =0$$

Then after $N$ impressions:

$$reward = N(1-\frac{\epsilon}{2})$$

In other words, for $\epsilon = 0.1$, we get $0.95N$ instead of $N$. However, this is still better than traditional A/B testing with no adaptation.

2.2 UCB1

Alright, so we have now looked at one adaptive learning method. The question is, can we improve upon it? To find out, let's look at the UCB1 algorithm. Recall earlier discussions about confidence intervals. We determined an upper and lower limit to represent where we believe the true CTR is. We'll now look at a very similar idea, but we are not going to use the confidence bound we discussed before.

If you are unsure how confidence intervals relate to the scenario at hand, considering the following: Intuitively, we know that if you collect 10 samples and take the mean, this is not as accurate as if you collect 1000 samples and take the mean. If you were to draw a confidence bound around the 10 sample mean, it would be much wider than that which was around the 1000 sample mean. This concept can be used algorithmically in order to dynamically update when each advertisement should be served.

Now, to be clear, we are not going to be using the exact same confidence interval that we have worked with before. Rather, we are going to use a tighter bound, the Chernoff-Hoeffding bound. It says for some:

$$\epsilon > 0$$

That the following probability inequality is true:

$$P \big(\hat{\mu}>\mu+\epsilon \big) \leq e^{-2\epsilon^2N} $$

It also follows that the opposite side is also true:

$$P \big(\hat{\mu} < \mu+\epsilon \big) \leq e^{-2\epsilon^2N} $$

Note, this bound is a little different from the confidence interval that we have seen before. Namely, the confidence interval gave us specific numbers, and said that the probability that the true parameter lied within that interval was 95% (this isn't entirely accurate, but is an approximate description):

$$P(\text{lower bound} < \mu < \text{upper bound} ) = 95\%$$

On the other hand, our new equation says that:

$$P(\mu \text{in interval}) > f(N)$$

Or in english, "the probability that the true parameter lies within this range is greater than some function of $N$". In order to get the greater than sign, we just need to combine and arrange the two probabilities:

$$P(\hat{\mu} - \mu > \epsilon) \leq e^{-2\epsilon^2N} \longleftrightarrow P(\hat{\mu} - \mu \leq \epsilon) > 1 - e^{-2\epsilon^2N} $$$$P(\hat{\mu} - \mu < - \epsilon) \leq e^{-2\epsilon^2N} \longleftrightarrow P(\hat{\mu} - \mu \geq - \epsilon) > 1 - e^{-2\epsilon^2N} $$$$P \big( |\hat{\mu} - \mu| \leq \epsilon \big) > 1 - 2 e^{-2\epsilon^2N} $$

Upper Confidence Bound

In any case, for us we are only interested in the upper bound, because we want to get the highest reward possible. How exactly does this help us? Well, intuitively if we have a tighter upper bound, we can be more confident above that maximum possible win rate of any bandit. We are skipping some math here, but essentially for any arm $j$, we "choose" $\epsilon$ to be:

$$\epsilon_j = \sqrt{\frac{2 lnN}{N_j}}$$

Where $N$ is the total number of games played so far, and $N_j$ is the total times that arm $j$ has been played. We can define the upper confidence bound via:

$$\mu_{UCB-j} = \hat{\mu}_j + \epsilon_j$$$$\mu_{UCB-j} = \hat{\mu}_j + \sqrt{\frac{2 lnN}{N_j}}$$

So, in other words, we are picking whichever bandit arm gives us the largest upper confidence bound for any given turn. Again, as with epsilon-greedy, we will be using a greedy strategy. However, we are not simply greedy with respect to the sample mean $j$, but rather the upper confidence bound of the sample mean. In psuedocode, we would have:

N = 0
N_j = 0 for all bandits j

while True:
    j* = argmax(mu_j + sqrt( (2ln(N)) / (N_j) ))
    Play arm j*
    Update mu_j*
    N++; Nj*++

In english the above can be described as follows:

  1. We set $N$ and $N_j$ equal to 0, for all bandits $j$.
  2. Then, in a loop, we get the $argmax$ of the estimated mean, $\hat{\mu}$, plus the epsilon that we just defined, $\epsilon_j = \sqrt{\frac{2 lnN}{N_j}}$. In other words, we choose the bandit with the highest upper bound.
  3. We play that bandits arms, and then we update the estimated mean for that arm, $\hat{\mu}_j$.
  4. Finally, we increment $N$ and $N_j$.

How does this help with the Explore vs. Exploit problem?

You may look at this algorithm and wonder how it helps us explore and exploit. Well, let's look at where the main logic of this algorithm is held, the following line:

$$j^* = argmax \Big( \hat{\mu}_j + \sqrt{\frac{2 lnN}{N_j}} \Big)$$

Let's now look at each term individually. The first term is just the estimated mean:

$$\hat{\mu}_j$$

If that is high, then we will be exploiting it more often. If you look at the second term:

$$\sqrt{\frac{2 lnN}{N_j}}$$

You will see that it depends on $N$, the total number of plays, and $N_j$, the number of plays for arm $j$. If $N$ is large while $N_j$ is low, then that will encourage us to explore arm $j$ more, because it increases the upper confidence bound. We also should consider the asymptotic behavior of:

$$\frac{ln(N)}{N}$$

Above, as $N \rightarrow \infty$, $\frac{ln(N)}{N}$ approaches 0, so we use only the true CTR (estimated means) when $N$ is large.

We can see that from a coding/implementation perspective, this is no more complex than the epsilon-greedy algorithm. The right question to ask at this point, however, is what is our expected loss from this algorithm. With a good deal of math it can be shown that the loss is proportional to $ln(N)$. Compare this to epsilon-greedy where the loss is proportional to $N$. This means that in the long run, UCB1 will perform much better than epsilon-greedy.

3. Bayesian Adaptive Learning

3.1 Bayesian Paradigm

We are finally approaching the point of implementing full blown bayesian learning. But before we do, I think it would be useful to make a distinction as to what the Bayesian Paradigm really is, and how it differs from the frequentist paradigm.

Now, intuitively we know that as we collect more data, our estimates will become more accurate. As far as frequentist statistics are concerned, the method that we use to determine how accurate our measurements are is the confidence interval.

One key point to remember is that the estimated click through rates (or whatever we are trying to measure) is a distribution! To be clear, an estimated parameter (like the mean) is a random variable, because it is the sum of independent and identically distributed random variables. Therefore, it has a distribution that is approximately gaussian, due to the central limit theorem.

So, a very important distinction that we want to make is that the true CTR is fixed. It is something that we are trying to find by obtaining data. We call this Frequentist Statistics. In frequentist statistics:

  • The parameters of the distribution are set (think population mean), we just don't know what they are.
  • Data is this randomly generated via those distributions (which are based on the population parameters)

On the other hand, we have Bayesian Statistics. This essentially works in the opposite way. We give the parameter a distribution. It becomes a random variable that has an actual probability distribution. The data on the other hand is fixed. You can make the argument that this better reflects reality because data is fixed. This is what happens in the real world. There is no randomness to a coin toss result after the fact. In this way we can then model the distribution of the parameter, given some data (i.e. $p(parameter \mid data)$).

To be clear, we can classify the two paradigms as follows:

Frequentist Statistics

When doing frequentist statistics we are trying to find the argmax of the likelihood. In other words, the value of the parameter that maximizes the likelihood of our data, $X$. This is a point estimate.

$$\hat{\theta} = argmax_{\theta} P(X \mid \theta)$$

For example, when using the frequentist paradigm, we measure things like mean and click through rate with point estimates:

$$\frac{\sum X} {N}$$

The problem that you may see is that the above method doesn't take into account how accurate those estimates are. The frequentist solution is to use the central limit theorem to show that the confidence interval was approximately gaussian and from there we could get an upper and lower bound on the 95% confidence interval.

So, the frequentist paradigm:

  1. Calculates the likelihood-the probability of your data, given the parameter. For example, the probability that you observed the data that you did, given a gaussian distribution with a mean $\mu =4$
  2. Maximizes the likelihood with respec to the parameter in question. For example, find the mean $\mu$ (and associated gaussian distribution) that maximizes the probability you observed the data that you did.
  3. This leaves us with a maximum likelihood estimate for that parameter.

This can all be represented by the equation we saw earlier:

$$\hat{\theta} = argmax_{\theta} P(X \mid \theta)$$

It is general, but think of the case: the likelihood of the data we observed ($X$) maximized by a given value of the mean $(\theta = \mu)$. The final value for $\hat{\theta}$ is a point estimate, not a distribution.

Bayesian Statistics

When we do Bayesian Statistics, we are trying to find the posterior distributon:

$$P(\theta \mid X)$$

This is the probability of a parameter, given the data. Data is given, and data is fixed. So, in bayesian statistics our parameter, $\theta$, has a distribution! This is the opposite of frequentist statistics where $\theta$ is a point estimate.

So, in Bayesian Statistics we treat $\theta$ as a random variable too. Because of this, and the subsequent fact that it has a distribution, we can use the shape of it's distribution to tell us how confident we are for any value of that parameter. We can apply bayes rule to help us with this:

$$P(\theta \mid X) = \frac{P(X \mid \theta) P(\theta)}{P(X)} = \frac{P(X \mid \theta) P(\theta)}{Z}$$

Where above:

$$Z \rightarrow \text{normalizing constant}$$$$P(X \mid \theta) \rightarrow \text{Likelihood of data given current }\theta$$$$P(\theta) \rightarrow \text{Prior, old beliefs about }\theta$$$$P(\theta \mid X) \rightarrow \text{Posterior, new beliefs about }\theta \text{after seeing data}$$

Now, at this point you may be thinking that this should be easy! This is just multiplication and division! Not so fast. These are all probability distributions. In particular, remember that $P(X)$ is the integral over:

$$\int P(X \mid \theta) P(\theta) d\theta$$

Generally speaking, this integral is either very hard or impossible to solve. One solution is to use sampling methods like Markov Chain Monte Carlo, but we are not going to cover that since we are going to arrive at a much more elegant solution.

3.2 Conjugate Priors

The answer to our problem from above is conjugate priors, which are able to give us perfectly elegant solutions for $P(\theta \mid X)$. It turns out that for specific likleihood distributions, $P(X \mid \theta)$, and specific prior distributions, $P(\theta)$, when you solve bayes equation the posterior distribution, $P(\theta \mid X)$ will have the same type of distribution as the prior! The easiest way to see this is via an example.

Let's come back to our old friend, the Bernoulli Distribution and it's relation to click through rates. We know that the likelihood for CTR is bernoulli:

$$P(X \mid \theta) = \prod_{i=1}^N \theta^{x_i} (1 - \theta)^{1 - x_i}$$

Now, what do we know that could help us find a distribution to represent the prior? Well we also know that $\theta$ must be between $[0,1]$, because it is the probability of a click. Remember, the prior represents our prior beliefs of $\theta$, in this case $p(click)$ is. So, it must be a distribution between 0 and 1!

What probability distributions fit the description above? Well, if you browse the your probability reference book or look on wikipedia you would arrive at the Beta Distribution:

In [1]:
import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation
from matplotlib import rc, animation
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm
from matplotlib.ticker import LinearLocator, FormatStrFormatter
import numpy as np
from scipy.stats import norm, beta
from scipy import stats
import pandas as pd
import seaborn as sns
from IPython.core.display import HTML

sns.set(style="white", palette="husl")
sns.set_context("talk")
sns.set_style("ticks")

%matplotlib inline
%config InlineBackend.figure_format = 'retina'
In [37]:
fig, ax = plt.subplots(figsize=(8,5))

# plotting distribution
x = np.linspace(0, 1, 1000)

a = [0.5, 5, 1, 2, 2, 1]
b = [0.5, 1, 3, 2, 5, 1 ]
legend = []

for a, b in zip(a,b):
    y = beta.pdf(x, a, b)
    plt.plot(x, y, lw=2)
    legend.append(f"a = {a}, b = {b}")

plt.xlabel(r'$\theta$')
plt.ylabel('Probability Density')
plt.title('Beta Distribution')
plt.legend(legend, bbox_to_anchor=(1, 1))
plt.ylim(0,2.6)
plt.show()

Which has a PDF defined as:

$$\theta = Beta(a,b) = \frac{\theta^{a-1} (1- \theta)^{b-1}}{B(a,b)}$$

Where we define $B(a,b)$ as:

$$B(a,b) = \frac{\Gamma(a) \Gamma(b)}{\Gamma(a+b)}$$

And $\Gamma$ is defined as:

$$\Gamma(n) = (n-1)!$$

Where $\Gamma$ is really just a generalized factorial for real numbers.

Combine Beta Prior with Bernoulli Likelihood to Solve for Posterior

If we let our prior distribution be the Beta Distribution, does that let us solve for the posterior? Well, let's try and combine it with the likelihood via bayes rule to find out. We can start as follows:

$$P(\theta \mid X) \propto P(X \mid \theta) P(\theta)$$

Where we are using the proportionality symbol because we are not considering the normalizing constant.

$$ P(\theta \mid X) \propto \Big[ \overbrace{ \prod_{i=1}^N \theta^{x_i}(1-\theta)^{1-x_i} }^\text{Bernoulli likelihood}\Big] \overbrace{\theta^{a-1}(1-\theta)^{b-1}}^\text{Beta Prior} $$

We can then combine like terms so that everything that has $\theta$ on the base goes together, and everything that has $1 - \theta$ goes together:

$$P(\theta \mid X) \propto \theta^{a - 1 + \sum_{i=1}^N x_i} (1 - \theta)^{b - 1 + \sum_{i=1}^N (1 - x_i)} $$

And you may see that this has the same shape as the Beta Distribution! Remember, our normalization constant doesn't matter because it does not depend on $\theta$. In particular, we can see that $P(\theta \mid X)$ is also a Beta distribution:

$$P(\theta \mid X) = Beta(a', b')$$

But the parameters are updated (we will call them $a'$ and $b'$). They are equal to:

$$a' = a + \sum_{i=1}^N x_i$$$$b' = b + N -\sum_{i=1}^N x_i$$

However, we can also define them in terms of our CTR problem:

$$a' = a + \text{# of clicks}$$$$b' = b + \text{# of no clicks}$$

Interesting Things to Note

One interesting fact about the beta distribution is that the mean is equation to:

$$E(\theta) = \frac{a}{a + b}$$

Which is exactly what the maximum likelihood estimate of the click through rate would have been. However, this solution is far more elegant, because this distribution doesn't require an approximation.

Another interesting fact arises when we analyze the variance of $\theta$:

$$var(\theta) = \frac{ab}{(a+b)^2 (a+b+1)}$$

Now what is interesting about this? Well, we know that as we collect more data, the larger $a$ and $b$ will become. Therefore, as $a$ and $b$ get bigger, the variance gets smaller. So this automatically follows the same behavior that we observe with the approximate confidence interval, which is that the variance shrinks as we gather more data.

Selecting $a$ and $b$

One question you may have is: How do we choose $a$ and $b$? Well, it turns out that if you set $a = 1$ and $b=1$, the beta distribution is equal to the Uniform Distribution. This makes a lot of sense, because if you don't know anything about your CTR prior to the experiment, then all possible CTR's are equally probable. We refer to this as a non-informative prior.

One important fact to keep in mind is that as we collect more and more data, the influence of the prior parameters becomes negligably small.

Summary of what we just did

I want to take a quick moment to summarize what we just did before moving on to the code implementation.

  • We started by discussing how we want to know treat the parameter as a random variable with a probability distribution.
  • We identified a set of probability distributions, in this case the Bernoulli and the Beta, such that when you combine them using Bayes rule, the prior and the posterior are the same type of probability distribution. We call these, conjugate priors.
  • Lastly, we showed that the Beta distribution naturally has the same mean as the maximum likelihood CTR of Bernoulli, and the variance automatically shrinks as you collect more data, just as the frequentist confidence interval did.

4. Bayesian A/B Testing Algorithm

Finally, we have arrived at the actual implementation of the A/B testing algorithm. We just looked at how we can get a probability distribution of our click through rates, but two questions still remain:

  1. How can we actually use these distributions to solve the explore-exploit dilemma?
  2. How can we ensure that our solution is better than that of epsilon-greedy or UCB1?

4.1 Sampling

The answer to the above questions is sampling. Whenever you generate random numbers, you have to ask, what distribution am I sampling from? In general, most programming languages give you a rand() function that provides you with a uniformly distributed number between $[0,1]$. numpy gives us a randn(), which is a Gaussian with a mean of 0 and a variance of 1. scipy gives us the ability to sample from all of the distributions that it contains, the Beta distribution being one of them.

So, the question now is how does sampling from the Beta Distribution help us? To make the answer to this clear, it helps to consider a few scenarios.

Scenario 1

The first scenario is when we have already spent a lot of time exploring, so we have pretty sharp, small variance estimates of our CTR.

In [94]:
def plot_sampling_scenario(a1, b1, a2, b2):
    fig, ax = plt.subplots(figsize=(8,5))

    x = np.linspace(0, 1, 1000)
    y1 = beta.pdf(x, a1, b1)
    plt.plot(x, y1, c=sns.xkcd_rgb["red"], lw=2)
    plt.fill_between(x, y1, color=sns.xkcd_rgb["light red"], alpha=0.2)

    y2 = beta.pdf(x, a2, b2)
    plt.plot(x, y2, c=sns.xkcd_rgb["blue"], lw=2)
    plt.fill_between(x, y2, color=sns.xkcd_rgb["light blue"], alpha=0.2)
    plt.ylim(0,32)

    plt.xlabel(r'$\theta$')
    plt.ylabel('Probability Density')
    plt.title("Beta Distributions for Two Different Ads CTRs")

    plt.legend(["CTR 1", "CTR 2"], bbox_to_anchor=(1, 1))
    
#     plt.tight_layout()
    
    return plt.show()

plot_sampling_scenario(500, 500, 800, 300)

If we sample a number from both of these beta distributions, we would expect with a very high probability that the one with the higher CTR will give us a larger random number. So, in terms of our Bayesian Bandit problem, we should choose to play the bandit that gives us the largest sample. Note that it is still possible for the worse bandit/advertisement to give us a higher random number, but it becomes increasingly unlikely as distributions get sharper (variance decrases).

Scenario 2

Let's now consider a second scenario. One advertisement has been exploited a lot, so we are very sure of its CTR. However, we have not collected that much data for the other advertisement:

In [95]:
plot_sampling_scenario(4, 3, 800, 300)

Sampling and choosing that advertisement that returned that maximum will help us in this scenario as well! Now, CTR 1, which has a higher variance, has a much higher chance of given us a random number that is larger. So, you can see that:

It is the distributions themselves that tell us how much to explore and exploit.

More data gives us less variance, and in turn that leads to less exploration. The same is true in the reverse direction. Less data means more variance, and that means more exploration.

Now, sticking with this same scenario, notice that if we split the distribution of CTR 1 along the mean of CTR 2, that is has more probability weight (shaded light red) below that of bandit 2. It has a bit of weight above CTR 2 (shaded dark red), but it is still more probable to choose a lower number than it is to choose a higher number (relative to CTR 2).

In [96]:
fig, ax = plt.subplots(figsize=(10,5))

a1, b1, a2, b2 = 4, 3, 800, 300
x = np.linspace(0, 1, 1000)

mean, var = beta.stats(a2, b2, moments='mv')
y2 = beta.pdf(x, a2, b2)
plt.axvline(x=mean, color=sns.xkcd_rgb["blue"], linestyle="dashed")
plt.ylim(0,32)

y1 = beta.pdf(x, a1, b1)
plt.plot(x, y1, c=sns.xkcd_rgb["red"], lw=2)
plt.fill_between(x, y1, where=(x < mean), color=sns.xkcd_rgb["light red"], alpha=0.2)
plt.fill_between(x, y1, where=(x > mean), color=sns.xkcd_rgb["dark red"], alpha=0.6)

plt.xlabel(r'$\theta$')
plt.ylabel('Probability Density')
plt.title('Beta Distributions for Two Different CTRs')

plt.legend(["Mean of CTR 2", "CTR 1"], bbox_to_anchor=(1, 1))
plt.tight_layout()
plt.show()

This means that CTR 2 is already pretty good, so we will spend more time exploiting that, and less time exploring.

Scenario 3

Now what if we had a scenario where the CTR with a sharp distribution (CTR 2) we lower than CTR 1. In that case, the probability of exploring the other CTR would be higher because more probability weight is to the right of it.

In [97]:
plot_sampling_scenario(4, 3, 300, 800)

Thompson Sampling

Note that this method is also refered to Thompson Sampling, and it can also be used in the context of reinforcement learning.

Pseudocode

With that said, we can put this all together with a bit of pseudocode as follows:

while True: 
    # Draw a random sample from all bandits current Beta(aj, bj)
    j* = bandit that gives us maximum sample, aka choose the bandit that gives us the maximum sample
    x = Play bandit j*, aka pull that bandits arm 
    aj* = aj* + x, update bandits a
    bj* = bj* + 1 - x, update bandits b

Note that this is completely independent from the number of CTRs that their are, which was one of the things we needed to correct from the frequentist A/B test. Also note that I have been using the term bandit and CTR somewhat interchangibly since they serve the same purpose in this overview.

4.2 Bayesian A/B in Code

Let's finally put everything together and create our Bayesian A/B test in code. We are going to define the following:

  • Class: Advertisement
  • Probability of winning, p
  • Beta parameters a and b, set equal to 1, which gives the uniform distribution
  • Method show_ad: returns 1 or 0 (a click or a no click)
  • Method sample: samples from class instance current beta distribution
  • Method update: updates instance beta parameters, x is number of clicks

We can run this experiment for 500 iterations see the outcome visually in the animation below.

In [4]:
NUM_TRIALS = 2000
ADVERTISEMENT_PROBABILITIES = [0.2, 0.5, 0.75] # Ground truth advertisement CTR probabilties, we don't know these!

class Advertisement:
    def __init__(self, p):
        self.p = p
        self.a = 1
        self.b = 1
        
    def show_ad(self):
        return np.random.random() < self.p
    
    def sample(self):
        return np.random.beta(self.a, self.b)
    
    def update(self, x):
        self.a += x
        self.b += 1 - x
        
def plot(bandits, trial):
    x = np.linspace(0,1,200)
    for b in bandits:
        y = beta.pdf(x, b.a, b.b)
        plt.plot(x, y, label=f"real p: {b.p}")
    plt.title(f"Bandit distributions after {trial} trials")
    plt.legend()
    plt.show()
        
def experiment():
    # initializing array of advertisements
    advertisements = [Advertisement(p) for p in ADVERTISEMENT_PROBABILITIES]
    
    # these are point where we will show a plot
    sample_points = [1999]
    
    # loop through each trial 
    for i in range(NUM_TRIALS):
        all_samples = [ad.sample() for ad in advertisements]
        bestb = advertisements[np.argmax(all_samples)]
#         if i in sample_points:
#             print( "current samples: " , all_samples)
#             plot(advertisements, i)
        x = bestb.show_ad()
        bestb.update(x)
    return advertisements

lower_bound = 0
upper_bound = 1

plt.ioff()                          

fig, ax1 = plt.subplots(figsize=(8, 5), dpi=150)       

x_axis = np.arange(lower_bound, upper_bound, 0.001)

# Initialize array of advertisements
advertisements = [Advertisement(p) for p in ADVERTISEMENT_PROBABILITIES]
colors = [sns.xkcd_rgb["red"], sns.xkcd_rgb["blue"], sns.xkcd_rgb["green"]]
shade_colors = [sns.xkcd_rgb["light red"], sns.xkcd_rgb["light blue"], sns.xkcd_rgb["light green"]]
legend = [f"Ground Truth Probability: {ad_prob}" for ad_prob in ADVERTISEMENT_PROBABILITIES]

def animate(trial_num):
    # ------------- Run Experiment -------------
    all_samples = [ad.sample() for ad in advertisements]
    best_ad = advertisements[np.argmax(all_samples)]
    
    x = best_ad.show_ad()
    best_ad.update(x)

    ax1.clear()
    ax1.set_xlim(0, 1)                        
    ax1.set_ylim(0, 30) 
    
    for ad, c, sc in zip(advertisements, colors, shade_colors):
        y = beta.pdf(x_axis, ad.a, ad.b)
        ax1.plot(x_axis, y, c=c)
        ax1.fill_between(x_axis, y, color=sc, alpha=0.3)
        
    ax1.legend(legend, loc=2)
    plt.title(f"Advertisement distributions after {trial_num} trials", pad="10")
    
""" Define steps and create animation object """
step = 1
steps = np.arange(0, 500, step)

plt.tight_layout()
plt.subplots_adjust(top=0.9)

# For rendering html video in cell
# html_video = HTML(
#     animation.FuncAnimation(
#         fig,
#         animate,
#         steps,
#         interval=100,
#     ).to_html5_video()
# )
# display(html_video)
# gif_video = animation.FuncAnimation(
#         fig,
#         animate,
#         steps,
#         interval=100,
#     )

# gif_video.save('bayesian_experiment.gif', writer='imagemagick')

plt.close()

Above, we see that we have 3 different advertisements, and for the sake of simplicity we can refer to them as the red, blue, and green advertisments (this isn't even that much of a leap-often in Bayesian A/B online testing something as simple of text color may be being compared). The advertisements have a ground truth probability of being clicked of:

$$P(Click \mid red) = 0.2$$$$P(Click \mid green) = 0.75$$$$P(Click \mid blue) = 0.5$$

However, keep in mind that we do not actually know that when conducting the experiment. I am simply including it for demonstration/clarity purposes. So, we assign each advertisement a Beta distribution with a $a = 1$ and $b=1$. From a machine learning perspective we can think of these as initial, unlearned parameters (think of how a neural network's weights start off as randomly initialized).

We then sample from each of these (unlearned) beta distributions, at which point all are equally likely to result in the highest sampled number. Whichever returns the highest number is then selected as as the advertisement to show to a customer. The customer will then either click on that advertisement, or not click on the advertisement. Based on that result, we update $a$ and $b$ of the advertisment that was shown (we update it's beta distribution). We then repeat this process for 500 iterations.

What we see is that for the first 5 to 10 iterations all of the advertisements iterations are nearly the same. However, something very elegant begins to happen over time. Let's look at the red advertisment for a moment. If on iteration 5 the red distribution ends up returning the highest sampled value (which isn't unrealistic since all the of beta distributions begin at the uniform distribution), then what happens? Well, we then actually show the red ad to a user. But, the red ad isn't very effective and has only a 20% chance of being clicked on. If the user then doesn't click on it (which is probable), then the number of no clicks for the red ad increases. This subsequently changes the values of $a$ and $b$ for the red beta distribution, making it so that when it is sampled from in the next iteration, it is less probable to return a high sampled number. This can be seen in the flow chart below:

Now, as the number of iterations increases, the fact that the green advertisement has a ground truth probability of 75% to be clicked on is reflected clearly. Let's say that we are once again at iteration 5, and this time the green beta distribution draws the highest sample. This is not unexpected, remember at iteration 5 all beta distributions are very similar. Well, that means that the green advertisement is shown to a user:

Remember, users really like the green advertisment, and there is a 75% chance that they are going to click on it. Assuming that they do, at that point the green beta distribution would have it's $a$ and $b$ updated, this time increase the probability of a higher number being sampled from it on the next iteration. This continues and becomes amplified with each iteration.

What is really cool is that as the number of iterations increases, the variance of the green beta can clearly be seen decrease (it becomes a tighter/skinnier distribution). It begins to form a very tight distribution around 0.75-the true parameter! This means that we essentially were able to the learn true parameter of the green advertisement probability!

How did we just use Bayes Rule?
Now, you may be wondering how exactly we just used bayes rule? Bayes rule, as we have seen several times, involves multiplying the prior by the likelihood in order to get the posterior. So, where exactly did that occur? Remember, we actually derived that earlier! We found that the beta and bernoulli distribution formed what is known as a conjugate prior pair, and that multiplying them together results in a beta Distribution. The new beta distribution is based on updated $a$ and $b$ parameters. So, bayes rule was baked into this process based on our prior derivation.

4.3 Thompson Sampling Convergence

As an additional visualization, I want to talke a brief moment to show how our click through rate converges to the best advertisement. The animation above showed how each advertisments respective distribution is updated over time, while the plot below shows houw the cumulative average click through rate of our experiment will converge to that of the advertisement with the highest ground truth probability:

In [16]:
def run_experiment(p1, p2, p3, N):
    bandits = [Advertisement(p1), Advertisement(p2), Advertisement(p3)]
    
    data = np.empty(N)
    
    for i in range(N):
        j = np.argmax([b.sample() for b in bandits])
        x = bandits[j].show_ad()
        bandits[j].update(x)
        
        data[i] = x
    cumulative_average_ctr = np.cumsum(data) / (np.arange(N) + 1)
    fig, ax = plt.subplots(figsize=(8,5))
    plt.plot(cumulative_average_ctr, zorder=3, color=sns.xkcd_rgb["black"])
    plt.plot(np.ones(N)*p1, color=sns.xkcd_rgb["light red"])
    plt.plot(np.ones(N)*p2, color=sns.xkcd_rgb["light blue"])
    plt.plot(np.ones(N)*p3, color=sns.xkcd_rgb["light green"])
    ax.set_xlabel('Number of Iterations (N)', fontsize=14)
    ax.set_ylabel('Probability', fontsize=14)
    ax.tick_params(labelsize="large")
    ax.legend(['Our experiment CTR probability', 'Ad 1, p = 0.2','Ad 2, p = 0.25','Ad 3, p = 0.3'], fontsize=14)
    plt.ylim((0,1))
    plt.xscale('log')
    plt.show()
    
run_experiment(0.2, 0.25, 0.3, 100000)

5. The Online Nature of A/B Testing

5.1 Bayesian Sampling

I want to take a moment and pause to consider how far we have come at this point.

  • We started with traditional A/B testing, making use of all sorts of approximations, awkwardly deciding how many samples you need to collect with more approximations, and then finally running a full experiment without any adaptation to results collected so far.
  • If we wanted to stop early, frequentist statistics told us that we would not get a valid result.
  • Now, we have a way to create an adaptive system! Which, if you think about it, doesn't even really require any kind of test; we are just able to start it and let it run.
  • It eventually converges, so that you are doing the best possible thing given the data that you have collected.
  • In other words, there is no threshold at which the results become valid, rather you will naturally converge to equilibrium.

Clearly, we have made some drastic improvements, while at the same time making our methodology far easier to explain, and elegant to implement.

5.2 The Online Nature of Bayesian Methods

Now, one very interesting aspect of the bayesian sampling method is that your measurements become more accurate after every single sample you collect. Why exactly is that? Well, as mentioned earlier, you can see that we don't just update a bandits $a$ and $b$ parameters after collecting all the data; rather, we update it after every time we pulled an arm! Put another way, the posterior that we have now becomes the prior when we incorporate even more data:

$$P(\theta_t \mid X) = P(X_t \mid \theta_{t-1}) P(\theta_{t-1})$$$$P(\theta_{t+1} \mid X) = P(X_{t+1} \mid \theta_{t}) P(\theta_{t})$$

We can compare this to a traditional machine learning model, where you fit all of your training data at the same time:

fit(Xtrain, Ytrain)

If we get more data in that case, we need to retrain from scratch. With out bayesian sampling approach, we just add a number to $a$ and a number to $b$. This is an example of what is known as online learning. The model will learn and improve after every single sample. Additionally, there won't be any RAM issues because $a$ and $b$ take up constant space. In the same vein, there will be no speed issues either.

In a sense, this is more similar to how humans and animals learn. Specifically, we take our most recent experience and form some opinion about it. Never do we take every experience we have ever had and try to reach some global optimum. In fact, we can't even do that because our memory systems are imperfect. All said and done, there are a lot of satisfying results from using the bayesian method!

6. Finding a Threshold without $p$-values

Now, even though Bayesian Testing is very efficient, it still takes up resources. For instance, we must store $a$ and $b$ somewhere (a database), generate random numbers for the bandits, and so on. So, if we are certain that one advertisement is better than another, why not just display that advertisement in a static way?

Perhaps even more importantly, on a deeper level think about the probability that you actually want to know. Most often in the role of a data scientist, especially when performing A/B testing, we are asked the following question:

What is the probability that version $A$ is better than version $B$?

Mathematically, we are being asked to find $P(A > B)$. Incredibly, frequentist statistics cannot answer this question. Frequentist statistics are based upon powerful, but often unintuitive assumptions and verbiage, leaving us in a bind to provide answers that the common person can understand. For instance, if we were asked the above question and only had frequentist statistics at our disposal, we would most likely need to:

  • Choose a null hypothesis, $H_0$, that $A$ is equal to $B$.
  • Run a hypothesis test, generate a test statistic, and compute a $p$-value.

Finally with a $p$-value in hand, we would provide them with the result, and more importantly, the following addendum:

"Remember, a $p$-value is the probability of obtaining a result equal to or 'more extreme' than what was actually observed, when the null hypothesis is true".

Yes, they may say they understand, but as pointed out Thinking, Fast and Slow, if they are not well versed in statistics they are most likely performing a substitution of one metric (the unintuitive frequentist one that you provided) with what they intuitively wanted to know (the probability that $A$ was better than $B$).

Well, Bayesian Statistics will once again come to our rescue! Because both advertisement $A$ and $B$ have distributions, we are able to calculate the probability that people intuitively want! Specifically, if we define the click through rate as $\lambda$, then we can find:

$$P(\lambda_A > \lambda_B)$$

6.1 The Joint Posterior for Two Variants

I am going to utilize the same experiment from earlier, only now focusing in on the green and blue advertisments. To make the derivations easier to follow, I am going to be referring to the green ad as $A$ and the blue ad as $B$.

The goal of the earlier experiement was the find the posterior distributions of the advertisement's click through rates. These posteriors were Beta Distributions with their own respective $a$ and $b$:

$$P(\lambda_A) = Beta(a=a_A,b=b_A) = \frac{\lambda_A^{a-1} (1- \lambda_A)^{b-1}}{B(a,b)}$$$$P(\lambda_B) = Beta(a=a_B,b=b_B) = \frac{\lambda_B^{a-1} (1- \lambda_B)^{b-1}}{B(a,b)}$$

The final shape of each was as follows:

In [37]:
NUM_TRIALS = 2000
ADVERTISEMENT_PROBABILITIES = [0.5, 0.75]
x_axis = np.arange(lower_bound, upper_bound, 0.001)

advertisements = experiment()
colors = [sns.xkcd_rgb["blue"], sns.xkcd_rgb["green"]]
shade_colors = [sns.xkcd_rgb["light blue"], sns.xkcd_rgb["light green"]]

fig = plt.figure(figsize=(13, 5))       
ax1 = fig.add_subplot(1, 2, 1)
ax2 = fig.add_subplot(1, 2, 2)

y = beta.pdf(x_axis, advertisements[0].a, advertisements[0].b)
ax1.plot(x_axis, y, c=colors[0])
ax1.fill_between(x_axis, y, color=shade_colors[0], alpha=0.3) 
ax1.set_xlabel(f'$\lambda_B$')
ax1.set_title(f"Final Posterior Beta Distribution for ad $B$")    
ax1.set_ylabel(f'PDF($\lambda$)')
ax1.set_ylim(-1, 45) 
ax1.legend([f"Beta(a={advertisements[0].a}, b={advertisements[0].b})"])

y = beta.pdf(x_axis, advertisements[1].a, advertisements[1].b)
ax2.plot(x_axis, y, c=colors[1])
ax2.fill_between(x_axis, y, color=shade_colors[1], alpha=0.3)
ax2.set_xlabel(f'$\lambda_A$')
ax2.set_title(f"Final Posterior Beta Distribution for ad $A$")    
ax2.set_ylim(-1, 45) 
ax2.legend([f"Beta(a={advertisements[1].a}, b={advertisements[1].b})"])

plt.show()

Now, in our original experiment, after finding the two distributions we kept them as univariate and sampled from them to accomplish our goal. But having these in front of us actually gives us the ability to answer the intuitive question from earlier, namely:

$$P(\lambda_A > \lambda_B)$$

Specifically, we can find the the joint distribution of $\lambda_A$ and $\lambda_B$:

$$P(\lambda_A) = Beta(a=a_A,b=b_A) = \frac{\lambda_A^{a-1} (1- \lambda_A)^{b-1}}{B(a,b)}$$$$P(\lambda_B) = Beta(a=a_B,b=b_B) = \frac{\lambda_B^{a-1} (1- \lambda_B)^{b-1}}{B(a,b)}$$$$P(\lambda_A, \lambda_B) = Beta(a=a_A,b=b_A) \cdot Beta(a=a_B,b=b_B) = \frac{\lambda_A^{a_A-1} (1- \lambda_A)^{b_A-1}}{B(a_A,b_A)} \cdot \frac{\lambda_B^{a_B-1} (1- \lambda_B)^{b_B-1}}{B(a_B,b_B)}$$

While this may look rather intimidating, it's 3d representation is quite intuitive, as seen below:

In [45]:
fig = plt.figure(figsize=(15, 5.5), dpi=200)       
ax1 = fig.add_subplot(1, 2, 1, projection='3d')
ax2 = fig.add_subplot(1, 2, 2)

def init():
    # ---- Make data ----
    ad_B = advertisements[0]
    ad_A = advertisements[1]
    lambda_A = np.arange(0, 1, 0.01)
    lambda_B = np.arange(0, 1, 0.01)
    lambda_A, lambda_B = np.meshgrid(lambda_A, lambda_B)
    joint_prob = beta.pdf(lambda_B, ad_B.a, ad_B.b) * beta.pdf(lambda_A, ad_A.a, ad_A.b)
    
    # ---- LEFT: 3D Plot ----
    # Plot the surface.
    surf = ax1.plot_surface(
        lambda_A,
        lambda_B,
        joint_prob,
        cmap="magma",
        linewidth=0, 
        antialiased=True,
        rstride=1, # rstride and cstride are difference makers
        cstride=1
    ) 

    # Customize the z axis.
    ax1.set_zlim(0, 150)
    ax1.zaxis.set_major_locator(LinearLocator(10))
    ax1.zaxis.set_major_formatter(FormatStrFormatter('%.0f'))
    ax1.view_init(15,15)
    ax1.set_zticks([0,50,100,150])


    # Labels
    ax1.set_xlabel(f'$\lambda_A$', labelpad=8)
    ax1.set_ylabel(f'$\lambda_B$', labelpad=8)
    ax1.set_zlabel(f'$P(\lambda_A, \lambda_B)$', labelpad=8)

    
    # ---- RIGHT: Mesh ---- 
    c = ax2.pcolormesh(lambda_A, lambda_B, joint_prob, cmap="magma")
    fig.colorbar(c, ax=ax2)

    y = lambda_A[0]
    ax2.plot(lambda_A[0], y, color="red")

    ax2.set_xlabel(f'$\lambda_A$', labelpad=8)
    ax2.set_ylabel(f'$\lambda_B$', labelpad=8)
    ax2.set_title('Heat Map of Joint Distribution')
    

def animate(i):
    ax1.view_init(elev=10, azim=i*4)
    return fig,
    
plt.tight_layout()
plt.subplots_adjust(wspace=0.3, bottom=0.15, top=0.9)

# For rendering html video in cell
# html_video = HTML(
#     animation.FuncAnimation(
#         fig,
#         animate,
#         init_func=init,
#         frames=90,
#         interval=100,
#     ).to_html5_video()
# )
# display(html_video)
gif_video = animation.FuncAnimation(
        fig,
        animate,
        init_func=init,
        frames=90,
        interval=100,
    )

gif_video.save('joint_distribution_heat_map_animation.gif', writer='imagemagick')

plt.close()

On the left you can see the three dimensional representation of the joint distribution. It almost looks like the distribution of ad $A$ was elongated along the $\lambda_B$ axis. In a sense this is what happened. When thinking about how a 3-dimensional plot is actually created, it is helpful to think of sweeping across all values of $\lambda_A$ and $\lambda_B$; in other words, a nested for loop:

for val_b in lambda_B:
    for val_a in lambda_A:
        joint_probability_density = lambda_B_pdf(val_b) * lambda_A_pdf(val_a)

# joint_probability_density is a bivariate function that is represented as a surface

So this should make sense. We can clearly see that the global maximum is located where the respective univariate distributions of $\lambda_A$ and $\lambda_B$ have maximums.

Now, on the right there is an equivalent representation of this 3-d plot in the form of a heat map. We still have our $\lambda_A$ and $\lambda_B$ axes, only now our third dimension (the joint PDF) is represented in the form of varying shades of color. The corresponding color bar can be seen to the right. What is very nice about this heat map is that it allows us to overlay the line (sheet in three dimensions) $\lambda_A = \lambda_B$. As was the case in basic algebra, this line simply represents where our two axis values are equivalent.

Why is this so useful? Well, let's think back to the original question we wanted to answer, aka the value we wanted to obtain:

$$P(\lambda_A > \lambda_B)$$

This line allows us to get a visual representation of just that! In order to determine the probability that ad $A$ is more effective than ad $B$, we simply need to determine the volume under the above three dimensional sheet, in the region where $\lambda_A$ is greater than $\lambda_B$ (in this case the lower/right triangle created by the red line).

If this isn't quite intuitive, consider the following two dimensional scenario: You are given a pdf, $P(X)$, and asked to find the probability that $X > 1$. In order to do so, you know that you must find the area under the curve where $X > 1$:

$$\int_{x=1}^\infty P(X)$$

The same principle applies in our situation. We have a joint distribution across $\lambda_A$ and $\lambda_B$. There is a region in our $\lambda_A$ and $\lambda_B$ axes where $\lambda_A > \lambda_B$. We want to find the volume under the curve in this region in order to determine $P(\lambda_A > \lambda_B)$. Now, just looking at the right heat map we can clearly see that nearly all of the volume under the curve lies in this region! Hence, we confirm what we already knew, there is a very high probability that $\lambda_A > \lambda_B$!

We will discuss what this probability is, and also its counterpart, $P(\lambda_B > \lambda_A)$, momentarily. However, it would be remiss of me to not include a visualization of how we arrived at these final posterior beta distributions. Below, we can see the entire experiment play out from this new perspective:

In [30]:
fig = plt.figure(figsize=(13.5, 5.5), dpi=150)       
ax1 = fig.add_subplot(1, 2, 1, projection='3d')
ax2 = fig.add_subplot(1, 2, 2)

# ---- Make data ----
ADVERTISEMENT_PROBABILITIES = [0.5, 0.75]
advertisements = [Advertisement(p) for p in ADVERTISEMENT_PROBABILITIES]
ad_B = advertisements[0]
ad_A = advertisements[1]
lambda_A = np.arange(0, 1, 0.01)
lambda_B = np.arange(0, 1, 0.01)
lambda_A, lambda_B = np.meshgrid(lambda_A, lambda_B)

def animate(i):
    # ------------- Run Experiment -------------
    all_samples = [ad.sample() for ad in advertisements]
    best_ad = advertisements[np.argmax(all_samples)]
    x = best_ad.show_ad()
    best_ad.update(x)

    joint_prob = beta.pdf(lambda_B, ad_B.a, ad_B.b) * beta.pdf(lambda_A, ad_A.a, ad_A.b)    

    # ------------ LEFT PLOT -------------------
    ax1.clear()
    
    surf = ax1.plot_surface(
        lambda_A,
        lambda_B,
        joint_prob,
        cmap="magma",
        linewidth=0, 
        antialiased=True,
        rstride=1, # rstride and cstride are difference makers
        cstride=1
    ) 
    # Customize the z axis.
    ax1.set_zlim(0, 150)
    ax1.zaxis.set_major_locator(LinearLocator(10))
    ax1.zaxis.set_major_formatter(FormatStrFormatter('%.0f'))
    ax1.set_zticks([0,50,100,150,200])


    # Labels
    ax1.set_xlabel(f'$\lambda_A$', labelpad=8)
    ax1.set_ylabel(f'$\lambda_B$', labelpad=8)
    ax1.set_zlabel(f'$P(\lambda_A, \lambda_B)$', labelpad=8)
    ax1.set_title(f"Trial: {i}")

    # ------------ RIGHT: Mesh ------------ 
    ax2.clear()
    c = ax2.pcolormesh(lambda_A, lambda_B, joint_prob, cmap="magma")
    
    y = lambda_A[0]
    ax2.plot(lambda_A[0], y, color="red")
    ax2.set_xlabel(f'$\lambda_A$', labelpad=8)
    ax2.set_ylabel(f'$\lambda_B$', labelpad=8)
    ax2.set_title('Heat Map of Joint Distribution')
    
    ax1.view_init(elev=10, azim=i)
    return fig,
    
plt.tight_layout()
plt.subplots_adjust(wspace=0.3, bottom=0.15, top=0.9)

# For rendering html video in cell
# html_video = HTML(
#     animation.FuncAnimation(
#         fig,
#         animate,
#         frames=500,
#         interval=100,
#     ).to_html5_video()
# )
# display(html_video)
# gif_video = animation.FuncAnimation(
#         fig,
#         animate,
#         frames=500,
#         interval=100,
#     )

# gif_video.save('joint_distribution_heat_map_animation_experiment.gif', writer='imagemagick')

plt.close()

Above we can see how the posterior distribution takes shape with each iteration. As always, the key thing to keep in mind is that we are updating our beliefs based on new evidence/observations.

6.2 Error and Loss Functions

Now that we have a handle of how the joint posterior can be utilized to find $P(\lambda_B > \lambda_A)$, we can dig a bit deeper into how we can use this in practice. At the beginning of this section, I brought up the idea of eventually stopping the experiment and just serving the best advertisement. Something we most likely want to know is:

What is the probability that we made a mistake?

For instance, say we gather some evidence and we then choose to display advertisement $A$. What is the probability that our choice was wrong, and $B$ in fact was better? Well, intuitively we can define the probability that we made a mistake as:

$$P(\lambda_B > \lambda_A)$$

Which, is the volume under the surface where $\lambda_B > \lambda_A$. We will define an error probability as follows:

$$E[error](A) = \int_0^1 \int_0^{\lambda_A} P(\lambda_A, \lambda_B) d\lambda_B d\lambda_A$$

Which is simply the expected volume under the surface based on the upper left triangle of our heatmap. Notice that the error is a function of the advertisement, hence, when we choose ad $B$ our error is as follows:

$$E[error](B) = \int_0^1 \int_{\lambda_A}^1 P(\lambda_A, \lambda_B) d\lambda_B d\lambda_A$$

If you are slightly rusty when it comes to double integrals, think of it as follows:

$$E[error](A) = \overbrace{\int_0^1 \overbrace{\int_0^{\lambda_A} P(\lambda_A, \lambda_B) d\lambda_B}^\text{Inner Integral} d\lambda_A}^\text{Outer Integral}$$

We have an outer and inner integral. In combination they are simply finding the volume under some surface. The way that they do that is the inner integral finds the area under the curve for a particular value of the outer integral. Then the outer integral increments, and the inner integral finds the area under the curve at the new value for the out integral. This repeats for all values in the interval of the outer integral. At the end, each cross sectional area is added together in order to find the total volume. Visually, finding $error(A)$ looks like:

In [148]:
fig = plt.figure(figsize=(7, 6))       
ax = fig.add_subplot(1, 1, 1)
advertisements = experiment()
ad_B = advertisements[0]
ad_A = advertisements[1]
lambda_A = np.arange(0, 1, 0.01)
lambda_B = np.arange(0, 1, 0.01)
lambda_A, lambda_B = np.meshgrid(lambda_A, lambda_B)
joint_prob = beta.pdf(lambda_B, ad_B.a, ad_B.b) * beta.pdf(lambda_A, ad_A.a, ad_A.b)

c = ax.pcolormesh(lambda_A, lambda_B, joint_prob, cmap="magma")
fig.colorbar(c, ax=ax2)

y = lambda_A[0]
ax.plot(lambda_A[0], y, color="red")
for x_1 in np.arange(0,1,0.02):
    ax.axvline(x_1, x_1, alpha=0.5)

ax.set_xlabel(f'$\lambda_A$', labelpad=8)
ax.set_ylabel(f'$\lambda_B$', labelpad=8)
ax.set_title(f'$error(A)$')
plt.show()

Where above we are taking all of the cross sectional areas signified by the vertical pink lines, and summing them up to find the total volume under the curve. Clearly, the volume will be incredibly small in this case, since nearly all of the volume under the curve is held in the lower right triangle (in the bright spot). However, that is what we would ideally want. If we choose ad $A$, we want the error probability to be small. We can determine this volume via numerical integration:

# Error function, when choosing A
error_A = 0
for lambda_a in range(0,1):
    for lambda_b in range(lambda_a, 1):
        error_A += joint_posterior[lambda_a, lambda_b]

# Error function, when choosing B
error_B = 0
for lambda_b in range(0,1):
    for lambda_a in range(0, lambda_b):
        error_B += joint_posterior[lambda_b, lambda_a]

There is a drawback to the this metric that hurts it's utility in making decision: it treats all errors as equally bad.

6.3 The Loss Function

We can fix the issue raised above by creating a Loss Function that treats small errors as less bad than big ones.

We can define the loss function as follows:

The loss function is the amount of uplift that one can expect to be lost by choosing a given variant, given particular values of $\lambda_A$ and $\lambda_B$.

It is defined as follows:

$$L(\lambda_A, \lambda_B, A) = max(\lambda_B - \lambda_A, 0)$$$$L(\lambda_A, \lambda_B, B) = max(\lambda_A - \lambda_B, 0)$$

For example, suppose we choose to display ad $B$. If the conversion rate for ad $B$ is $\lambda_B = 0.1$ and the conversion rate for ad $A$ is $\lambda_A = 0.15$, then our loss would be:

$$L(\lambda_A, \lambda_B, B) = max(0.15 - 0.1, 0) = 0.05$$

In contrast if we choose $A$ our loss would be:

$$L(\lambda_A, \lambda_B, A) = max(0.1 - 0.15, 0) = 0$$

While this loss function makes sense when inspected and applied to a single $(\lambda_A, \lambda_B)$ pair, a visualization should make things more clear how it will be applied to alongside out entire error function:

In [13]:
def loss_func(lambda_A, lambda_B, ad):
    if ad == "A":
        return np.maximum(lambda_B - lambda_A, 0)
    elif ad == "B":
        return np.maximum(lambda_A - lambda_B, 0)
In [100]:
fig = plt.figure(figsize=(13.5, 5.5), dpi=150)       
ax1 = fig.add_subplot(1, 2, 1, projection='3d')
ax2 = fig.add_subplot(1, 2, 2)

def init():
    # Make data.
    lambda_A = np.arange(0, 1, 0.01)
    lambda_B = np.arange(0, 1, 0.01)
    lambda_A, lambda_B = np.meshgrid(lambda_A, lambda_B)
    Z = loss_func(lambda_A, lambda_B, "A")

    # Plot the surface.
    surf = ax1.plot_surface(
        lambda_A,
        lambda_B,
        Z,
        cmap="magma",
        linewidth=0,
        antialiased=True)

    # Customize the z axis.
    ax1.set_zlim(0, 1)
    ax1.zaxis.set_major_locator(LinearLocator(10))
    ax1.zaxis.set_major_formatter(FormatStrFormatter('%.02f'))

    # Labels
    ax1.set_xlabel(f'$\lambda_A$', labelpad=8)
    ax1.set_ylabel(f'$\lambda_B$', labelpad=8)
    ax1.set_zlabel('Z', labelpad=8)
    ax1.set_title(f'Loss Function: $L(\lambda_A, \lambda_B, A)$', pad="10")
    
    # Heat Map
    c = ax2.pcolormesh(lambda_A, lambda_B, Z, cmap="magma")
    fig.colorbar(c, ax=ax2)

def animate(i):
    ax1.view_init(elev=10, azim=i)
    return fig,
    
""" Define steps and create animation object """

# For rendering html video in cell
# html_video = HTML(
#     animation.FuncAnimation(
#         fig,
#         animate,
#         init_func=init,
#         frames=360,
#         interval=100,
#     ).to_html5_video()
# )
# display(html_video)

# gif_video = animation.FuncAnimation(
#         fig,
#         animate,
#         init_func=init,
#         frames=360,
#         interval=100,
#     )

# gif_video.save('loss_function_animation.gif', writer='imagemagick')

What we can see is that the loss function will linearly scale the volume moving away from the line $\lambda_A = \lambda_B$, so that volume (probability) further away from the mid line is considered worse. This makes intuitive sense. If the advertisement is right on the line, in other words $A$ and $B$ are nearly identical, then a mistake is not unreasonable. However, if the error is far away from the midline it should be weighted more heavily since it is more egregious.

Now, to find the expected loss given a joint posterior, we would solve for the expected value of the loss function:

$$E[L](ad) = \int_0^1 \int_0^1 L(\lambda_A, \lambda_B, ad) P(\lambda_A, \lambda_B) d\lambda_A d\lambda_B$$

Above we are find the total volume under the entire surface. The difference is that now the surface is actually made up of the loss function times the posterior distribution:

$$Surface \rightarrow L(\lambda_A, \lambda_B, ad) P(\lambda_A, \lambda_B) d\lambda_A d\lambda_B$$

Visually, we can see the resulting surface when selecting ad $A$ below:

In [31]:
advertisements = experiment()
ad_B = advertisements[0]
ad_A = advertisements[1]
In [37]:
fig = plt.figure(figsize=(12, 4), dpi=100)       
ax1 = fig.add_subplot(1, 2, 1, projection='3d')
ax2 = fig.add_subplot(1, 2, 2)

lambda_A = np.arange(0, 1, 0.01)
lambda_B = np.arange(0, 1, 0.01)
lambda_A, lambda_B = np.meshgrid(lambda_A, lambda_B)
joint_prob = beta.pdf(lambda_B, ad_B.a, ad_B.b) * beta.pdf(lambda_A, ad_A.a, ad_A.b)
loss = loss_func(lambda_A, lambda_B, "A")
Z = joint_prob * loss

# Plot the surface.
surf = ax1.plot_surface(
    lambda_A,
    lambda_B,
    Z,
    cmap="magma",
    linewidth=0,
    antialiased=True,
    rstride=1, # rstride and cstride are difference makers
    cstride=1
)

# Customize the z axis.
ax1.set_zlim(0, 1)
ax1.zaxis.set_major_locator(LinearLocator(10))
ax1.zaxis.set_major_formatter(FormatStrFormatter('%.01f'))
ax1.xaxis.set_major_formatter(FormatStrFormatter('%.01f'))
ax1.yaxis.set_major_formatter(FormatStrFormatter('%.01f'))
ax1.set_zticks([0,0.2,0.4,0.6,0.8, 1])

# Labels
ax1.set_xlabel(f'$\lambda_A$', labelpad=8)
ax1.set_ylabel(f'$\lambda_B$', labelpad=8)
ax1.set_zlabel('Z', labelpad=8)
ax1.set_title(f'$L(\lambda_A, \lambda_B, A) \cdot Posterior$', pad="15", size=13)

# Heat Map
c = ax2.pcolormesh(lambda_A, lambda_B, Z, cmap="magma")
fig.colorbar(c, ax=ax2)

y = lambda_A[0]
ax2.plot(lambda_A[0], y, color="red")
ax2.set_xlabel(f'$\lambda_A$', labelpad=8)
ax2.set_ylabel(f'$\lambda_B$', labelpad=8)

plt.subplots_adjust(wspace=0.4, bottom=0.15, top=0.9)

plt.show()

And when selecting ad $B$:

In [36]:
fig = plt.figure(figsize=(12, 4), dpi=100)         
ax1 = fig.add_subplot(1, 2, 1, projection='3d')
ax2 = fig.add_subplot(1, 2, 2)

lambda_A = np.arange(0, 1, 0.01)
lambda_B = np.arange(0, 1, 0.01)
lambda_A, lambda_B = np.meshgrid(lambda_A, lambda_B)
joint_prob = beta.pdf(lambda_B, ad_B.a, ad_B.b) * beta.pdf(lambda_A, ad_A.a, ad_A.b)
loss = loss_func(lambda_A, lambda_B, "B")
Z = joint_prob * loss

# Plot the surface.
surf = ax1.plot_surface(
    lambda_A,
    lambda_B,
    Z,
    cmap="magma",
    linewidth=0,
    antialiased=True,
    rstride=1, # rstride and cstride are difference makers
    cstride=1
)

# Customize the z axis.
ax1.set_zlim(0, 1)
ax1.zaxis.set_major_locator(LinearLocator(10))
ax1.zaxis.set_major_formatter(FormatStrFormatter('%.0f'))
ax1.xaxis.set_major_formatter(FormatStrFormatter('%.01f'))
ax1.yaxis.set_major_formatter(FormatStrFormatter('%.01f'))
ax1.set_zticks([0,10,20,30,40,50])

# Labels
ax1.set_xlabel(f'$\lambda_A$', labelpad=8)
ax1.set_ylabel(f'$\lambda_B$', labelpad=8)
ax1.set_zlabel('Z', labelpad=8)
ax1.set_title(f'$L(\lambda_A, \lambda_B, B) \cdot Posterior$', pad="15", size=13)
ax1.set_zlim(0, 50)

# Heat Map
c = ax2.pcolormesh(lambda_A, lambda_B, Z, cmap="magma")
fig.colorbar(c, ax=ax2)
y = lambda_A[0]
ax2.plot(lambda_A[0], y, color="red")
ax2.set_xlabel(f'$\lambda_A$', labelpad=8)
ax2.set_ylabel(f'$\lambda_B$', labelpad=8)

plt.subplots_adjust(wspace=0.4, bottom=0.15, top=0.9)

plt.show()

This again makes perfect sense! We want our loss to be very low when choosing $A$ and very high when choosing $B$! I want to note that difference in scale between the $Z$ values of plots above. They are drastically different by an order of magnitude, further proving that our loss function works correctly. In order to find the volume under the respective curves above, we could implement the following:

def loss_func(lambda_A, lambda_B, ad):
    if ad == "A":
        return np.maximum(lambda_B - lambda_A, 0)
    elif ad == "B":
        return np.maximum(lambda_A - lambda_B, 0)

loss_A = 0
for lambda_a in range(0,1):
    for lambda_b in range(0,1):
        loss += joint_posterior[lambda_a, lambda_b] * loss(i, j "A")

Now, I should note again that the volume under these curves is currently being determined via numerical integration. However, a closed form solution can be found and is derived in the appendix.

7. Confidence Interval Approximation vs. Beta Posterior

I want to now take a moment to look at how accurate the confidence interval was that we derived earlier. In particular, we will visualize the approximated gaussian that arises from the Central Limit Theorem and compare that to the Bayesian Beta Posterior. The code is of course hidden, but it is very straight forward. We can see below that our methods converge by the time $N = 500$.

In [40]:
import warnings
warnings.filterwarnings('ignore')

T = 501
true_ctr = 0.5

a = 1
b = 1 

data = np.empty(T)    

lower_bound = 0
upper_bound = 1

plt.ioff()                          

fig, ax1 = plt.subplots(figsize=(8, 5), dpi=150)       

x_axis = np.arange(lower_bound, upper_bound, 0.001)

def animate(i):
    # ------------- Run Experiment -------------
    # determine if click or no click (i.e. show the advertisement)
    x = 1 if np.random.random() < true_ctr else 0 
    data[i] = x
    
    # update beta parameters based on click or no click
    global a
    global b
    a += x
    b += 1 - x
    
    # determine current p(ctr) (our parameter), the number of samples, and the std (frequentist way)
    p = data[:i].mean()
    n = i + 1
    std = np.sqrt(p*(1-p)/n)

    # plot gaussian approximation (confidence interval, because it is based on std) (frequentist way)
    frequentist_gaussian = norm.pdf(x_axis, loc=p, scale=std)

    # plot posterior (bayesian way)
    bayesian_posterior = beta.pdf(x_axis, a=a, b=b)
    
    ax1.clear()
    ax1.set_xlim(0, 1)                        
    ax1.set_ylim(0, 20) 
    
    ax1.plot(x_axis, frequentist_gaussian, c=sns.xkcd_rgb["red"])
    ax1.plot(x_axis, bayesian_posterior, c=sns.xkcd_rgb["blue"])
    ax1.legend(["Gaussian Approximation" , "Beta Posterior"])
    ax1.annotate('N = {}'.format(i), [0.1,17.5], size=20)
    
""" Define steps and create animation object """
step = 1
steps = np.arange(0, T, step)

plt.title("Confidence Interval Approximation vs. Beta Posterior")

plt.tight_layout()
plt.subplots_adjust(top=0.9)

# For rendering html video in cell
# html_video = HTML(
#     animation.FuncAnimation(
#         fig,
#         animate,
#         steps,
#         interval=100,
#     ).to_html5_video()
# )
# display(html_video)
# gif_video = animation.FuncAnimation(
#         fig,
#         animate,
#         steps,
#         interval=100,
#     )

# gif_video.save('confidence_interval_approximation_vs_beta_posterior.gif', writer='imagemagick')

plt.close()

8. Alternative Conjugate Priors

Working with conjugate priors is a central part of Bayesian Machine Learning. We have already seen that a Beta prior and a Bernoulli Likelihood for a conjugate prior pair for experiments with a binary outcome. What about experiments that are non-binary? That is what we are going to discuss now.

As we do, it is very important to remember what our goal is: We are trying to find the probability of a parameter, given the data:

$$P(parameter \mid data)$$

In the case of earlier experiments the parameter was the click through rate probability:

$$P(p_{CTR} \mid data)$$

In the categorical case it will remain similar, while in the gaussian it will slightly differ.

8.1 Conjugate Prior for a Die Roll (Categorical Outcome)

In an experiment where we are rolling a die, we will be using the categorical likelihood as our likelihood:

$$P(X \mid \theta) = \prod_{n=1}^N \prod_{k=1}^K \theta_k ^{I(x == k)}$$

Where $I$ is the identity function:

$$I(true) = 1 \;\;, \;\; I(false) = 0$$

Our prior will be the Dirichlet Prior, which can be thought of a multivariate version of the Beta:

$$P(\theta) = Dirichlet(\vec{\alpha}) = \frac{1}{B(\vec{a})} \prod_{k=1}^K \theta_k ^{\alpha_k -1}$$$$B(\vec{\alpha}) = \frac{\prod_{k=1}^K \Gamma(\alpha_k)}{\Gamma \Big(\sum_{k=1}^K \alpha_k \Big)}$$$$\vec{\alpha} = (\alpha_1, ..., \alpha_K)$$

The next step is to combine the likelihood and the prior to determine the posterior:

$$P(\theta \mid X) \propto \prod_{n=1}^N \prod_{k=1}^K \theta_k ^{I(x == k)} \frac{1}{B(\vec{a})} \prod_{k=1}^K \theta_k ^{\alpha_k -1}$$

We can then combine all of the $\theta_k$'s and move the product over $N$ into the exponent:

$$P(\theta \mid X) \propto \overbrace{ \prod_{k=1}^K \theta_k ^{\alpha_k -1 + \sum_{n=1}^N I(x_n == k)}}^\text{Form of Dirichlet Distribution}$$

Which is in the form of a Dirichlet Distribution, just as we wanted. So, our posterior can be written as:

$$P(\theta \mid X) = Dirichlet(\vec{a}')$$$$a_k' = a_k + \sum_{n=1}^N I(x_n==k)$$$$a_k' = a_k + \# \; times \; k \; appeared $$

8.2 Gaussian Conjugate Priors

Sometimes we will be modeling a real number instead of something that is binary or categorical. Gaussians are a common distribution for modeling real numbers, so we will be using a Gaussian for now.

In the Gaussian case we will actually be looking for the probability of the mean, $\mu$, given the data:

$$P(\mu \mid data)$$

So $\mu$ is now our parameter and we are trying to update it via bayesian inference.

One thing that will immediately make this more complicated is that the Gaussian has two parameters; the mean and variance. Since we don't want to over complicate things right off the bat, for now we will only place prior on the mean and assume the variance is fixed.

To started, we are going to be using the precision (the inverse of variance) instead of the variance itself:

$$\lambda^{-1} = \sigma^2$$

We can look in the table of conjugate priors on wikipedia and see that conjugate prior for a gaussian likelihood is also a gaussian:

$$\text{Likelihood: } P(X \mid \mu) \longrightarrow X \approx N(\mu, \tau^{-1})$$$$\text{Prior: } P(\mu) \longrightarrow \mu \approx N(m_0, \lambda_0^{-1})$$

As always, our first step is to multiply the prior and likelihood to get a proportionality for the posterior:

$$P(\mu \mid X) \propto \overbrace{\Big[ \prod_{n=1}^N \frac{1}{\sqrt{\frac{2\pi}{\tau}}} e^{\frac{-\tau}{2} (x_n -\mu)^2}\Big]}^\text{Likelihood} \overbrace{\frac{1}{\sqrt{\frac{2\pi}{\lambda_0}}} e^{\frac{-\lambda}{2} (\mu - m_0)^2}}^\text{Prior}$$

We can then drop our constants and only keep what is in the exponent:

$$P(\mu \mid X) \propto exp \Big( \frac{-\lambda_0}{2} (\mu - m_0)^2 - \frac{\tau}{2} \sum_{n=1}^N (x_n^2 - 2x_n \mu + \mu^2) \Big) $$

We can then collect all of the terms by order of $\mu$, ignoring anything that does not depend on $\mu$ because it goes away with the normalizing constant:

$$P(\mu \mid X) \propto exp \Big[ -\frac{1}{2} \big\{ (\lambda_0 + \tau N)\mu^2 - 2(m_0\lambda_0 + \tau(\sum_{n=1}^N x_n))\mu + ... \big\} \Big] $$

The next step is to recognize what form of the equation we are looking for. For the posterior we want:

$$-\frac{1}{2}\lambda(\mu -m)^2 = -\frac{1}{2}(\lambda \mu^2 - 2m\lambda\mu + ...)$$

Which is actually the exact same form that we just found! This means we just need to equate the two by the degree in $\mu$:

$$\lambda = \lambda_0 + \tau N$$$$m\lambda = m_0 \lambda_0 + \tau \sum_n x_n \leftrightarrow m = \frac{m_0 \lambda_0 + \tau \sum_n x_n}{\lambda_0 + \tau N}$$

And we have arrived at our closed form solution! Note that the precision term is simpler to solve, while the $m$ term must be solved after because it depends on $\lambda$. We now have our solution for the new $m$ and new $\lambda$ based on the prior parameters and the data we collected.

It is interesting to note that if you look at the precision it is the precision for $\mu$ and it increases linearly with $N$. So, when $N \rightarrow \infty$ the precision also goes to $\infty$. Next, if you look at the second term for $m$, you will notice that the numerator depends on the sum of $x_n$ and the denominator depends on $N$. This means that in the limit when $N$ is large this will converge to the sample mean of $X$. As we know, in the limit this is equal to $\mu$.

Appendix

A.1 Closed Form Solution $ \rightarrow P(\lambda_B > \lambda_A)$

As mentioned earlier when we had been computing $P(\lambda_B > \lambda_A)$ via numerical integration, a closed form solution does exist. I didn't feel the need to use it since it would most likely not be used in practice (since it is not extensible to more than two variants), but it is a great exercise to go through. This derivation is based off of derivations that can be found here.

Recall the definition of $P(\lambda_B > \lambda_A)$:

$$P(\lambda_B > \lambda_A) = \int_0^1 \int_{\lambda_A}^1 P(\lambda_A, \lambda_B) d\lambda_B d\lambda_A$$

Where our joint distribution is defined as:

$$P(\lambda_A, \lambda_B) = \frac{\lambda_A^{a_A-1} (1- \lambda_A)^{b_A-1}}{B(a_A,b_A)} \cdot \frac{\lambda_B^{a_B-1} (1- \lambda_B)^{b_B-1}}{B(a_B,b_B)}$$

And hence, we want to find a closed form solution to the following integral:

$$P(\lambda_B > \lambda_A) = \int_0^1 \int_{\lambda_A}^1 \frac{\lambda_A^{a_A-1} (1- \lambda_A)^{b_A-1}}{B(a_A,b_A)} \cdot \frac{\lambda_B^{a_B-1} (1- \lambda_B)^{b_B-1}}{B(a_B,b_B)} d\lambda_B d\lambda_A$$

To give you an idea of where we are heading, we will find the the solution to this integral to be:

$$P(\lambda_B > \lambda_A) = \sum_{i=0}^{a_B-1} \frac{B(a_A + i, b_B + b_A)}{(b_B + i)B(1+i, b_B)B(a_A, b_A)}$$

Where:

$$a_A \rightarrow \text{number of successes for } A + 1$$$$a_B \rightarrow \text{number of failures for } A + 1$$$$b_A \rightarrow \text{number of successes for } B + 1$$$$b_B \rightarrow \text{number of failures for } B + 1$$

Derivation

With that said, let's finally get into the derivation. To begin, I want to start with a few simple definitions/clarifications. First, we have the Beta Distribution:

$$Beta(a, b) = \frac{\lambda^{a-1}(1 - \lambda)^{b - 1}}{B(a, b)}$$

A continuous probability distribution commonly used as a conjugate prior in bayesian inference. We then have the Beta Function:

$$B(a,b) = \int_0^1 \lambda^{a-1}(1-\lambda)^{b-\lambda}d\lambda$$

Where the beta function also has a special relationship to the $\Gamma$ function:

$$B(a,b) = \frac{\Gamma(a) \Gamma(b)}{\Gamma(a+b)}$$

And $\Gamma$ is:

$$\Gamma(n) = (n-1)!$$

Our goal is defined as follows:

Find $P(\lambda_B > \lambda_A)$, the probability that $B$ will beat $A$ in the long run.

Original Univariate Distributions

So, for both variant $A$ and $B$ we hold beliefs (probability distributions):

$$\lambda_A \approx Beta(a_A, b_A) = \frac{\lambda_A^{a_A-1} (1- \lambda_A)^{b_A-1}}{B(a_A,b_A)}$$$$\lambda_B \approx Beta(a_B, b_B) = \frac{\lambda_B^{a_B-1} (1- \lambda_B)^{b_B-1}}{B(a_B,b_B)}$$

Joint Distributions

Because these distributions are independent of eachother, we can find the joint distribution by simply multiplying them together. In order to get the total probability that $\lambda_B > \lambda_A$, we must integrate over the joint distribution where $\lambda_B > \lambda_A$:

$$P(\lambda_B > \lambda_A) = \int_0^1 \overbrace{\int_{\lambda_A}^1 \frac{\lambda_A^{a_A-1} (1- \lambda_A)^{b_A-1}}{B(a_A,b_A)} \cdot \frac{\lambda_B^{a_B-1} (1- \lambda_B)^{b_B-1}}{B(a_B,b_B)} d\lambda_B}^\text{Inner Integral} d\lambda_A$$

Evaluate Inner Integral using Incomplete Beta Function

Let's start by evaluating the inner integral. The beta distribution containing $\lambda_A$ can be pulled out because the inner integral is with respect to $\lambda_B$:

$$\int_0^1 \frac{\lambda_A^{a_A-1} (1- \lambda_A)^{b_A-1}}{B(a_A,b_A)} \overbrace{\int_{\lambda_A}^1\frac{\lambda_B^{a_B-1} (1- \lambda_B)^{b_B-1}}{B(a_B,b_B)} d\lambda_B}^\text{Inner Integral} d\lambda_A$$

Now, reviewing what we know about the Beta function, the inner integral above looks to be of the exact shape as the regularized incomplete beta function:

$$I_x(a,b) = \frac{B(x; a,b)}{B(a,b)} = \int_0^x\frac{t^{a-1} (1-t)^{b-1}}{B(a,b)}dt$$$$\overbrace{\int_{\lambda_A}^1\frac{\lambda_B^{a_B-1} (1- \lambda_B)^{b_B-1}}{B(a_B,b_B)} d\lambda_B}^\text{Inner Integral} \; \; \; \longleftrightarrow \int_0^x\frac{t^{a-1} (1-t)^{b-1}}{B(a,b)}dt$$

So, we can replace our inner integral with $I_x$, but there is one thing we must look out for; the bounds in our inner integral are swapped with those in the incomplete beta function. To resolve this, we have to hold a very intuitive grasp of the situation at hand. We are trying to solve a double integral and find the volume under a portion of our joint distribution surface. We know that that total volume under the joint distribution must be equal to 1:

$$\int_0^1 \int_0^1 P(\lambda_A, \lambda_B) d\lambda_A d\lambda_B= 1$$

Hence, in order to account for the reversed bounds of integration in our inner integral, we simply need to subtract our expression from 1:

$$P(\lambda_B > \lambda_A) = 1 - \int_0^1 \frac{\lambda_A^{a_A-1} (1- \lambda_A)^{b_A-1}}{B(a_A,b_A)} I_{\lambda_A}(a_B, b_B)d\lambda_A$$

Again, this is to account for the fact that out inner integral was from $\int_{\lambda_A}^1$ and not $\int_0^{\lambda_A}$.

Express $I$ without integral

From here our goal is to now define $I$ in terms that involve no integration. In order to do this we will make use of a recurrence relationship that $I$ has, as well as a base case/identity. We can start by trying to find the identity, $I_{\lambda_A}(1, b)$:

$$I_{\lambda_A} (a,b) = \int_0^{\lambda_A} \frac{t^{a-1} (1-t)^{b-1}}{B(a,b)}dt$$$$I_{\lambda_A} (1,b) = \int_0^{\lambda_A} \frac{(1-t)^{b-1}}{B(1,b)}dt$$

Where we can solve the integral via u-substitution:

$$\int (1-t)^{b-1}dt$$$$\text{let: } u = 1- t \text{; } du = -dt$$$$\int u^{b-1}du$$

Via the power rule we end up with:

$$\frac{-u^b}{b} = \frac{-(1-t)^b}{b} + C$$

We also can make use of the following identity concerning the Beta Function:

$$B(1,b) = \frac{1}{b}$$

Which we can substitute back into our original expression in order to arrive at a base case identity:

$$I_{\lambda_A} (1,b) = \frac{\frac{-(1-t)^b}{b}}{\frac{1}{b}} \Big|_0^{\lambda_A} = -(1-t)^b \Big|_0^{\lambda_A} = 1 - (1 - \lambda_A)^b$$$$I_{\lambda_A} (1,b) = 1 - (1 - \lambda_A)^b$$

Now, we can also make use of the following recursive relationship that $I$ holds:

$$I_x(a+1, b) = I_x(a,b) - \frac{x^a(1-x)^b}{aB(a,b)}$$

Which we can tweak in our case to be (simply subtracting 1 from each $a$ in the case above):

$$I_{\lambda_A}(a,b) = I_{\lambda_A}(a-1,b) - \frac{\lambda_A^{a_A - 1} (1 - \lambda_A)^{b_A}}{(a_A - 1)B(a_A -1, b_A)}$$

Now, we still have an $I$ on the right hand side (that contains an integral we want to get rid of), but we are a bit closer to the expression we want. We want to now define the $I$ on the right hand side based on our original recursive relationship; this continues until $I$ is defined entirely recursively:

$$ I_{\lambda_A}(a_A,b_A) = \overbrace{1 - (1 - \lambda_A)^b_A}^\text{Identity/base case} - \overbrace{ \sum_{j=1}^{a_A-1} \frac{\lambda_A^{a_A - j} (1 - \lambda_A)^{b_A}}{(a_A - j)B(a_A - j, b_A)} }^\text{Recursive Portion} $$

Great, we have arrived at an expression for $I$ that doesn't have an integral of $I$ on the right hand side! Equivalently, we can express the above equation as:

$$ I_{\lambda_A}(a_A,b_A) = 1 - \sum_{i=0}^{a_A-1} \frac{\lambda_A^{i} (1 - \lambda_A)^{b_A}}{(b_A + i)B(1 + i, b_A)} $$

Substitute back into expression for joint distribution

Now that we have found $I$ in form that involves no integration, we can substitute it into our last expression for the joint distribution:

$$P(\lambda_B > \lambda_A) = 1 - \int_0^1 \frac{\lambda_A^{a_A-1} (1- \lambda_A)^{b_A-1}}{B(a_A,b_A)} I_{\lambda_A}(a_B, b_B)d\lambda_A$$$$P(\lambda_B > \lambda_A) = 1 - \int_0^1 \frac{\lambda_A^{a_A-1} (1- \lambda_A)^{b_A-1}}{B(a_A,b_A)} \Big(1 - \sum_{i=0}^{a_B-1} \frac{\lambda_A^{i} (1 - \lambda_A)^{b_B}}{(b_B + i)B(1 + i, b_B)} \Big) d\lambda_A $$

We can make a simplification by splitting the integral into two parts and seeing that the first part is the integral of a probability distribution, which will evaluate to 1:

$$P(\lambda_B > \lambda_A) = 1 - \overbrace{\int_0^1 \frac{\lambda_A^{a_A-1} (1- \lambda_A)^{b_A-1}}{B(a_A,b_A)} d\lambda_A}^\text{Probability distribution, evaluates to 1} + \int_0^1 \frac{\lambda_A^{a_A-1} (1- \lambda_A)^{b_A-1}}{B(a_A,b_A)} \sum_{i=0}^{a_B-1} \frac{\lambda_A^{i} (1 - \lambda_A)^{b_B}}{(b_B + i)B(1 + i, b_B)} d\lambda_A$$

Next, we can bring the expression outside of the summation inside, collecting like terms:

$$P(\lambda_B > \lambda_A) = \int_0^1 \overbrace{\frac{\lambda_A^{a_A-1} (1- \lambda_A)^{b_A-1}}{B(a_A,b_A)}}^\text{Bring inside, collect terms} \sum_{i=0}^{a_B-1} \frac{\lambda_A^{i} (1 - \lambda_A)^{b_B}}{(b_B + i)B(1 + i, b_B)} d\lambda_A$$$$ P(\lambda_B > \lambda_A) = \int_0^1 \sum_{i=0}^{a_B-1} \frac{\lambda_A^{a_A - 1 + i} (1 - \lambda_A)^{b_A + b_B - 1}}{(b_B + i)B(a_A, b_A)B(1 + i, b_B)} d\lambda_A $$

And then swap the order of the integral and summation:

$$ P(\lambda_B > \lambda_A) = \sum_{i=0}^{a_B-1} \int_0^1 \frac{\lambda_A^{a_A - 1 + i} (1 - \lambda_A)^{b_A + b_B - 1}}{(b_B + i)B(a_A, b_A)B(1 + i, b_B)} d\lambda_A $$

Followed by pulling out the terms that are not with respect to the variable of integration ($d\lambda_A$), and multiplying by $\frac{B(a_A + i, b_A + b_B)}{B(a_A + i, b_A + b_B)}$. This leaves use with a beta distribution inside of the integral, which evaluates to 1:

$$ P(\lambda_B > \lambda_A) = \sum_{i=0}^{a_B-1} \frac{B(a_A + i, b_A + b_B)}{(b_B + i)B(a_A, b_A)B(1 + i, b_B)} \overbrace{\int_0^1 \frac{\lambda_A^{a_A - 1 + i} (1 - \lambda_A)^{b_A + b_B - 1}}{B(a_A + i, b_A + b_B)}}^\text{A beta distribution} d\lambda_A $$

Finally, we arrive at our closed form solution:

$$ P(\lambda_B > \lambda_A) = \sum_{i=0}^{a_B-1} \frac{B(a_A + i, b_A + b_B)}{(b_B + i)B(a_A, b_A)B(1 + i, b_B)} $$

A.2 Closed Form Loss Function $ \rightarrow L(\lambda_A, \lambda_B, ad)$

We can build off of the closed form solution derived about in order to find a closed form loss function. This derivation draws from work here.

Recall, the idea behind the loss function (and solving for $P(\lambda_B > \lambda_A)$ in general) was because we wanted to know when we could stop running the test, i.e. that one variant was mathematically proven to be better than the other, within a certain threshold of caring. This looked like:

$$P(\lambda_B > \lambda_A) = \int_0^1 \int_{\lambda_A}^1 P(\lambda_A, \lambda_B) d\lambda_B d\lambda_A$$$$P(\lambda_B > \lambda_A) = \int_0^1 \int_{\lambda_A}^1 \frac{\lambda_A^{a_A-1} (1- \lambda_A)^{b_A-1}}{B(a_A,b_A)} \cdot \frac{\lambda_B^{a_B-1} (1- \lambda_B)^{b_B-1}}{B(a_B,b_B)} d\lambda_B d\lambda_A$$

And we stated that:

The test will be stopped when $P(\lambda_B > \lambda_A)$ is less then some threshold of caring.

In other words:

$$P(\lambda_B > \lambda_A) \leq \text{Threshold of caring}$$

To help simplify this derivation, we are going to let our closed form solution from earlier equal $h$:

$$ P(\lambda_B > \lambda_A) = \sum_{i=0}^{a_B-1} \frac{B(a_A + i, b_A + b_B)}{(b_B + i)B(a_A, b_A)B(1 + i, b_B)} = h(a_A, b_A, b_A, b_B) $$

So, if we want to apply what we learned in A.1 to the loss function, we must first start by updating our equation above to consider the loss function:

$$L = (\lambda_A - \lambda_B)\int_0^1 \int_{\lambda_A}^1 \frac{\lambda_A^{a_A-1} (1- \lambda_A)^{b_A-1}}{B(a_A,b_A)} \cdot \frac{\lambda_B^{a_B-1} (1- \lambda_B)^{b_B-1}}{B(a_B,b_B)} d\lambda_B d\lambda_A$$

We can distribute across our loss, $(\lambda_A - \lambda_B)$:

$$L = \int_0^1 \int_{\lambda_A}^1 \frac{\lambda_A^{a_A} (1- \lambda_A)^{b_A-1}}{B(a_A,b_A)} \cdot \frac{\lambda_B^{a_B-1} (1- \lambda_B)^{b_B-1}}{B(a_B,b_B)} d\lambda_B d\lambda_A - \int_0^1 \int_{\lambda_A}^1 \frac{\lambda_A^{a_A - 1} (1- \lambda_A)^{b_A-1}}{B(a_A,b_A)} \cdot \frac{\lambda_B^{a_B} (1- \lambda_B)^{b_B-1}}{B(a_B,b_B)} d\lambda_B d\lambda_A $$

And then perform some rather simple arithmetic:


$$L = \frac{B(a_B + 1, b_B)}{B(a_B + 1 , b_B)} \int_0^1 \int_{\lambda_A}^1 \frac{\lambda_A^{a_A-1} (1- \lambda_A)^{b_A-1}}{B(a_A,b_A)} \cdot \frac{\lambda_B^{a_B} (1- \lambda_B)^{b_B-1}}{B(a_B,b_B)} d\lambda_B d\lambda_A - \frac{B(a_A + 1, b_A)}{B(a_A + 1 , b_A)} \int_0^1 \int_{\lambda_A}^1 \frac{\lambda_A^{a_A} (1- \lambda_A)^{b_A-1}}{B(a_A,b_A)} \cdot \frac{\lambda_B^{a_B-1} (1- \lambda_B)^{b_B-1}}{B(a_B,b_B)} d\lambda_B d\lambda_A $$


$$L = \frac{B(a_B + 1, b_B)}{B(a_B,b_B) } \int_0^1 \int_{\lambda_A}^1 \frac{\lambda_A^{a_A-1} (1- \lambda_A)^{b_A-1}}{B(a_A,b_A)} \cdot \frac{\lambda_B^{a_B} (1- \lambda_B)^{b_B-1}}{B(a_B + 1 , b_B)} d\lambda_B d\lambda_A - \frac{B(a_A + 1, b_A)}{B(a_A,b_A)} \int_0^1 \int_{\lambda_A}^1 \frac{\lambda_A^{a_A} (1- \lambda_A)^{b_A-1}}{B(a_A + 1 , b_A)} \cdot \frac{\lambda_B^{a_B-1} (1- \lambda_B)^{b_B-1}}{B(a_B,b_B)} d\lambda_B d\lambda_A $$

$$L = \frac{B(a_B + 1, b_B)}{B(a_B,b_B) } h(a_A, b_A, a_B + 1, b_B) - \frac{B(a_A + 1, b_A)}{B(a_A,b_A)} h(a_A + 1, b_A, a_B, b_B) $$

And we end up with the form that our closed form solution was expecting.

In [ ]:
 
In [ ]:
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm
from matplotlib.ticker import LinearLocator, FormatStrFormatter

fig = plt.figure(figsize=(8,6), dpi=100)
ax = fig.gca(projection='3d')

# Make data.
ad_B = advertisements[0]
ad_A = advertisements[1]
X_A = np.arange(0, 1, 0.01)
X_B = np.arange(0, 1, 0.01)
X_A, X_B = np.meshgrid(X_A, X_B)
joint_prob = beta.pdf(X_B, ad_B.a, ad_B.b) * beta.pdf(X_A, ad_A.a, ad_A.b)

# Plot the surface.
surf = ax.plot_surface(
    X_A,
    X_B,
    joint_prob,
    cmap="PuBuGn",
    linewidth=0, 
    antialiased=True,
    rstride=1, # rstride and cstride are difference makers
    cstride=1
) 

# Customize the z axis.
ax.set_zlim(0, 200)
ax.zaxis.set_major_locator(LinearLocator(10))
ax.zaxis.set_major_formatter(FormatStrFormatter('%.0f'))
ax.view_init(15,15)

# Labels
ax.set_xlabel(f'$X_A$', labelpad=8)
ax.set_ylabel(f'$X_B$', labelpad=8)
ax.set_zlabel('Joint Porb', labelpad=8)
ax.set_title('Joint PDF', pad="20")
ax.set_zticks([0,50,100,150,200])

plt.show()
In [ ]:
def plot_func():
    fig = plt.figure(figsize=(7, 6))       
    ax = fig.add_subplot(1, 1, 1)
    advertisements = experiment()
    ad_B = advertisements[0]
    ad_A = advertisements[1]
    lambda_A = np.arange(0, 1, 0.01)
    lambda_B = np.arange(0, 1, 0.01)
    lambda_A, lambda_B = np.meshgrid(lambda_A, lambda_B)
    joint_prob = beta.pdf(lambda_B, ad_B.a, ad_B.b) * beta.pdf(lambda_A, ad_A.a, ad_A.b)

    c = ax.pcolormesh(lambda_A, lambda_B, joint_prob, cmap="magma")
    fig.colorbar(c, ax=ax2)

    y = lambda_A[0]
    ax.plot(lambda_A[0], y, color="red")
    for x_1 in np.arange(0,1,0.05):
        ax.axvline(x_1, x_1, alpha=0.5)

    ax.set_xlabel(f'$\lambda_A$', labelpad=8)
    ax.set_ylabel(f'$\lambda_B$', labelpad=8)
    ax.set_title(f'$E(B)$')
    return plt.show()

plot_func()
In [ ]:
NUM_TRIALS = 2000
# ONLY USING BLUE AND GREEN RIGHT NOW
ADVERTISEMENT_PROBABILITIES = [0.75, 0.5] # Ground truth advertisement CTR probabilties, we don't know these!

lower_bound = 0
upper_bound = 1

plt.ioff()                          

fig, ax1 = plt.subplots(figsize=(8, 6), dpi=150)       

X_1 = np.arange(0, 1, 0.005)
X_2 = np.arange(0, 1, 0.005)
X_1, X_2 = np.meshgrid(X_1, X_2)

# Initialize array of advertisements
advertisements = [Advertisement(p) for p in ADVERTISEMENT_PROBABILITIES]
colors = [sns.xkcd_rgb["blue"], sns.xkcd_rgb["green"]]
shade_colors = [sns.xkcd_rgb["light blue"], sns.xkcd_rgb["light green"]]
legend = [f"Ground Truth Probability: {ad_prob}" for ad_prob in ADVERTISEMENT_PROBABILITIES]

def animate(trial_num):
    # ------------- Run Experiment -------------
    all_samples = [ad.sample() for ad in advertisements]
    best_ad = advertisements[np.argmax(all_samples)]
    
    x = best_ad.show_ad()
    best_ad.update(x)

    ax1.clear()
    
#     for ad, c, sc in zip(advertisements, colors, shade_colors):
    Z = (
        beta.pdf(X_1, advertisements[0].a, advertisements[0].b) * 
        beta.pdf(X_2, advertisements[1].a, advertisements[1].b)
    )
#     Z = beta.pdf(X_1, green_ad.a, green_ad.b) * beta.pdf(X_2, blue_ad.a, blue_ad.b)

#     ax1.contourf(X_1, X_2, Z, 20, cmap='RdGy')
#     ax1.colorbar()   

    ax1.pcolormesh(X_1, X_2, Z)
    y = X_1[0]
    ax1.plot(X_1[0], y, color="red")
#     fig.colorbar(c, ax=ax)
        
    ax1.legend(legend, loc=2)
    plt.title(f"Advertisement distributions after {trial_num} trials", pad="10")
    
""" Define steps and create animation object """
step = 1
steps = np.arange(0, 500, step)

plt.tight_layout()
plt.subplots_adjust(top=0.9)

# For rendering html video in cell
html_video = HTML(
    animation.FuncAnimation(
        fig,
        animate,
        steps,
        interval=100,
    ).to_html5_video()
)
display(html_video)
# gif_video = animation.FuncAnimation(
#         fig,
#         animate,
#         steps,
#         interval=100,
#     )

# gif_video.save('bayesian_experiment.gif', writer='imagemagick')

plt.close()
In [ ]:
def loss_func(X_green, X_blue, color):
    if color == "blue":
        return np.maximum(X_green - X_blue, 0)
    elif color == "green":
        return np.maximum(X_blue - X_green, 0)

fig = plt.figure(figsize=(8,6), dpi=150)
ax = fig.gca(projection='3d')

def init():
    # Make data.
    X_green = np.arange(0, 1, 0.001)
    X_blue = np.arange(0, 1, 0.001)
    X_green, X_blue = np.meshgrid(X_green, X_blue)
    Z = loss_func(X_green, X_blue, "blue")

    # Plot the surface.
    surf = ax.plot_surface(X_green, X_blue, Z, cmap=cm.jet,
                           linewidth=0, antialiased=True)

    # Customize the z axis.
    ax.set_zlim(0, 1)
    ax.zaxis.set_major_locator(LinearLocator(10))
    ax.zaxis.set_major_formatter(FormatStrFormatter('%.02f'))

    # Labels
    ax.set_xlabel(f'$X_1$', labelpad=8)
    ax.set_ylabel(f'$X_2$', labelpad=8)
    ax.set_zlabel('Z', labelpad=8)
    ax.set_title('$f(x,y) = e^{-(x^2 + y^2)}$', pad="20")

def animate(i):
    ax.view_init(elev=10, azim=i*4)
    return fig,
    
""" Define steps and create animation object """

# For rendering html video in cell
html_video = HTML(
    animation.FuncAnimation(
        fig,
        animate,
        init_func=init,
        frames=90,
        interval=100,
    ).to_html5_video()
)
display(html_video)
In [ ]:
def loss_func(X_green, X_blue, color):
    if color == "blue":
        return np.maximum(X_green - X_blue, 0)
    elif color == "green":
        return np.maximum(X_blue - X_green, 0)

fig = plt.figure(figsize=(8,6), dpi=150)
ax = fig.gca(projection='3d')

# Make data.
X_green = np.arange(0, 1, 0.00001)
X_blue = np.arange(0, 1, 0.00001)
X_green, X_blue = np.meshgrid(X_green, X_blue)
Z = beta.pdf(X_green, green_ad.a, green_ad.b) * beta.pdf(X_blue, blue_ad.a, blue_ad.b) * loss_func(X_green, X_blue, "blue")

# Plot the surface.
surf = ax.plot_surface(X_green, X_blue, Z, cmap=cm.jet,
                       linewidth=0, antialiased=True)

# Customize the z axis.
ax.set_zlim(0, 40)
ax.zaxis.set_major_locator(LinearLocator(10))
ax.zaxis.set_major_formatter(FormatStrFormatter('%.02f'))

# Labels
ax.set_xlabel(f'$X_1$', labelpad=8)
ax.set_ylabel(f'$X_2$', labelpad=8)
ax.set_zlabel('Z', labelpad=8)
ax.set_title('$f(x,y) = e^{-(x^2 + y^2)}$', pad="20")

plt.show()

© 2018 Nathaniel Dake