1. Bayesian Inference¶

So, you want to know about Bayesian Techniques and how they are utilized in Machine Learning? Maybe you have head about the Causal Revoluation and want to get a better understanding of the role these techniques played in getting there? Or, you have determined that your problem would be best solved Bayesian A/B Testing? Whatever your reasoning, you have come to the right place. However, before we dissect the techniques listed above, and many others, we need to determine two things:

What flaws exist with Frequentist Statistics?
What is meant by Bayesian Statistics, and how do they help counteract the shortcomings of 1?

Now, I have several posts concerning statistics and probability that I highly recommend reading before continuing (in my math section). However, if you are comfortable with those subjects feel free to continue. The outline of frequentist statistics below are comparatively sparse, since I already have several posts dedicated to their mechanics and intuitions. Below, we cover just enough to see what drawbacks exist in the sub field. With that said, let's get started!

1.1 Frequentist Statistics¶

Frequentist statistics test whether an event (hypothesis) occurs or not. They calculate the probability of an even in the long run of the experiment-i.e. the experiment is repeated under the same conditions to obtain the outcome.

Here, the sampling distributions of fixed size are taken. Then, the experiment is theoretically repeated an infinite number of times, but practically done with a stopping iteration. For example, I perform an experiment with a stopping iteration in mind, say I will stop the experiment after it has been repeated 1000 times.

This can be demonstrated with a simple example involving a coin toss. Here, we are trying to estimate the fairness of the coin:

No. of tosses	No. of heads	Difference
10	4	-1
50	25	0
100	44	-6
500	255	5
1000	502	2
5000	2533	33
10000	5067	67

Now, we know that the probability of getting a head on tossing a fair coin is 0.5. We know that the number of heads represents the number of heads obtained. And, we know that the difference is:

$$0.5*(\text{No. of tosses}) - (\text{No. of heads})$$

An important thing is to note that, though the difference between the actual number of heads and expected number of heads (50% of the number of tosses) increases as the number of tosses are increased, the proportion of number of heads to total number of tosses approaches 0.5 (for a fair coin).

And just like that, our incredibly simple experiment presents us with a very common flaw found the in frequentist approach:

The result of an experiment is dependent on the number of times the experiment is repeated.

1.2 Flaws in Frequentist Statistics¶

So, let's dig into the flaws of frequentist statistics a bit more thoroughly. There are three main ones that rise above the rest:

$p$-values depend on sample size¶

The p-value is the probability of obtaining an effect at least as extreme as that in your data, if the null hypothesis is true. Well, because the $p$-value is determined based on the value of a test statistic, and that test statistic is always a function of the number of iterations (i.e. number of coin tosses), if two people work on the exact same data and data generating processes, they may get different $p$-values if their sample size changes. This is clearly undesirable!

An example may make this more concrete: Person A may choose to stop tossing a coin when the total count reaches 100 while B stops at 1000. For different sample sizes, we get different t-scores and different p-values. Similarly, intention to stop may change from fixed number of flips to total duration of flipping. In this case too, we are bound to get different p-values.

Confidence Interval depends on sample size¶

Like the $p$-value, our confidence interval depends heavily on the sample size. This flies in the face of the goal of statistics! We are trying to determine the ground truth underneath our data. If two people are working with the exact same data set and generating processes, their result's (and certainly the significance) should be consistent.

Confidence Intervals are not probability distributions¶

Because confidence intervals are not probability distributions, they do not provide the most probable value for a parameter. They simple state that (in the case of a 95% confidence interval) if we were to repeat the process of performing the experiment and generating the confidence interval many times, 95% of the confidence intervals would contain the true population mean. This is described in more detail in my post concerning statistical inference and hypothesis testing.

Now, what approach could we take that would allow us to solve the issues above? Specifically, how can Bayesian Statistics help us get past frequentist statistics shortcomings?

2. Bayesian Statistics¶

A very simple, yet precise description of bayesian statistics is as follows:

Bayesian statistics is a mathematical procedure that applies probabilities to statistical problems. It provides people the tools to update their beliefs in the evidence of new data.

However, we can't dissect bayesian statistics without first understanding Bayes Theorem. So, that is where we will begin.

2.1 Bayes Theorem¶

Let's say that we are faced with the following scenario:

1% of women have breast cancer (and therefore 99% do not).
80% of mammograms detect breast cancer when it is there (and therefore 20% miss it).
9.6% of mammograms detect breast cancer when it’s not there (and therefore 90.4% correctly return a negative result).

We can put this into a table that looks like this:

_	Cancer (1%)	No Cancer (99%)
Test Positive	80%	9.6%
Test Negative	20%	90.4%

Here is how we read the table:

1% of women have cancer
If you already have cancer, you are in the first column. There’s an 80% chance you will test positive. There’s a 20% chance you will test negative.
If you don’t have cancer, you are in the second column. There’s a 9.6% chance you will test positive, and a 90.4% chance you will test negative.

And, here is how we could represent this graphically:

Keep the above diagram in mind as we go through this example, it is invaluable in thinking of the situation intuitively.

How accurate is the test? $\rightarrow$ Intuitive Reasoning¶

Now supposed that your get a positive test result. What are the chances that you have cancer? 80%? 99%? 15? Lets walk through it.

Ok, we got a positive result. It means we’re somewhere in the top row of our table. Let’s not assume anything — it could be a true positive or a false positive.
The chances of a true positive are equal to: the chance you have cancer chance test caught it = 1% 80% = .008
The chances of a false positive are equal to: the chance you don’t have cancer chance test caught it anyway = 99% 9.6% = 0.09504 Lets update our table with this information:

__	Cancer (1%)	No Cancer (99%)
Test Positive	1% x 80% = 0.008	99% x 9.6% = 0.09504
Test Negative	1% x 20% = 0.002	99% x 90.4% = 0.89496

What was our question again? What’s the chance we really have cancer if we get a positive result?. Well, the chance of an event is the number of ways it could happen given all possible outcomes. This leads us to some general intuition about probability!

$$Probability = \frac{desired \; event}{all \; possibilities}$$

What does that look like in our case? Well we have:

Desired event = Given a positive test, we have cancer
All possibilities = A postive test and having cancer + a positive test and having no cancer

In other words:

$$Probability \; of \;cancer \;given\;positive\;test = \frac{positive\;test\;and\;cancer}{positive\;test\;and\;cancer+positive\;test\;and\;no\;cancer}$$

And if we fill in our values from the table we arrive at:

$$Probability = \frac{0.01*0.8}{0.01*0.8+0.99*0.096}=0.0776=7.6\%$$

Interesting! — a positive mammogram only means you have a 7.8% chance of cancer, rather than 80% (the supposed accuracy of the test). It might seem strange at first but it makes sense: the test gives a false positive 9.6% of the time (quite high), so there will be many false positives in a given population. For a rare disease, most of the positive test results will be wrong.

Let’s test our intuition by drawing a conclusion from simply eyeballing the table. If you take 100 people, only 1 person will have cancer (1%), and they’re most likely going to test positive (80% chance). Of the 99 remaining people, about 10% will test positive, so we’ll get roughly 10 false positives. Considering all the positive tests, just 1 in 11 is correct, so there’s a 1/11 chance of having cancer given a positive test. The real number is 7.8% (closer to 1/13, computed above), but we found a reasonable estimate without a calculator.

How accurate is the test? $\rightarrow$ Application of Bayes Theorem¶

Now, we can use a more concrete mathematical framework to solve this problem! Recall (from previous posts), Bayes theorem is defined as:

$$P(A|B)=\frac{P(A)*P(B|A)}{P(B)}$$

In our example, the application of Bayes Theorem would look like:

$$P(cancer\;|\;positive\;test)=\frac{P(cancer)*P(positive\;test\;|\;cancer)}{P(positive\;test)}$$

$$\frac{0.01*0.8}{0.01*0.8+0.99*0.096}=7.8\%$$

This yields the same result as our general intuition equation! Maybe Bayes theorem isn't so hard after all...

2.2 A Scientific Perspective: Induction vs. Deduction¶

After going through the above examples, you may have the following feeling: "This all makes sense when we break it down and walk through it step by step, but it didn't feel intuitive". If that is an accurate description of you right now, take solace in the fact that I was (and still am for the most part) the same way. You see, there are a few factors that leave our intuition lacking when dealing with Bayes rule.

2.2.1 Induction vs Deduction¶

In the scientific process, there are two main forms of reasoning: deduction and induction. Deduction is the more familiar of the two, and works as follows:

So, we form some hypothesis about the state of the world or it's workings, gather data about that state, and then see if our data confirms or denies our hypothesis. That is deduction. We perform deduction every day with relative ease.

Then, there is induction, which looks like:

Here, we gather data about the world around us, and after picking up on a pattern we induce a hypothesis, and then work towards concluding it's validity. Now, it should be noted that this idea of reasoning from evidence to hypothesis can be thought of as a parallel of reasoning from effect to cause.

And this is where Thomas Bayes stepped in. He was concerned with the specific probabilities of two events, one (the hypothesis) occuring before the other (the evidence). This lead him to coin the term inverse probability.

2.2.2 Inverse and Forward Probabilities¶

Thomas Bayes showed that you can deduce the probability of a cause from an effect. Now, if you know the cause of something, it is easy to estimate the probability of the effect; this is known as the forward probability. In other words, the probability in the causal direction. For instance, if you know it is raining (the cause) you would experience no discomfort in estimating the probability that someone is holding an umbrella (the effect).

However, going in the opposite direction, aka finding the inverse probability, is much harder. To demonstrate why, we need to create an example derived from Bayes original paper. Imagine that you shoot a billiard ball on a table, making sure that it bounces as many times as possible so that we have no idea where it will end up (i.e. making it random). We want to know the probability that the ball will land within $x$ feet of the end of the table.

Well, if we know the length of the table, $L$, this is a very easy question to answer. For instance, if $L$ is 10 feet, and we want to know the probability of the ball stopping within $x = 1 ft$ of the end of the table, we know that the probability is:

$$p(x =1 \mid L=10) = \frac{1}{10}$$

What we just solved for was the forward probability. Our intuitive understanding of physics tells us that the longer the table, the lower the probability of landing in the sub region, because there are more final resting places that the ball can end up. Likewise, the larger that the subregion is (the larger $x$ is), the higher the probability of landing there because it contains a larger set of stopping positions.

Now, let's consider the inverse probability problem:

Here, we observe the final position of the ball to be $x=1$ foot away from the end of the table. However, we are not given the length $L$ of the table. The question that can be asked then is: What is the probability that the table is 100 feet long?

$$P(L=100 \mid x=1)$$

Common sense tells us that the table is more likely to be 50 feet vs. 100 feet, however, it gives no clear guidance as to how much more likely.

Why exactly was the forward probability so much easier to solve for compared to the inverse probability? In this case, the asymmetry comes from the fact that $L$ acts as the cause, and $x$ as the effect. Note, we are starting to get to work towards more fundamental principles. Imagine that you observe someone throwing a rock at a window (a cause). All of us can predict, without a problem, that the window is most likely going to break. Human cognition is designed to work in this direction. However, if we are just given the effect (the broken window), we need much more information in order to deduce the cause (that it was even a rock that broke the window in the first place, since it could have easily been many other things).

Bayes set out to break this cognitive asymmetry and explain how to assess inverse probabilities. Perhaps the most important role of Baye's rule in statistics (remember, we used it earlier in the cancer example) is:

We can estimate the conditional probability directly in one direction, in which our human judgement is more reliable, and use mathematics to derive the conditional probability in the other direction, for which our judgement is rather hazy.

2.2.3 A Philosophical Question¶

One way to look at Bayes theorem is as a method for updating your beliefs in a certain hypothesis. This is very important to understand, because a large part of human belief about future events rests on the frequency with which they or similar events have occured in the past.

For instance, in the cancer example, we can think of the application of bayes rule as the updating of our belief that we have cancer, given a positive test. We perform the update by including new evidence, namely the overall probablity of having cancer in the general population.

This raises an interesting philosophical question:

Can we legitimately translate the expression "given that I know" into the langauge of probabilties? For example, is "given that I know $T$" the same as "among cases where $T$ occurred"?

Originally the language of probability, expressed in symbols like $P(L \mid X)$, was intended to capture the concept of frequencies in games of chance. However, the expression "given that I know" is epistemological and should be governed by logic of knowledge, not by frequencies and proportions. What Thomas Bayes provided was a framework in which we can utilize the concept of probabilty to approximate these epistemological scenarios.

2.2.4 Human Psychology $\rightarrow$ Multiplication of Probabilities and Substitution¶

There is one final angle that I want to discuss with regards to why the application of bayes rules is often not intuitive. In the book Thinking, Fast and Slow, by Daniel Kahneman, he highlights two key biases that humans often have when dealing with probabilistic reasoning.

The first, relates to the multiplication of probabilities. In general, humans have a difficult enough time dealing with a single probability, never the less multiple. This is exactly the situation that we encounter with bayes rule. We end up multiplying the probability of having cancer in the general population, by the probability of getting a positive test given that you have cancer. This multiplication process is not one that humans deal with gracefully, and often find unintuitive (until breaking it down step by step).

Secondly, and more importantly, when we are faced with a tough question pertaining to probabilities, we often perform substitution. Kahneman highlights that the problem solving portion of our mind is often lazy, and will substitute a hard question for a similar, albeit easier one. In the cancer scenario, if asked:

What is the probability you have cancer, given a positive test?

We will substitute that for an easier question:

What is the likelihood you have a positive test, given you have cancer?

From which we have the probability already at our disposal-80%! What we have done is forgotten our base rate, i.e. what is the probability that anyone in the general population has cancer to begin with, in this case 1%. This substitution happens without us even realizing, meaning it is nearly automatic.

You may wonder, how can you prevent this incorrect intuition of substitution from occurring. Well, first and foremost it should be known that in general this substitution isn't a bad thing. In fact, it is an optimization that occurred via evolution to allow humans to maximize their ability to make decisions pertaining to survival. However, to get around this bias and apply more effective statistical reasoning, the only thing you can do is train your mind to look out for these scenarios. Kahneman recommends always keeping the base rate in mind, and imagine that you are updating it with new evidence.

3. Bayesian Inference¶

The above example is very helpful to explain where Bayesian Inference comes from and showing the mechanics in action. However, in Data Science applications it is generally used to to interpret data! By pulling in prior knowledge about what we know, we can draw stronger conclusions with small data sets!

3.1 Probability Distributions as Beliefs¶

It's helpful to think of probabilities as beliefs about the world; specifically the associated confidence surrounding particular beliefs. For instance, say we wanted to develop beliefs surrounding the height's of people. One way to do that is to utilize a probability distribution. So, say we have access to the following height data:

import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation
from matplotlib import rc, animation
from scipy.stats import norm
import numpy as np
import pandas as pd
from IPython.core.display import HTML

import seaborn as sns
sns.set(style="white", palette="husl")
sns.set_context("talk")
sns.set_style("ticks")

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

mu, sigma = 5.83, 0.4
normal_dist = np.random.normal(mu, sigma, 1000)
heights = np.around(normal_dist, decimals=2)
indices = np.arange(1,1000)
df_heights = pd.DataFrame({
     'Height (ft)':heights   
})
df_heights.index.rename('Person', inplace=True)

display(df_heights.head(10))

We can then plot a discrete distribution (limited in the number of values it can take on), breaking down into smaller and smaller bins (shown below). For instance, if I asked you predict whether someones height (without seeing them) was greater or less than 5 feet 8 inches, your beliefs about their probability may be as follows:

def normal_dist_experiment(num_bins, y_label, title, normed=False, plot_continuous_dist=False):
    fig = plt.figure(figsize=(8,5))

    count, bins, ignored = plt.hist(
        heights,
        num_bins,
        density=normed,
        color='forestgreen',
        alpha=0.5,
        linewidth=1,
        edgecolor='black'
    )
    if plot_continuous_dist:
        y = 1/(sigma * np.sqrt(2 * np.pi)) *np.exp( - (bins - mu)**2 / (2 * sigma**2) )
        plt.plot(
            bins,
            y,
            linewidth=2,
            color='r'
        )
        plt.fill_between(bins, y, color=sns.xkcd_rgb["light green"], alpha=0.4)
    plt.xlabel(f'Height (ft)')
    plt.ylabel(y_label, size=14)
    plt.title(title)
    plt.show()
    
normal_dist_experiment(2, 'Probability Density of being in group', f'Probability group with {2} bins', normed=True)

If I then changed the question, asking you the probability of which of the 5 bins below their height fell into, you beliefs may look something like:

normal_dist_experiment(5, 'Probability Density of being in group', f'Probability group with {5} bins', normed=True)

And we could keep breaking the distributions into finer and finer bins, slowly creating a finer-grained set of beliefs:

normal_dist_experiment(10, 'Probability Density of being in group', f'Probability group with {10} bins', normed=True)

normal_dist_experiment(25, 'Probability Density of being in group', f'Probability group with {25} bins', normed=True)

normal_dist_experiment(100, 'Probability Density of being in group', f'Probability group with {100} bins', normed=True)

Until the distribution has finally become continuous! It is import to remember that this continuous distribution still shows how your belief is allocated!

normal_dist_experiment(
    100,
    'Probability Density of being in group',
    f'Probability group with {100} bins',
    normed=True,
    plot_continuous_dist=True
)

Once you have your beliefs, you can determine things such as the average, median, standard devation:

print('Mean height: ', np.around(heights.mean(), decimals=2), 'ft')
print('Median height: ', np.around(np.median(heights), decimals=2), 'ft')
print('Standard deviation in height: ', np.around(heights.std(), decimals=2), 'ft')

Mean height:  5.84 ft
Median height:  5.84 ft
Standard deviation in height:  0.39 ft

The key intuition to remember here is that we can think of probability distributions as:

For all of these heights, where am I placing my bets? What do I believe, and how much?

3.2 Bayesian Inference Example¶

Let's look at an example of Bayesian Inference in the context of updating our beliefs. Say we are weighing our dog at the vet's office. When we take the dog to the vet, it squirms on the scale which makes it difficult to get an accurate reading. At the last visit, we got 3 readings for the dogs weight:

$$\text{13.9 lb, 17.5 lb, 14.1 lb}$$

There is a standard statistical interpretation for this. We are able to calculate the mean, standard deviation, and standard error for this set of numbers and create a distribution for the dogs actual weight (to view the calculations, click Show Code at the bottom of the notebook)!

Calculate the Mean¶

sample_weights = np.array([13.9, 17.5, 14.1])
mu = sample_weights.mean()
print("Sample Mean: ", mu)

Sample Mean:  15.166666666666666

Calculate the Standard Deviation (sample)¶

N = len(sample_weights)
df = N - 1 
ddof = N - df
sigma = sample_weights.std(ddof=ddof)
print("Sample Standard Deviation: ", sigma)

Sample Standard Deviation:  2.023198787399136

Calculate the Standard Error¶

std_error = sigma / np.sqrt(N)
print("Standard Error: ", std_error)

Standard Error:  1.1680943645290156

From these statistics, we can now formally represent our beliefs about the dogs weight via a probability distribution. Specifically, we have what is needed to parameterize the normal distribution. Recall that a sample mean, $\hat{\mu}$, is a random variable and will tend to be Gaussian distributed (based on the central limit theorem):

$$\hat{\mu} \approx N(\mu, \frac{\sigma^2}{N})$$

Where this variance is simply the standard error squared:

$$\hat{\mu} \approx N(\mu, stderr^2)$$

Note that while this function is parameterized by $\mu$, the population mean, and we only have access to $\hat{\mu}$, this is okay! The population mean will be the true mean of $\hat{\mu}$, and we can use this estimate as an approximation. This is why I used the approximation symbol earlier.

So, with that said, let's get a visual of the dogs weights:

def get_y_height(y_min, y_max, actual_y_pos):
    total_height = abs(y_min) + abs(y_max)
    height_of_point = abs(y_min) + actual_y_pos
    return height_of_point / total_height

def plot_mean(y_min, y_max):
    # Plotting mean
    plt.axvline(
        x=mu,
        color=sns.xkcd_rgb["grey"], 
        ymax=get_y_height(y_min, y_max, norm.pdf(mu, mu, std_error)),
        ymin=get_y_height(y_min, y_max, 0),
        alpha=0.5,
        linestyle="dashed",
        lw=1
    )
    

def plot_standard_error(y_min, y_max):
    # Plotting standard error
    plt.axvline(
        x=mu-std_error,
        color=sns.xkcd_rgb["sea blue"],
        ymax=get_y_height(y_min, y_max, norm.pdf(mu-std_error, mu, std_error)),
        ymin=get_y_height(y_min, y_max, 0),
        linestyle="dashed",
        lw=1
    )
  
    plt.axvline(
        x=mu+std_error,
        color=sns.xkcd_rgb["sea blue"],
        ymax=get_y_height(y_min, y_max, norm.pdf(mu+std_error, mu, std_error)),
        ymin=get_y_height(y_min, y_max, 0),
        linestyle="dashed",
        lw=1
    )

    
def plot_measurements(measurements, y_min, y_max):
    # Plotting our measurements
    for measurement in measurements:
        plt.axvline(
            x=measurement,
            color=sns.xkcd_rgb["light red"],
            ymax=get_y_height(y_min, y_max, norm.pdf(measurement, mu, std_error)),
            ymin=get_y_height(y_min, y_max, 0),
        )
        plt.scatter(
            measurement,
            norm.pdf(measurement, mu, std_error),
            c=sns.xkcd_rgb["light red"],
            zorder=5,
            alpha=0.5
        )
        plt.scatter(measurement, 0, c=sns.xkcd_rgb["red"], zorder=5)

        
def setup_axes(ax):
    # setting axes
    ax.set_xlabel('Weight in Pounds', fontsize=14)
    ax.set_ylabel('Probability Density', fontsize=14)
    ax.tick_params(labelsize="large")
    ax.set_title('Normal Distribution of Dogs weight',fontsize=16)
    ax.legend(
        ['Weight Distribution',
         'mean',
         'standard error',
         'standard error',
         'measurements'
        ], 
        fontsize=14,
        bbox_to_anchor=(1, 1)
    )
    
    
def plot_dist(measurements, mu, std_error):
    fig, ax = plt.subplots(figsize=(8,5))

    # plotting distribution
    x = np.linspace(mu - 3.5*std_error, mu + 3.5*std_error, 100)
    y = norm.pdf(x, mu, std_error)
    plt.plot(x, y, c=sns.xkcd_rgb["grey"], lw=1)
    plt.fill_between(x, y, color=sns.xkcd_rgb["light blue"], alpha=0.2)
    plt.ylim(-0.01)
    y_min, y_max = ax.get_ylim()

    plot_mean(y_min, y_max)
    
    plot_standard_error(y_min, y_max)
    
    plot_measurements(measurements, y_min, y_max)
    
    setup_axes(ax)

    return plt.show()

plot_dist(sample_weights, mu, std_error)

So, the distribution shows what we believe using this approach. It is normally distributed with a mean of 15.2 lbs, and a standard error of 1.2 lbs. The actual measurements are shown in red. Unfortunately, this curve is unsatisfyingly wide. While the peak is at 15.2 lbs, the probability distribution could easily be as low as 13 lbs or as high as 17 lbs. This is much to wide to make any kind of confident decision.

What options do we have at this point? It is common at times like this to go back and try to gather more data, but that may be infeasible or too expensive. At times like these we are usually stuck with the measurements that we already have. In this case, the dog's patience is used up and we must move on with our 3 initial measurements.

And that is where Bayes Theorem comes in...

3.2.1 Bayes Theorem to the Rescue¶

Bayes Theorem is very useful in making the most out of small data sets. Lets take a minute to go over what it looks like again.

$$P(A \mid B)=\frac{P(B \mid A)*P(A)}{P(B)}$$

or in our case:

$$P(weight\;|\;measurements)=\frac{P(measurements\;|\;weight)*P(weight)}{P(measurements)}$$

simplified to:

$$P(w \mid m)=\frac{P(m \mid w)*P(w)}{P(m)}$$

We can clearly see 4 terms above. Each of these terms represents a different part of the process.

$P(w)$ - Prior¶

This represents our prior beliefs. In this case it represents what we believe about the dogs weight before we even put it on the scale.

"The probability of cancer before you know the test result"

$P(m \mid w)$ - Likelihood¶

This is the likelihood that our measurements would occur, given a particular weight. For example, given that the dogs weight is 200lbs, what is the likelihood that we measure her to be 13.9,14.1, and 17.5 lbs?

"If you know the patient has cancer, how likely are you to get a positive test result?"

$P(w \mid m)$ - Posterior¶

This is the posterior, which shows the probability of the dog being a weight, given the measurements that we made; this is what we are most interested in finding.

"What do we believe the probability of cancer is, given the test result?"

$P(m)$¶

This is the probability of data, showing the probability that any given data point will be measured. We will be assuming this is constant, i.e. that the scale is unbiased.

We can think of this formula as saying: "lets start with a belief $P(w)$, take some measurements and update it with $P(m|w)$, and then we have a new belief once done, $P(w|m)$.

3.2.2 Uniform Prior¶

Okay great, so at this point, it is not a terrible idea to start perfectly agnostic and making no assumptions about the result. In this case, we assume that the dogs weight is equally likely to be 13 lbs, or 15 lbs, or 1 lb, or 100000 lbs, and we will let the data speak.

To do this we assume a uniform prior, meaning that its probability is a constant for all values. This allows us to reduce our equation to:

$$P(w \mid m) = P(m \mid w)$$

So we now want to calculate this- that is we want to calculate the probability of our measurements occuring, given a weight. We will want to do this for all of the possible weights!

Visual Representation¶

Before going through a few of these calculations by hand, I think it is very helpful to get a visual representation of what this situation actually looks like. When speaking of the likelihood, a good intuitive way to think about it is:

What is the total likelihood of observing our measurements, given that the dogs weight was normally distributed with a mean of x?

Notice that extra information that I added to this description: "normally distributed with a mean of...". Generally, for brevity this portion is just left off, but I feel like that makes understanding the likelihood (especially in this case) much more difficult. What we really are saying is: "Suppose the dogs weight is normally distributed with a mean of 16, how likely/probable would our observations be?"

So, we would have a normal curve, and could then determine the likelihood of each individual observation on that curve (in the image above this would be the height of the red line, or in other words the $y$ value). These likelihoods are not probabilities, rather they are probability densities. Because each observation is independent, we can multiply them all together in order to get a total likelihood of the observations.

We can then repeat this, only now we can ask: "Suppose the dogs weight is normally distributed with a mean of 14, how likely/probable would our observations be?" We could then determine the total likelihood again. Finally, we can compare the total likelihood given the weight is 16 against the total likelihood given the weight is 14. Whichever is greater represents a weight that is more likely, given our observations. Visually, this can be seen below:

def mle_gaussian(x, mu, sigma):
    likelihood_of_data = norm.pdf(x, mu, sigma)
    log_likelihood_of_data = np.log(likelihood_of_data)
    return np.sum(log_likelihood_of_data)

lower_bound = 12
upper_bound = 20
length = 2000

plt.ioff()                          

fig = plt.figure(figsize=(12, 5), dpi=150)       
ax1 = fig.add_subplot(1, 2, 1)
ax2 = fig.add_subplot(1, 2, 2)

ax2.set_title(r"Total Likelihood of Observations vs. $w$", pad="10")
ax2.set_xlabel(r"$w$ (weight, lbs)", fontsize=20)

means = np.linspace(lower_bound, upper_bound, length)
sigma = std_error
expected_mean = mu
x_axis = np.arange(12, 20, 0.001)

# Set expected weight/mean in MLE plot
ax2.axvline(x=expected_mean, color=sns.xkcd_rgb["greenish"], linestyle="dashed")

# Create ax_mle to update on plot 2, create mu and likelihood arrays (needed as well)
ax_mle, = ax2.plot(0, 0, lw=3, c=sns.xkcd_rgb["red"])                         # Initialize plot object for distance
marker1, = ax2.plot(lower_bound, 0, 'or')
mu_arr = np.array([])
likelihoods = np.array([])

mle_annotation = ax2.annotate(
    '',
    xy=(16, -3),
    size=18,
)

def animate(current_mu):
    # ------------- Gaussian Plot - left -------------
    ax1.clear()
    ax1.set_ylim(-0.015, 0.36) 
    
    # Plot Dist
    gaussian_dist = norm.pdf(x_axis, current_mu, sigma)
    ax1.plot(x_axis, gaussian_dist, c=sns.xkcd_rgb["grey"],linewidth=1)
    ax1.fill_between(x_axis, gaussian_dist, color=sns.xkcd_rgb["light red"], alpha=0.1)

    # Plot current mean/weight and expected mean/weight
    ax1.axvline(x=current_mu, color=sns.xkcd_rgb["dusty blue"], linestyle="dashed")
    ax1.axvline(x=expected_mean, color=sns.xkcd_rgb["greenish"], linestyle="dashed")
    
    # Annotate and set title/labels
    ax1.annotate(r"$w$ = {}".format(round(current_mu, 1)), [18,0.32], size=18)
    ax1.set_title(r"Gaussian Distribution: Parameterized by $w$", pad="10")
    ax1.set_xlabel(r"$m$ (measurements)", fontsize=18)
    
    # Plot Measurements
    y_min, y_max = ax1.get_ylim()
    for weight in sample_weights:
        ax1.axvline(
            x=weight,
            color=sns.xkcd_rgb["light red"],
            ymax=get_y_height(y_min, y_max, norm.pdf(weight, current_mu, std_error)),
            ymin=get_y_height(y_min, y_max, 0),
        )
        ax1.scatter(
            weight,
            norm.pdf(weight, current_mu, std_error),
            c=sns.xkcd_rgb["light red"],
            zorder=5,
            alpha=0.5
        )
        ax1.scatter(weight, 0, c=sns.xkcd_rgb["neon red"], zorder=5)
    

    # ------------- MLE Update - right -------------
    global mu_arr
    global likelihoods

    # Calculate current likelihood of mu/weight given the observation weights
    current_mu_likelihood = mle_gaussian(sample_weights, current_mu, sigma)
    mu_arr = np.append(mu_arr, current_mu)
    likelihoods = np.append(likelihoods, current_mu_likelihood)
    
    # Update plot weith new likelihood included
    ax_mle.set_data(mu_arr, likelihoods)  
    marker1.set_data(current_mu, mle_gaussian(sample_weights, current_mu, sigma))

    # Annotate
    mle_annotation.set_text(r"$P(w = {} \mid m)$".format(round(current_mu, 1)))
        
    return ax_mle,

def init():
    ax2.set_xlim(12, 20)                               # Initialize x and y limits
    ax2.set_ylim(-30, 0) 
    return ax_mle,

""" Define steps and create animation object """
step = 0.01
steps = np.arange(lower_bound, upper_bound, step)

plt.tight_layout()

# For rendering html video in cell
# html_video = HTML(
#     animation.FuncAnimation(
#         fig,
#         animate,
#         steps,
#         init_func=init, 
#         interval=25,
#     ).to_html5_video()
# )
# display(html_video)

gif_video = animation.FuncAnimation(
        fig,
        animate,
        steps,
        init_func=init, 
        interval=25,
    )

gif_video.save('mle_estimation_dog.gif', writer='imagemagick')

plt.close()

Above, as we change the value for $w$, the dog's mean weight, we get a new normal distribution (shown on the left). For each distribution, we take the likelihood of each individual observation and multiply them together (multiplying together the length of the red vertical lines), getting a total likelihood. This total likelihood is plotted on the right (note, this total likelihood is actually the $log$ of the likelihood, hence the negative). The maximum likelihood is shown by the green dotted line in this case. We can see that the maximum likelhood (given a uniform prior) is just equal to the mean of our observations (15.2 lbs).

Example Calculation¶

To make this more concrete, let's run through an example and determine the likelihood of our measurements given a specific weight: 17 lbs.

$$P(w=17 \mid m=[13.9,14.1,17.5]) = P(m=[13.9,14.1,17.5] \mid w=17)$$

$$P(w=17 \mid m=[13.9,14.1,17.5]) = P(m=13.9 \mid w=17) * P(m=14.1 \mid w=17) * P(m=17.5 \mid w=17)$$

$$P(w=17 \mid m=[13.9,14.1,17.5]) = N(m=13.9 \mid w=17, \sigma^2) * N(m=14.1 \mid w=17, \sigma^2) * N(m=17.5 \mid w=17, \sigma^2)$$

Here, the value on the left is equal to the multiplication of the three terms on the right. This can repeated for all values of the weight (in the visualization I repeat it for $[12,20]$. For additional clarity, the terms on the right each constitute a normal distribution with mean $w = 17$ and a variance $\sigma^2$ (in this case $stderr^2$), evaluated at the respected observation (13.9, 14.1, 17.5). Each evaluation yields a probability density, which we then multiply together in order to get a final likelihood.

Likelihood vs. Probability¶

One thing that may be slightly confusing at this point is the difference between probability and likelihood. If you have gone through my posts on probability and statistics, you may remember that I said that you could not utilize a continuous probability distribution in order to get a point estimate for the probability of a specific value. For instance, if you had a distribution of the dogs weight (say with mean 15 lbs), you could not use that to get the probability of observing a weight of 13.2 lbs. The reason for this was that a continuous distribution has an $y$ axis that represents probability density, not probability. The entire area under the curve must sum to one if it is to be a valid distribution. If you want to determine the probability of something, you must utilize the are under the curve. (Note that the reason for the term probability density is based on the term probability mass. The term probability mass was utilized first to essentially just represent a standard probability. At that point, when dealing with a curve that needed to be integrated to get the probability/probability mass, the term probability density made sense, considering that the integration of a density yields a mass).

Specifically, in our case above, the $y$ axis of the distribution represents probability per lb. Hence, the are under the curve is:

$$\frac{probability}{lb} * lb = probability$$

So, clearly we need to multiply our $y$ value by a certain $x$ (lb), in order to get a valid probability. And yet, above it seems as though we are getting point estimates and not performing this multiplication/normalization. That is because we are dealing with the likelihood. This likelihood does not need to be a valid probability. A way of thinking about this is as follows; think about our original equation for the posterior distribution:

$$P(A \mid B)=\frac{P(B \mid A)*P(A)}{P(B)}$$

$$Posterior=\frac{Likelihood*Prior}{Probability \; of \; data}$$

In order to ensure that the posterior distribution is valid, we must divide by this probability of data term. However, our likelihood has no such division occuring. On it's own, it has not been normalized in a way that ensures the summation to one.

One thing that makes this extra confusing is standard notation. Generally, $p(x)$ is thought of as probability, but in actuality:

$$p(x) = \text{Probability Density}$$

And, subsequently:

$$p(x) \delta x = \text{Probability of measuring X in } [x, x + \delta x]$$

$$\delta x = \text{interval length}$$

A derivation for this entire concept is included in the appendix.

Maximum Likelihood Estimate¶

The main takeaway here is that by shifting our beliefs about what the dogs weight may be, we were able to find the weight value that made the measurements most likely, in this case it was 15.2 lbs! This is known as the Maximum Likelihood Estimate!

This is very interesting! When you take the average and calculate the stanadard deviation and standard error, its gives the likelihood that you would get by doing bayes method and assuming a uniform prior! So, we found the exact same means and probability distribution as before, only this time we used bayes method.

However, this does mean that we are not that much closer to a useful estimate. To get this, we will need to make our prior non-uniform.

3.2.3 Non-Uniform Prior (True Bayesian Inference)¶

The key to Bayesian Inference is the use of a non-uniform prior. This allows us to essentially start off with a base idea of what we expect, and then update it based on newly acquired evidence (the likelihood). For instance, let's say that the last time we came to the vets office to weight the dog, we found that it weight 14.2 lbs, with a standard error of 1 lb. At this point the dog does not seem noticably heavier than last time. So let's define this as our prior, namely:

$$p(w) = N(\mu = 14.2, \sigma = 1)$$

This probability distribution is what we believe about the dogs weight before we even put it on the scale. It is the prior probability of her being a given weight. Visually, this distribution can be seen below:

def get_y_height(y_min, y_max, actual_y_pos):
    total_height = abs(y_min) + abs(y_max)
    height_of_point = abs(y_min) + actual_y_pos
    return height_of_point / total_height

def plot_mean(y_min, y_max, mu, std_error):
    # Plotting mean
    plt.axvline(
        x=mu,
        color=sns.xkcd_rgb["grey"], 
        ymax=get_y_height(y_min, y_max, norm.pdf(mu, mu, std_error)),
        ymin=get_y_height(y_min, y_max, 0),
        alpha=0.5,
        linestyle="dashed",
        lw=1
    )
    

def plot_standard_error(y_min, y_max, mu):
    # Plotting standard error
    plt.axvline(
        x=mu-std_error,
        color=sns.xkcd_rgb["sea blue"],
        ymax=get_y_height(y_min, y_max, norm.pdf(mu-std_error, mu, std_error)),
        ymin=get_y_height(y_min, y_max, 0),
        linestyle="dashed",
        lw=1
    )
  
    plt.axvline(
        x=mu+std_error,
        color=sns.xkcd_rgb["sea blue"],
        ymax=get_y_height(y_min, y_max, norm.pdf(mu+std_error, mu, std_error)),
        ymin=get_y_height(y_min, y_max, 0),
        linestyle="dashed",
        lw=1
    )

    
def plot_measurements(measurements, y_min, y_max):
    # Plotting our measurements
    for measurement in measurements:
        plt.axvline(
            x=measurement,
            color=sns.xkcd_rgb["light purple"],
            ymax=get_y_height(y_min, y_max, norm.pdf(measurement, mu, std_error)),
            ymin=get_y_height(y_min, y_max, 0),
        )
        plt.scatter(
            measurement,
            norm.pdf(measurement, mu, std_error),
            c=sns.xkcd_rgb["light purple"],
            zorder=5,
            alpha=0.5
        )
        plt.scatter(measurement, 0, c=sns.xkcd_rgb["neon purple"], zorder=5)

        
def setup_axes(ax):
    # setting axes
    ax.set_xlabel('Weight in Pounds', fontsize=14)
    ax.set_ylabel('Probability Density', fontsize=14)
    ax.tick_params(labelsize="large")
    ax.set_title('Prior Distribution of Dogs weight',fontsize=16)
    ax.legend(
        ['Weight Distribution',
         'mean',
         'standard error',
         'standard error',
        ], 
        fontsize=14,
        bbox_to_anchor=(1, 1)
    )
    
    
def plot_dist(measurements, mu, std_error, fill_color):
    fig, ax = plt.subplots(figsize=(8,5))

    # plotting distribution
    x = np.linspace(mu - 3.5*std_error, mu + 3.5*std_error, 100)
    y = norm.pdf(x, mu, std_error)
    plt.plot(x, y, c=sns.xkcd_rgb["grey"], lw=1)
    plt.fill_between(x, y, color=sns.xkcd_rgb[fill_color], alpha=0.5)
    plt.ylim(-0.01)
    y_min, y_max = ax.get_ylim()

    plot_mean(y_min, y_max, mu, std_error)
    
    plot_standard_error(y_min, y_max, mu)
    
    plot_measurements(measurements, y_min, y_max)
    
    setup_axes(ax)

    return plt.show()

plot_dist([], 14.2, 1, "pale yellow")

So, now in order to find our posterior distribution, our equation will look like:

$$P(w \mid m)=P(m \mid w)*P(w)$$

From a mathematical perspective, we are trying to find the value of $w$ that maximizes the right hand side of the equation:

$$\hat{w} = argmax_w \Big( P(m \mid w)*P(w) \Big) $$

Where $\hat{w}$ is known as the Maximum a Posteriori Estimation. Now, the way that this plays out is we will iterate through all values of $w$, evaluating the respective density based on $w$ as input to the prior distribution and likelihood distribution. The densities are multiplied together and the right hand side of the equation has been evaluated.

Again, this is done for all $w$, and we are trying to find the $w$ that maximizes the right hand side of the above equation. Visually, this process can be seen in the animation below:

def bayesian_inference(x, mu, sigma, prior):
    likelihood_of_data = norm.pdf(x, mu, sigma)
    log_likelihood_of_data = np.log(likelihood_of_data * prior)
    return np.sum(log_likelihood_of_data)


lower_bound = 12
upper_bound = 20
length = 2000

plt.ioff()                          

fig = plt.figure(figsize=(15, 5), dpi=150)   
ax1a = fig.add_subplot(1, 3, 1)
ax1 = fig.add_subplot(1, 3, 2)
ax2 = fig.add_subplot(1, 3, 3)

ax2.set_title(r"Likelihood of Measurements x Prior", pad="10", size=14)
ax2.set_xlabel(r"$w$ (weight, lbs)", fontsize=14)

means = np.linspace(lower_bound, upper_bound, length)
sigma = std_error
expected_mean = mu
prior_mu = 14.2
prior_stderr = 1
x_axis = np.arange(lower_bound, upper_bound, 0.001)
x_axis_prior = np.arange(10, upper_bound, 0.001)
prior_dist = norm.pdf(x_axis_prior, prior_mu, prior_stderr)
posterior_mu_expected = 14.6

# Set expected weight/mean in MLE plot
ax2.axvline(x=posterior_mu_expected, color=sns.xkcd_rgb["greenish"], linestyle="dashed")

# Create ax_mle to update on plot 2, create mu and likelihood arrays (needed as well)
ax_mle, = ax2.plot(0, 0, lw=3, c=sns.xkcd_rgb["purple"])                       
marker1, = ax2.plot(lower_bound, 0, 'o', c="purple")
mu_arr = np.array([])
posteriors = np.array([])

mle_annotation = ax2.annotate(
    '',
    xy=(16.5, -10),
    size=14,
)

def animate(current_mu):
    # ------------- Prior Plot - Left -------------
    ax1a.clear()
    ax1a.set_ylim(-0.015, 0.42) 
    ax1a.plot(x_axis_prior, prior_dist, c=sns.xkcd_rgb["grey"], linewidth=1)
    ax1a.fill_between(x_axis_prior, prior_dist, color=sns.xkcd_rgb["pale yellow"], alpha=0.5)
    ax1a.axvline(x=prior_mu, color=sns.xkcd_rgb["grey"], linestyle="dashed", linewidth=1)
    
    # Plot prior mean
    y_min, y_max = ax1a.get_ylim()
    ax1a.axvline(
        x=current_mu,
        color=sns.xkcd_rgb["blue"],
        ymax=get_y_height(y_min, y_max, norm.pdf(current_mu, prior_mu, prior_stderr)),
        ymin=get_y_height(y_min, y_max, 0),
    )
    ax1a.scatter(
        current_mu,
        norm.pdf(current_mu, prior_mu, prior_stderr),
        c=sns.xkcd_rgb["blue"],
        zorder=5,
        alpha=0.5
    )
    ax1a.scatter(current_mu, 0, c=sns.xkcd_rgb["blue"], zorder=5)
    ax1a.axvline(x=posterior_mu_expected, color=sns.xkcd_rgb["greenish"], linestyle="dashed")

    # Annotate and set title/labels
    ax1a.annotate(r"$w$ = {}".format(round(current_mu, 1)), [17,0.37], size=14)
    ax1a.set_title(f"Prior Distribution: $P(w ; \mu$ = {prior_mu})", pad="10", size=14)
    ax1a.set_xlabel(r"$w$ (weight, lbs)", fontsize=14)
    
    # ------------- Gaussian Plot - Middle -------------
    ax1.clear()
    ax1.set_ylim(-0.015, 0.36) 
    
    # Plot Dist
    gaussian_dist = norm.pdf(x_axis, current_mu, sigma)
    ax1.plot(x_axis, gaussian_dist, c=sns.xkcd_rgb["grey"],linewidth=1)
    ax1.fill_between(x_axis, gaussian_dist, color=sns.xkcd_rgb["light red"], alpha=0.1)

    # Plot current mean/weight and expected mean/weight
    ax1.axvline(x=current_mu, color=sns.xkcd_rgb["dusty blue"], linestyle="dashed")
    ax1.axvline(x=posterior_mu_expected, color=sns.xkcd_rgb["greenish"], linestyle="dashed")
    
    # Annotate and set title/labels
    ax1.annotate(r"$w$ = {}".format(round(current_mu, 1)), [18,0.32], size=14)
    ax1.set_title(r"Likelihood of Measurements", pad="10", size=14)
    ax1.set_xlabel(r"$m$ (measurements)", fontsize=15)
    
    # Plot Measurements
    y_min, y_max = ax1.get_ylim()
    for weight in sample_weights:
        ax1.axvline(
            x=weight,
            color=sns.xkcd_rgb["light red"],
            ymax=get_y_height(y_min, y_max, norm.pdf(weight, current_mu, std_error)),
            ymin=get_y_height(y_min, y_max, 0),
        )
        ax1.scatter(
            weight,
            norm.pdf(weight, current_mu, std_error),
            c=sns.xkcd_rgb["light red"],
            zorder=5,
            alpha=0.5
        )
        ax1.scatter(weight, 0, c=sns.xkcd_rgb["neon red"], zorder=5)
    
    # ------------- Bayesian Inference Update - right -------------
    global mu_arr
    global posteriors

    # Bayesian inference, find prior and liklelihood
    prior = norm.pdf(current_mu, prior_mu, prior_stderr)
    current_mu_posterior = bayesian_inference(sample_weights, current_mu, sigma, prior)
    mu_arr = np.append(mu_arr, current_mu)
    posteriors = np.append(posteriors, current_mu_posterior)
    
    # Update plot weith new likelihood included
    ax_mle.set_data(mu_arr, posteriors)  
    marker1.set_data(current_mu, current_mu_posterior)

    # Annotate
    mle_annotation.set_text(r"$\prod \; P(w \mid m) P(w)$")
        
    return ax_mle,

def init():
    ax2.set_xlim(12, 20)                               # Initialize x and y limits
    ax2.set_ylim(-100, 0) 
    return ax_mle,

""" Define steps and create animation object """
step = 0.01
steps = np.arange(lower_bound, upper_bound, step)

plt.tight_layout()

# For rendering html video in cell
# html_video = HTML(
#     animation.FuncAnimation(
#         fig,
#         animate,
#         steps,
#         init_func=init, 
#         interval=25,
#     ).to_html5_video()
# )
# display(html_video)

gif_video = animation.FuncAnimation(
        fig,
        animate,
        steps,
        init_func=init, 
        interval=25,
    )

gif_video.save('bayesian_inference.gif', writer='imagemagick')


plt.close()

We see our prior on the left, our likelihood distribution in the center, and our likelihood times prior on the right. For the prior, we are evaluating the distribution at $w$, which gives us a density the height of the blue line. For our likelihood, we are determining the likelihood of observing our data, given different distributions ($w$), yielding the heights of the red lines. These densities are multiplied together to get the purple curve seen on the right. The maximum value of that curve corresponds to the maximium a posteriori estimate of the dogs weight.

As a small example, if we were to freeze the animation above when $w = 17$, our calculation would look like:

$$P(w=17 \mid m=[13.9,14.1,17.5] = Likelihood(m=[13.9,14.1,17.5] \mid w=17)*Prior(w=17))$$

Remember, we can think of this process as updating our beliefs surrounding the dogs weight. With the prior we utilized, we began by saying "We think the weight is 14.2 lbs". We then incorporated our new evidence (the measurements), and then updated our beliefs to be: "We think the weight, based on the new evidence, is 14.6 lbs".

A. Appendix¶

A.1 Likelihood vs. Probability¶

Likelihood is an interesting concept. It is not a probability, but is proportional to a probability. A likelihood is what is received when a standard probability density function is evaluated at a single point. The evaluation leads to an output with units:

$$\frac{\text{Probability Density}}{x}$$

In the bayesian context, the likelihood of a hypothesis, $H$, given some data, $D$, is proportional to the probability of obtaining $D$ given that $H$ is true, multiplied by a positive constant $K$. In other words:

$$L(H \mid D) = k * P(D \mid H)$$

Since the likelihood is not an actual probability it doesn't obey various rules of probability. For example, the likelihood does not need to sum to 1.

A critical difference between probability and likelihood is in the interpretation of what is fixed and what can vary. In the case of conditional probability, $P(D \mid H)$, the hypothesis is fixed and the data are free to vary. Likelihood, however, is the opposite. The likelihood of a hypothesis, $L(H \mid D)$, conditions on the data as if they are fixed while allowing the hypothesis to vary.

To be clear, likelihoods are meaningless in isolation. This lack of meaning is due to the arbitrary constant shown above, $k$. Only by comparing likelihoods do they become interpretable, because the constant in each likelihood cancels the other one out. The easiest way to explain this is via a simple example.

Suppose that I flip a coin 10 times and it comes up 6 heads and 4 tails. If the coin were fair, $p(heads)=0.5$, the probability of this occurrence is defined by the binomial distribution:

$$P(X=x) = {n\choose x} p^x (1-p)^{n-x}$$

Where ${n\choose x}$ is the binomial coefficient:

$${n\choose x} = \frac{n!}{x!(n-x)!}$$

If we substitute in our values we get:

$$P(X=6) = \frac{10!}{6!4!} 0.5^6 (1-0.5)^{4} \approx 0.21$$

Now, if the coin happened to be a trick coin, so that $p(heads)=0.75$, the probability of 6 heads in 10 tosses is:

$$P(X=6) = \frac{10!}{6!4!} 0.75^6 (1-0.75)^{4} \approx 0.15$$

The quantify the statistical evidence for the first hypothesis against the second, we simply divide one probability by the other, yielding a likelihood ratio. In this case it is:

$$LR = \frac{0.21}{0.15} = 1.4$$

In english this just means that the data are 1.4 times as probable under a fair coin hypothesis than under this particular trick coin hypothesis. Notice that when performing the division the first terms, $\frac{10!}{6!4!}$, cancel eachother out. In other words:

Both likelihoods have the same data. Both have the same constnant. Those constants cancel out.

The $\frac{10!}{6!4!}$ term in the equations above details the journey that we take in obtaining 6 heads out of 10. If we change the journey (implement a different sampling plan), then the terms value may change, but crucially, since it is the same term in both likelihoods it will always cancel out.

Now, so far we only compared two likelihoods. However, we can actually plot our likelihood function! This allows us to find where it has a maximum and hence the hypothesis that maximizes the likelihood of our data.

A.2 Why do we use the density to find the MLE for continuous distributions?¶

The idea for the maximum likelihood estimate is to find the value of the parameter(s) for which the data has the highest probability. In reality we are really doing this with densities. To illustrate this, let's use an example.

Let's say that we have two light bulbs whose lifetimes follow an exponential$(\lambda)$ distribution. Suppose that we also independently measure their lifetimes and get data $x_1 = 2$ years and $x_2 = 3$ years. We want to find the value of $\lambda$ that maximizes the probability of observing this data.

The main paradox to deal with is that for a continuous distribution the probability of a single value, say $x_1=2$ is zero. We can resolve this paradox by remembering that a single measurement really means a range of values, e.g. in this example we may check the light bulb once per day. So the data $x_1 = 2$ years really means $x_1$ is somewhere in a range of 1 day around 2 years.

If the range is small, we call it $dx_1$. The probability that $X_1$ is in the range is approximated by $f_{X_1}(x_1 \mid \lambda) * dx_1$. This is seen below. The data value $x_2$ is treated in exactly the same way.

Since the data is collected independently the joint probability is the product of the individual probabilities. Stated carefully:

$$P(X_1 \text{ in range}, X_2 \text{ in range}) \approx f_{X_1}(x_1 \mid \lambda) dx_1 \cdot f_{X_2}(x_2 \mid \lambda) dx_2$$

Finally, using the values $x_1 = 2$ and $x_2 = 3$ and the formula for an exponential pdf we have:

$$P(X_1 \text{ in range}, X_2 \text{ in range} \mid \lambda) \approx \lambda e^{-2\lambda} dx_1 \cdot \lambda e^{-3 \lambda} dx_2 = \lambda^2 e^{-5 \lambda} dx_1 dx_2$$

Now that we have a genuine probability we can look for the value of $\lambda$ that maximizes it. Looking at the formula, we can see that the factor $dx_1 dx_2$ will play no role in finding the maximum! So for the MLE we drop it and simply call the density the likelihood:

$$Likelihood = f(x_1, x_2 \mid \lambda) = \lambda^2 e^{-5 \lambda}$$

And by utilizing the log likelihood and a bit of calculus, we arrive at our final answer:

$$Log \; Likelihood = 2 ln(\lambda) - 5 \lambda$$

$$\frac{d}{d \lambda} (Log \; Likelihood ) = \frac{2}{\lambda} - 5 = 0$$

$$\lambda = \frac{2}{5}$$

	Height (ft)
Person
0	5.33
1	5.87
2	6.08
3	6.86
4	5.45
5	5.58
6	5.88
7	6.08
8	5.49
9	5.61