The main goal of this post is to dig a bit further into Bayes rule, from a purely probabilistic perspective! Before we begin I do want to make one note; a great deal of the power of Bayes Rule comes in the form of bayesian inference and bayesian statistics, which can be found in the statistics section. I would recommend reading both of those posts as well if you are interested, since they demonstrate the application of Bayes rule to real world problems. If you have caught the bayesian bug at that point then I recommend reading my posts on Bayesian AB testing, found in the Machine Learning section.
One more thing to note: I am going to hold of on explaining the importance of Bayes Rule until the end, and its many use cases will in reality be spread throughout the aformentioned posts. Just another reason to go through them all. With that out of the way, let's begin!
We worked with Bayes Rule briefly in the probability introduction, but just to recap, it can be derived as follows:
We know that the below statement represents the conditional probability of $A$ given $B$:
$$p(A \mid B)=\frac{p(A,B)}{p(B)}$$And we also know that the opposite is also true:
$$p(B \mid A)=\frac{p(B,A)}{p(A)}$$And since:
$$p(A,B)=p(B,A)$$We can write:
$$p(A \mid B)=\frac{p(B \mid A)*p(A)}{p(B)}$$Now, often times we may not have $p(B)$ directly, but this is just the marginal distribution of the joint probability $p(A,B)$, summed over all $p(A)$. It looks like:
$$p(B)=\sum_ip(A_i,B) = \sum_ip(B \mid A_i)*p(A_i)$$If we are working with continuous distributions, sum turns into an integral.
Another way to think of this, is that the term on the bottom is just a normalization constant (Z) to ensure that the distribution sums to one.
$$p(A \mid B)=\frac{p(B \mid A)*p(A)}{Z}$$Another way of saying this, is that they are proportional:
$$p(A \mid B)\propto p(B \mid A)*p(A)$$Now this is a very powerful fact! Because the denominator ($p(B)$) does not depend on $A$, if we are simply trying to find the value of $A$ that maximizes the conditional probability of $p(A \mid B)$, we can ignore the denominator! In other words, this is used when we are trying to find the argmax of a distribution:
$$argmax_Ap(A \mid B)$$So, we don't need to know the actual value of the probability, just the particular A that gives us the maximum probability. Because Z is independent of A:
$$argmax_Ap(A \mid B) = argmax_Ap(B \mid A)p(A)$$This leads us into one of the main uses for Bayes Rule.
In the context of the Bayes Classifier, $y$ represents the class, and $x$ represents the data.
$$p(y \mid x)=\frac{p(x \mid y)*p(y)}{p(x)}$$We refer to $p(x \mid y)$ as the generative distribution, because it tells us what the features look like for a specific class y, which we are already given.
Note, that while the bayes classifier does make use of bayes rule, it does NOT necessarily make use of bayesian statistics. For more information on exactly what that means please see the posts on Bayesian Statistics. Again, the purpose of this post is really to just demonstrate it's role when purely confined to basic probability problems.
We are now going to go over a few brief examples where Bayes Rule can be applied in a simple proabilistic setting. First we can start with a very famous problem in probability know as The Monty Hall Problem. Imagine you are on a game show and you have to pick a door. There are 3 doors, and behind 1 of the doors there is a car, and behind the other two doors there are goats. Here is how the game works:
The big question is, which door should you choose?
So, remember, you choose door 1, and each probability is conditioned on this. We then define the following:
$$ C = \text{where the car really is}$$$$ p(C=1) = p(C=2) = p(C=3) = 1/3$$For example, $p(C=1)$ represents the probability that a car is behind door 1. We can then define the random variable $H$:
$$ H = \text{random variable to represent the door that Monty Hall opens}$$We can assume he opens door 2 without loss of generality, since the problem is symmetric.
$$p(H=2 \mid C=1) = 0.5$$Remember that you chose door 1. So if the car is behind door 1, he can choose either door 2 or 3 since they will each be a goat. If the car is behind door 2, he cannot open door 2, so the probability is 0:
$$ p(H=2 \mid C=2) = 0$$Similarly, if the car is behind door 3, then monty hall has to open door 2, since that is the only door left with a goat:
$$p(H=2 \mid C=3) = 1$$Now, What probability do we actually want? We want to know if we should stick with door 1 or switch to door 3. In other words we want to compare:
$$p(C=1 \mid H=2) \text{ vs. } p(C=3 \mid H=2)$$Now, we can do that using bayes rule!
$$p(A \mid B)=\frac{p(B \mid A)*p(A)}{p(B)}$$$$p(A \mid B)=\frac{p(B \mid A)*p(A)}{\sum_ip(B \mid A_i)*p(A_i)}$$Where in our case:
$$A: C=3 \;, B: H=2$$$$p(C=3 \mid H=2) = \frac{p(H=2 \mid C=3)p(C=3)}{p(H=2)}$$$$p(C=3 \mid H=2) = \frac{p(H=2 \mid C=3)p(C=3)}{p(H=2 \mid C=1)p(C=1)+p(H=2 \mid C=2)p(C=2)+p(H=2 \mid C=3)p(C=3)}$$$$p(C=3 \mid H=2) = \frac{\frac{1}{3}}{\frac{1}{2}*\frac{1}{3}+0*\frac{1}{3}+1*\frac{1}{3}} = \frac{2}{3}$$And we can similarly show:
$$p(C=1 \mid H=2) = \frac{1}{3}$$Hence, by the above application of Bayes Rule it is clear that we should always switch doors!
We can also think about the problem like so:
$$ p(C=1) = 1/3 $$$$ p(C=2) = 1/3$$ $$ p(C=3) = 1/3$$ $$ p(C=2 \text{ or } C=3) = 2/3$$Now lets say that we pick door 1, and monty hall opens door 2, showing us there is a goat behind it. We now know that $p(C=2) = 0$. In other words, monty has revealed certain information to us that we did not have originally. Hence, our equation $p(C=2 \text{ or } C=3) = 2/3$ still remains true, which means that $p(C=3) = 2/3$ and $p(C=1) = 1/3$. So we want to pick door 3! Note the reason this happens is because once door 2 is opened, it is known and is no longer a random variable.
Now, this problem is often referred to as a paradox. The reason it is viewed as a paradox is because it violates general human intuition and common sense. Now, this section will touch on some more advanced topics such as causal analysis (which will be covered in later posts), but I would feel remiss if I did not add a few sentences on the topic.
In general, human intuition operates under the logic of causation, while data conform to the logic of probabilties and proportions. Paradoxes often arise when we misapply the rules we have learned in one realm to another. In the case the Monty Hall problem, the main thing needed to resolve this apparent paradox is that we must take into account not only the data, but also the data generating process (the rules of the game). The main idea is as follows:
The way that we obtain information is no less important than the information itself.
Based on the rules of the game, we can deduce the following: If we open door 1, Monty cannot open door 1. However, he could have opened door 2. If, instead he choses to open door 3, it is more likely that he opened door 3 because he was forced to. This leads us to see that their is more evidence than before that the car is behind door 2.
If we start wading into the waters of causation, we learn that our minds rebel at the possibility of a correlation without a causation, since we learned to associate the two since birth. Causeless correlation violates our common sense.
Lets look at another example of where Bayes rule comes into play. Suppose we are doing disease testing. We would take a blood sample, extract some features from it, and output whether or not that person has the disease. So, we would have:
Lets look further at a realistic scenario where this is involved. Most people are healthy and non diseased, most of the time. So, suppose that only 1% of the population has the disease. We can build a classifier that just predicts "no" each time. In other words, it doesn't learn anything. It is already correct for 99% of cases though! Hence, accuracy is not always the best metric to utilize. Perhaps we do not care about overall accuracy?
What we actually want to measure is $p(predict=1 | disease=1)$. This is called the true positive rate. In medical terminology this is referred to as sensitivity. In information retrieval is is known as hit rate or recall.
We can solve for the above using bayes rule:
$$p(prediction=1 | disease=1) = \frac{p(prediction=1, disease=1)}{p(disease=1)}$$Typically, we count 4 things:
Prediction = 1 | Prediction = 0 | |
---|---|---|
Disease = 1 | True Positive | False Negative |
Disease = 0 | True Positive | False Negative |
With that said, we can calculate sensitivity as follows:
$$p(prediction=1 | disease=1) = \frac{p(prediction=1, disease=1)}{p(disease=1)}$$$$sensitivity = recall = \frac{TP}{TP+FN}$$And we can then calculate the specificity (the true negative rate):
$$p(prediction=0 | disease=0) = \frac{p(prediction=0, disease=0)}{p(disease=0)}$$$$specificity = \frac{TN}{TN+FP}$$Now, in information retrieval, rather than specificity, we are interested in precision.
$$precision = \frac{TP}{TP+FP}$$What is this the probability of? Well, $TP$ can be defined as:
$$TP = p(prediction=1, disease=1)$$And $TP + FP$:
$$TP+FP = p(prediction=1)$$Which then looks like:
$$precision = \frac{TP}{TP+FP} = \frac{p(prediction=1, disease=1)}{p(prediction=1)}$$Which equals:
$$p(disease=1|prediction=1) = \frac{p(prediction=1, disease=1)}{p(prediction=1)}$$This is a useful measure! Just because your results come back positive, does not mean that you have the disease! Generally, more testing is required! This will be explored further in the bayesian statistics section.