3. Bayesian Classifiers¶

This is an article that I have been excited to write for quite some time! While the model and classification techniques are nothing new, Naive Bayes (and Bayes Nets generalizations) build upon all of the intuitions that I have laid out in past posts on probability theory, classic frequentist statistical inference, and bayesian inference. I highly recommend that you take some time to read them over if your background in statistics, probability, and bayesian inference is rusty (I will particularly be assuming the reader is comfortable with bayesian inference).

With that said let's get started!

1. Setting the Stage: A Continuous Classification Problem¶

I want to begin by introducing a simple toy classification problem that can help create a hollistic picture of what we are trying to accomplish. Imagine that you are working as a Data Scientist for an energy company, and you are trying to predict whether a given customer will be a "big consumer" of energy that month, based on there energy consumption after 10 days. Now "big consumer" clearly is a rather arbitrary concept, but let's say that you have spoken to management and they are classifying a big consumer as anyone using over 2000 kWh in a given month. To make this a bit more clear, we may have historical data that looks something like:

Month	Energy Usage (After 10 Days)	Energy Usage (End of Month)	Big Spender?
1	1200	3000	Y
2	450	890	N
3	600	1745	N
2	800	2100	Y
1	100	290	N
3	724	2020	Y
1	1800	1950	N
.	.	.	.
.	.	.	.
.	.	.	.

As a side note, there are many reasons that we may want to classify a consumer as a big spender early in the month, particularly if we want to offer them insights into why they are consuming more energy, tips as to how they can reduce energy consumption, or a paper report in the mail (proven to be more engaging, while also the most costly form of engagement) that may help them reduce energy and and hence their current bill. Reasons aside, the question that arises is:

After 10 days how do you classify user's as big spenders?

Of course there are many options! We could train a logistic regression model to find statistical patterns between our continuous inputs, $X$ (energy usage after 10 days), and our output, $Y$ (big spender?), or we could fit a decision tree to the data, or a neural network-our choices are endless. As the data scientist your job is to understand the context of the problem, edge cases, business requirements, etc, and select the most appropriate model. In this case, considering the subject of the post, we will chose to use a Naive Bayes Model.

Note, I want to again make clear that this is of course a toy problem. Training any model to make robust predictions on a single feature is a tall order, working well only if the feature happens to significantly explain the variance in the dependent variable to a high degree$^1$ (for more on this see my post on dimensionality reduction, particularly the example pertaining to height vs. urefu, where one almost perfectly explains the other). Additionally, in order to keep things relatively simple I am not going to consider adding autoregressive features (a time series concept where you find a relationship between the current value of a variable and it's prior values.

With that said, let's take a look at some of this data (I want to note that as always you can choose to show or hide the code that makes up this post by scroll to the bottom and clicking the corresponding button). We can see our two groups (big spenders and not big spenders) and their corresponding usage after 10 days shown in the histograms below:

import numpy as np
from scipy.stats import bernoulli, binom, norm, multivariate_normal
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib import rc, animation
from IPython.core.display import display, HTML
from mpl_toolkits.mplot3d import Axes3D
from matplotlib.ticker import LinearLocator, FormatStrFormatter

from _plotly_future_ import v4_subplots
import cufflinks
import plotly.plotly as py
import plotly
import plotly.graph_objs as go
from plotly.offline import iplot
import plotly.figure_factory as ff

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

sns.set(style="white", palette="husl")
sns.set_context("talk")
sns.set_style("ticks")

big_spender_mean = 2000
big_spender_std = 800
num_big_spenders = 10000
big_spender_usage_ten_days = np.random.normal(big_spender_mean, big_spender_std, num_big_spenders)

low_spender_mean = 200
low_spender_std = 100
num_low_spenders = 20000

low_spender_usage_ten_days = np.random.normal(low_spender_mean, low_spender_std, num_low_spenders)

trace1 = go.Histogram(
    x=big_spender_usage_ten_days,
    nbinsx=100,
    name="Big Spender",
    marker_color='blue',
)

trace2 = go.Histogram(
    x=low_spender_usage_ten_days,
    nbinsx=100,
    name="Low Spender",
    marker_color='crimson',
)

data = [trace1, trace2]

layout = go.Layout(
    width=750,
    height=450,
    title="Big Spender vs. Non Big Spender Distribution",
    xaxis=dict(title="Usage after 10 days (kWh)"),
    yaxis=dict(title="Number of Customers"),
    barmode='stack'
)

fig = go.Figure(data=data, layout=layout)
fig.update_traces(opacity=0.75)

# plotly.offline.iplot(fig)
html_fig = plotly.io.to_html(fig, include_plotlyjs=True)
display(HTML(html_fig))

We can calculate a few descriptive statistics from the above data and see that our low spenders have a mean of ~$200$ kWh of energy used after 10 days, while our big spenders have a mean of $2000$ kWh and a much higher standard deviation (spread), $800$ kWH. For the sake of this problem our data was generated via a gaussian process (via numpy.random.normal), but in reality we could expect to see more of a long tail distribution for our big spenders (in other words, a business insight we could gleam is that big spenders will consist mainly of homes that have used one to two thousand kWh of energy after 10 days, but a few very small cases of large companies and office facilities using twenty times more energy after 10 days. Data that is generated via a gaussian process does not allow for that type of spread, so our process surely is not gaussian in nature).

Now that we have seen the data set that we are provided to work with, let's restate the question we are trying to answer:

In the upcoming month, how can we classify customers as either big or low spenders based on this past dataset?

In other words, next month if we are given a customer who has used $700$ kWh in the first 10 days, do we predict that they are going to be a big spender?

2. Placing our objective into a mathematical context¶

2.1 Bayes Rule in a univariate setting¶

Now, one of the main jobs of a data scientist, applied mathematician, or really any analytical problem solver, is to place a qualitive question within a mathematical framework. Here we can put this into a probabilistic context, by stating that our goal is to find:

$$P(\text{Big Spender} \mid \text{Usage after 10 days})$$

Where it is common to refer to our data, usage after 10 days, as $X$, and our output variable that we are trying to predict, big spender, as $Y$:

$$P(Y \mid X)$$

In english (but in a mathematical context), we can define our goal as:

Find the probability that a customer is a big spender, given their first 10 days of usage.

How can we go about finding that? Well, let's think for a minute about what the probability above is defined as based on probability theory. Via the product rule$^2$, or more commonly referred to as the rule of conditional probability, we can see that:

$$P(XY) = P(X \mid Y) \cdot P(Y) = P(Y \mid X) \cdot P(X)$$

And via some simple algebra we arrive at:

$$P(Y \mid X) = \frac{P(X \mid Y) \cdot P(Y)}{P(X)}$$

What do each of these quantities represent, in english? We have:

$$P(X \mid Y) \longrightarrow \text{Probability of observing our data, X, given the consumer is a big spender}$$$$P(Y) \longrightarrow \text{Prior probability that the user a big spender}$$$$P(X) \longrightarrow \text{Probability of observing the data we did (across all types of spenders)}$$

In a very general sense, this is the framework of bayesian inference (and the equation above is often refered to as bayes rule). In what I find to be the most intuitive notation, we can write the equation as:

$$P(H \mid DX) = \frac{P(D \mid HX) \cdot P(H \mid X)}{P(D \mid X)}$$

Where $H$ represents a particular hypothesis that we are interested in (someone is a big spender), $D$ represents our observed data, and $X$ represents all background information that our probabilities are conditional on. Each term is generally referred to as:

$$\overbrace{P(H \mid DX)}^{\text{posterior probability}} = \frac{\overbrace{P(D \mid HX)}^\text{likelihood} \cdot \overbrace{P(H \mid X)}^\text{prior probability}}{\underbrace{P(D \mid X)}_\text{probability of data}}$$

Now for an incredibly enlightening take on probability theory as extended logic (where the above notation is taken from), I highly recommend checking out Probability Theory: The Logic of Science.

With that said, why is is this relationship useful? Well, as stated earlier it can actually allow us to conduct a general inference. To see what I mean and build a bit of intuition around this process, I want us to think about what $P(X \mid Y)$ means for a moment. Do we have enough information to calculate it?

Keep that thought in mind and take a look at the distribution plot below. We have essentially fit two normal distributions to our historical data, one distribution was fit to the big spenders, the other to the non big spenders:

hist_data = [big_spender_usage_ten_days, low_spender_usage_ten_days ]

group_labels = ['Big Spender', 'Non Big Spender']
group_colors = ['blue', 'crimson']

fig_traces = ff.create_distplot(
    hist_data, 
    group_labels,
    colors=group_colors,
    bin_size=5,
    curve_type='normal'
)

layout = go.Layout(
    autosize=False,
    width=750,
    height=450,
    title="Big Spender vs. Non Big Spender Distribution",
    xaxis=dict(title="Usage after 10 days (kWh)"),
    yaxis=dict(title="Number of Customers"),
    barmode='stack'
)

fig = go.Figure(data=fig_traces, layout=layout)

# plotly.offline.iplot(fig)
html_fig = plotly.io.to_html(fig, include_plotlyjs=True)
display(HTML(html_fig))

What exactly do I mean by fit a distribution to our data? Well if you recall, the normal distribution is defined mathematically as:

$$f(x \mid \mu, \sigma^2) = \frac{1}{\sqrt{2 \pi \sigma^2}} exp(-\frac{(x-\mu)^2}{2\sigma^2})$$

And it is entirely parameterized by the mean and variance (in the univariate case). So, we can calculate the mean and variance for both of our classes respectively (big spenders and non big spenders, only using the data belonging to each class when calculating the class parameters):

$$\mu = \frac{1}{N} \sum_{i=1}^N x_i$$$$\sigma^2 = \frac{1}{N}\sum_{i = 1}^N (x_i - \mu)^2$$

And we can then plot the corresponding normal distributions fit to those parameters. In other words, we are finding two normal distributions:

$$\text{Big Spender} \longrightarrow f(x \mid \mu_{BS}, \sigma_{BS}^2) = \frac{1}{\sqrt{2 \pi \sigma_{BS}^2}} exp(-\frac{(x-\mu_{BS})^2}{2\sigma_{BS}^2})$$$$\text{Low Spender} \longrightarrow f(x \mid \mu_{LS}, \sigma_{LS}^2) = \frac{1}{\sqrt{2 \pi \sigma_{LS}^2}} exp(-\frac{(x-\mu_{LS})^2}{2\sigma_{LS}^2})$$

Where, from our data, our big spender distribution is parameterized via $\mu_{BS} = 2000$ and $\sigma_{LS} = 800$, and our non big spender distribution is parameterized via $\mu_{BS} = 200$ and $\sigma_{LS} = 100$.

big_spender_mean = big_spender_usage_ten_days.mean()
big_spender_std = big_spender_usage_ten_days.std()

low_spender_mean = low_spender_usage_ten_days.mean()
low_spender_std = low_spender_usage_ten_days.std()

usage = np.arange(-300, 5000, 1)

big_spender_normal_dist = norm.pdf(usage, big_spender_mean, big_spender_std)
low_spender_normal_dist = norm.pdf(usage, low_spender_mean, low_spender_std)

trace2 = go.Scatter(
    x=usage,
    y=big_spender_normal_dist,
    marker = dict(
        color = 'blue',
    ),
    fill='tozeroy',
    name="Normal Distribution fit to Big Spender Data"
)

trace1 = go.Scatter(
    x=usage,
    y=low_spender_normal_dist,
    marker = dict(
        color = 'crimson',
    ),
    fill='tozeroy',
    name="Normal Distribution fit to Low Spender Data"
)



data = [trace1, trace2]
layout = go.Layout(
    legend=dict(x=0.5, y=0.97),
    width=750,
    height=450,
    title="Normal Distributions fit to Big and Non Big Spenders Historical Data",
    xaxis=dict(title="Usage in first 10 days of month"),
    yaxis=dict(title=r"$\text{Probability Density: } f(x \mid \mu, \sigma^2)$")
)

fig = go.Figure(data=data, layout=layout)

# plotly.offline.iplot(fig)
html_fig = plotly.io.to_html(fig, include_plotlyjs=True)
display(HTML(html_fig))

Those distributions above are $P(X \mid Y)$! In other words, we have one of the key pieces to finding $P(Y \mid X)$. Now, how can that be useful? Well, say we are given a consumer that we want to classify; they have used $250$ kWh of energy in the first 10 days of the month. What is the likelihood, $P(X \mid Y)$, of the data point for each respective distribution? Let's take a look at the plot below:

trace2 = go.Scatter(
    x=usage,
    y=big_spender_normal_dist,
    marker = dict(
        color = 'blue',
    ),
    fill='tozeroy',
    name="Normal Distribution fit to Big Spender Data"
)

trace1 = go.Scatter(
    x=usage,
    y=low_spender_normal_dist,
    marker = dict(
        color = 'crimson',
    ),
    fill='tozeroy',
    name="Normal Distribution fit to Low Spender Data"
)

offset = 300
trace3 = go.Scatter(
    x=[usage[250 + offset]],
    y=[low_spender_normal_dist[250 + offset]],
    marker = dict(
        color = 'green',
        size=10
    ),
    mode="markers",
    name=f"Likelihood of observed data point, given low spender: {round(low_spender_normal_dist[250 + offset], 5)}"
)

trace4 = go.Scatter(
    x=[usage[250 + offset]],
    y=[big_spender_normal_dist[250 + offset]],
    marker = dict(
        color = 'gold',
        size=10
    ),
    mode="markers",
    name=f"Likelihood of observed data point, given big spender: {round(big_spender_normal_dist[250 + offset], 5)}"
)

data = [trace1, trace2, trace3, trace4]
layout = go.Layout(
    legend=dict(
        x=0.5, y=0.97,
        font_size=10,
    ),
    width=750,
    height=450,
    title="Normal Distributions fit to Big and Non Big Spenders Historical Data",
    xaxis=dict(title="Usage in first 10 days of month"),
    yaxis=dict(title=r"$\text{Probability Density: } f(x \mid \mu, \sigma^2)$")
)

fig = go.Figure(data=data, layout=layout)

# plotly.offline.iplot(fig)
html_fig = plotly.io.to_html(fig, include_plotlyjs=True)
display(HTML(html_fig))

We can see that, given the big spender distribution the likelihood of a consumer using $250$ kWh of energy in the first 10 days is $0.00005$. On the other hand, the likelihood of a non big spender using $250$ kWh in the first 10 days is $0.0035$. In other words:

$$P(\text{250 kWh in first 10 days} \mid \text{Low spender}) > P(\text{250 kWh in first 10 days} \mid \text{High spender})$$$$P(X \mid Y_{LS}) > P(X \mid Y_{BS})$$

At this point the gears should be turning and you should start to see potential ways to create an algorithmic decision procedure based on these likelihoods. A simple procedure would be to just predict that the consumer belongs to the class corresponding to a higher likelihood.

However, let's slow down for a moment. There are two things that we are not currently considering that can have a large impact on our prediction accuracy: inclusion of priors via bayes rule and the inclusion of additional input features.

By priors I simply mean prior knowledge about the probability of a customer being a big spender or a non big spender. If for instance we know that historically approximately $10\%$ of customers are big spenders each month, then we can include that prior information in our decision procedure via Bayes rule. We will come back to priors shortly.

As for additional input features, clearly even in this simple example there is a good deal of overlap between the two distributions. In the real world, there will be even more overlap and hence more ambiguity that the model is most likely going to struggle to handle. One solution, as mentioned earlier, is to include more features. You can imagine that time of year will surely effect this prediction. For instance, if we are in the month of June, the first 10 days are likely to consume less energy than the last 10 days (because as the month goes on, the temperature rises and air conditioning is more likely to be utilized). Similarly, a weather forecast could be included and a forecasted heat wave or cold front would clearly impact our prediction). So clearly adding more features would be beneficial; the question is, how exactly would can we achieve this?

2.2 Extend to a multivariate domain¶

Fortunately, extending our idea of likelihood maximization is not difficult at all! We simply need to make use of the multivariate gaussian, defined as:

$$f(x \mid \mu, \Sigma) = \frac{1}{\sqrt{(2 \pi)^D |\Sigma|}} exp\Big(-\frac{1}{2} (x - \mu)^T \Sigma^{-1} (x - \mu)\Big)$$

Where $x$ now is a vector containing each feature, $\mu$ is a vector holding the mean values of each feature, and $\Sigma$ is the covariance matrix (a $D x D$ matrix where $D$ is the number of features present-our dimensionality). If this all seems a little abstract, fear not, we are going to walk through an example to clear up any ambiguity. I am going to add an additional feature, the difference between forecasted temperature on day 15 of the month and actual temperature on day 5. This feature will serve as a proxy/heuristic for whether the remaing half of the month will be hotter or colder than the first. As a simple example, imagine the temperature on day 5 of the month was 70 degrees F, while the temperature on day 15 was 85 degrees F. This implies that the temperature in the 2nd half of the month is greater than the first, and hence energy usage may increase by more than a factor of 3 before the month has ended (a factor of 3 would imply that each 10 day interval-first 10 days of month, middle 10 days, and last 10 days-all use the same amount of energy). Our data set will now look like:

Month	Energy Usage (After 10 Days)	Temp Diff (F)	Energy Usage (End of Month)	Big Spender?
1	1200	9	3000	Y
2	450	2	890	N
1	1800	5	1950	N
.	.	.	.	.
.	.	.	.	.
.	.	.	.	.

From this we can see how to calculate our distribution parameters, where again each class (big and non big spender) has their own mean, and now their own covariance:

$$\mu = \frac{1}{N} \sum_{i=1}^N x_i$$$$\Sigma = \frac{1}{N - 1} \sum_{i=1}^N (x_i - \mu)(x_i - \mu)^T$$

I want to emphasize again that:

Each class, big spender and non big spenders, have their own mean and covariance parameters. It is the fact that these parameters are different that subsequently allows for each class to have their own distributions.

If we let big_temp_diff_mean = 7 and low_temp_diff_mean = 2, we can make use of scipy's multivariate_normal.pdf to see how our distributions looks in this multivariate case (note: usage was scaled down by a factor of 100 to allow the 3-d plot to render in a reasonable time):

fig = plt.figure(figsize=(15, 5.5), dpi=200)       
ax1 = fig.add_subplot(1, 2, 1, projection='3d')
ax2 = fig.add_subplot(1, 2, 2)

def init():
    #Parameters to set
    usage = np.arange(-3, 50, 0.5)
    temp_diff = np.arange(-20, 20, 0.5)
    usage, temp_diff = np.meshgrid(usage, temp_diff)

    big_usage_mean = 20
    big_usage_std = 8
    big_temp_diff_mean = 7
    big_temp_diff_std = 5

    low_usage_mean = 2
    low_usage_std = 1
    low_temp_diff_mean = 2
    low_temp_diff_std = 5

    pos = np.empty(usage.shape + (2,))
    pos[:, :, 0] = usage
    pos[:, :, 1] = temp_diff

    likelihood_dist_non_big_spender = multivariate_normal(
        [low_usage_mean, low_temp_diff_mean], 
        [
            [low_usage_std, 0], 
            [0, low_temp_diff_std]
        ]
    )
    likelihood_dist_big_spender = multivariate_normal(
        [big_usage_mean, big_temp_diff_mean], 
        [
            [big_usage_std, 0], 
            [0, big_temp_diff_std]
        ]
    )

    likelihood_nbs = likelihood_dist_non_big_spender.pdf(pos)
    likelihood_bs = likelihood_dist_big_spender.pdf(pos)

    # ---- LEFT: 3D Plot ----
    # Plot the surface.
    surf_low = ax1.plot_surface(
        usage,
        temp_diff,
        likelihood_nbs,
        cmap="Reds",
        linewidth=0, 
        alpha=0.5,
        antialiased=True,
        rstride=1, # rstride and cstride are difference makers
        cstride=1
    ) 
    surf_big = ax1.plot_surface(
        usage,
        temp_diff,
        likelihood_bs,
        cmap="Blues",
        linewidth=0, 
        alpha=0.5,
        antialiased=True,
        rstride=1, # rstride and cstride are difference makers
        cstride=1
    ) 

    # ---- RIGHT: Mesh ---- 
    c = ax2.pcolormesh(usage, temp_diff, likelihood_nbs, cmap="Reds", alpha=0.6)
    fig.colorbar(c, ax=ax2)
    c2 = ax2.pcolormesh(usage, temp_diff, likelihood_bs, cmap="Blues", alpha=0.4)
    fig.colorbar(c2, ax=ax2)             

    ax2.set_xlabel("Usage", labelpad=7)
    ax2.set_ylabel("Temp Diff", labelpad=7)
    ax2.set_title('Heat Map of Joint Distribution')

    ax1.set_xlabel('Usage', labelpad=9)
    ax1.set_ylabel('Temp Diff', labelpad=9)
    ax1.set_zlabel('Likelihood', labelpad=9)
    ax1.set_yticks([-15,-5, 5, 15])
    ax1.set_xticks([10,30,50])
    ax1.set_zticks([0.01, 0.03, 0.05, 0.07])

def animate(i):
    ax1.view_init(elev=10, azim=i*4)
    return fig,
    
plt.tight_layout()
plt.subplots_adjust(wspace=0.3, bottom=0.15, top=0.9)

# For rendering html video in cell
# html_video = HTML(
#     animation.FuncAnimation(
#         fig,
#         animate,
#         init_func=init,
#         frames=90,
#         interval=100,
#     ).to_html5_video()
# )
# display(html_video)
gif_video = animation.FuncAnimation(
        fig,
        animate,
        init_func=init,
        frames=90,
        interval=100,
    )

gif_video.save("bayes_joint_likelihood_distribution_heat_map_animation.gif", writer='imagemagick')

plt.close()

Again, we can envision a decision procedure by which we take our 2 dimensional data point that we are trying to classify, and determine which class distribution has a higher likelihood of observing that data point. This still begs the question, what about this process will make it a Bayesian classifier?

2.3 Creating a Bayesian Classifier by including prior knowledge¶

Up until now we have mainly been dealing with what is known as the likelihood distribution, $P(X \mid Y)$, of which we have had one per class. Specifically, it can be viewed as: A probability distribution over our input features, given that the class is some value. In other words, we can calculate quantities such as:

$$P(X \mid Y=\text{"Big Spender"}) = P([\text{usage=300, temp diff = 4}] \mid Y=\text{"Big Spender"})$$

Or:

$$P(X \mid Y=\text{"Non big Spender"}) = P([\text{usage=800, temp diff = 9}] \mid Y=\text{"Non big Spender"})$$

And, as of now, we simply have a decision procedure to select the $Y$ that yields the higher probability density of our input features. This makes sense, but how could we improve upon this prediction procedure? Well, as we said earlier, what we are truly interested in finding is $P(Y \mid X)$, known as the posterior probability; this is the probability that a new data point belongs to a certain class, given our historical data. As defined earlier, we know that this quantity is defined via Bayes rule:

$$P(Y \mid X) = \frac{P(X \mid Y) \cdot P(Y)}{P(X)}$$

We can see that we will still be making use of the likelihood distribution, $P(X \mid Y)$, as well as two new quantities: the prior probability, $P(Y)$, and the probability of our data across all classes, $P(X)$.

The prior probability can be defined as the probability of being a big spender (or non big spender), based on no observed data. In other words, we can treat this as a simple frequency. If historically 20% of customers are big spenders, then we can write:

$$P(Y = \text{Big spender}) = 0.2$$$$P(Y = \text{Non Big spender}) = 0.8$$

Our probability of data can be determined via marginalizing the joint probability of $X$ and $Y$:

$$P(XY) = P(X \mid Y) P(Y)$$$$P(X) = \sum_Y P(XY)$$

Where we are essentially determining the probability of observing a specific data point across all hypothesis. For example, if we had a data point where the usage was 200 kWh and the temperature difference 2 degrees F, we could calculate $P(X)$ by finding the probability of observing a usage of 200 and temperature difference of 2 given the class is big spender, and then add that to the probability of observing a usage of 200 and temperature difference of 2 given the class is non big spender. Of course this will be performed across all data points. In the continuous case we would be utilizing integration instead of summation:

$$P(X) = \int_Y P(X, Y)$$

In our case, we can find the $P(X)$ via breaking down the product rule:

$$P(X) = P(X \mid Y=\text{Big spender})P(Y = \text{Big spender}) + P(X \mid Y=\text{Non big spender})P(Y = \text{Non big spender})$$

And at this point we have everything we need in order to create a full Bayes Classifier! We can walk through our earlier example to make sure this is clear. We know the following:

$$X = [\text{usage = 200, temp diff = 2}]$$$$\text{Big Spender Likelihood Parameters} \longrightarrow \mu_{usage} =2000, \mu_{temp} = 7, \sigma_{usage} = 800, \sigma_{temp} = 5 $$$$\text{Non Big Spender Likelihood Parameters} \longrightarrow \mu_{usage} = 200, \mu_{temp} = 2, \sigma_{usage} = 100, \sigma_{temp} = 1 $$$$P(Y = \text{Big spender}) = 0.2$$$$P(Y = \text{Non Big spender}) = 0.8$$

And we can calculate our posterior distributions as follows; starting with the posterior for big spender:

$$P(Y = \text{Big Spender} \mid X) = \frac{P(Y = \text{Big Spender}) P(X \mid Y=\text{"Big Spender"})}{P(X)}$$$$P(Y = \text{Big Spender} \mid X) = \frac{0.2 \cdot N\Big(X, [2000, 7], [800, 5] \Big)} {N\Big(X, [2000, 7], [800, 5] \Big) \cdot 0.2) + N\Big(X, [200, 2], [100, 1] \Big) \cdot 0.8}$$

Where our likelihoods can be calculated via:

big_usage_mean = 20
big_usage_std = 8
big_temp_diff_mean = 7
big_temp_diff_std = 5

low_usage_mean = 2
low_usage_std = 1
low_temp_diff_mean = 2
low_temp_diff_std = 5

likelihood_non_big_spender = multivariate_normal(
    [low_usage_mean, low_temp_diff_mean], 
    [
        [low_usage_std, 0], 
        [0, low_temp_diff_std]
    ]
)
likelihood_big_spender = multivariate_normal(
    [big_usage_mean, big_temp_diff_mean], 
    [
        [big_usage_std, 0], 
        [0, big_temp_diff_std]
    ]
)

likelihood_non_big_spender.pdf(np.array([2., 2.])) 
=> 0.07117625434171772

likelihood_big_spender.pdf([2., 2.])
=> 3.3158179074871642e-12

Leaving us with a final value of:

$$P(Y = \text{Big Spender} \mid X) = \frac{0.2 \cdot 3.315 \cdot 10^{-12}} {3.315 \cdot 10^{-12} \cdot 0.2 + 0.07112 \cdot 0.8} = 1.165 \cdot 10^{-11}$$

And the same calculation can be performed for the non big spender class:

$$P(Y = \text{Non big spender} \mid X) = \frac{P(Y = \text{Non big spender}) P(X \mid Y=\text{Non big spender})}{P(X)}$$$$P(Y = \text{Non big spender} \mid X)= \frac{0.8 \cdot N\Big(X, [200, 2], [100, 1] \Big)} {N\Big(X, [2000, 7], [800, 5] \Big) \cdot 0.2) + N\Big(X, [200, 2], [100, 1] \Big) \cdot 0.8}$$$$P(Y = \text{Non big spender} \mid X) = \frac{0.8 \cdot 0.07112} {3.315 \cdot 10^{-12} \cdot 0.2 + 0.07112 \cdot 0.8} = 0.999$$

Now of course in this case we will predict that the data point was generated by (belongs to) the non big spender class. There will obviously be many scenarios where the prediction is not quite as one sided.

2.4 Simplifying our Calculation¶

Now, in practice there are actually a few things that we can do to simplify our calculations. First and foremost, we can see that in the calculation of the posterior probabilities, $P(X)$ occurs in the denominator of both expressions. Hence, we can actually just leave it out all together and simply take the $argmax$ over $P(X \mid Y = k)P(Y = k)$:

$$k^* = argmax_k \big\{ P(X \mid Y = k) P(Y = k) \big\}$$

Of course we will no longer be yielding a valid probabilty, but in the classification context that is not quite our goal. We simply want to choose the class that would maximize our posterior probability-we aren't incredibly concerned with what the posterior probability value is exactly. Note, for those unfamiliar with the $argmax$, is can be viewed similarly to an if statement:

if p(x | y=k1) * p(y=k1) > p(x | y=k2) * p(y=k2):
    return k1
else:
    return k2

Additionally, in the real world we often will deal with very high dimensional data, in certain cases such as image classification $D > 1000$. Due the curse of dimensionality, as we perform our calculations our probabilities will approach $0$. In order to prevent our computations from underflowing as they approach $0$, we will actually take the $log$ of our probability product:

$$log \big( P(X \mid Y = k) P(Y = k) \big) = log\big( P(X \mid Y = k)\big) + log \big( P(Y = k) \big)$$$$k^* = argmax_k \Big\{ log\big( P(X \mid Y = k)\big) + log \big( P(Y = k) \big) \Big\}$$

Because the logarithm, by definition, turns multiplication into addition, this will prevent the successive multiplication of small probability values, and instead simply leave us adding them. This is allowed because the logarithm transform is monotonically increasing, and since we are interested in the $argmax$ and not the actual probability values, the maximums will still occur at the same inputs. As a quick example take a look at the visualization below: The maximum of both the red and blue curves occur at $x=0$:

x = np.arange(-4.9, 4.9, 0.1)
y = -x**2+ 25
log_y = np.log(y)

trace1 = go.Scatter(
    x=x,
    y=y,
    marker = dict(
        color = 'blue',
    ),
    name=r"$y = -x^2 + 25$"
)

trace2 = go.Scatter(
    x=x,
    y=log_y,
    marker = dict(
        color = 'crimson',
    ),
    name=r"$log(y)$"
)



data = [trace1, trace2]
layout = go.Layout(
    legend=dict(x=0.8, y=0.97),
    width=650,
    height=400,
    title=r"$y \text{ and } log(y)$",
    xaxis=dict(title=r"$x$"),
    yaxis=dict(title=r"response")
)

fig = go.Figure(data=data, layout=layout)

# plotly.offline.iplot(fig)
html_fig = plotly.io.to_html(fig, include_plotlyjs=True)
display(HTML(html_fig))

Additionally, addition is faster (in most cases) than multiplication from a computational perspective; this speed up is just another added benefit.

2.5 Why not simply use the likelihood?¶

If you are like many people and are left wondering why we should not simply let the posterior probability equal the likelihood, I highly encourage you to read my post on bayesian inference-particularly the first example regarding cancer classification.

In essence, without diving into too much theory and staying in the machine learning classification context, using prior probabilities is incredibly useful when it comes to dealing with unbalanced classes. For instance, if we were trying to predict whether a person has cancer and we know that only 1% of the population actually has cancer, the knowledge of that distribution ought to be included in our calcuation. Intuitively you can reason about this as follows: we are trying to make the best predictions possible, making use of any useful data and background information that we can. This prior probability is most certainly relevant and appropriate data that should be included.

3. The Discrete Case¶

Now that we have walked through the basics of performing a bayesian classification based on continuous inputs (energy usage and temperature difference), where we used a gaussian distribution to measure our likelihoods, we can take a look at how to utilize this classifier based upon discrete inputs. To do so we can look at one of the most famous use cases for a Bayes Classifier, a form of text classification known as spam detection.

To start let's assume that we are only tracking a single input feature, how many times the word loan appears in a given email. Intuitively we may reason that the word "loan" occurs often in spam, and rarely in non-spam emails. A good distribution to track this is the binomial distribution. Without digging into to much detail (see my post on the normal distribution), the binomial distribution represents the number of successes in an experiment containing $n$ trials. In our case, each trial is a word, and a success is that the word is "loan". Mathematically this distribution is modeled as:

$$ f(k \mid n, p) = {n \choose k} p^k (1 - p)^{n-k}$$

Where $f$ is a probability mass function rather than a PDF, $p$ is the probability of success, $1-p$ is the probability of failure, $k$ is the number of successes, and $n-k$ the number of failures (note this applies to events that are independent). Additionally, recall that the binomial coefficient's purpose is account for the fact that their are multiple ways to order successes and failures, and we want to consider all of those different combinations. Now unlike PDF's, a PMF will return an actual probability value (since by definition it is a function that gives the probability that a discrete random variable is exactly equal to some value-i.e. no integration required). Visually, this can be seen below in the simple example of a die roll:

traces = []
for i in range(1, 7):
    trace = go.Scatter(
        x=[i],
        y=[1/6],
        marker = dict(
            color = 'gold',
            size=10
        ),
        name=f"P({i})"
    )
    traces.append(trace)
    
shapes = []
for i in range(1, 7):
    shapes.append(
        go.layout.Shape(
            type="line",
            x0=i,
            y0=0,
            x1=i,
            y1=1/6,
            line=dict(
                color="gold",
                
            )
        )
    )
    
data = traces
layout = go.Layout(
    showlegend=False,
    width=800,
    height=400,
    title="Probability Mass Function of Die Rolls",
    xaxis=dict(title="Die number"),
    yaxis=dict(title="PMF (probability of roll)", range=[0,0.5]),
    shapes=shapes

)

fig = go.Figure(data=data, layout=layout)

# plotly.offline.iplot(fig)
html_fig = plotly.io.to_html(fig, include_plotlyjs=True)
display(HTML(html_fig))

As a quick example, if we let $n = 100$, $p = 0.08$, and assume that we observed the word "loan" $5$ times, the probability of that occuring is:

$$ f(k=5 \mid n = 100, p = 0.08) = {100 \choose 5} 0.08^5 (1 - 0.08)^{95} = 0.09$$

Now as with the continuous case, we would like to model two output classes; in other words we want to find likelihood distributions that correspond to spam and not spam emails. Recall that in the continuous case, our distributions were found by:

Splitting up the data based on class.
Calculating the mean and variance of each classes data set.
Using the above mean and variance as the parameters of a gaussian distribution. Note that the gaussian was selected by us, the data scientist, as the model that corresponds to the under lying data generating process.
The above gaussians are the distributions corresponding to each class.

Well, in our discrete case the same thing occurs, only now our distribution that models our process (the binomial) is only parameterized by a single parameter, $p$, the probability of success (the word being "loan"). As an example, based on our data we may find that the word loan appears at a frequency of $0.08$ in spam emails, and $0.001$ for not spam emails. Our distributions would then be represented as:

$$ P(X \mid Y = \text{spam}) = f(k \mid n, p = 0.08) = {n \choose k} 0.08^k (1 - 0.08)^{n-k}$$$$ P(X \mid Y = \text{not spam}) = f(k \mid n, p = 0.001) = {n \choose k} 0.001^k (1 - 0.001)^{n-k}$$

We can plot these likelihood for varying values of $k$, with $n = 20$, below:

k = np.arange(0, 20, 1)
n = k.shape[0]

theta_spam = 0.08
f_spam = binom.pmf(k, n, theta_spam)

theta_not_spam = 0.001
f_not_spam = binom.pmf(k, n, theta_not_spam)

trace1 = go.Scatter(
    x=k,
    y=f_spam,
    mode='markers',
    marker = dict(
        color = 'blue',
        size=10
    ),
    name="Spam"
)

trace2 = go.Scatter(
    x=k,
    y=f_not_spam,
    mode='markers',
    marker = dict(
        color = 'crimson',
        size=10
    ),
    name="Not Spam"
)

data = [trace1, trace2]
layout = go.Layout(
    legend=dict(x=0.8, y=0.97),
    width=800,
    height=400,
    title="Binomial Distributions fit to Spam and Not Spam historical data",
    xaxis=dict(title=r"$k$"),
    yaxis=dict(title=r"$f(k \mid n, p) = P(X \mid Y)$")
)

fig = go.Figure(data=data, layout=layout)

# plotly.offline.iplot(fig)
html_fig = plotly.io.to_html(fig, include_plotlyjs=True)
display(HTML(html_fig))

Note that I specifically only plotted the markers so that there is no confusion surrounding the fact that this is indeed a discrete distribution. Just as with the continuous case, we have a pretty clear cut decision procedure that we can apply to classify a given email as spam or not spam. Based on our likelihood distributions corresponding to each class, we can take an input data point (an email, containing a specific number of occurences of the word "loan"), and see which class yields a higher likelihood of observing that data point. If we look specifically at an email that contained $3$ occurences of the word loan, based on our plot we would conclude that it is indeed a spam email.

3.1 Our Interests Lie with the Posterior, not simply the likelihood¶

However, remember that we are not simply interested in the likelihood! Rather, we want to find the posterior probabilty that our data point is spam or not spam, given the email:

$$P(\text{spam} \mid \text{email})$$

If we only look at the likelihood, $P(\text{email} \mid \text{spam})$, then we are essentially finding the probability that we would observe the email we did, given that it belonged to the spam class. Clearly, the former is a more accurate representation of what we seek. As in the continuous case, Bayes rule allows us to compute the posterior:

$$P(Y \mid X) = \frac{P(X \mid Y) \cdot P(Y)}{P(X)}$$$$P(\text{spam} \mid \text{email}) = \frac{P(\text{email} \mid \text{spam}) \cdot P(\text{spam})}{P(\text{email})}$$

Which we will also want to do for the not spam class:

$$P(\text{not spam} \mid \text{email}) = \frac{P(\text{email} \mid \text{not spam}) \cdot P(\text{not spam})}{P(\text{email})}$$

It is these two posterior probabilities that will be compared when we make our classification. Because we are only interested in the final relative comparison, we do not need to consider $P(X) = P(\text{email})$, since it will be included in the calculation of each posterior and not change the relative relation. Our prior probabilities will simply be the frequency of emails on the whole that are spam, and those that are not spam.

3.2 More than one input feature¶

In reality, just as in the our continuous example, we would want to look at more than one input feature (i.e. more than the word "loan"). To handle this we can utilize the generalization of the binomial distribution, the multinomial distribution. The generalization is rather straightforward; consider that the binomial distribution was really just modeling a sequence of events/trials, in which there are were only two outcomes. Hence, due to the fact that the sum of the individual probabilities in an exhaustive set and must equal $1$, we only need a single parameter/probability, the probability of success (in this case $p$):

$$\text{Binomial} \longrightarrow f(k \mid n, p) = {n \choose k} p^k (1 - p)^{n-k}$$

However, note that we could also state that $P(success) = p_1$ and $P(failure) = p_2 = 1 - p_1$, updating our equation to be:

$$\text{Binomial} \longrightarrow f(k \mid n, p_1, p_2) = {n \choose k} p_1^k p_2^{n-k}$$

Additionally it is clear that the exponent terms simply add up to the total number of trials, $n$. So, in a more general case, we have the multinomial distribution:

$$\text{Multinomial} \longrightarrow f(x_1, \dots ,x_k \mid n, p_1, \dots , p_k) = \frac{n!}{x_1! \dots x_k!} p_1^{x_1} \cdot \dots \cdot p_k^{x_k}$$

Where we have $x_i$ representing the number of successes for class $i$, $p_i$ representing the probability of success for class $i$, and the fractional coefficient of factorials simply being the generalization of the binomial coefficient. Note that the multinomial distribution has the following properties:

$$\sum_i p_i = 1$$$$\sum_i x_i = n$$

Let's take a minute to ensure we understand this in the context of our spam detection problem. For every data point (word), we want to know which particular word was observed; in other words, which word was a success. The probability of a word being a success will simply be the frequency with which it occurred in the data. For instance, if we were parameterizing our spam model, we would get a count of all of the words, and then for each word across all the spam emails we would calculate the frequency (probability) with which it occurred. If we happened to have $100,000$ words across all of our spam emails, and we found that cash occurred $2,782$ times, then our probability parameter for observing cash would be $p_{cash} = 0.02782$. This would need to be done for all words, leaving us with a list of parameters, one for each word. This probability parameters of this list would sum to one since they are mutually exhaustive.

When trying to determine the likelihood of a specific email given the spam class, we count the number of occurrences of each word in the email, and that makes up our $x$ list. For instance, if our email was:

Free cash loan if you act fast! This loan will not last! Act now to get your CASH.

We would see that:

$$x_{cash} = 2, x_{loan} = 2, x_{fast} = 1, \dots$$

And those would be the exponents of the corresponding probability parameter in the multinomial model that was just parameterized from our data.

The visualization below gives a sense for what the multinomial model looks like (where $p_{loan} = 0.3$ and $p_{cash}=0.7$):

fig = plt.figure(figsize=(10, 6), dpi=200)       
ax1 = fig.add_subplot(1, 1, 1, projection='3d')

def init():
    loan = np.arange(1, 10, 1)
    cash = np.arange(1, 10, 1)
    loan, cash = np.meshgrid(loan, cash)

    pos = np.empty(loan.shape + (2,))
    pos[:, :, 0] = loan
    pos[:, :, 1] = cash

    ploan = 0.3
    pcash = 0.7

    rv = multinomial(10, p=[ploan, pcash])
    joint = rv.pmf(pos)

    cash_list = [[val, val] for val in cash.flatten()]
    loan_list = [[val, val] for val in loan.flatten()]
    joint_list = [[0, val] for val in joint.flatten()]

    for idx in range(len(cash_list)):
        ax1.plot(
            loan_list[idx], 
            cash_list[idx], 
            joint_list[idx],
            'k-',
            alpha=0.8, 
            linewidth=2, 
            color="black"
        )

    scatter = ax1.scatter(
        loan,
        cash,
        joint,
        c=joint.flatten(),
        s=70,
        cmap="Blues"
    ) 

    ax1.set_xlabel("Loan (word occurrences)", labelpad=9)
    ax1.set_ylabel("Cash (word occurrences)", labelpad=9)
    ax1.set_zlabel("Probability", labelpad=9)

    ax1.set_xticks([2, 4, 6, 8])
    ax1.set_yticks([2, 4, 6, 8])
    ax1.set_zticks([0.05, 0.15, 0.25])

def animate(i):
    ax1.view_init(elev=10, azim=i)
    return fig,
    
plt.tight_layout()
plt.subplots_adjust(wspace=0.3, bottom=0.15, top=0.9)

# For rendering html video in cell
# html_video = HTML(
#     animation.FuncAnimation(
#         fig,
#         animate,
#         init_func=init,
#         frames=360,
#         interval=100,
#     ).to_html5_video()
# )
# display(html_video)
# gif_video = animation.FuncAnimation(
#         fig,
#         animate,
#         init_func=init,
#         frames=360,
#         interval=100,
#     )

# gif_video.save("multinomial_distribution.gif", writer='imagemagick', fps=120)

plt.close()

3.3 Another Simplification¶

As is often the case in applied mathematics and machine learning, we often care not only about the underlying theory and modeling a given situation accurately, but we also look to balance that with simplicity and performance. What is a simpler way to model our spam classification scenario? Well, instead of counting the number of times each word appears and modeling it as a binomial/multinomial, we can simply check if the word occurs at all in a given email. In other words, we don't care how many times it appeared, we just care about whether or not it occurred at all. This type of indicator (boolean) variable has a Bernoulli Distribution :

$$f(k \mid p) = \left\{ \begin{array}{ll} p \; \hspace{47pt} if \; \; k =1 \\ q = 1 - p \hspace{20pt} if \; k = 0\\ \end{array} \right. $$$$f(k \mid p) = p^k(1-p)^{1-k}$$

3.4 Continuous, Discrete, and Boolean Bayes Classifer Summary¶

To summarize what we have gone over so far:

When dealing with continuous inputs we can model the class distributions with a gaussian distribution
When dealing with discrete inputs we can model the class distributions with a binomial distribution
When dealing with boolean inputs we can model the class distributions with a bernoulli distribution

Conveniently, in Scikit learn the Naive Bayes classifier comes in exactly these three variants: GaussianNB, MultinomialNB, and BernoulliNB. I want to make clear however that we are not limited to using only these three distributions. Imagine a scenario where our inputs are a mix of discrete and continuous variables, we can actually mix these distributions together, and even use distributions that we have not mentioned. How exactly could we do that? Well, at this point we have been using our distributions in order to determine the likelihood of observing the data we did, given a particular class. In our continous case (multivariate) we had the following PDF:

$$f(x \mid \mu, \Sigma) = \frac{1}{\sqrt{(2 \pi)^D |\Sigma|}} exp\Big(-\frac{1}{2} (x - \mu)^T \Sigma^{-1} (x - \mu)\Big)$$

And in the multinomial case we had:

$$f(x_1, \dots ,x_k \mid n, p_1, \dots , p_k) = \frac{n!}{x_1! \dots x_k!} p_1^{x_1} \cdot \dots \cdot p_k^{x_k}$$

If we had both continuous and discrete input variables, we could model the continuous with the gaussian, and the discrete with the multinomial, and the overall likelihood of a data point with their product:

$$ f(x \mid \mu, \Sigma, n, p_1, \dots , p_k) = \Big[ \overbrace{ \frac{1}{\sqrt{(2 \pi)^D |\Sigma|}} exp\Big(-\frac{1}{2} (x - \mu)^T \Sigma^{-1} (x - \mu)\Big) }^\text{Gaussian Likelihood}\Big] \cdot \Big[\overbrace{ \frac{n!}{x_1! \dots x_k!} p_1^{x_1} \cdot \dots \cdot p_k^{x_k} }^\text{Multinomial Likelihood}\Big] $$

Where $x$ is a vector containing all of input variables. For instance, take our energy example. Say that our inputs were usage amount, season, and month. usage_amount will be a continous variable, modeled by our gaussian, while season a discrete variable taking on the values 0, 1, 2, 3, corresponding to our four seasons. month will also be discrete, taking on the values 1,...,12. An input vector x could then look like: 259.154, 2, 8, which means we are trying to classify a consumer who used 259.154 kWh of energy in the first 10 days, during the summer, in the month of august.

Also, note that we have not yet discussed what naive means in the context of naive bayes classifiers; we have only referred to bayes classifiers in general. Take a moment to think about what naive may mean in our classification problem before moving to the next section.

4. Naive Bayes¶

At this point we have laid all of the ground work to finally discuss one of the main ideas behind this post, the Naive Bayes classifier. We can introduce this in terms of our spam classification example. Our object is to determine if an email is spam, and we intuitively know that it is more likely to be spam if it contains words like "cash", "loan", "pills", etc. So, in the case of "money" we want to find distributions for:

$$P(\text{money} \mid \text{spam})$$$$P(\text{money} \mid \text{not spam})$$

Discrete probabilities such as these are just counts, so this is actually very straighforward:

$$P(\text{money} \mid \text{spam}) = \frac{\text{# spam messages containing "money"}}{\text{total # of spam messages}}$$

So, if we have $1000$ spam emails in total, and $150$ of them contain the word "money", then $P(\text{money} \mid \text{spam}) = 0.15$. Not that this is a boolean indicator where we are simply seeing if the word "money" occurs in a given email-we are not counting how many times it occurs. To be clear, the probability $0.15$ that we just found is our parameter $p$ (probability of success) in the bernoulli distribution; the frequency (occurrences divided by total number of words) is essentially a maximum likelihood estimation. With this taken care of we would do the same thing for $P(\text{money} \mid \text{not spam})$.

4.1 What makes this Naive?¶

Now, consider the probability $P(\text{cash} \mid \text{spam})$. Do you think that this is correlated with $P(\text{money} \mid \text{spam})$? There is a very good chance that it is! However, in a Naive Bayes classifier we assume that they are independent:

$$P(\text{cash, money} \mid \text{spam}) = P(\text{cash} \mid \text{spam}) P(\text{money} \mid \text{spam})$$

More generally, we can define our naive assumption as follows:

Naive Assumption: All input features are independent, and we will ignore any correlations between features.

Mathematically written as:

$$P(\text{all words} \mid \text{spam}) = P(\text{word 1} \mid \text{spam}) \cdot P(\text{word 2} \mid \text{spam}) \dots P(\text{word n} \mid \text{spam})$$

4.2 What makes this Bayesian?¶

Now, as we stated before, we do not simply want to model likelihoods! In other words, we are not just interested in:

$$P(X \mid \text{spam}) \; \; \; vs. \; \; \; P(X \mid \text{not spam}) $$

We are trying to make predictions! So we are really after the posterior probability, $P(\text{spam} \mid X)$. This can be calculated via Bayes rule (as we showed earier):

$$P(\text{spam} \mid \text{email}) = \frac{P(\text{email} \mid \text{spam}) \cdot P(\text{spam})}{P(\text{email})}$$

In other words, we are now factoring in prior probabilities; the probability of an email being spam and the probability of an email being not spam. For example, if we had $70$ not spam emails and $30$ spam emails, our prior probability of not spam would be $0.70$, and spam $0.30$.

4.3 Modeling $P(X \mid \text{class})$¶

So, how exactly will we model our likelihoods? For instance, say that we have a document containing a list of words, $x_1, x_2, \dots, x_n$. Well, the joint probability of observing those words is:

$$P(x_1, x_2, \dots, x_n \mid \text{class})$$

And based on our naive independence assumption this is simply equal to:

$$P(x_1, x_2, \dots, x_n \mid \text{class}) = P(x_1 \mid \text{class}) \cdot P(x_2 \mid \text{class}) \dots P(x_n \mid \text{class})$$

Take a moment to pause now and think about what we are missing in the above equation. Recall, when modeling boolean input features, we use the bernoulli distribution:

$$f(x \mid p) = p^x(1-p)^{1-x}$$

Where $x \in \{0, 1\}$. What are we not accounting for in our example above? Put another way:

Are we making use of all of the information that we have at our disposal to make the most accurate prediction?

This is a central theme to machine learning and probabilistic inference; we want to use all relevent information in order to make the most accurate predictions and inferences possible.

So, with that said, we are currently not incorporating words that we don't observe. For instance, if we do not observe the word "pills", then it may (slightly) decrease our likelihood of the email belonging to the spam class. We can include that by defining a vocabulary, $V$, which is simply the set of all the words occurring throughout all emails. So, if we then define $X$ to be the set of words occurring in the email:

$$ P(X \mid \text{class}) = \prod_{w \in X}P(w \mid \text{class}) \prod_{w \in \; V - X} \big( 1 - P(w \mid \text{class}) \big) $$

Where $V - X$ is the complement of $X$ (the set of elements that do not occur in $X$). We take the product of $1 - P(w \mid \text{class})$ because we are interested in the probability of those words not occurring. It should be clear that this is simply an extension of the bernoulli distribution, namely the product of the bernoulli distribution's for each occurring word in our email, and those that did not occur but are in our vocabularly.

4.4 Similarity to $K$ Nearest Neighbors¶

For those familiar with KNN, note that conceptually speaking naive bayes in almost the exact opposite type of classifier. In the case of KNN, we are trying to approximate a function that takes in words in an emai and outputs spam or not spam:

$$f(\text{words in email}) \longrightarrow \text{spam / not spam}$$

Naive Bayes, on the other hand, assumes that all data we observe is produced from the actual class. In other words, the class itself is the data generating process. I will dig into this a bit more when we discuss generative vs. discriminative classifiers.

4.5 Naive Bayes Implementation¶

For this post I am going to create a Naive Bayes implementation based on the MNIST dataset. Now, this data is technically discrete, since each $x$ input is a pixel value, ranging from [0, 255]. However, the physical reality is that we view varying colors as essentially continuous, so we can model this via a continuous distribution. We could use a distribution that is defined from $\{0,1\}$, but instead we will model it using a gaussian distribution. In this example we will specifically be using a multivariate gaussian, as shown earlier.

If you have understood this post up until this point then you are just about all set to move on to the implementation, but I want to take a minute to go over a few more optimizations that we will be using first.

Optimization 1
First, because Naive Bayes treats all input dimensions as independent from one another, we do not actually need to use the full covariance matrix in the parameterization of our multivariate gaussian. Recall that the covariance matrix has the shape (here with three different dimensions):

$$\Sigma = \begin{bmatrix} C_{1,1} & C_{1,2} & C_{1,3}\\ C_{2,1} & C_{2,2} & C_{2,3}\\ C_{3,1} & C_{3,2} & C_{3,3}\\ \end{bmatrix} = \begin{bmatrix} \sigma^2_{1} & C_{1,2} & C_{1,3}\\ C_{2,1} & \sigma^2_{2} & C_{2,3}\\ C_{3,1} & C_{3,2} & \sigma^2_{3}\\ \end{bmatrix} $$

Where $\sigma^2_i$ is the variance of the $i$th dimension, and $C_{i,j}$ is the covariance of the $i$ and $j$th dimensions. They are defined as follows:

$$Cov(x_i, x_i) = \Sigma_{i,i}= \sigma^2_i = Var(x_i) = E \big[ (X_i - \mu_i)(X_i - \mu_i)\big] = E \big[ (X_i - \mu_i)^2\big]$$$$Cov(x_i, x_j) = \Sigma_{i,j} = E\Big[(x_i-\mu_i)(x_j-\mu_j)\Big]$$

Note, the covariance between two attributes is essentially an indication as to whether they change together or in opposite directions (I encourage you to experiment with the equation in order to see why this is the case). So, because of our naive assumption of dimensional independence, our covariance matrix actually has the form:

$$\Sigma = \begin{bmatrix} \sigma^2_{1} & 0 & 0\\ 0 & \sigma^2_{2} & 0\\ 0 & 0 & \sigma^2_{3}\\ \end{bmatrix} $$

Where the off diagonals have a value of $0$ to indicate that each dimension is entirely independent of the others; they do not change together at all. Because of this, we do not actually need to store a $DxD$ covariance matrix, instead we can store a $D$-size vector (containing the variances along the diagonals-this is referred to as an axis-aligned elliptical covariance). This speeds up our calculations significantly because we do not need to invert the covariance matrix via the traditional method. Recall, in order to invert a diagonal matrix we simply need to replace each element along the diagonal with it's reciprocal:

$$\Sigma = \begin{bmatrix} \frac{1}{\sigma^2_{1}} & 0 & 0\\ 0 & \frac{1}{\sigma^2_{2}} & 0\\ 0 & 0 & \frac{1}{\sigma^2_{3}}\\ \end{bmatrix} $$

Thankfully, scipy's fantastic multivariate normal API allows us to pass in either a $DxD$ covariance matrix, or a $D$ sized vector in place of it. At this point we are essentially still doing:

$$P(X \mid class) = P(\text{pixel 1} \mid class)P(\text{pixel 2} \mid class) \dots P(\text{pixel 784} \mid class)$$$$P(X \mid class) = N(x_1 \mid \mu_1, \sigma_1^2)N(x_2 \mid \mu_2, \sigma_2^2) \dots N(x_{784} \mid \mu_{784}, \sigma_{784}^2)$$

In other words, breaking off each dimension into it's own univariate gaussian (parameterize by only mean and variance-not covariance), and multiplying them all together to get a joint likelihood. Note, we can still use the multivariate function in scipy and by passing in a variance list instead of the full covariance matrix an equivalent calculation will occur.

Optimization 2
Our second optimization involves the utilization of the logarithm. Calculating the exponential function is time consuming, as is multiplication, both of which are currently present in our multivariate normal likelihood function:

$$f(x \mid \mu, \Sigma) = \frac{1}{\sqrt{(2 \pi)^D |\Sigma|}} exp\Big(-\frac{1}{2} (x - \mu)^T \Sigma^{-1} (x - \mu)\Big)$$

When including our prior, our current decision procedure looks like:

$$\text{Prediction} \longrightarrow argmax_C \Big\{ P(X \mid C) P(C) \Big\} $$

If we take the logarithm of the right side, our prediction will not change (again, based on monotonic nature of logarithm), but we will be able to turn the multiplication of the prior and likelihood into addition, and get rid of the exponential in the likelihood:

$$argmax_C \Big\{ log\big( P(X \mid C) \big) + log \big( P(C) \big) \Big\} $$$$ argmax_C \Big\{ log\big( f(x \mid \mu, \Sigma) \big) + log \big( P(C) \big) \Big\} $$$$ argmax_C \Big\{ log\big( \frac{1}{\sqrt{(2 \pi)^D |\Sigma|}} exp\Big(-\frac{1}{2} (x - \mu)^T \Sigma^{-1} (x - \mu)\Big) \big) + log \big( P(C) \big) \Big\} $$$$ argmax_C \Big\{ log\big( \frac{1}{\sqrt{(2 \pi)^D |\Sigma|}} \big) + log\big( exp\Big(-\frac{1}{2} (x - \mu)^T \Sigma^{-1} (x - \mu)\Big) \big) + log \big( P(C) \big) \Big\} $$$$\text{Prediction} \longrightarrow argmax_C \Big\{ log\big( \frac{1}{\sqrt{(2 \pi)^D |\Sigma|}} \big) + \Big(-\frac{1}{2} (x - \mu)^T \Sigma^{-1} (x - \mu\Big) + log \big( P(C) \big) \Big\} $$

Luckily, scipy also has a function for the log pdf of a multivariate normal gaussian built in that we can make use of.

Optimization 3
One final optimization/implementation detail we must make note of is smoothing. When inverting the covariance matrix, we can sometimes run into what is known as the singular covariance problem; this is the matrix equivalent of dividing by $0$. More can certainly be said on this, but the main idea is that the inverse of a matrix is found via (in a $2$ dimensional case):

$$A^{-1} = \frac{1}{|A|} \begin{bmatrix} d & -b \\ -c & a\\ \end{bmatrix} $$

Where $A$ is just equal to:

$$A = \begin{bmatrix} a & b \\ c & d\\ \end{bmatrix}$$

And $|A|$ is the determinant of $A$. Hence, this singular covariance problem arises when the determinant of $A$ is $0$. This occurs when the matrix transformation $A$ transforms space down onto a line (recall, the determinant is a representation of what a linear transformation does to area in the original space when it is applied. If it reduces area to $0$ by squishing space onto a line, the determinant is $0$). The simplest way to account for this is to add a smoothing parameter to our calculation. Originally our covariance matrix was calculated via:

$$\Sigma = \frac{\big(X - \mu \big)^T \big(X - \mu \big)}{N - 1}$$

Which is known as the maximum likelihood estimate of $\Sigma$ (it is calculated via maximizing our likelihood function). We will update this to be a smoothed estimate:

$$\Sigma = \frac{\big(X - \mu \big)^T \big(X - \mu \big)}{N - 1} + \lambda I$$

Where $\lambda$ is a small number like $10^{-3}$, and $I$ is the identity matrix.

Pseudocode¶

That covers the bulk of the theory that we need to be familiar with in order to fully implement the naive bayes classifier. Before going ahead and implementing it in python, let's take a look at what this may look like in pseudocode.

def fit(X, Y): 
    dict_of_gaussians = {}
    priors = {}
    for c in classes:
        Xc = X[corresponding Y == c]
        mu, var = mean and diagonal covariance of Xc
        dict_of_gaussians[c] = {"mu": mu, "var": var}
        priors[c] = len(Xc) / len(X)

In our fit function, what we essentially are trying to do is calculate the mean and variance for each class. Recall that all we need in order to fully parameterize a gaussian are mean and variance; they are known as sufficient statistics. We then will have a predict function:

def predict(X):
    predictions = []
    max_posterior = -inf
    best_class = None

    for x in X:
        for c in classes:
            mu, var = dict_of_gaussians[c]
            posterior = log_pdf(x, mu, var) + log(priors[c])
            if posterior > max_posterior:
                max_posterior = posterior
                best_class = c
        predictions.append(best_class)
    return predictions

Here, we just want to loop through every sample that we want to predict on. We then loop through each class, grabbing it's mean, variance and prior. We then use the mean and variance in order to determine the log likelihood of the sample belonging to this class. This, added to the log of the prior probability of belonging to this class yield the posterior. Finally, we predict the class that has the maximum posterior probability. Our model performance can be seen below (and the code behind it can be seen as well; scroll to bottom of page and click "show code", or simply checkout the .py file on my github.)

from datetime import datetime
from scipy.stats import norm
from scipy.stats import multivariate_normal as mvn

from Machine_Learning.supervised_learning.utils import get_mnist_data


class NaiveBayes(object):
    def fit(self, X, Y, smoothing=10e-3):
        self.gaussians = {}
        self.priors = {}
        class_labels = set(Y)

        for c in class_labels:
            # Loop through all training examples belonging to a particular class
            # Find mean and variance (these parameterize that class's gaussian)
            current_x = X[Y == c]
            self.gaussians[c] = {
                "mean": current_x.mean(axis=0),
                "var": current_x.var(axis=0) + smoothing
            }
            self.priors[c] = float(len(Y[Y == c])) / len(Y)  # Could calculate log prior

    def score(self, X, Y):
        P = self.predict(X)
        return np.mean(P == Y)

    def predict(self, X):
        N, D = X.shape
        K = len(self.gaussians) # Number of classes
        P = np.zeros((N, K)) # (holds probability of each sample, n, belonging to class, k)

        for class_, gaussian in self.gaussians.items():
            mean, var = gaussian["mean"], gaussian["var"]

            # Calculating N different log pdfs at the same time
            P[:, class_] = mvn.logpdf(X, mean=mean, cov=var) + np.log(self.priors[class_])

        return np.argmax(P, axis=1)


if __name__ == "__main__":
    X, Y = get_mnist_data()

    Ntrain = len(Y) // 2
    Xtrain, Ytrain = X[:Ntrain], Y[:Ntrain]
    Xtest, Ytest = X[Ntrain:], Y[Ntrain:]


    nb_model = NaiveBayes()
    t0 = datetime.now()
    nb_model.fit(Xtrain, Ytrain)
    print("Naive Bayes Training time:", (datetime.now() - t0))

    t0 = datetime.now()
    print("Training accuracy with Naive Bayes Model:", round(nb_model.score(Xtrain, Ytrain), 2))
    print("Time to compute trainin accuracy:", (datetime.now() - t0), "Train size:", len(Ytrain))

    t0 = datetime.now()
    print("Test accuracy with Naive Bayes Model:", round(nb_model.score(Xtest, Ytest), 2))
    print("Time to compute test accuracy:", (datetime.now() - t0), "Test size:", len(Ytest))

Reading and Transforming MNIST Data...
Naive Bayes Training time: 0:00:00.250052
Training accuracy with Naive Bayes Model: 0.8
Time to compute trainin accuracy: 0:00:03.333656 Train size: 21000
Test accuracy with Naive Bayes Model: 0.79
Time to compute test accuracy: 0:00:03.246695 Test size: 21000

5. Bayes Classifier $\longrightarrow$ Non-Naive Bayes¶

Now, our model performed rather well with a test accuracy of $0.79$, but I am sure that you are wondering how it would perform if we didn't utilize our naive assumption. Remember, this naive assumption was simply that all of our input features are independent. In the context of our image classification, this means that each pixel is entirely independent of the others. Clearly this not realistic. In reality we can expect that pixels are most definitely dependent; if there is a dark pixel (the edge of a number in MNIST) we can imagine that there is a higher probability of it's neighbor being dark as well. In other words:

$$P(\text{pixel 2 is dark} \mid \text{pixel 1 is dark} ) \neq P(\text{pixel 2 is dark}) $$

And our earlier statement no longer holds:

$$P(X \mid C) = \prod_i P(X_i \mid C)$$

It follows that in the case of non-naive bayes our features are not independent. The question then is how do we model the likelihood, $P(X \mid C)$? In reality this is a very open ended question-we can model it however we want! For instance, we could simply use the full covariance matrix instead of the diagonal covariance matrix, hence eliminating our naive assumption. We could also let $P(X \mid C)$ be a hidden markov model, which I have an entire section dedicated to on this blog (check out the post pertaining to continuous HMM's if interested). If we wanted to make things even more complex, we could implement a custom Bayes Net, which is what I will be doing in a future article. The idea behind a bayes net is that you get to model how each variable interacts exactly, and with what distribution. We can then test if our assumptions are correct by comparing the accuracy of our models. The main thing to realize is that we can make this aribtrarily complex, which of course imposes a variety of tradeoffs.

5.1 Multivariate Gaussian with full covariance¶

As I said, in order to get rid of our naive assumption we will be making use of the full covariance matrix to parameterize our multivariate gaussian. A property of probability is that if two random variables are independent, their covariance is $0$. Mathematically that looks like:

$$X_i \text{ independent of } X_j \longrightarrow cov(X_i, X_j) = 0$$

And we can prove it as follows:

$$cov(X_i, X_j) = E \big[ (X_i - \mu_i)(X_j - \mu_j) \big]$$$$ = E \big[ X_i X_j - X_i \mu_j - X_j \mu_i + \mu_i \mu_j \big]$$$$ = E \big[ X_i X_j \big] - E \big[ X_i \mu_j \big] - E \big[ X_j \mu_i \big] + E \big[ \mu_i \mu_j \big]$$

Since our variables $X_i$ and $X_j$ are independent, we know that $E \big[ X_i X_j \big] = E \big[ X_i \big] E \big[X_j \big]$, and hence:

$$ = E \big[ X_i \big] E \big[X_j \big] - E \big[ X_i \big] \mu_j - E \big[ X_j\big] \mu_i + \mu_i \mu_j$$$$ = \mu_i \mu_j - \mu_i \mu_j - \mu_j \mu_i + \mu_i \mu_j$$$$cov(X_i, X_j) = 0$$

Note that the opposite is not true; if two variables have $0$ covariance, they are not necessarily independent. There is only one exception to this, and that is the gaussian distribution. In the gaussian case, if two variables $X_i$ and $X_j$ are gaussian distributed and their covariance is $0$, then $X_i$ and $X_j$ are independent.

So, in our non naive case we simply will be using the full covariance matrix to parameterize our multivariate gaussian. You may be wondering, why not simply always use the full covariance matrix? Well, in order to calculate the multivariate gaussian likelihood based on the full covariance matrix we will need to find it's determinant and inverse, which requires additional computation and will increase our calculation time.

Because scipy has such a friendly API, there really isn't much that we need to change in our implementation. Looking at the interface for scipy's multivariate normal log pdf function:

mvn.logpdf(X, mean=mean, cov=cov)

The cov parameter can be a full covariance matrix, a diagonal variance matrix (what we used in the naive case, known as an axis-aligned elliptical gaussian), or even a single scalar (a circular gaussian). Let's give this implementation a shot below:

import numpy as np
from datetime import datetime
from scipy.stats import multivariate_normal as mvn

from Machine_Learning.supervised_learning.utils import get_mnist_data


class Bayes(object):
    def fit(self, X, Y, smoothing=10e-3):
        N, D = X.shape
        self.gaussians = {}
        self.priors = {}
        class_labels = set(Y)

        for c in class_labels:
            # Loop through all training examples belonging to a particular class
            # Find mean and variance
            current_x = X[Y == c]
            self.gaussians[c] = {
                "mean": current_x.mean(axis=0),
                "cov": np.cov(current_x.T) + np.eye(D) * smoothing  # Calculating full covariance
            }
            self.priors[c] = float(len(Y[Y == c])) / len(Y)  # Could calculate log prior

    def score(self, X, Y):
        P = self.predict(X)
        return np.mean(P == Y)

    def predict(self, X):
        N, D = X.shape
        K = len(self.gaussians) # Number of classes
        P = np.zeros((N, K)) # (holds probability of each sample, n, belonging to class, k)

        for class_, gaussian in self.gaussians.items():
            mean, cov = gaussian["mean"], gaussian["cov"]

            # Calculating N different log pdfs at the same time, using full covariance
            P[:, class_] = mvn.logpdf(X, mean=mean, cov=cov) + np.log(self.priors[class_])

        return np.argmax(P, axis=1)


if __name__ == "__main__":
    X, Y = get_mnist_data()

    Ntrain = len(Y) // 2
    Xtrain, Ytrain = X[:Ntrain], Y[:Ntrain]
    Xtest, Ytest = X[Ntrain:], Y[Ntrain:]


    nb_model = Bayes()
    t0 = datetime.now()
    nb_model.fit(Xtrain, Ytrain)
    print("Training time:", (datetime.now() - t0))

    t0 = datetime.now()
    print("Train accuracy:", round(nb_model.score(Xtrain, Ytrain), 3))
    print("Time to compute train accuracy:", (datetime.now() - t0), "Train size:", len(Ytrain))

    t0 = datetime.now()
    print("Test accuracy:", round(nb_model.score(Xtest, Ytest), 3))
    print("Time to compute test accuracy:", (datetime.now() - t0), "Test size:", len(Ytest))

Reading and Transforming MNIST Data...
Training time: 0:00:00.304378
Train accuracy: 0.973
Time to compute train accuracy: 0:00:03.208424 Train size: 21000
Test accuracy: 0.942
Time to compute test accuracy: 0:00:03.200256 Test size: 21000

By only change two lines of code in order to utilize the full covariance matrix we were able to achieve significantly better results-over 94% accuracy on the test set! We can clearly see the power of naive bayes at this point. Also it is worth noting that we did not pay a significant speed penalty for the increased performance.

6. Generative vs. Discriminative Models¶

With that the majority of this article is complete, and if you were just looking for the implementation details of a Bayes Classifier, feel free to move on. However, I want to take a moment to place these classifiers in the context of a more general framework-generative vs. discriminative models.

Recall that for all classifiers that output a probability, the probability that we make our prediction from is the posterior, $P(C \mid X)$. We call classifiers that model this probability directly discriminative classifiers. In the case of logistic regression it's mathematical representation is:

$$z = x_1w_1 + x_2w_2 + \dots x_nw_n + b w_0$$$$P(Y = \text{class }1 \mid X) = P(C \mid X) = \frac{1}{1 + e^{-z}}$$

Given the data, these classifiers try to learn to discriminate between each class (hence then name). In the case of logistic regression, a hyperplane is used to discriminate between the classes. We can see the general flow of information below:

The data is used to predict the most likely class directly, and few assumption are made (besides the fact that we are assuming our data is linearly seperable). Now, in the case of generative classifiers we have the opposite scenario. Here, as was the case with bayesian classifiers, our main calculation is $P(X \mid C)$. We can still find the posterior via bayes rule, but the bulk of effort is spent working with $P(X \mid C)$. We start with the class, and then model $X$, shown below:

The assumption is that each class has it's own structure, and hence it's own distribution of $X$. Based on this assumption, we treat each class as a data generating process (again, hence the name). Because the classes have different structures, they generate data in different ways, and we can use this to predict the class of an input data point. Note that it is this assumption about the distribution of the data that can make or break the model. It is a way of bringing in prior knowledge, which can give our model an advantage if our distribution choice is correct; if we are incorrect, our model performance will suffer.

Now, how do they compare in general? Well, generative classifiers are theoretically satisfying due to being based on the rules of probability. Each input variable is modeled directly, and we have the ability to change our model of $P(X \mid C)$ if it yields poor results (we can change the distribution we originally believe the data was generated from). Because of this, we know how each variable influences our final results. If you or your clients find it very important to be able to explain models in terms of input variables, this is very helpful (often the case in consulting). However, I would be remiss if I did not mention that historically, discriminative models have performed better-just look at the recent incredible successes of deep learning. However, neural networks are very hard to explain. For instance, the linear combination of two input features could have no physical meaning, even if it is useful in the neural networks predictive power.

With that said, I will close out this post at risk of it becoming far too lengthy. Be on the look out for new content related to bayesian regression and probabilistic graphical models in the coming weeks.

Appendix A: Linear and Quadratic Discriminant Analysis Derivations¶

One thing that we can do is actually derive a form of the Bayes Classifier that is even simpler to work with. You will often run into this and see it referred to as LDA, or Linear Discriminant Analysis, or QDA, Quadratic Discriminant Analysis. Before we get started on the derivation there are two fundamental assumptions that must be made clear:

The data in a given class is Gaussian. In other words, $P(X \mid C) \approx N(X; \mu, \Sigma)$. Our gaussian can be full covariance (non-naive).
We are dealing with a binary classification.

With that said, let's see how we could come up with a simpler representation of our bayesian classifier! We can refer to our two classes as $0$ and $1$. As with before, our decision rule will be that we predict the class that is the $argmax$ of $C$. From Bayes rule we know that each posterior probability above can be written as:

$$P(C = 0 \mid X) = \frac{P(X \mid C = 0) P(C = 0)}{P(X)}$$$$P(C = 1 \mid X) = \frac{P(X \mid C = 1) P(C = 1)}{P(X)}$$

Placing these expressions inside of our prediction rule we have:

$$\text{Prediction Rule: } \;\;\; argmax_C \Big \{ \frac{P(X \mid C = 0) P(C = 0)}{P(X)}, \frac{P(X \mid C = 1) P(C = 1)}{P(X)} \Big\}$$

Where, the denominator can then be dropped (because the denominator of each expression is not dependent on $C$, for which we are taking the $argmax$):

$$\text{Prediction Rule: } \;\;\; argmax_C \Big \{ P(X \mid C = 0) P(C = 0), P(X \mid C = 1) P(C = 1) \Big\}$$

At this point we can substitute our expressions for each probability distribution, namely:

$$P(X \mid C = 0) = \frac{1}{\sqrt{(2 \pi)^D |\Sigma_0|}} exp\Big(-\frac{1}{2} (X - \mu_0)^T \Sigma_0^{-1} (X - \mu_0)\Big)$$

And the equivalent for class $1$. This leaves us with:

$$argmax_C \Big \{ \frac{1}{\sqrt{(2 \pi)^D |\Sigma_0|}} exp\Big(-\frac{1}{2} (X - \mu_0)^T \Sigma_0^{-1} (X - \mu_0)\Big) P(C = 0), \frac{1}{\sqrt{(2 \pi)^D |\Sigma_1|}} exp\Big(-\frac{1}{2} (X - \mu_1)^T \Sigma_1^{-1} (X - \mu_1)\Big) P(C = 1) \Big\}$$

And because our classification is binary, $P(C = 0) + P(C = 1) = 1$, and we can represent then probabilities via a single term, $\alpha$. Specifically, we can let $P(C = 1) = \alpha$ and $P(C = 0) = 1 -\alpha$.

$$argmax_C \Big \{ \frac{1}{\sqrt{(2 \pi)^D |\Sigma_0|}} exp\Big(-\frac{1}{2} (X - \mu_0)^T \Sigma_0^{-1} (X - \mu_0)\Big) (1 - \alpha), \frac{1}{\sqrt{(2 \pi)^D |\Sigma_1|}} exp\Big(-\frac{1}{2} (X - \mu_1)^T \Sigma_1^{-1} (X - \mu_1)\Big) \alpha \Big\}$$

To simplify this and get rid of the exponential and some of the multiplication we can take the logarithm of both sides:

$$argmax_C \Big \{ K_0 - \frac{1}{2} (X - \mu_0)^T \Sigma_0^{-1} (X - \mu_0) + log(1 - \alpha), K_1 -\frac{1}{2} (X - \mu_1)^T \Sigma_1^{-1} (X - \mu_1) + log(\alpha) \Big\}$$

Where we replaced the log of the coefficient of the gaussian's with $K_0$ and $K_1$. At this point, we can take into account the fact that the argmax is really just a determining which expression yields the greater output (based on the class) and returning that class. Another way we can write this decision rule is that we will predict whichever expression is greater:

$$ K_1 -\frac{1}{2} (X - \mu_1)^T \Sigma_1^{-1} (X - \mu_1) + log(\alpha) > K_0 - \frac{1}{2} (X - \mu_0)^T \Sigma_0^{-1} (X - \mu_0) + log(1 - \alpha) $$

Above, we predict class $1$ if the left hand side is indeed greater than the right; else we predict class $0$. We can then move everything to the left hand side:

$$ K_1 -\frac{1}{2} (X - \mu_1)^T \Sigma_1^{-1} (X - \mu_1) + log(\alpha) - K_0 + \frac{1}{2} (X - \mu_0)^T \Sigma_0^{-1} (X - \mu_0) - log(1 - \alpha) > 0 $$

We can then expand the multiplications as (in the case of the first multiplication):

$$(X - \mu_1)^T \Sigma_1^{-1} (X - \mu_1)$$$$(X^T - \mu_1^T) \Sigma_1^{-1} (X - \mu_1)$$$$(X^T - \mu_1^T) (\Sigma_1^{-1}X - \Sigma_1^{-1}\mu_1)$$$$X^T\Sigma_1^{-1}X - \mu_1^T\Sigma_1^{-1}X - X^T \Sigma_1^{-1} \mu_1 + \mu_1^T\Sigma_1^{-1}\mu_1$$

Doing the same thing for the second multiplication, and substituting in to our equation we see that:

$$ K_1 -\frac{1}{2} \big( X^T\Sigma_1^{-1}X - \mu_1^T\Sigma_1^{-1}X - X^T \Sigma_1^{-1} \mu_1 + \mu_1^T\Sigma_1^{-1}\mu_1 \big) + log(\alpha) - K_0 + \frac{1}{2} \big( X^T\Sigma_0^{-1}X - \mu_0^T\Sigma_0^{-1}X - X^T \Sigma_0^{-1} \mu_0 + \mu_0^T\Sigma_0^{-1}\mu_0 \big) - log(1 - \alpha) > 0 $$

Then, by taking advantage of the following equality:

$$\mu_1^T \Sigma_1^{-1} X = X^T \Sigma_1^{-1} \mu_1$$

Which just, can quickly be verified from a dimensional perspective (in this case we are only working with a single $X$ vector, compared to a $N x D$ $X$ matrix):

$$\mu_1^T \Sigma_1^{-1} X = X^T \Sigma_1^{-1} \mu_1$$$$[1\; x \; D][D \; x \; D][D \; x \; 1] = [1\; x \; D][D \; x \; D][D \; x \; 1] $$

Feel free to work through a 2x2 example of the above if you'd like. With that said, by taking advantage of this equality we arrive at the simplification:

$$K_1 -\frac{1}{2} X^T\Sigma_1^{-1}X + \mu_1^T\Sigma_1^{-1}X -\frac{1}{2} \mu_1^T\Sigma_1^{-1}\mu_1 + log(\alpha) - K_0 +\frac{1}{2} X^T\Sigma_0^{-1}X - \mu_0^T\Sigma_0^{-1}X +\frac{1}{2} \mu_0^T\Sigma_0^{-1}\mu_0 - log(1 - \alpha) > 0 $$

Finally, we can combine terms by their degree in $X$, and see the following:

$$ \overbrace{ X^T \frac{1}{2} \big( \Sigma_0^{-1} - \Sigma_1^{-1} \big)X}^\text{Quadratic Term} + \overbrace{ \big(\mu_1^T \Sigma_1^{-1} - \mu_0^T \Sigma_0^{-1} \big)X }^\text{Linear Term}- \overbrace{ \frac{1}{2} \mu_1^T \Sigma_1^{-1}\mu_1 + \frac{1}{2} \mu_0^T \Sigma_0^{-1}\mu_0 + K_1 - K_0 - log(\alpha) - log(1 - \alpha)}^\text{Constant Term} > 0 $$

What we have ended up with is a multidimensional quadratic equation-this is our prediction rule! We plug in $X$ to the above equation and if it is greater than $0$ we predict class $1$, else we predict class $0$. Because this is a quadratic equation we call this Quadratic Discriminant Analysis.

Now, you are likely wondering why exactly this is better. Well, it comes down to implementation; in the earlier bayesian classifier that we worked with, we needed to actually calculate the likelihood (PDF) using the scipy library; this required taking the matrix determinant, inverse, exponentation, etc. These operations are computationally expensive! However, in this QDA form we can just solve for the parameters in terms of $\mu_0$, $\mu_1$, $\Sigma_0$, $\Sigma_1$, and $\alpha$, and then we can simply plug into the quadratic equation directly, which computationally is much faster.

We can come up with a simpler representation of the above decision rule, namely:

$$X^T A x + w^TX + b > 0$$

Where:

$$A = \frac{1}{2} \big( \Sigma_0^{-1} - \Sigma_1^{-1} \big)$$$$w^T = \big(\mu_1^T \Sigma_1^{-1} - \mu_0^T \Sigma_0^{-1} \big)$$$$b = -\frac{1}{2} \mu_1^T \Sigma_1^{-1}\mu_1 + \frac{1}{2} \mu_0^T \Sigma_0^{-1}\mu_0 + K_1 - K_0 - log(\alpha) - log(1 - \alpha)$$

At this point any subsequent predictions require no matrix inversion or determinants. Take note of what happens if the covariance of class $1$ is equal to the covariance of class $0$; the squared term drops out! It is in exactly this situation that we get what is called Linear Discriminant Analysis.

Linear classifiers are particularly interesting because they all find a weight vector and a bias term. The difference is in how they do it-each method comes with it's own strengths and weaknesses. In the bayesian case we need to make assumptions about the distributions of the data (see generative models). From a theoretical standpoint, if these assumptions turn out to be true, the bayesian classifier is the optimal classifier$^3$! Intuitively this should make sense; we are encoding/providing additional information that our model can use to it's advantage, giving it additional predictive power. This is not so in the discriminative case.

Appendix: References¶

Introduction to Statistical Learning, Chapter 3
Probability Theory: The Logic of Science, Chapter 2
Introduction to Statistical Learning, Chapter 2.2.3

# import warnings
# warnings.filterwarnings('ignore')