1. Hidden Markov Models Introduction¶

This post is going to cover hidden markov models, which are used for modeling sequences of data. Sequences appear everywhere, from stock prices, to language, credit scoring, webpage visits.

Often, we may be dealing with sequences in machine learning and we don't even realize it; or we may even ignore the fact that it came from a sequence. For instance, the following sentence:

"Like and cats dogs I"

Clearly, this model does not make any sense. This is what happens when you use a model such as bag of words. The fact that it becomes much harder to tell what a sentence means when you take away the time aspect, tells you that there is a lot of information carried there. The original sentence was:

"I like cats and dogs"

This may be relatively easy to decode on your own, but you can imagine that this gets much harder as the sentence gets longer.

1.1 Outline¶

We will start by looking at the most basic markov model, with no hidden portion. These are very useful for modeling sequences as we will see. We will talk about the mathematical properties of the markov model, and go through a ton of examples so we can see how they are used. Google's PageRank algorithm is based on markov models. So, despite being based on old technology, markov models are still very useful and relevant today.
We will also talk about how to model language, and how to analyze web visitor data, so you can fix problems like high bounce rate.
Next, we will look at the hidden markov model. This will be very complex mathematically, but the first section should prepare you. We will look at the three basic problems in hidden markov modeling:
- Predicting the probability of a sequence
- Predicting the most likely sequence of hidden states given an observed sequence
- How to train a hidden markov model
- We will even go further and look at how this relates to deep learning by using gradient descent to train our HMM. Typically, the expectation maximization algorithm is used. We will do this too, but we will see how gradient descent makes this much easier.
We will finally look at Hidden Markov Models for real-valued data.

2. Unsupervised or Supervised?¶

We can now discuss where HMM's fit in the spectrum of machine learning techniques. Hidden markov models are for modeling sequences. If you think of a sequence by itself, that could look like:

x(1), x(2),...,x(t),...,x(T)

We can see that there is no label there, so HMM's just model the distribution of a sequence; this means that it is unsupervised.

Classification¶

However, we often see HMMs being used for classifiers as well. For instance, we could train an HMM to model a male voice and a female voice. Then we could predict, given a new voice sample, whether the voice is male or female.

How can we do this, given that we only have a model for the probability of the data, $p(X)$? The key idea here is bayes rule. What we actually modeled was $p(X \; | \; male)$ and $p(X \; | \; female)$. We know that bayes rule helps us reverse the conditional, leaving us with:

$$p(male \; | \; x) \; and \; p(female \; | \; x)$$

Now we can find the most probable class, and our prediction becomes whatever the most probable class is. From bayes rule we know that:

$$p(male \; | \; x) = \frac{p(X \; | \; male) p(male)}{p(X)}$$$$p(female \; | \; x) = \frac{p(X \; | \; female) p(female)}{p(X)}$$

And in general: $$posterior = \frac{likelihood * prior}{normalization \; constant}$$

We do not care about the actual probability, $p(X \; | \; C)$, just which one is greater. It should be noted that while we can model it with an HMM, but also with Naive Bayes:

$$P(X \; | \; C) = P(x(1,1) \; | \; C)*P(x(1,2) \; | \; C)*...*P(x(T,D) \; | \; C)$$

With Naive Bayes, we make the independence assumption, meaning that each sample is independent. So, we take the probability of each given feature, and multiply them together to get the final $P(X \; | \; C)$. What we can even do is extend this idea behind hidden markov models and model the data using more general concepts like Bayesian Belief Networks.

Conclusion¶

At its core, HMMs are unsupervised. However, it can easily be used for classification just by creating a separate model for each class, and then making the prediction based on which model gives you the maximum posterior probability.