8. HMM Applications¶

We have now covered the vast majority of the theory related to HMM's. We have seen how they model the probability of a sequence, and can handle when those sequences deal with latent states. Additionally we saw how they can be extended to deal with continuous observations by incorporating the concept of a gaussian mixture model in place of the $B$ emission matrix. I want to take a post to go over a few different real world applications of HMM's and leave you with a few ideas of what else is possible.

1. Generative vs. Discriminative Classifiers¶

Recall in the introduction that we discussed how even though the HMM is an unsupervised model because it models a probability, we can utilize it effectively as a classifier by making use of bayes rule. There is a name for these types of classifiers: Generative Models. The reason for this name is because they take the probability density of a class, and you pretend that it generates the data that you observe.

Compare this to discrimative models, such as logistic regression or deep neural networks. These models give you probability of the class given the input directly; there is no modeling of the likelihood probability.

Interestingly, sometimes these two types of models intersect. In my posts on logistic regression I discuss that often we cannot solve for the weights directly and must use gradient descent. There was an exception to this rule, however; if we assume that each class was gaussians and had the same covariance, but different means, then the weights can be solved for directly. So, if we know that both classes are gaussian, and we know their means and covariances, then we know their likelihoods. Hence, this is a generative model also. In this particular setup, generative and discriminative models intersect.

Now, you may be wondering what the benefits and drawbacks of each model type are. Generative models tend to have a great deal of theory behind them, and they are very principled. However, research has shown that discriminative models simple work better (aka deep learning). With that said, one of the drawbacks of deep learning is that it is very hard to explain exactly what is going on, or what features are being learned. There is no graphical model that you can point to and say: "here is a strong connection between these two nodes which represents something in physical reality". That is fine because often what we care about is accuracy and functionality, not just simple relationships between input variables. The strength of neural networks is that they model complex relationships; unfortunately that complexity means that they can't be explained in terms that would be satisfactory to a statistician.

2. HMM Classification on Poetry Data¶

Now, let's take a chance to implement an HMM classifier; specifically we will try and classify poetry as either being written by Robert Frost or Edgar Allan Poe.

import numpy as np
from scipy.stats import bernoulli, binom, norm
import seaborn as sns
import matplotlib.pyplot as plt

from IPython.core.display import HTML
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

sns.set(style="white", palette="husl")
sns.set_context("talk")
sns.set_style("ticks")

import string
import numpy as np
import matplotlib.pyplot as plt

from sklearn.utils import shuffle
from nltk import pos_tag, word_tokenize

from hmm.hmm_theano import HMM
from hmm.utils import get_obj_s3
# from hmm.generate_hmm_continuous import get_signals, big_init, simple_init
# from hmm.utils import random_normalized

class HMMClassifier():
    def __init__(self):
        pass

    def fit(self, X, Y, V):
        K = len(set(Y)) # Number of classes
        N = len(Y)
        self.models = []
        self.priors = [] # Priors are for bayes rule
        for k in range(K):
            # Loop through all classes
            thisX = [x for x, y in zip(X, Y) if y == k] # All x's the belong to class k
            C = len(thisX) # Number of elements of class k
            self.priors.append(np.log(C) - np.log(N)) # Not technically the prior

            hmm = HMM(5) # Create an HMM with 5 hidden states
            hmm.fit(thisX, V=V, p_cost=0.1, print_period=1, learning_rate=10e-5, max_iter=100)
            self.models.append(hmm)

    def score(self, X, Y):
        N = len(Y)
        correct = 0
        for x, y in zip(X, Y):
            posteriors = [hmm.log_likelihood(x) + prior for hmm, prior in zip(self.models, self.priors)]
            p = np.argmax(posteriors)
            if p == y:
                correct += 1

        return float(correct) / N


def get_tags(s):
    """Determines parts of speech tags for a given string."""
    word_tag_tuples = pos_tag(word_tokenize(s))
    return [tag for word, tag in word_tag_tuples]

def get_data():
    """Gather's blocks of text for each author, and determines the POS tags for each."""
    word2idx = {}
    current_idx = 0
    X = [] # Sequences
    Y = [] # Labels

    for file_name, label in zip(("robert_frost.txt", "edgar_allan_poe.txt"), (0,1)):
        count = 0
        for line in get_obj_s3(file_name).read().decode("utf-8").split("\n"):
            line = line.rstrip()
            if line:
                tokens = get_tags(line)
                if len(tokens) > 1:
                    for token in tokens:
                        if token not in word2idx:
                            word2idx[token] = current_idx
                            current_idx += 1
                    sequence = np.array([word2idx[w] for w in tokens])
                    X.append(sequence)
                    Y.append(label)
                    count += 1
                    if count >= 50:
                        break

    print("Vocabulary: ", word2idx.keys())
    return X, Y, current_idx


def main():
    X, Y, V = get_data()

    # We will not be using the words directly because there are so many of them
    # Rather, we will use parts of speech tagging instead
    X, Y = shuffle(X, Y)
    N = 20 # Number of test samples
    Xtrain, Ytrain = X[:-N], Y[:-N]
    Xtest, Ytest = X[-N:], Y[-N:]

    model = HMMClassifier()
    model.fit(Xtrain, Ytrain, V)
    print("Score: ", model.score(Xtest, Ytest))


if __name__ == "__main__":
    main()

Vocabulary:  dict_keys(['CD', 'NNS', 'VBN', 'IN', 'DT', 'JJ', 'NN', ',', 'CC', 'PRP', 'MD', 'RB', 'VB', 'VBD', 'TO', 'WRB', ':', 'VBG', 'JJR', 'NNP', 'VBZ', '.', 'UH', 'PDT', 'VBP', 'PRP$', 'RP', 'JJS', 'EX', 'POS', 'WP', 'WDT'])
A learned from training: 
 [[0.17506025 0.18066102 0.18609025 0.24336249 0.21482599]
 [0.18094838 0.20688447 0.20624709 0.19943386 0.2064862 ]
 [0.18118592 0.20094281 0.19645607 0.22470088 0.19671432]
 [0.19651591 0.19909004 0.19802814 0.19440417 0.21196174]
 [0.18393464 0.20639321 0.20205437 0.20599433 0.20162345]]
B learned from training: 
 [[2.35334942e-02 3.82448473e-02 2.88004342e-02 6.02374659e-02
  6.14221578e-02 4.69799134e-02 6.54778710e-02 4.41635395e-02
  4.91652517e-02 5.86668491e-02 3.07292879e-02 5.45169352e-02
  4.39708605e-02 3.78915602e-02 3.69967269e-02 1.64092884e-02
  2.07551411e-02 1.58218378e-02 1.18942523e-02 1.45948193e-02
  3.04849949e-02 4.02238067e-02 1.43686779e-02 1.19737345e-02
  4.05544165e-02 2.88862697e-02 1.20898838e-02 1.58297582e-02
  1.16009030e-02 1.65449871e-02 1.71696830e-02 3.51204904e-07]
 [2.13893020e-02 3.94563587e-02 2.97160859e-02 6.02071385e-02
  6.14223023e-02 4.77384020e-02 6.74280406e-02 4.55217549e-02
  4.47048714e-02 5.73258476e-02 3.16709639e-02 5.46406837e-02
  4.53116968e-02 3.90478153e-02 3.47955047e-02 1.69067075e-02
  2.14302115e-02 1.62595496e-02 1.22432214e-02 1.24684644e-02
  3.04844576e-02 4.14830328e-02 1.22650294e-02 1.23505535e-02
  4.18271790e-02 2.64330020e-02 1.24763904e-02 1.62584730e-02
  1.19442776e-02 1.70610115e-02 1.77316420e-02 2.83047019e-08]
 [2.15921317e-02 3.93700610e-02 2.96281701e-02 6.01849123e-02
  6.14285255e-02 4.76730004e-02 6.72317489e-02 4.53770068e-02
  4.51125877e-02 5.74601517e-02 3.15975185e-02 5.46223235e-02
  4.51875919e-02 3.89570422e-02 3.49985455e-02 1.68701977e-02
  2.13554188e-02 1.62303667e-02 1.22074810e-02 1.26688877e-02
  3.04776663e-02 4.13413320e-02 1.24623984e-02 1.23066891e-02
  4.17161439e-02 2.66620961e-02 1.24311726e-02 1.62505955e-02
  1.19079264e-02 1.70015863e-02 1.76886942e-02 2.97327494e-08]
 [2.01508075e-02 4.05005523e-02 3.02742061e-02 6.00182340e-02
  6.13814509e-02 4.82126643e-02 6.83771244e-02 4.62048591e-02
  4.20533291e-02 5.65046939e-02 3.23263053e-02 5.46532660e-02
  4.61546903e-02 3.99046472e-02 3.34192629e-02 1.73039854e-02
  2.17313900e-02 1.65854297e-02 1.23828260e-02 1.12438171e-02
  3.04902056e-02 4.20184467e-02 1.10499742e-02 1.25062171e-02
  4.26509500e-02 2.49736029e-02 1.26391679e-02 1.67508786e-02
  1.21034741e-02 1.72747338e-02 1.81588056e-02 2.05680200e-09]
 [2.08356403e-02 3.99043353e-02 2.99726754e-02 6.01253068e-02
  6.14147105e-02 4.79444813e-02 6.78667484e-02 4.58258162e-02
  4.35132782e-02 5.69757168e-02 3.19640694e-02 5.46534125e-02
  4.56795920e-02 3.94200953e-02 3.41822680e-02 1.70814375e-02
  2.15712426e-02 1.63993049e-02 1.23109604e-02 1.19182794e-02
  3.04737373e-02 4.17319028e-02 1.17177790e-02 1.24224050e-02
  4.21977138e-02 2.57808410e-02 1.25518626e-02 1.64702737e-02
  1.20138901e-02 1.71583777e-02 1.79218431e-02 2.76738938e-09]]
pi learned from training: 
 [0.21611219 0.19810147 0.20027411 0.18986764 0.19564459]

A learned from training: 
 [[0.20583268 0.20271784 0.19594875 0.20084338 0.19465736]
 [0.20548077 0.1824924  0.19931049 0.20422252 0.20849382]
 [0.20726071 0.20441014 0.19537182 0.20916409 0.18379324]
 [0.21168008 0.21652711 0.18855161 0.19510039 0.18814081]
 [0.21554224 0.17862877 0.19825785 0.22390615 0.18366498]]
B learned from training: 
 [[1.34033462e-02 6.73472938e-02 1.83190521e-02 7.56533238e-02
  8.67983726e-02 5.85050786e-02 8.22028585e-02 5.16715862e-02
  4.14091313e-02 1.40830276e-02 2.37387020e-02 6.07888656e-02
  2.98027749e-02 1.92002814e-02 1.75525937e-02 1.31703415e-02
  1.91117739e-02 2.37823411e-02 1.72130724e-09 3.60964442e-02
  3.51387939e-02 4.97140461e-02 1.82538180e-02 6.96797413e-08
  3.46756951e-02 1.91712257e-02 2.65921710e-02 2.00111564e-02
  1.73975724e-02 1.29102863e-02 6.13685572e-09 1.34979688e-02]
 [1.32449725e-02 6.64770208e-02 1.80992816e-02 7.64111683e-02
  8.69536513e-02 5.82350203e-02 8.13422062e-02 5.10288203e-02
  4.19725784e-02 1.39016379e-02 2.34357002e-02 6.16494927e-02
  2.98731436e-02 1.89657705e-02 1.81571130e-02 1.41606073e-02
  1.88843491e-02 2.34489796e-02 2.69677365e-07 3.75866720e-02
  3.45991699e-02 4.90954163e-02 1.88278427e-02 4.06140569e-07
  3.46825970e-02 1.89023626e-02 2.62237844e-02 1.97267189e-02
  1.79733472e-02 1.27897069e-02 4.70833619e-07 1.33497219e-02]
 [1.32190002e-02 6.63601582e-02 1.80740120e-02 7.64950876e-02
  8.70687842e-02 5.82331945e-02 8.12210624e-02 5.08845610e-02
  4.20096364e-02 1.38507744e-02 2.33567908e-02 6.18060079e-02
  2.99301954e-02 1.89098392e-02 1.82274089e-02 1.42991154e-02
  1.88219286e-02 2.34032322e-02 6.41280169e-07 3.78168448e-02
  3.46203524e-02 4.89685336e-02 1.88974768e-02 6.19780580e-07
  3.46446278e-02 1.88829279e-02 2.62057168e-02 1.96823809e-02
  1.80483169e-02 1.27566731e-02 7.45781812e-07 1.33033529e-02]
 [1.33573537e-02 6.71250613e-02 1.82679949e-02 7.58419773e-02
  8.68973461e-02 5.84543521e-02 8.19758511e-02 5.14733707e-02
  4.15283126e-02 1.40194252e-02 2.36361242e-02 6.10381496e-02
  2.98550921e-02 1.91210824e-02 1.76993018e-02 1.34335068e-02
  1.90320462e-02 2.36949411e-02 4.65366139e-08 3.65019972e-02
  3.50644497e-02 4.95275684e-02 1.83965620e-02 4.11981608e-08
  3.46498261e-02 1.91113173e-02 2.65168784e-02 1.99332086e-02
  1.75400118e-02 1.28648885e-02 1.49788379e-07 1.34417654e-02]
 [1.31252800e-02 6.58292634e-02 1.79197860e-02 7.69308656e-02
  8.71935510e-02 5.80819699e-02 8.07030142e-02 5.04782026e-02
  4.23454557e-02 1.37467019e-02 2.31795424e-02 6.23277270e-02
  2.99835153e-02 1.87749092e-02 1.85744516e-02 1.48924995e-02
  1.86828370e-02 2.32209105e-02 9.52431072e-07 3.87091420e-02
  3.43098075e-02 4.85863246e-02 1.92374150e-02 1.25195763e-06
  3.46425827e-02 1.87342188e-02 2.59936151e-02 1.95288884e-02
  1.83824094e-02 1.26744697e-02 6.31083683e-07 1.32078088e-02]]
pi learned from training: 
 [0.19680341 0.20011458 0.20010496 0.19819526 0.2047818 ]

Score:  0.7

3. HMM POS Tagging¶

We now are going to go through an example that deals with parts of speech tagging. Below is a preview of what the data looks like:

Confidence NN B-NP
in IN B-NP
the DT B-NP
pound NN I-NP
is VBZ B-VP
widely RB I-VP
expected VBN I-VP

So, each token is on a separate line, and the tag is decided. We are interested in the first two columns. We can see that every word has a corresponding tag.

Find a Baseline¶

Now, in general when faced with a machine learning problem a very good thing to do is find a baseline. What is a good baseline that we know that can be used for classification? Logisitic regression. We can see how this would be applied to logistic regression very easily; we would one hot encode the words and the tags, and then perform softmax. We will see that we can already do very well with just that (over 90% accuracy). This means that over 90% of the time, a word has the same tag. It is very rare that a word can be used in two ways. For example:

I drank milk this morning.

This company is milking its customers' money.

We will also caculate the F1 score, which is the harmonic mean of the precision and recall:

$$Precision = \frac{\text{True positives}}{\text{True positives + False positives}}$$$$Recall = \frac{\text{True positives}}{\text{True positives + False negatives}}$$$$F1 = \frac{2(precision \cdot recall)}{precision + recall}$$

Using Sequence's to our advantage¶

One question that you may want to ask is can we use the fact that these words show up in a sequence, and hence can we use the context to achieve better predictions? The answer is yes we can! We could utilize a recurrent neural network, or a Hidden Markov Model! We will use hidden markov model's since that is the focus of this post.

We can think of the words as the observed sequence, and the tags as the sequence of hidden states. There is one small twist, however. Generally with HMM's we use an algorithm called expectation maximization to train the model. We will not need that in this scenario! This is because we actually know the hidden states; in other words, the hidden states are not some abstract hyper parameter, but an actual real thing.

Recall that HMM's are defined by three things:

Initial State Distribution, $\pi \rightarrow $ The frequency of start tags
State Transition matrix, $A\rightarrow p\big(tag(t) \mid tag(t-1)\big)$
Observation Probability matrix, $B \rightarrow p\big(word(t) \mid tag(t)\big)$

All of these can be calculated by using maximum likelihood directly (just by counting)! For example, $\pi$ will just be the frequency of the start tags. We know that $A$ is markov, so we can just count up all of the transitions and divide by the row sums. Finally, the observation probabilities depend only on the current state. So that is just the probability of a word given the tag. This can also be calculated just by counting.

Summary¶

So, we are going to solve the parts of speech problem as follows:

Create a baseline via logistic regression
Implement an HMM and see how it performs compared to logistic regression

Logistic Regression Baseline¶

import numpy as np
import theano
import theano.tensor as T
import matplotlib.pyplot as plt
from sklearn.utils import shuffle
from sklearn.metrics import f1_score
from sklearn.tree import DecisionTreeClassifier

from hmm.utils import get_obj_s3

class LogisticRegression:
    def __init__(self):
        pass

    def fit(self, X, Y, V=None, K=None, lr=10e-1, mu=0.99, batch_sz=100, epochs=6):
        """Fit our logistic regression model.

        Variables:
            - V: Vocabulary size, aka the number of input features
            - K: The number of classes (pos tags)
            - N: The number of training examples
            - W: Weight matrix from input layer (V) to output layer (K)
                 Remember, the W matrix in logistic regression can be thought
                 of as a learned representation of the output classes, K. Each
                 column (K total) holds a set of weights that map back to each
                 individual node in the input layer. These columns can be thought
                 of as representations in the vocabulary space of each part of
                 speech tag. For an individual sample we can see how similar it
                 is to each column W to determine the most probable class.

        Steps:
            - Define all necessary variables
            - Determine p(y | x), where x is a word vector representing the input word
        """
        if V is None:
            V = len(set(X))
        if K is None:
            K = len(set(Y))
        N = len(X)

        # Create out weight matrix, V x K
        W = np.random.randn(V, K) / np.sqrt(V + K)
        b = np.zeros(K)
        self.W = theano.shared(W)
        self.b = theano.shared(b)
        self.params = [self.W, self.b]

        thX = T.ivector("X")
        thY = T.ivector("Y")

        # The general equation for logistic regression is:
        # p(y|x) = w*x + b
        # In this case, we can simple do self.W[thX] because thX is just an
        # array of indices that
        py_x = T.nnet.softmax(self.W[thX] + self.b)
        prediction = T.argmax(py_x, axis=1)

        cost = -T.mean(T.log(py_x[T.arange(thY.shape[0]), thY]))
        grads = T.grad(cost, self.params)
        dparams = [theano.shared(p.get_value()*0) for p in self.params]
        self.cost_predict_op = theano.function(
            inputs=[thX, thY],
            outputs=[cost, prediction],
            allow_input_downcast=True,
        )

        updates = [
            (p, p + mu*dp - lr*g) for p, dp, g in zip(self.params, dparams, grads)
        ] + [
            (dp, mu*dp - lr*g) for dp, g in zip(dparams, grads)
        ]
        train_op = theano.function(
            inputs=[thX, thY],
            outputs=[cost, prediction],
            updates=updates,
            allow_input_downcast=True
        )

        costs = []
        n_batches = N // batch_sz
        for i in range(epochs):
            X, Y = shuffle(X, Y)
            print("epoch:", i)
            for j in range(n_batches):
                Xbatch = X[j*batch_sz:(j*batch_sz + batch_sz)]
                Ybatch = Y[j*batch_sz:(j*batch_sz + batch_sz)]

                c, p = train_op(Xbatch, Ybatch)
                costs.append(c)
                if j % 200 == 0:
                    print(
                        "i:", i, "j:", j,
                        "n_batches:", n_batches,
                        "cost:", c,
                        "error:", np.mean(p != Ybatch)
                    )

        plt.figure(figsize=(8,5))
        plt.plot(costs, color="blue")
        plt.xlabel("Iteration Number")
        plt.ylabel("Cost")
        plt.show()


    def score(self, X, Y):
        _, p = self.cost_predict_op(X, Y)
        return np.mean(p == Y)

    def f1_score(self, X, Y):
        _, p = self.cost_predict_op(X, Y)
        return f1_score(Y, p, average=None).mean()


def get_data(split_sequences=False):

    word2idx = {}
    tag2idx = {}
    word_idx = 0
    tag_idx = 0
    Xtrain = []
    Ytrain = []
    currentX = []
    currentY = []
    for line in get_obj_s3("pos_train.txt").read().decode("utf-8").split("\n"):
        line = line.rstrip()
        if line:
            r = line.split()
            word, tag, _ = r
            if word not in word2idx:
                word2idx[word] = word_idx
                word_idx += 1
            currentX.append(word2idx[word])

            if tag not in tag2idx:
                tag2idx[tag] = tag_idx
                tag_idx += 1
            currentY.append(tag2idx[tag])
        elif split_sequences:
            Xtrain.append(currentX)
            Ytrain.append(currentY)
            currentX = []
            currentY = []

    if not split_sequences:
        Xtrain = currentX
        Ytrain = currentY

    # load and score test data
    Xtest = []
    Ytest = []
    currentX = []
    currentY = []
    for line in get_obj_s3("pos_test.txt").read().decode("utf-8").split("\n"):
        line = line.rstrip()
        if line:
            r = line.split()
            word, tag, _ = r
            if word in word2idx:
                currentX.append(word2idx[word])
            else:
                currentX.append(word_idx)  # use this as unknown
            currentY.append(tag2idx[tag])
        elif split_sequences:
            Xtest.append(currentX)
            Ytest.append(currentY)
            currentX = []
            currentY = []
    if not split_sequences:
        Xtest = currentX
        Ytest = currentY

    return Xtrain, Ytrain, Xtest, Ytest, word2idx


def main():
    Xtrain, Ytrain, Xtest, Ytest, word2idx = get_data()

    # convert to numpy arrays
    Xtrain = np.array(Xtrain)
    Ytrain = np.array(Ytrain)

    # convert Xtrain to indicator matrix
    N = len(Xtrain)
    V = len(word2idx) + 1
    print("vocabulary size:", V)
    # Xtrain_indicator = np.zeros((N, V))
    # Xtrain_indicator[np.arange(N), Xtrain] = 1

    # decision tree
    dt = DecisionTreeClassifier()

    # without indicator
    dt.fit(Xtrain.reshape(N, 1), Ytrain)
    print("dt train score:", dt.score(Xtrain.reshape(N, 1), Ytrain))
    p = dt.predict(Xtrain.reshape(N, 1))
    print("dt train f1:", f1_score(Ytrain, p, average=None).mean())

    # with indicator -- too slow!!
    # dt.fit(Xtrain_indicator, Ytrain)
    # print("dt score:", dt.score(Xtrain_indicator, Ytrain))

    # train and score
    model = LogisticRegression()
    model.fit(Xtrain, Ytrain, V=V)
    print("training complete")
    print("lr train score:", model.score(Xtrain, Ytrain))
    print("lr train f1:", model.f1_score(Xtrain, Ytrain))


    Ntest = len(Xtest)
    Xtest = np.array(Xtest)
    Ytest = np.array(Ytest)

    # decision tree test score
    print("dt test score:", dt.score(Xtest.reshape(Ntest, 1), Ytest))
    p = dt.predict(Xtest.reshape(Ntest, 1))
    print("dt test f1:", f1_score(Ytest, p, average=None).mean())

    # logistic test score -- too slow!!
    print("lr test score:", model.score(Xtest, Ytest))
    print("lr test f1:", model.f1_score(Xtest, Ytest))


if __name__ == "__main__":
    main()

vocabulary size: 19123
dt train score: 0.9649595941944107
dt train f1: 0.9078586969360977
epoch: 0
i: 0 j: 0 n_batches: 2117 cost: 3.7837276086760796 error: 0.99
i: 0 j: 200 n_batches: 2117 cost: 1.0623792367749263 error: 0.3
i: 0 j: 400 n_batches: 2117 cost: 0.6632107218171731 error: 0.16
i: 0 j: 600 n_batches: 2117 cost: 0.560398011536175 error: 0.16
i: 0 j: 800 n_batches: 2117 cost: 0.3950989706885884 error: 0.11
i: 0 j: 1000 n_batches: 2117 cost: 0.3382475105457404 error: 0.09
i: 0 j: 1200 n_batches: 2117 cost: 0.49274031194178336 error: 0.09
i: 0 j: 1400 n_batches: 2117 cost: 0.2989853014428005 error: 0.09
i: 0 j: 1600 n_batches: 2117 cost: 0.31798901797705037 error: 0.07
i: 0 j: 1800 n_batches: 2117 cost: 0.36493074113793883 error: 0.12
i: 0 j: 2000 n_batches: 2117 cost: 0.27213532629776654 error: 0.07
epoch: 1
i: 1 j: 0 n_batches: 2117 cost: 0.2472523641223321 error: 0.09
i: 1 j: 200 n_batches: 2117 cost: 0.3272462642044524 error: 0.09
i: 1 j: 400 n_batches: 2117 cost: 0.2858867098774024 error: 0.05
i: 1 j: 600 n_batches: 2117 cost: 0.12427674968977459 error: 0.03
i: 1 j: 800 n_batches: 2117 cost: 0.2991778273719337 error: 0.09
i: 1 j: 1000 n_batches: 2117 cost: 0.1843941669293302 error: 0.04
i: 1 j: 1200 n_batches: 2117 cost: 0.27152672448713266 error: 0.07
i: 1 j: 1400 n_batches: 2117 cost: 0.23246194544145113 error: 0.07
i: 1 j: 1600 n_batches: 2117 cost: 0.20043717246201395 error: 0.05
i: 1 j: 1800 n_batches: 2117 cost: 0.2425542452127861 error: 0.06
i: 1 j: 2000 n_batches: 2117 cost: 0.1698714968557581 error: 0.03
epoch: 2
i: 2 j: 0 n_batches: 2117 cost: 0.13979718141767478 error: 0.02
i: 2 j: 200 n_batches: 2117 cost: 0.1197517027290733 error: 0.04
i: 2 j: 400 n_batches: 2117 cost: 0.2515375103174018 error: 0.07
i: 2 j: 600 n_batches: 2117 cost: 0.1689711599808309 error: 0.03
i: 2 j: 800 n_batches: 2117 cost: 0.2500265434780376 error: 0.09
i: 2 j: 1000 n_batches: 2117 cost: 0.3045318316434233 error: 0.07
i: 2 j: 1200 n_batches: 2117 cost: 0.22644226253678285 error: 0.07
i: 2 j: 1400 n_batches: 2117 cost: 0.13125740850654571 error: 0.03
i: 2 j: 1600 n_batches: 2117 cost: 0.26175213592858015 error: 0.09
i: 2 j: 1800 n_batches: 2117 cost: 0.20885020543592603 error: 0.06
i: 2 j: 2000 n_batches: 2117 cost: 0.2034870743694782 error: 0.06
epoch: 3
i: 3 j: 0 n_batches: 2117 cost: 0.08961761905872807 error: 0.03
i: 3 j: 200 n_batches: 2117 cost: 0.1756377068251531 error: 0.04
i: 3 j: 400 n_batches: 2117 cost: 0.24178625864384018 error: 0.06
i: 3 j: 600 n_batches: 2117 cost: 0.16331098850633602 error: 0.04
i: 3 j: 800 n_batches: 2117 cost: 0.09420641597532414 error: 0.03
i: 3 j: 1000 n_batches: 2117 cost: 0.18423494668668736 error: 0.05
i: 3 j: 1200 n_batches: 2117 cost: 0.055786035984426476 error: 0.01
i: 3 j: 1400 n_batches: 2117 cost: 0.15378967861480178 error: 0.05
i: 3 j: 1600 n_batches: 2117 cost: 0.15950257809218626 error: 0.05
i: 3 j: 1800 n_batches: 2117 cost: 0.133053998432271 error: 0.03
i: 3 j: 2000 n_batches: 2117 cost: 0.09329675904361057 error: 0.02
epoch: 4
i: 4 j: 0 n_batches: 2117 cost: 0.17666991932982587 error: 0.05
i: 4 j: 200 n_batches: 2117 cost: 0.07136150567535755 error: 0.03
i: 4 j: 400 n_batches: 2117 cost: 0.11441739708057515 error: 0.04
i: 4 j: 600 n_batches: 2117 cost: 0.16762795554411342 error: 0.03
i: 4 j: 800 n_batches: 2117 cost: 0.0746682230006661 error: 0.02
i: 4 j: 1000 n_batches: 2117 cost: 0.20259439969662388 error: 0.08
i: 4 j: 1200 n_batches: 2117 cost: 0.053256635408138214 error: 0.01
i: 4 j: 1400 n_batches: 2117 cost: 0.20010885385689928 error: 0.06
i: 4 j: 1600 n_batches: 2117 cost: 0.0774237392612108 error: 0.03
i: 4 j: 1800 n_batches: 2117 cost: 0.2977642505515331 error: 0.06
i: 4 j: 2000 n_batches: 2117 cost: 0.23147191295694455 error: 0.09
epoch: 5
i: 5 j: 0 n_batches: 2117 cost: 0.14069677831208585 error: 0.04
i: 5 j: 200 n_batches: 2117 cost: 0.05196061628375496 error: 0.01
i: 5 j: 400 n_batches: 2117 cost: 0.23837638051405055 error: 0.07
i: 5 j: 600 n_batches: 2117 cost: 0.27822398574649787 error: 0.07
i: 5 j: 800 n_batches: 2117 cost: 0.08130631402985512 error: 0.01
i: 5 j: 1000 n_batches: 2117 cost: 0.10155616789466457 error: 0.04
i: 5 j: 1200 n_batches: 2117 cost: 0.22978947504939168 error: 0.07
i: 5 j: 1400 n_batches: 2117 cost: 0.15055074648301456 error: 0.05
i: 5 j: 1600 n_batches: 2117 cost: 0.2484974967443738 error: 0.05
i: 5 j: 1800 n_batches: 2117 cost: 0.3070827351270764 error: 0.08
i: 5 j: 2000 n_batches: 2117 cost: 0.34005020963439436 error: 0.09

training complete
lr train score: 0.9600948391088524
lr train f1: 0.9120427330365094
dt test score: 0.9168372839141355
dt test f1: 0.8717182342896075
lr test score: 0.9102518099499757
lr test f1: 0.8713374692511834

HMM Implementation¶

The nice thing about hidden markov models is that if the hidden states have an explicit meaning we can actually measure them directly. For instance, if our hidden states represent the part of speech tag for a given word, we can determine that for all of our training examples. In that case, we can determine $A$ rather easily! We would count up all of the times we transition from one pos tag to another, and divide by the total number of times we are in the first part of speech tag to begin with. For example, if we want to know the probability of transition from a noun to a verb, we would simply do:

$$A(noun, verb) = \frac{\text{# times nouns transitions to verb}}{\text{total # times hidden state is noun}}$$

Where we know that $A$'s probabilistic meaning is:

$$A(noun, verb) = p \big(z(t) = verb \mid z(t-1) = noun\big)$$

This simple counting can be done for all different hidden state (pos tag) transitions (note this is simply maximum likelihood). In order to determine the $B$ observation matrix, we can simply count the number of times we observed a certain word in our vocabulary, given we are in a particular hidden state (pos tag). For instance, if we wanted to know the probability of observing the word milk given our word is a verb, we would do:

$$B(milk, verb) = \frac{\text{# times nouns transitions to verb}}{\text{total # times hidden state is noun}}$$

Again, we know that $B$ can be defined as:

$$B(milk, verb) = p \big(x(t) = milk \mid z(t) = verb\big)$$

Viterbi Algorithm¶

One of the main algorithms that we can perform on an HMM is the viterbi algorithm, which as we know allows us to look at problems in the reverse direction. Normally, we think of hidden causes as producing the observations that we see. The viterbi algorithm allows us to ask:

Given a sequence of observations, what is the most likely sequence of hidden states?

So, in the example above, our observations would be a string of words, and the viterbi algorithm would allow us to determine the most likely sequence of parts of speech. What is important to be aware of here is the API of the viterbi algorithm:

input -> sequence of observations

output -> corresponding sequence of hidden states

If we think about what we are actually doing when modeling parts of speech tags with an HMM, we come to the conclusion that we are coming up with a probability model of grammar! This is pretty cool if you ask me. Our model will encapsulate the principle that a verb often follows a noun, and a verb is unlikely to follow a verb:

$$p(verb \mid noun) \rightarrow high$$

For example:

"John ran"

And likewise:

$$p(verb \mid verb) \rightarrow low$$

"walk jump"

Using HMM's for Parts of Speech Tagging¶

Here is how we will use HMM's in order to do parts of speech tagging:

Step 1, Training: This entails finding the probabilities of our model. This means finding the distributions that tell us how to go from hidden state to observation, and how to go from hidden state to hidden state. Again, the observation is the word and the hidden state is the tag. Note that during the training stage hidden states are not really hidden because we are given the tags in the training data and we can find all of these probabilties via counting.
Step 2, Prediction: Use the viterbi algorithm to map from an observed sequence to a sequence of hidden states. So, the input will be a sentence and the output will be a corresponding sequence of pos tags.

from functools import reduce

import numpy as np
from sklearn.metrics import f1_score

from hmm.discrete_hmm_scaled import HMM
from hmm.baseline_logistic_regression import get_data


def accuracy(T, Y):
    # T: targets, Y: predictions
    # inputs are lists of lists
    n_correct = 0
    n_total = 0
    for t, y in zip(T, Y):
        n_correct += np.sum(t == y)
        n_total += len(y)
    return float(n_correct) / n_total


def total_f1_score(T, Y):
    # inputs are lists of lists
    T = np.concatenate(T)
    Y = np.concatenate(Y)
    return f1_score(T, Y, average=None).mean()


def main(smoothing=1e-1):
    # X = words, Y = POS tags
    Xtrain, Ytrain, Xtest, Ytest, word2idx = get_data(split_sequences=True)
    V = len(word2idx) + 1

    # Find hidden state transition matrix (A) and initial state distribution (pi)
    M = len(set(reduce(lambda x,y: x+y, Ytrain))) + 1
    A = np.ones((M, M)) * smoothing # Add-one smoothing
    pi = np.zeros(M)
    for y in Ytrain:
        # Loop through all hidden states (pos tags)
        if len(y) > 0:
            pi[y[0]] += 1
            for i in range(len(y) - 1):
                A[y[i], y[i+1]] += 1
    # Turn A and pi into probability matrices
    A /= A.sum(axis=1, keepdims=True)
    pi /= pi.sum()

    # Find the observation matrix
    B = np.ones((M, V)) * smoothing
    for x, y in zip(Xtrain, Ytrain):
        for xi, yi in zip(x, y):
            B[yi, xi] += 1
    B /= B.sum(axis=1, keepdims=True)

    hmm = HMM(M)
    hmm.pi = pi
    hmm.A = A
    hmm.B = B

    # get predictions
    Ptrain = []
    for x in Xtrain:
        p = hmm.get_state_sequence(x)
        Ptrain.append(p)

    Ptest = []
    for x in Xtest:
        p = hmm.get_state_sequence(x)
        Ptest.append(p)

    # print results
    print("train accuracy:", accuracy(Ytrain, Ptrain))
    print("test accuracy:", accuracy(Ytest, Ptest))
    print("train f1:", total_f1_score(Ytrain, Ptrain))
    print("test f1:", total_f1_score(Ytest, Ptest))


if __name__ == '__main__':
    main()

train accuracy: 0.9737492147907447
test accuracy: 0.9287840091183486
train f1: 0.9202153114841468
test f1: 0.862609207175332