Nathaniel Dake Blog

3. Markov Models Example Problems

We will now look at a model that examines our state of healthiness vs. being sick. Keep in mind that this is very much like something you could do in real life. If you wanted to model a certain situation or environment, we could take some data that we have gathered, build a maximum likelihood model on it, and do things like study the properties that emerge from the model, or make predictions from the model, or generate the next most likely state.

Let's say we have 2 states: sick and healthy. We know that we spend most of our time in a healthy state, so the probability of transitioning from healthy to sick is very low:

$$p(sick \; | \; healthy) = 0.005$$

Hence, the probability of going from healthy to healthy is:

$$p(healthy \; | \; healthy) = 0.995$$

Now, on the other hand the probability of going from sick to sick is also very high. This is because if you just got sick yesterday then you are very likely to be sick tomorrow.

$$p(sick \; | \; sick) = 0.8$$

However, the probability of transitioning from sick to healthy should be higher than the reverse, because you probably won't stay sick for as long as you would stay healthy:

$$p(healthy \; | \; sick) = 0.02$$

We have now fully defined our state transition matrix, and we can now do some calculations.

1.1 Example Calculations

1.1.1

What is the probability of being healthy for 10 days in a row, given that we already start out as healthy? Well that is:

$$p(healthy \; 10 \; days \; in \; a \; row \; | \; healthy \; at \; t=0) = 0.995^9 = 95.6 \%$$

How about the probability of being healthy for 100 days in a row?

$$p(healthy \; 100 \; days \; in \; a \; row \; | \; healthy \; at \; t=0) = 0.995^{99} = 60.9 \%$$

2. Expected Number of Continuously Sick Days

We can now look at the expected number of days that you would remain in the same state (e.g. how many days would you expect to stay sick given the model?). This is a bit more difficult than the last problem, but completely doable, only involving the mathematics of infinite sums.

First, we can look at the probability of being in state $i$, and going to state $i$ in the next state. That is just $A(i,i)$:

$$p \big(s(t)=i \; | \; s(t-1)=i \big) = A(i, i)$$

Now, what is the probability distribution that we actually want to calculate? How about we calculate the probability that we stay in state $i$ for $n$ transitions, at which point we move to another state:

$$p \big(s(t) \;!=i \; | \; s(t-1)=i \big) = 1 - A(i, i)$$

So, the joint probability that we are trying to model is:

$$p\big(s(1)=i, s(2)=i,...,s(n)=i, s(n+1) \;!= i\big) = A(i,i)^{n-1}\big(1-A(i,i)\big)$$

In english this means that we are multiplying the transition probability of staying in the same state, $A(i,i)$, times the number of times we stayed in the same state, $n$, (note it is $n-1$ because we are given that we start in that state, hence there is no transition associated with it) times $1 - A(i,i)$, the probability of transitioning from that state. This leaves us with an expected value for $n$ of:

$$E(n) = \sum np(n) = \sum_{n=1..\infty} nA(i,i)^{n-1}(1-A(i,i))$$

Note, in the above equation $p(n)$ is the probability that we will see state $i$ $n-1$ times after starting from $i$ and then see a state that is not $i$. Also, we know that the expected value of $n$ should be the sum of all possible values of $n$ times $p(n)$.

2.1 Expected $n$

So, we can now expand this function and calculate the two sums separately.

$$E(n) = \sum_{n=1..\infty}nA(i,i)^{n-1}(1 - A(i,i)) = \sum nA(i, i)^{n-1} - \sum nA(i,i)^n$$

First Sum
With our first sum, we can say that:

$$S = \sum na(i, i)^{n-1}$$$$S = 1 + 2a + 3a^2 + 4a^3+ ...$$

And we can then multiply that sum, $S$, by $a$, to get:

$$aS = a + 2a^2 + 3a^3 + 4a^4+...$$

And then we can subtract $aS$ from $S$:

$$S - aS = S'= 1 + a + a^2 + a^3+...$$

This $S'$ is another infinite sum, but it is one that is much easier to solve!

$$S'= 1 + a + a^2 + a^3+...$$

And then $aS'$ is:

$$aS' = a + a^2 + a^3+ + a^4 + ...$$

Which, when we then do $S' - aS'$, we end up with:

$$S' - aS' = 1$$$$S' = \frac{1}{1 - a}$$

And if we then substitute that value in for $S'$ above:

$$S - aS = S'= 1 + a + a^2 + a^3+... = \frac{1}{1 - a}$$$$S - aS = \frac{1}{1 - a}$$$$S = \frac{1}{(1 - a)^2}$$

Second Sum
We can now look at our second sum:

$$S = \sum na(i,i)^n$$$$S = 1a + 2a^2 + 3a^3 +...$$$$Sa = 1a^2 + 2a^3 +...$$$$S - aS = S' = a + a^2 + a^3 + ...$$$$aS' = a^2 + a^3 + a^4 +...$$$$S' - aS' = a$$$$S' = \frac{a}{1 - a}$$

And we can plug back in $S'$ to get:

$$S - aS = \frac{a}{1 - a}$$$$S = \frac{a}{(1 - a)^2}$$

Combine
We can now combine these two sums as follows:

$$E(n) = \frac{1}{(1 - a)^2} - \frac{a}{(1-a)^2}$$$$E(n) = \frac{1}{1-a}$$

Calculate Number of Sick Days
So, how do we calculate the correct number of sick days? That is just:

$$\frac{1}{1 - 0.8} = 5$$

3. SEO and Bounce Rate Optimization

We are now going to look at SEO and Bounch Rate Optimization. This is a problem that every developer and website owner can relate to. You have a website and obviously you would like to increase traffic, increase conversions, and avoid a high bounce rate (which could lead to google assigning your page a low ranking). What would a good way of modeling this data be? Without even looking at any code we can look at some examples of things that we want to know, and how they relate to markov models.

3.1 Arrival

First and foremost, how do people arrive on your page? Is it your home page? Your landing page? Well, this is just the very first page of what is hopefully a sequence of pages. So, the markov analogy here is that this is just the initial state distribution or $\pi$. So, once we have our markov model, the $\pi$ vector will tell us which of our pages a user is most likely to start on.

3.2 Sequences of Pages

What about sequences of pages? Well, if you think people are getting to your landing page, hitting the buy button, checking out, and then closing the browser window, you can test the validity of that assumption by calculating the probability of that sequence. Of course, the probability of any sequence is probability going to be much less than 1. This is because for a longer sequence, we have more multiplication, and hence smaller final numbers. We do have two alternatives however:

  • 1) You can compare the probability of two different sequences. So, are people going through the entire checkout process? Or is it more probable that they are just bouncing?
  • 2) Another option is to just find the transition probabilities themselves. These are conditional probabilities instead of joint probabilities. You want to know, once they have made it to the landing page, what is the probability of hitting buy. Then, once they have hit buy, what is the probability of them completing the checkout.

3.3 Bounce Rate

This is hard to measure, unless you are google and hence have analytics on nearly every page on the web. This is because once a user has left your site, you can no longer run code on their computer or track what they are doing. However, let's pretend that we can determine this information. Once we have done this, we can measure which page has the highest bounce rate. At this point we can manually analyze that page and ask our marketing people "what is different about this page that people don't find it useful/want to leave?" We can then address that problem, and the hopefully later analysis shows that the fixed page no longer has a high bounce right. In the markov model, we can just represents this as the null state.

3.4 Data

So, the data we are going to be working with has two columns: last_page_id and next_page_id. This can be interpreted as the current page and the next page. The site has 10 pages with the id's 0-9. We can represent start pages by making the current page -1, and the next page the actual page. We can represent the end of the page with two different codes, B(bounce) or C (close). In the case of bounce, the user saw the page and then immediately bounced. In the case of close, the user saw the page stayed and potentially saw some useful information, and then closed the window. So, you can imagine that our engineer may use time as a factor in determining if it is a bounce or a close.

In [9]:
import numpy as np
import pandas as pd
In [14]:
"""Goal here is to store start page and end page, and the count how many times that happens. After that 
we are going to turn it into a probability distribution. We can divide all transitions that start with specific
start state, by row_sum"""
transitions = {} # getting all specific transitions from start pg to end pg, tallying up # of times each occurs
row_sums = {} # start date as key -> getting number of times each starting pg occurs

# Collect our counts
for line in open('../../../data/site/site_data.csv'):
  s, e = line.rstrip().split(',') # get start and end page 
  transitions[(s, e)] = transitions.get((s, e), 0.) + 1
  row_sums[s] = row_sums.get(s, 0.) + 1 
  
# Normalize the counts so they become real probability distributions 
for k, v in transitions.items():
  s, e = k
  transitions[k] = v / row_sums[s]
  
# Calculate initial state distribution
print('Initial state distribution')
for k, v in transitions.items():
  s, e = k
  if s == '-1': # this means it is the start of the sequence. 
    print (e, v)
      
# Which page has the highest bounce rate?
for k, v in transitions.items():
  s, e = k
  if e == 'B':
    print(f'Bounce rate for {s}: {v}')
Initial state distribution
8 0.10152591025834719
2 0.09507982071813466
5 0.09779926474291183
9 0.10384247368686106
0 0.10298635241980159
6 0.09800070504104345
7 0.09971294757516241
1 0.10348995316513068
4 0.10243239159993957
3 0.09513018079266758
Bounce rate for 1: 0.125939617991374
Bounce rate for 2: 0.12649551345962112
Bounce rate for 8: 0.12529550827423167
Bounce rate for 6: 0.1208153180975911
Bounce rate for 7: 0.12371650388179314
Bounce rate for 3: 0.12743384922616077
Bounce rate for 4: 0.1255756067205974
Bounce rate for 5: 0.12369559684398065
Bounce rate for 0: 0.1279673590504451
Bounce rate for 9: 0.13176232104396302

We can see that page with id 9 has the highest value in the initial state distribution, so we are most likely to start on that page. We can then see that the page with highest bounce rate is also at page id 9.

4. Build a 2nd-order language model and generate phrases

So, we are now going to work with non first order markov chains for a little bit. In this example we are going to try and create a language model. So we are going to first train a model on some data to determine the distribution of a word given the previous two words. We can then use this model to generate new phrases. Note that another step of this model would be to calculate the probability of a phrase.

So the data that we are going to look at is just a collection of Robert Frost Poems. It is just a text file with all of the poems concatenated together. So, the first thing we are going to want to do is tokenize each sentence, and remove punctuation. It will look similar to this:

def remove_punctuation(s):
    return s.translate(None, string.punctuation)

tokens = [t for t in remove_puncuation(line.rstrip().lower()).split()]

Once we have tokenized each line, we want to perform various counts in addition to the second order model counts. We need to measure the initial distribution of words, or stated another way the distribution of the first word of a sentence. We also want to know the distribution of the second word of a sentence. Both of these do not have two previous words, so they are not second order. We could technically include them in the second order measurement by using None in place of the previous words, but we won't do that here. We also want to keep track of how to end the sentence (end of sentence distribution, will look similar to (w(t-2), w(t-1) -> END)), so we will include a special token for that too.

When we do this counting, what we first want to do is create an array of all possibilities. So, for example if we had two sentences:

I love dogs
I love cats

Then we could have a dictionary where the key was (I, love) and the value was an array [dogs, cats]. If "I love" was also a stand alone sentence, then the value would be [dogs, cats, END]. The function below can help us with this, since we first need to check if there is any value for the key, create an array if not, otherwise just append to the array.

def add2dict(d, k, v):
    if k not in d:
        d[k] = []
    else:
        d[k].append(v)

One we have collected all of these arrays of possible next words, we need to turn them into probability distributions. For example, the array [cat, cat, dog] would become the dictionary {"cat": 2/3, "dog": 1/3}. Here is a function that can do this:

def list2pdict(ts):
    d = {}
    n = len(ts)
    for t in ts:
        d[t] = d.get(t, 0.) + 1 
    for t, c in d.items():
        d[t] = c / n
    return d

Next, we will need a function that can sample from this dictionary. To do this we will need to generate a random number between 0 and 1, and then use the distribution of the words to sample a word given a random number. Here is a function that can do that:

def sample_word(d):
    p0 = np.random.random()
    cumulative = 0
    for t, p in d.items():
        cumulative += p
        if p0 < cumulative:
            return t
    assert(False) # should never get here

Because all of our distributions are structured as dictionaries, we can use the same function for all of them.

In [1]:
import numpy as np
import string
In [9]:
"""3 dicts. 1st store pdist for the start of a phrase, then a second word dict which stores the distributions
for the 2nd word of a sentence, and then we are going to have a dict for all second order transitions"""
initial = {}
second_word = {}
transitions = {}

def remove_punctuation(s):
  return s.translate(str.maketrans('', '', string.punctuation))

def add2dict(d, k, v): 
  """Parameters: Dictionary, Key, Value"""
  if k not in d:
    d[k] = []
  d[k].append(v)
  
# Loop through file of poems
for line in open('../../../data/poems/robert_frost.txt'):
  tokens = remove_punctuation(line.rstrip().lower()).split() # Get all tokens for specific line we are looping over
  
  T = len(tokens) # Length of sequence
  for i in range(T): # Loop through every token in sequence
    t = tokens[i] 
    if i == 0: # We are looking at first word
      initial[t] = initial.get(t, 0.) + 1 
    else:
      t_1 = tokens[i - 1]
      if i == T - 1: # Looking at last word
        add2dict(transitions, (t_1, t), 'END')
      if i == 1: # second word of sentence, hence only 1 previous word
        add2dict(second_word, t_1, t)
      else: 
        t_2 = tokens[i - 2] # Get second previous word
        add2dict(transitions, (t_2, t_1), t) # add previous and 2nd previous word as key, and current word as val
        
# Normalize the distributions
initial_total = sum(initial.values())
for t, c in initial.items():
  initial[t] = c / initial_total

# Take our list and turn it into a dictionary of probabilities
def list2pdict(ts):
  d = {}
  n = len(ts) # get total number of values
  for t in ts: # look at each token
    d[t] = d.get(t, 0.) + 1
  for t, c in d.items(): # go through dictionary, divide frequency by sum
    d[t] = c / n
  return d

for t_1, ts in second_word.items():
  second_word[t_1] = list2pdict(ts)

for k, ts in transitions.items():
  transitions[k] = list2pdict(ts)
  
def sample_word(d):
  p0 = np.random.random() # Generate random number from 0 to 1 
  cumulative = 0 # cumulative count for all probabilities seen so far
  for t, p in d.items():
    cumulative += p
    if p0 < cumulative:
      return t
  assert(False) # should never hit this
  
"""Function to generate a poem"""
def generate():
  for i in range(4):
    sentence = []
    
    # initial word
    w0 = sample_word(initial)
    sentence.append(w0)
    
    # sample second word
    w1 = sample_word(second_word[w0])
    sentence.append(w1)
    
    # second-order transitions until END -> enter infinite loop
    while True:
      w2 = sample_word(transitions[(w0, w1)]) # sample next word given previous two words
      if w2 == 'END': 
        break
      sentence.append(w2)
      w0 = w1
      w1 = w2
    print(' '.join(sentence))
    
generate()
another from the childrens house of makebelieve
they dont go with the dead race of the lettered
i never heard of clara robinson
where he can eat off a barrel from the sense of our having been together

5. Google's PageRank Algorithm

Markov models were even used in Google's PageRank algorithm. The basic problem we face is:

  • We have $M$ webpages that link to eachother, and we would like to assign importance scores $x(1),...,x(M)$
  • All of these scores are greater than or equal to 0
  • So, we want to assign a page rank to all of these pages

How can we go about doing this? Well, we can think of a webpage as a sequence, and the page you are on as the state. Where does the ranking come from? Well, the ranking actually comes from the limiting distribution. That is, in the long run, the proportion of visits that will be spent on this page. Now, if you think "great that is all I need to know", slow down. How can we actually do this in practice? How do we train the markov model, and what are the values we assign to the state transition matrix? And how can we ensure that the limiting distribution exists and is unique? The key insight was that we can use the linked structure of the web to determine the ranking.

The main idea is that a link to a page is like a vote for its importance. So, as a first attempt we could just use a frequency count to measure the votes. Of course, that wouldn't be a valid probability distribution, so we could just divide each row by its sum to make it sum to 1. So we set:

$$A(i, j) = \frac{1}{n(i)} \; if \; i \; links \; to \; j$$$$A(i, j) = 0 \; otherwise$$

Here $n(i)$ stands for the total number of links on a page, and you can confirm that the sum of a row is $\frac{n(i)}{n(i)} = 1$, so this is a valid markov matrix. Now, we still aren't sure if the limiting distribution is unique.

5.1 This is already a good start

Let's keep in mind that the above solution already solves a few problems. For instance, let's say you are a spammer and you want to sell 1000 links on your webpage. Well, because the transition matrix must remain a valid probability matrix, the rows must sum to 1, which means that each of your links now only has a strength of $\frac{1}{1000}$. For example the frequency matrix would look like:

abc.com amazon.com facebook.com github.com
thespammer.com 1 1 1 1

And then if we transformed that into a probability matrix it would just be each value divided by the total number of links, 4:

abc.com amazon.com facebook.com github.com
thespammer.com 0.25 0.25 0.25 0.25

You may then think, I will just create 1000 pages and each of them will only have 1 link. Unfortunately, since nobody knows about those 1000 pages you just created nobody is going to link to them, which means they are impossible to get to. So, in the limiting distribution, those states will have 0 probability because you can't even get to them, so there outgoing links are worthless. Remember, the markov chains limiting distribution will model the long running proportion of visits to a state. So, if you never visit that state, its probability will be 0.

We still have not ensure that the limiting distribution exists and is unique.

5.2 Perron-Frobenius Theorem

How can we ensure that our model has a unique stationary distribution. In 1910, this was actually determined. It is known as the Perron-Frobenius Theorem, and it states that:

If our transition matrix is a markov matrix -meaning that all of the rows sum to 1, and all of the values are strictly positive, i.e. no values that are 0- then the stationary distribution exists and is unique.

In fact, we can start in any initial state and as time approaches infinity we will always end up with the same stationary distribution, therefore this is also the limiting distribution.

So, how can we satisfy the PF criterion? Let's return to this idea of smoothing, which we first talked about when discussing how to train a markov model. The basic idea was that we can make things that were 0, non-zero, so there is still a small possibility that we can get to that state. This might be good news for the spammer. So, we can create a uniform probability distribution $U = \frac{1}{M}$, which is an $M x M$ matrix ($M$ is the number of states). PageRanks solution was to take the matrix we had before and multiply it by 0.85, and to take the uniform distribution and multiply it by 0.15, and add them together to get the final pagerank matrix.

$$G = 0.85A + 0.15U$$

Now all of the elements are strictly positive, and we can convince ourselves that G is still a valid markov matrix.


© 2018 Nathaniel Dake