5. Hidden Markov Model Calculations¶

This appendix serves as an accompaniment to hidden markov models, discrete observations. We will go over calculations concerning:

Probability of a Sequence
Forward-Backward Algorithm
Viterbi Algorithm
Baum-Welch Algorithm

0. General Definitions¶

We will start by restating common variables that will be used throughout our calculations.

0.1 Hidden States and Observations¶

We will let the number of hidden states be $M$:

$$\text{Number of hidden states} = M$$

And the length of our sequence of observations be $T$:

$$\text{Length of sequence of observations} = T$$

0.2 Joint Distribution¶

We know that the joint distribution containing both our observed symbols and hidden states is:

$$p(x,z)$$

Where both $x$ and $z$ are vectors:

$$x = \big[x(1), x(2), ..., x(T)\big]$$$$z = \big[z(1), z(2), ..., z(T)\big]$$

0.3 Marginalized Distribution¶

And we want to find the distribution for the sequence of observed symbols:

$$p\big(x(1), x(2),...,x(T)\big)$$

This is done by marginalizing out $z$.

0.4 Initial State Distribution $\pi$¶

We have our initial state distribution, $\pi$, the probability of being in a hidden state when the sequence begins:

$$\pi$$

0.5 State Transition Matrix $A$¶

Then there is our state transition matrix, $A$, which represents the probability of going from state $i$ to state $j$:

$$A(i, j) = \text{probability of going from state i to state j}$$

0.6 Observation Matrix $B$¶

Finally, we have our observation matrix, $B$, which represents the probability of observing symbol $k$ while in state $j$:

$$B(j,k) = \text{probability of observing symbol k while you are in state j}$$

1. Probability of a Sequence¶

We have gone over the equations utilized in determining the probability of a sequence, but we will now solidify that with an actual example. Here is the problem statement:

There is a magician who has two biased coins, coin 1 and coin 2 that he is flipping. We are trying to determine the probability of observing the following sequence of coin flips:

$$p\big(HHT \big)$$
$$p\big( x(1)=H, x(2)=H, x(3)=T \big)$$
To start, we know that since the magician has two coins, the number of hidden states, $M$, is 2. We also know that we can either observe a flipped coin being heads or tails, so the number of possible observed symbols is also 2. We are told that the magician really likes coin 1, so the initial state distribution is:

He is also very figity, and tends to switch between coins very often, so his state transition matrix is:

Finally, the coin 1 is biased towards heads and has a 0.7 probability of being heads, and a 0.3 probability of being tails. Coin 2 is biased towards tails, and has a 0.6 probability of being tails, and a 0.4 probability of being heads. Hence, the observation matrix looks like:

We now have all of the information needed to find the probability of the sequence $H,H,T$. Intuitively, we can think about it as follows: We must first take into account the probability that we start in specific state, and from that state we observed heads. We then must account for the probability that from that state we transition to another hidden state and observe heads. And finally, we must take into the account that from the second hidden state we transition to another hidden state and observe tails. Let's look at each part separately.

1.1 The probability of the Initial State¶

We can write the probability of the initial state and observing heads as:

$$\pi\big(z(1)\big)p\big(x(1)=1|z(1)\big)$$

1.2 The probability of transitioning from state 1 to state 2¶

$$p\big(z(2)|z(1)\big) = A\big(z(1),z(2)\big)$$

1.3 The probability of observing heads from state 2¶

$$p\big(x(2)=1|z(2)\big) = B\big(z(2),x(2)=1\big)$$

Now, if we repeated this process for transitioning from state 2 to 3, we would end up with the following equation:

Which we can then update by utilizing our matrices $A$ and $B$:

$$p\big(x,z \big) = \pi\big(z(1)\big)B\big(z(2),x(2)=1\big) * A\big(z(1),z(2)\big) * B\big(z(2),x(2)=1\big) * A\big(z(2),z(3)\big) * B\big(z(3),x(3)=0\big)$$

Now, at this point we can see that there are multiple values that $z$ can take on, and that in order to find $p(x)$ we must marginalize out $z$, like so:

$$\sum_{z_1 = 1..M,..,z_3=1..M}\pi\big(z(1)\big)B\big(z(2),x(2)=1\big) * A\big(z(1),z(2)\big) * B\big(z(2),x(2)=1\big) * A\big(z(2),z(3)\big) * B\big(z(3),x(3)=0\big)$$

We can see that since $M$ is 2, and we have $T =3$ observations, there are going to be $2^3$ different operations (separate summations) that need to be performed in order to marginalize out $z$. That would be very messy to write out, be we will calculate it in code below:

import numpy as np

# Initial probability distribution and transition matrices
pi = np.array([0.8, 0.2])
A = np.array([[0.1, 0.9],[0.9, 0.1]])
B = np.array([[0.7, 0.3],[0.4, 0.6]])
x = np.array([0,0,1]) # 0 for heads, 1 for tails, based on numpy indexing
operations_per_iteration = 5

# Function to calculate the probability of the observed sequence for a given z1, z2, z3
def sequence_probability_with_z(z1, z2, z3):
  return pi[z1]*B[z2, x[0]] * A[z1, z2]*B[z2, x[1]] * A[z2, z3]*B[z3, x[2]]

# Initial marginalized sequence probability and number of iterations performed
marginalized_sequence_prob = 0
iterations = 0

# Calculate marginalized sequence probability: p(x1, x2, x3) -> p(H, H, T)
Z = [0,1]
for z1 in Z:
  for z2 in Z:
    for z3 in Z:
      iterations += 1 
      marginalized_sequence_prob += sequence_probability_with_z(z1, z2, z3)

print('Sequence Probability: ', marginalized_sequence_prob)
print('Total Operations: ', iterations * operations_per_iteration)

Sequence Probability:  0.11169000000000001
Total Operations:  40

And we finally arrive at our sequence probability:

$$p(H,H,T) = 0.11$$

1.4 Complexity¶

Note that in order to achieve this we needed to take the summation of 8 different calculations, consisting of $2T-1$ operations. From this, we know that the time complexity of this algorithm is $O(TM^T)$. Visually, this can be seen clearly below:

We can see via simply counting operations in the above visualization, that there are 5 internal operations that must be performed. In our case, $T=3$, and hence:

$$2T-1 = 5$$

The summation goes through all of our states at each time step, and since $M=2$, we have:

$$M^T = 2 ^ 3 = 2*2*2 = 8$$

Hence, our total number of operations for our example is:

$$2T-1*M^T = 5 * 8 = 40$$

2. Forward-Backward Algorithm¶

At this point we are ready to move on to an algorithm that will help reduce the exponential complexity that we are dealing with above. This algorithm, the Forward-Backward algorithm, works by defining a variable called $\alpha$, that represents:

The joint probability of seeing the sequence you have observed up until now and being in a specific state at that time.

Mathematically, it is written as:

$$\alpha(t,i) = p \big(x(1),...,x(t),z(t)=i\big)$$

We can see that $\alpha$ is indexed by both time and $i$, the state.

2.1.1 Step 1 $\rightarrow$ Initial Value of $\alpha$¶

The first step of the forward backward algorithm is to calculate the initial value of $\alpha$ at $t=1$:

$$\alpha(1, i) = p \big(x(1), z(1)=i\big)$$$$\alpha(1, i) = p\big(z(1) = i \big) p\big(x(1) \mid z(1)= i\big)$$

Where, we know that:

$$\pi_i = p\big(z(1) = i \big)$$

And also that:

$$p\big(x(1) \mid z(1)= i\big) = B\big(i, x(1)\big)$$

So, we can rewrite our equation for $\alpha(1,i)$ as:

$$\alpha(1,i) = \pi_iB\big(i, x(1)\big)$$

Now, in terms of our example, we know that we have $M=2$ states. Hence, we need to find $\alpha$ for each starting state (coin 1 and coin 2):

$$\alpha(1,1) = \alpha(1, \text{coin 1}) = \pi_1B\big(z(1)=1, x(1)\big)$$$$\alpha(1,2) = \alpha(1, \text{coin 2})= \pi_2B\big(z(1)=2, x(1)\big)$$

2.1.2 Key Points¶

The first thing that I want you to take note of, is that that variable, $\alpha$, will only be updated for specific values. Currently it only has two values; that for $t=1$ and coin 1, and $t=1$ with coin 2. From an implementation standpoint, $\alpha$ will be a 2 dimensional matrix:

alpha = np.zeros((T, self.M))

The second key point is that the entire goal of this process is to find $p\big(x(1)\big)$. That may not be clear, or you may wonder why we have introduced this new variable $\alpha$. To answer that, let's first write the equation for $p\big(x(1)\big)$:

$$p\big(x(1)\big) = p\big(z(1)=1\big) p\big(x(1) \mid z(1) =1 \big) +...+p\big(z(1)=M\big) p\big(x(1) \mid z(1) =M \big)$$$$p\big(x(1)\big) = \pi_1 B \big(1, x(1) \big) + \pi_2 B \big(2, x(1) \big)+ ... +\pi_M B \big(M, x(1) \big)$$

But wait! We can see that this is our definition for $\alpha(1,i)$! If we sum $\alpha$ over the state $i$ when $t=1$, we end up with $p\big(x(1)\big)$. This leads us to the critical intuition:

$\alpha$ is a way to successively keep track of the probability of a sequence.

By creating $\alpha$, we can keep track of the probability of a sequence up to a current point in time, while also taking advantage of certain properties of probability that allow us to reduce the total number of operations we need to perform (aka, we can reduce the time complexity).

2.1.3 Step 2 $\rightarrow$ The Induction Step¶

At that point, we come to the induction step. We are trying to find $p\big(x(1), x(2)\big)$ now. We know that the induction step is an updating of $\alpha$ defined as:

$$\alpha(t+1, j) = \sum_{i=1}^M \alpha(t,i) A(i,j)B(j, x(t+1))$$

Note, if we wanted to find $\alpha(t+1, j)$ for all values that $j$ can take on, we would simply take the summation with $j$ as the index. With that said, inuitively we can think of the induction step as follows (in relation to our magician example): At $t=1$ the magician in holding one of the coins-we are not sure which-and we observe him flip a heads. We then go to $t=2$, he shuffles the coins behind his back, and then flips one of the coins where we again observe a heads. What we want to know is, what is the probability of that happening?

In order to determine that we must consider that the magician could transition from coin 1 or 2 at $t=1$, to coin 1 or 2 at $t=2$, where we then must determine the probability of observing $x(2)=heads$. This can be written as:

$$\pi_1A(1,1)B(1, x(2)=heads)+ \pi_2A(2,1)B(1, x(2)=heads)+\pi_1A(1,2)B(2, x(2)=heads)+ \pi_2A(2,2)B(2, x(2)=heads)$$

We can include $B$ for $t=1$ as well:

$$\pi_1B(1, x(1)=heads)A(1,1)B(1, x(2)=heads)+ \pi_2B(2, x(1)=heads)A(2,1)B(1, x(2)=heads)+\pi_1B(1, x(1)=heads)A(1,2)B(2, x(2)=heads)+ \pi_2B(2, x(1)=heads)A(2,2)B(2, x(2)=heads)$$

Ahah! This is just $\alpha$ at time $t=2$! Note, this is technically the sum of $\alpha$ at $t=2$, over $j$ states. We can represent $\alpha$ at $t=2$ for each individual state below:

$$\alpha(t=2, j) = \sum_{i=1}^M \alpha(t,i) A(i,j)B(j, x(t+1))$$$$\alpha(t=2, j=1) = \sum_{i=1}^M \alpha(t,i) A(i,j=1)B(j=1, x(t+1))$$$$\alpha(t=2, j=1) = \pi_1B(1, x(1)=heads)A(1,1)B(1, x(2)=heads)+ \pi_2B(2, x(1)=heads)A(2,1)B(1, x(2)=heads)$$$$\alpha(t=2, j=2) = \sum_{i=1}^M \alpha(t,i) A(i,j=2)B(j=2, x(t+1))$$$$\alpha(t=2, j=2) = \pi_1B(1, x(1)=heads)A(1,2)B(2, x(2)=heads)+ \pi_2B(2, x(1)=heads)A(2,2)B(2, x(2)=heads)$$

Again, in order to find the total probability of the sequence $p\big(x(1), x(2)\big)$, we would want to marginalize out the hidden state, in this case indexed by $j$:

$$p\big(x(1), x(2)\big) = \sum_{j=1..M} \alpha(t=2, j)$$

However, we do not need to perform that step just yet, since we still have one more observed symbol to consider.

2.1.4 Step 2 $\rightarrow$ The Termination Step¶

We finally reach the point where we are trying to calculate $p\big(x(1), x(2), x(3)\big)$. Now, in order to do this we must perform the exact same set of steps that we just performed above, where we calculates $\alpha(t=3, j)$, for each hidden state $j$. The only difference is that now we must take the summation over $\alpha(t=3, j)$, for all $j$. This gives us our final sequence probability:

$$p\big(x(1), x(2), x(3)\big) = \sum_{j=1..M} \alpha(t=3, j)$$

# Initial M states, T time steps, and alpha forward variable
M = 2
T = 3 
alpha = np.zeros((T, M))

# Initial value step
operations_initial = 0
for i in range(M):
  alpha[0,i] = pi[i]*B[i, x[0]]
  operations_initial += 1

# Induction Step
iterations = 0
operations_per_iteration = 2
for t in range(1, T):
  for j in range(M):
    for i in range(M):
      alpha[t, j] += alpha[t-1, i]*A[i,j]*B[j,x[t]]
      iterations += 1

sequence_probability = alpha[-1].sum()
print('Sequence Probability: ', sequence_probability)
print('Total Operations: ', iterations * operations_per_iteration + operations_initial + 1)

Sequence Probability:  0.11865600000000001
Total Operations:  19

We can then vectorize the above process as follows; first, let's get rid of the inner loop over $i$:

alpha = np.zeros((T, M))

# Initial vectorized
alpha[0] = pi * B[:, x[0]].reshape(1,2)

# Induction Step - loop over i has been vectorized
for t in range(1, T):
  for j in range(M):
    alpha[t, j] = alpha[t-1, :].dot(A[:,j])*B[j,x[t]]

sequence_probability = alpha[-1].sum()
print('Sequence Probability: ', sequence_probability)

Sequence Probability:  0.118656

And we can then vectorize out the loop over $j$:

# Initialize alpha forward variable
alpha = np.zeros((T, M))

# Initial Value Step
alpha[0] = pi * B[:, x[0]].reshape(1,2) # Explicitly calling reshape to ensure we have a 
                                        # (1 x 2) vector that will be multiplied element
                                        # wise with our pi state distribution

# Induction Step - both loop over i and j have been removed
for t in range(1, T):
  alpha[t] = alpha[t-1].dot(A) * B[:, x[t]]
  
sequence_probability = alpha[-1].sum()
print('Sequence Probability: ', sequence_probability)

Sequence Probability:  0.118656

2.1.5 Complexity¶

Let's quickly determine our overall number of operations for the forward algorithm in our example. In order to find $\alpha(t=1, 1)$ we need to perform 1 multiplication, and the same goes for $\alpha(t=1, 2)$. So our initial value step requires two operations.

Then, we have our induction step. Here we are trying to find $\alpha(t=2, 1)$ and $\alpha(t=2, 2)$. Each requires 4 operations, so a total of 8.

We then hit our termination step, which will again require 8 operations, plus an additional operation to take the summation of $\alpha$ at $t=3$ over $j$. So that is 9 operations.

We end up with a total of:

$$2 + 8 + 9 = 19 \text{ operations}$$

This matches what we found above in via code. From a big O standpoint, the initial 2 operations are a constant, as is the final 1 summation. That means that we are dealing with the induction steps complexity of $M^2$ at each time step (except the first); so $T-1$ time steps. Again, the $-1$ is a constant and we will be removed, leaving us with a complexity of $O(M^2T)$. Note, following this we would have expected to get $2^2*3 = 12 \text{ operations}$, yet we came up with 19, why? This is because we were factoring in the number of operations inside of each summation, which is a constant $2$ operations. In other words, no matter how $T$ or $M$ change, it will always be $2$ operations:

$$\alpha(t,i) A(i,j)B(j, x(t+1))$$

Where the first is $\alpha*A$ and the second is $(\alpha*A) * B$. Because these $2$ operations are a constant, they are dropped from the computational complexity.

3. Viterbi Algorithm¶

So far, the question that we have been asking is: Given a sequence of observed symbols, what is the sequence's probability? Another question that we are likely to want to ask is:

What is the sequence of hidden states, given the observation?

This is where the Viterbi Algorithm comes in; it calculates the most probable hidden states sequence, given the observed sequence, under the current model. Now, we will be sure to highlight that the viterbi algorithm works in a very similar manner to the forward algorithm. Whereas in the Forward Algorithm for each induction step we would take the sum of the current row of $\alpha$ (in effect marginalizing over all states of $z$), we are now going to take the max of the row.

To do this, we will create two new variables:

$$\delta(t,i) = max \Big \{p\big(z(1), z(t)=i, x(1),...,x(t) \big) \Big \}$$$$\psi(t,i)$$

$\delta(t,i)$, is indexed by time and state. This will represent the maximum probability of ending up in state $i$ at time $t$, which is a joint probability distribution over the state sequence and observed sequence.
$\psi(t,i)$, which is also indexed by time and state. This will keep track of the actual state sequences that end up at time $t$ and in state $i$.

3.3.1 Step 1 $\rightarrow$ Initialization¶

As with the forward algorithm, the first step that we will take is to find the initial values of our new variables.

$$\delta(t,i) = \pi_i B\big(i, x(1)\big)$$$$\psi(1,i) = 0$$

Note that the initialization of $\delta$ is identical to that of $\alpha$. In our magician example, we would again have $M=2$ states, so we need to find $\delta(1, i = 1)$ and $\delta(1, i = 1)$.

$$\delta(1,i=1) = \pi_1 B\big(1, x(1)\big)$$$$\delta(1,i=2) = \pi_2 B\big(2, x(1)\big)$$$$\psi(1,i = 1) = 0$$$$\psi(1,i = 2) = 0$$

Visually this can be seen below:

We can see that there are two paths that can be followed (in yellow and purple). The first represents the probability of starting with coin 1 and then observing heads ($\delta(1,i=1)$), while the second represents the probability of starting with coin 2 and observing heads ($\delta(1,i=2)$). Note that we do not need to take the maximum since there is only one potential option for each state $i$.

3.3.2 Step 2 $\rightarrow$ Recursively Update¶

We now have the recursion step where we update our variables for all times and states:

$$\delta(t, j) = max_{1 \leq i \leq M} \big \{ \delta(t-1, i) A(i,j)\big \} B\big(j, x(t) \big)$$$$\psi(t,j) = argmax_{1 \leq i \leq M} \big \{ \delta(t-1, i) A(i,j)\big \}$$

In our case, we would first consider time step $t=2$. Visually, we can see the scenario below:

We can see that for $z(2) = 1$, in other words for the second time step when the state is coin 1, there are two ways we can get there. We can come from the coin 1 ($\delta(1,1)$), or from coin 2 ($\delta(1,2)$). These paths are represented by the yellow arrows. The goal of this step is to determine which incoming path (yellow arrow) has a higher probability. In order to determine that we need to:

Take into account the probability of being at the origin of each yellow arrow. Mathematically, that means taking into account $\delta(1,1)$ and $\delta(1,2)$.
Take into account the probability of transitioning from the state $i$, to the state $j$. Mathematically, we will use the transition matrix, $A(i,j)$, for this.

So, we can state our goal as follows:

For each state in the current time step, find the incoming state, $i$, that maximizes $\delta(t-1, i) A(i,j)$.

We can go through this for $t=2$ and state $j = 1$. In order to find $\delta(2,1)$, we can perform the following calculation:

$$\delta(2,1) = max \Big \{ \big(\delta(1, 1) A(1,1)\big), \big(\delta(1, 2) A(2,1)\big) \Big \} B \big(1, x(2) = heads\big)$$

And in order to find $\psi(t, j)$:

$$\psi(2, 1) = argmax_i \Big \{ \big(\delta(1, i=1) A(i=1,1)\big), \big(\delta(1, i=2) A(i=2,1)\big) \Big \}$$

A way to intuitively think of $\psi$ is as follows. Let's say for a second that we have $\psi(t=2, j =1)$ and $\psi(t=2, j =2)$.

$\psi(t=2, j =1)$ is representing the coin that was most likely transitioned from in order to get to coin 1.
$\psi(t=2, j =2)$ is representing the coin that was most likely transitioned from in order to get to coin 2.

3.3.3 Step 3 $\rightarrow$ Termination Step¶

For the best state sequence, in the final best state, we can perform the following calculation:

$$z(T)^* = argmax_{1 \leq i \leq M} \delta(T, i)$$

To determine the rest of the best state sequence, we just need to back track using $\psi$. This is done for time equals $T-1$ all the way down to 1:

$$z(t)^* = \psi (t+1, z(t+1)^*)$$

The idea behind this is that we know that taking the argmax over our $\delta$ with respect to $i$ will give us the final state. Once we know that, recall that our variable $\psi$ is designed to keep track of what coin we most likely came from in order to get the current one. In other words, once we know our last state (by taking the argmax of $\delta$), we can use $\psi$ in order to find what coin we most likely transitioned from to get there. That will be our coin for the previous time step, at which point we can use $\psi$ again! This is what is meant by back tracking.

# Initialize variables to zeros matrices
delta = np.zeros((T, M)) 
psi = np.zeros((T, M))

# Initial value step
for i in range(M):
  delta[0, i] = pi[i]*B[i, x[0]]
  psi[0,i] = 0 
  
# Recursion Step
for t in range(1, T):
  for j in range(M):
    current = np.array([])
    for i in range(M):
      current = np.append(current, delta[t-1, i]*A[i, j])
    currentMax = current.max()
    currentArgMax = current.argmax()
    delta[t, j] = currentMax * B[j, x[t]]
    psi[t, j] = currentArgMax
    
print('delta: \n', delta)
print('psi: \n', psi, '\n')

# Back Track
states = np.zeros(T, dtype=np.int32)
states[T-1] = np.argmax(delta[T-1])
for t in range(T-2, -1, -1):
  states[t] = psi[t+1, states[t+1]]
print('Hidden State Sequence (by index): \n', states, '\n')

# Map states to coin 1 and 2
print('Hidden State Sequence (by coin label): ', 
      list(map(lambda state: 'coin 1' if state == 0 else 'coin 2', states))
     )

delta: 
 [[0.56     0.08    ]
 [0.0504   0.2016  ]
 [0.054432 0.027216]]
psi: 
 [[0. 0.]
 [1. 0.]
 [1. 0.]] 

Hidden State Sequence (by index): 
 [0 1 0] 

Hidden State Sequence (by coin label):  ['coin 1', 'coin 2', 'coin 1']

4. Baum-Welch Algorithm¶

Of the 3 previous techniques we have gone through so far, the Baum-Welch Algorithm is by far the most challenging. It used to help us find our transition matrix, $A$, and emission matrix, $B$, when they are not known. We will need to be very careful and ensure to aggresively question our intuitions and what is actually going on under the hood. It begins by performing the forward-backward algorithm.

We went through the forward algorithm in section 2, but we have not touched on the backward algorithm yet. To jog your memory, the backward algorithm is used to find the backward variable $\beta$, which is defined as:

Given you are currently in state $i$ , what is the the probability that the rest of the sequence you had observed plays out?

It has two main steps, an initialization step:

$$\beta(T, i) = 1 $$$$\beta(t, i) = p\big(x(t+1), ... x(T) \mid z(t)=i\big)$$

And, an induction step that is carried out to time t=1 (or zero based on indexing):

$$\beta(t, i) = \sum_{j=1}^M A(i,j)B\big(j, x(t+1)\big) \beta(t+1, j)$$

4.1 Calculate Forward and Backward Variables¶

Calculate $\alpha$¶

We can begin by calculating our forward variable, $\alpha$, for our magician example:

pi = np.array([0.8, 0.2])
A = np.array([[0.1, 0.9],[0.9, 0.1]])
B = np.array([[0.7, 0.3],[0.4, 0.6]])
x = np.array([0,0,1]) # 0 for heads, 1 for tails, based on numpy indexing

# Initial M states, T time steps, and alpha forward variable
M = 2
T = 3 
alpha = np.zeros((T, M))
beta = np.zeros((T, M))

# ---- Calculate alpha ----
# Initial value step
operations_initial = 0
for i in range(M):
  alpha[0,i] = pi[i]*B[i, x[0]]
  operations_initial += 1

# Induction Step
iterations = 0
operations_per_iteration = 2
for t in range(1, T):
  for j in range(M):
    for i in range(M):
      alpha[t, j] += alpha[t-1, i]*A[i,j]*B[j,x[t]]
      iterations += 1

Calculate $\beta$¶

And then we can calculate the backward variable $\beta$.

In our scenario regarding the magician, $\beta(t=2, 1)$ can be thought of as: Given the magician is holding coin 1 at time $t=2$, what is the probability of observing the remainder of our sequence (in this case just $tails$ at time $t=3$).

# ---- Calculate beta ----
# Initial value step
operations_initial = 0
for i in range(M):
  beta[T-1,i] = 1
  
# Induction Step
for t in range(T-2, -1, -1):
  for i in range(M):
    for j in range(M):
      beta[t, i] += A[i,j] * B[j, x[t+1]] * beta[t+1, j]

print('Value of Alpha after forward algorithm: \n', alpha, '\n')
print('Value of Beta after Backward algorithm: \n', beta)

Value of Alpha after forward algorithm: 
 [[0.56     0.08    ]
 [0.0896   0.2048  ]
 [0.057984 0.060672]] 

Value of Beta after Backward algorithm: 
 [[0.1587 0.3723]
 [0.57   0.33  ]
 [1.     1.    ]]

Calculate $\phi$¶

Okay, we have now solved for our forward and backward variables, meaning it is time to introduce $\phi$. This is where things can become a bit confusing, so we will trod carefully. $\phi$ is defined as:

Probability of being in the state $i$ at time $t$, transitioning to state $j$ at time $t + 1$, given the observation sequence.

It's exact probality is defined as:

$$\phi(t, i ,j) = p\big(z(t)=i, z(t+1)=j \mid x\big)$$

Why would this help us? Well, recall that the goal of our algorithm is to find $A$ and $B$ when they are not known. $\phi$, by definition has to do with the probability of being in a state and transitioning to another state, given an observation sequence. Intuitively we should have an inkling that that sounds useful in finding $A$ and $B$, whose definitions directly deal with probabilities of transitioning from one state to another, and observing certain symbols in the process.

If we look at the exact probability for a moment, it is important to keep in mind why there is a dependency on $x$. Well, for $z(t)=i$, we can think about our magician example. Let's say that we wanted to know the probability of our magician holding coin 1 at time $t=2$; that is clearly dependent on what lead up to time $t=2$! For instance, if at time $t=1$ we were at coin 1, it is less likely to also be there at time $t=2$ (based on the figity nature of the magician). So, if at time $t=1$ we observed heads, which we know is more probable from coin 1, we would be less likely to think the magician was holding coin 1 at time $t=2$ (since he most likely switched to coin 1).

Now, mathematically $\phi$ is defined as:

$$\phi(t, i ,j) = \frac{\alpha(t,i)A(i,j)B\big(j, x(t+1)\big)\beta(t+1, j)}{\sum_{i=1}^M \sum_{j=1}^M \alpha(t,i)A(i,j)B\big(j, x(t+1)\big)\beta(t+1, j)}$$

It may seem a bit overwhelming to go from the simple english definition from earlier, to that mathematical conglomeration, composed of 4 different variables, states, symbols, etc. To help with this, first notice that the numerator and denominator are the same, expect for the summations in the denominator. These serve as a normalizer to ensure that we have a valid probability.

Let's dissect the numerator:

$$\alpha(t,i)A(i,j)B\big(j, x(t+1)\big)\beta(t+1, j)$$

There are 4 distinct terms, as seen here:

$\alpha($ $t,i)$ $A(i,j)$ $B\big(j, x(t+1)\big)$ $\beta(t+1, j)$

Now, let's say that we wanted to determine $\phi(t=2, i=1, j=2)$. Well, we can then look to the visualization below, where each term can be seen via the corresponding colored arrows:

$\alpha$¶

Remember, $\alpha$ represents the joint probability of seeing the sequence you have observed up until now and being in a specific state at that time, is made up of the sum of the purple paths multiplied together, plus the blue paths multiplied together. We can clearly see that the purple and blue paths represent the different ways to end up in state 1 (coin 1) at time $t=2$. That is needed to take care of the first portion of our definition of $\phi$: "Probability of being in the state $i$ at time $t$".

$A$¶

Represents the probability of transitioning from $i=1$ to $j=2$ at time $t=2$, can be seen in orange. It accounts for the portion of our $\phi$ definition: "transitioning to state $j$".

$B$¶

Representing the probability of observing tails at time $t=3$, can be seen in red. It accounts for the portion of our $\phi$ definition: "given the observation sequence".

$\beta$¶

Represents the probability that if in state $i$, the rest of the sequence you had observed plays out. In this case, $\beta(t=3, 1 \; or \; 2)$ will be 1, based on its definition, so it is not seen in the diagram above. However, if our sequence length was longer it would be present, and play the same roll on the back end of the calculation that $\alpha$ played on the front end.

I have created a second visualization with the probabilities included, if you would like to work the calculations out by hand:

Let's quickly walk through solving for $\phi(t=2, i=1, j =2)$ in code. First we will perform it without utilizing our alpha variable from earlier (instead just summing the blue and purple paths), and then we will utilize alpha:

# Probability that at time t = 2 we are in state 1
probability_of_purple_path = 0.8 * 0.7 * 0.1 * 0.7 
probability_of_blue_path = 0.2 * 0.4 * 0.9 * 0.7
probability_of_holding_coin_1_at_time_2 = (probability_of_purple_path + 
                                           probability_of_blue_path)

# Probability that we transition to state 2
probability_transition_to_state_2 = A[0,1]

# Probability of observing tails at time t + 1 (t = 3)
probability_observing_tails_t_3 = B[1,1]

# Probability of rest of sequence given that we are at coin 2 at time t = 3
probability_of_remain_sequence_given_state_2_t_3 = 1

numerator = (probability_of_holding_coin_1_at_time_2 * 
             probability_transition_to_state_2 *
            probability_observing_tails_t_3 * 
             probability_of_remain_sequence_given_state_2_t_3)

print('Unormalized Probability in Numerator (solved manually): ', numerator)

Unormalized Probability in Numerator (solved manually):  0.04838400000000001

Let's now solve for the numerator using $\alpha$:

i = 0 # Coin 1, 0 based on indexing
j = 1 # Coin 2, 1 based on indexing
t = 1 # t = 2 in our example, but we need it to equal 1 here since 0 was starting index

numerator = (alpha[t,i] * A[i, j] * B[j, x[t+1]])
print('Unormalized Probability in Numerator (solved via alpha): ', numerator)

Unormalized Probability in Numerator (solved via alpha):  0.04838400000000001

And we can find the denominator which we will use to normalize our above probability:

normalizing_denominator = 0
for i in range(M):
  for j in range(M):
    normalizing_denominator += (alpha[t,i] * A[i, j] * B[j, x[t+1]] * beta[t+1, j])
    
print('Normalizing denominator: ', normalizing_denominator)

Normalizing denominator:  0.11865600000000001

And finally, we can find $\phi$:

phi_t2_i1_j2 = numerator / normalizing_denominator
print('Phi at time t = 2, i = 1, j = 2: ', phi_t2_i1_j2)

Phi at time t = 2, i = 1, j = 2:  0.4077669902912622

Alright, now we just found one entry for $\phi$ at a specific $i$ and $j$ transition, at a certain $t$. The next step is to determine all values of $\phi$ for our given sequence (remember, we have been dealing with a specific sequence of observed symbols thus far: Heads, Heads, Tails).

phi = np.zeros((3, 2, 2))
for t in range(T-1):
  for i in range(M):
    for j in range(M):
      numerator = (alpha[t,i] * A[i, j] * B[j, x[t+1]] * beta[t+1, j])
      phi[t,i,j] = numerator / normalizing_denominator
      numerator = 0
print('phi based on our single Heads, Heads, Tails sequence: \n', phi)

phi based on our single Heads, Heads, Tails sequence: 
 [[[0.18830906 0.56067961]
  [0.24211165 0.00889968]]

 [[0.02265372 0.40776699]
  [0.46601942 0.10355987]]

 [[0.         0.        ]
  [0.         0.        ]]]

Calculate $\gamma$¶

Great, we have found $\phi$ for all values of $t$, $i$, and $j$. This is a good time to introduce the next new variable: $\gamma$. It is defined as:

$$\gamma(t,i) = \sum_{j=1}^M \phi(t,i,j)$$

So, $\gamma$ is just the $\phi$ probability, marginalized over $j$. In other words:

$$\gamma(t,i) = p \big(z(t) = i \mid x\big)$$

What is very important to realize is that $\gamma$ is a representation of transitions from state $i$, across all times $t$. Meanwhile, $\phi$ is a representation of the transitions from state $i$ to $j$, across all times $t$.

Let us solve for $\gamma$ now:

gamma = np.zeros((3, 2))
for t in range(T-1):
  for i in range(M):
    for j in range(M):
      gamma[t,i] += phi[t,i,j]
print('gamma based on our single Heads, Heads, Tails sequence: \n', gamma)

gamma based on our single Heads, Heads, Tails sequence: 
 [[0.74898867 0.25101133]
 [0.43042071 0.56957929]
 [0.         0.        ]]

We now have our variables $\phi$ and $\gamma$, and are ready for a big "Ahah!" moment. The key to performing the steps up to this point lies in the fact that when we take the sum of $\phi$ and $\gamma$ over time (marginalize out time), $\phi$ represents the expected number of transitions from state $i$ to $j$:

$$\sum_{t=1}^{T-1}\phi(t,i,j) = E\big( \text{number of transitions from state i to state j}\big)$$

And $\gamma$ represents the expected number of transitions from state $i$:

$$\sum_{t=1}^{T-1}\gamma(t,i) = E\big( \text{number of transitions from state i}\big)$$

Once we have the number of expected transitions from $i$ to $j$, and the expected number of transitions from $i$, we can us that to find an updated $A$!

So, first let us marginalize our $\phi$ and $\gamma$ over $t$:

# marginalize phi over t
phi_marginalized = np.zeros((M, M))
for i in range(M):
  for j in range(M):
    for t in range(T-1):
      phi_marginalized[i,j] += phi[t, i, j]
print('phi after marginalizing over t: \n', phi_marginalized)

phi after marginalizing over t: 
 [[0.21096278 0.9684466 ]
 [0.70813107 0.11245955]]

# marginalize gamma over t
gamma_marginalized = np.zeros((M))
for i in range(M):
  for t in range(T-1):
    gamma_marginalized[i] += gamma[t, i]
print('gamma after marginalizing over t: \n', gamma_marginalized)

gamma after marginalizing over t: 
 [1.17940939 0.82059061]

Now, to be totally clear about what the above marginalized matrices represent, let's quickly look at the annotated version below:

$$\text{marginalized }\phi(i, j) = \text{state i}\left\{ \overbrace{ \begin{bmatrix} 0.21 & 0.96\\ 0.708 & 0.112 \end{bmatrix}}^\text{state j} \right. $$$$\text{marginalized }\gamma(i) = \overbrace{ \begin{bmatrix} 1.179 & 0.82 \end{bmatrix}}^\text{state i} $$

We can see above that the marginalized $\phi$ has rows corresponding to $i$, and columns to $j$, where the entry $i,j$ represents a transition from state $i$ to state $j$.

For $\gamma$, it is a one dimensional array, where the first entry corresponds to transitioning from state $i = 1$ (coin 1), and the second transitioning from state $i=2$ (coin 2).

Update $A$¶

What we can do next is perform our update equation for $A$, which is defined as the expected number of transitions from $i$ to $j$, divided by the number of transitions from $i$:

$$A(i,j) = \frac{\sum_{t=1}^{T-1} \phi(t,i,j)}{\sum_{t=1}^{T-1}\gamma(t,i)}$$

In other words, we want to divide the first row of marginalized $\phi$ (representing transtions from $i = \text{coin 1}$ to $j$, by the first entry in $\gamma$ (representing transitions from $i$). We then repeat the same thing for the second row of marginalized $\phi$, dividing it by the second entry of marginalized $\gamma$. This leaves us with an updated $A$:

A_update = phi_marginalized
A_update[0, :] = A_update[0, :] / gamma_marginalized[0]
A_update[1, :] = A_update[1, :] / gamma_marginalized[1]
print('Updated Transition matrix A: \n', A_update)

Updated Transition matrix A: 
 [[0.17887155 0.82112845]
 [0.86295292 0.13704708]]

We can be confident that this is a valid distribution, as each row sums to 1. Notice, that it is similar to our original matrix (the ground truth $A$ that we started with). However, it is not the same; why is that? Well, this updated $A$ above was learned from one sequence (heads, heads, tails), that only consisted of 3 coin flips. That is a very small amount of data to learn from. The next steps would be to account for many sequences, ideally of slightly greater length.

Update $B$¶

The last step remaining is the complete the update for $B$. $B(i,k)$ is just all the $\gamma$'s if $x(t) = k$, divided by all the $\gamma$'s:

$$B(i,k) = \frac{\sum_{t=1}^{T-1} \gamma(t,i) \; if \; x(t)=k, \; else \; 0}{\sum_{t=1}^{T-1}\gamma(t,i)}$$

In case this is not clear, remember that:

$$\gamma(t,i) = p \big(z(t) = i \mid x\big)$$

And $B$ is representing:

$$B(i,k) = p\big(x(t)=k|z(t)=i\big)$$

So, since we have $\gamma$ we want to marginalize out the $t$, and incorporate the observed symbol $k$ in order to get an updated $B$. The most effective way to incorporate the observed symbol $k$ is, during the marginalization of $t$ in $\gamma$, essentially perform a count where the value of $\gamma(t,i)$ is only considered if $k \neq 0$ for that $t$.

Remember, $\gamma$ is based on our entire observation sequence, while $B$ is utilized for a specific observation. So, we need a way to marginalize out all of the observations in time, and end up with a specific probability of observing each symbol, given the state. Again, this is the definition of $B$! So, what we are saying is:

At each state $i$ in $\gamma$ (each column of $\gamma$), marginalize out the time; however, only include that specific time entry (the row/index of the state column of $\gamma$ we are looking at), if for that time $t$, the $k$ (symbol) that we are currently calculating $B$ for, was observed at the time $t$.

In our case, where $\gamma$ has been solved to be:

$$\gamma(t, i) = \text{time}\left\{ \overbrace{ \begin{bmatrix} 0.74 & 0.25\\ 0.43 & 0.56 \\ 0 & 0 \end{bmatrix}}^\text{state i} \right. $$

If we were solving for $B(i = 1, k = 1)$ (in other words coin 1 and heads), then then numerator of our update equation would be:

$$\gamma(1,1) + \gamma(2,1) = 0.74 + 0.43 = 1.17$$

Both $\gamma$'s are included in the calculation, because for each time $t$ that iterated through in the summation in the numerator ($t=1$ and $t=2$), the observation was $heads$, $k=1$.

However, if we had been solving for $B(i = 1, k = 2)$, we would have been summing 0 for both time steps, because the observation, heads ($k=1$) would not equal the value of the symbol index, $k = 2$.

$$\gamma(1,1) + \gamma(2,1) = 0 + 0 = 0$$

This is a way to perform maximum likelihood estimation in a sense. We are counting the situations where we observe a particular symbol from a particular state, and then dividing by how many times we are in that particular state (to ensure we end up with a valid probability). In other words we are finding:

$$ \begin{pmatrix} \text{The probability of} \\ \text{observing symbol $k$} \\ \text{given you are in state $i$} \end{pmatrix} = \frac{ \begin{pmatrix} \text{The number of times we were in state $i$,} \\ \text{observed symbol $k$, times the probability} \\ \text{of being in state $i$ at that time} \end{pmatrix} } { \begin{pmatrix} \text{Expected number of times} \\ \text{we are in state $i$} \\ \end{pmatrix} } $$

In code we can calculate this as follows:

B_update_numerator = np.zeros((M, len(set(x))))
K = [0,1] # Observations, 0 is heads, 1 is tails

for i in range(M):
  for k in K:
    for t in range(T-1):
      if x[t] == k:
        B_update_numerator[i,k] += gamma[t, i]
      else:
        B_update_numerator[i,k] += 0 # Just adding this in for clarity 
print('Non-normalized update of B emission matrix: \n', B_update_numerator)

Non-normalized update of B emission matrix: 
 [[1.17940939 0.        ]
 [0.82059061 0.        ]]

# Divide by normalizer
B_update = np.zeros((M, len(set(x))))
B_update[0, :] = B_update_numerator[0, :] / gamma_marginalized[0]
B_update[1, :] = B_update_numerator[1, :] / gamma_marginalized[1]
print('Updated Emission matrix B: \n', B_update)

Updated Emission matrix B: 
 [[1. 0.]
 [1. 0.]]

What may seem slightly pecular at first is that fact that we received an emission matrix where the probability of heads is 100%, and the probability of observing tails is 0%. Surely that must be wrong, considering our original $B$ matrix had non zero probabilities for for heads and tails? Yet, on closer inspection it makes a lot of sense. Based on our one, simple/small training sequence, heads was observed twice, and tails was observed on the last time step, which is not factored into the udpate of $B$ (we sum up to $T-1$). Hence, as far as our update equations were concerned, they have never seen a tails emission and the probability must then be 0. Now, this is of course a shortcoming of our example, but never the less we can see how sparsity may come into play if we aren't careful; in other words, zero probabilities being assigned to situations which certainly can occur, due to lack of training data.