5. Intro to Dynamic Programming and Iterative Policy Evaluation¶

We are now going to start looking at solutions to MDP's. As we saw in the last section, the center piece of the discussion is the Bellman Equation:

$$\text{Bellman Equation} \rightarrow V_\pi(s) = \sum_a \pi(a \mid s) * \sum_{s'}\sum_r p(s',r \mid s,a) \Big \{ r + \gamma V_\pi(s') \Big \}$$¶

In fact, the bellman equation can be used directly to solve for the value function. If you look carefully, you will see that this is actually a set of $S$ equations with $S$ unknowns. In fact, it is a linear equation, meaning it is not too difficult to solve. In addtion, a lot of the matrix entries will be zero, since the state transitions will most likely be sparse.

However, this is not the approach we will take. Instead, we will do what is called iterative policy evaluation.

1.1 Iterative Policy Evaluation¶

What exactly is iterative policy evaluation? Well, essentially it means that we apply the bellman equation again and again, and eventually it will just converge. We can write down the algorithm in pseudocode as follows:

$ \text{def iterative_policy_evaluation}(\pi)\text{:} \\ \hspace{1cm} \text{initialize V(s) = 0 for all s} \in \text{S} \\ \hspace{1cm} \text{while True:} \\ \hspace{2cm} \Delta = 0 \\ \hspace{2cm} \text{for each s} \in \text{S:} \\ \hspace{3cm} \text{old_v = V(s)} \\ \hspace{3cm} V_\pi(s) = \sum_a \pi(a \mid s) * \sum_{s'}\sum_r p(s',r \mid s,a) \Big \{ r + \gamma V_\pi(s') \Big \}\\ \hspace{3cm} \Delta = \text{max(} \Delta \text{, |V(s) - old_v|)} \\ \hspace{2cm} \text{if} \Delta \text{< threshold: break} \\ \hspace{1cm} \text{return V(s)} $

We can see above that the input to iterative policy evaluation is a policy, $\pi$, and the output is the value function for that particular policy. It works as follows:

We start by initializing $V(s)$ to 0 for all states.

Then, in an infinite loop, we initialize a variable called $\Delta$, which represents the maximum change during the current iteration. $\Delta$ is used to determine when to quit. When it is small enough, that is when we will break out of the loop.

Then, for every state in the state space, we keep a copy of the old $V(s)$, and then we use bellmans equation to update V(s).

We set $\Delta$ to be the maximum change for $V(s)$ in that iteration.

Once this converges, we return $V(s)$

The main point of interest in this algorithm, is of course the part that contains the bellman equation. Notice how strictly speaking, the value at iteration $k+1$ depend only on the values at iteration $k$:

$$ V_{k+1}(s) = \sum_a \pi(a \mid s) * \sum_{s'}\sum_r p(s',r \mid s,a) \Big \{ r + \gamma V_{k}(s') \Big \}$$

However, this need not be the case. In fact, we can always just use our most up to date versions of the value function for any state. This actually ends up converging faster.

1.2 Definitions¶

A final note; we generally call the act of finding the value function for a given policy the prediction problem. Soon, we will learn an algorithm for finding the optimal policy, which is known as the control problem.

2. Gridworld in Code¶

import numpy as np
import matplotlib.pyplot as plt

"""Environment"""
class Grid: 
  def __init__(self, width, height, start):
    self.width = width
    self.height = height
    self.i = start[0] # start is a tuple of 2 integers
    self.j = start[1]
    
  def set(self, rewards, actions):
    """actions enumerate all possible actions that can take you to new state.
       actions should be a dict of: (i, j): A (row, col): list of possible actions
       rewards should be a dict of: (i, j): r (row, col): reward
    """
    self.rewards = rewards
    self.actions = actions
    
  def set_state(self, s):
    """Useful for various algorithms we will use. For example, iterative policy evaluation
    requires looping through all the states, and then doing an action to get to the next
    state. In order to know what the next state is, we have to put the agent into that 
    state, do the action, and then determine the next state. We do not automatically have
    a master list of state transitions, we must figure them out by playing the game."""
    self.i = s[0]
    self.j = s[1]
    
  def current_state(self):
    "Returns current (i,j) position of agent."
    return (self.i, self.j)
  
  def is_terminal(self, s):
    """Returns true if s is terminal state, false if not. Easy way to check this is to see
    if the state is in the action dictionary. If you can do an action from the state, then
    you can transition to a different state, and hence your state is not terminal."""
    return s not in self.actions
  
  def move(self, action):
    """Checks if action is in actions dictionary. If not, we are not able to do this move,
    so we simply stay in same position."""
    if action in self.actions[(self.i, self.j)]:
      if action == 'U':
        self.i -= 1
      elif action == 'D':
        self.i += 1
      elif action == 'R':
        self.j += 1
      elif action == 'L':
        self.j -= 1
    # Return reward (if any, default is 0)
    return self.rewards.get((self.i, self.j), 0)
  
  def undo_move(self, action):
    """Pass in the action you just took, and the environment will undo it. 
    Ex -> Just went up, it will move you back down."""
    if action == 'U':
      self.i += 1
    elif action == 'D':
      self.i -= 1
    elif action == 'R':
      self.j -= 1
    elif action == 'L':
      self.j += 1
    # Raise an exception if we arrive somewhere we shouldn't be -> Should never happen
    assert(self.current_state() in self.all_states())
    
  def game_over(self):
    "Returns true if game over, false otherwise. Only need to check if in terminal state."
    return (self.i, self.j) not in self.actions
  
  def all_states(self):
    """We can calculate all of the states simply by enumerating all of the states from 
    which we can take an action (which don't include terminal states), and all of the 
    states that return a reward (which do include terminal states). Since there may be
    some of the same states in both actions and rewards, we cast it to a set. This also
    makes the data structure O(1) for search."""
    return set(self.actions.keys()) | set(self.rewards.keys())
  
def standard_grid():
  """Returns standard grid. This is a grid that has the structure shown in section 4.
  All of the actions are defined such that we can move within the grid, but never off
  of it. We also cannot walk into the wall, nor out of the terminal state. Upper left
  is defined to be (0,0). We define rewards for arriving at each state. The grid looks
  like:
      .  .  .  1
      .  x  . -1
      s  .  .  .
  * x means you can't go there
  * s means start position
  * number means reward at that state
  """
  g = Grid(3, 4, (2, 0))
  rewards = {(0, 3): 1, (1, 3): -1}
  actions = {
    (0, 0): ('D', 'R'),
    (0, 1): ('L', 'R'),
    (0, 2): ('L', 'D', 'R'),
    (1, 0): ('U', 'D'),
    (1, 2): ('U', 'D', 'R'),
    (2, 0): ('U', 'R'),
    (2, 1): ('L', 'R'),
    (2, 2): ('L', 'R', 'U'),
    (2, 3): ('L', 'U'),
  }
  g.set(rewards, actions)
  return g

def negative_grid(step_cost=-0.1):
  """Here we want to penalize each move. This is done to prevent a robot from moving 
  randomly to solve the maze. If you only gave it a reward for solving the maze, then it
  would never learn anything beyond a random strategy. We know that we can incentivize
  the robot to solve the maze more efficiently by giving it a negative reward for each
  step taken. That is what we are doing here-incentivizing the robot to solve the maze 
  efficiently, rather than moving randomly until it reaches the goal."""
  g = standard_grid()
  g.rewards.update({
    (0, 0): step_cost,
    (0, 1): step_cost,
    (0, 2): step_cost,
    (1, 0): step_cost,
    (1, 2): step_cost,
    (2, 0): step_cost,
    (2, 1): step_cost,
    (2, 2): step_cost,
    (2, 3): step_cost,
  })
  return g

3. Iterative Policy Evaluation In Code¶

We are now going to implement iterative policy evaluation in code. To demonstrate how we can find the value function for different policies, we are going to do iterative policy evaluation on two different policies.

The first policy we will look at is a completely random (uniform) policy.

Remember, there are two probability distributions involved in bellmans equation:

$$\text{Policy probability: }\hspace{1cm}\pi(a \mid s)$$$$\text{Markov State Transition Probability: }\hspace{1cm}p(s',r \mid s,a)$$

The probability that is relevant for implementing bellmans equation is $\pi(a \mid s)$. This is the probability that we take an action $a$, given that we are in state $s$. For a uniform random policy, this probability will be:

$$\frac{1}{\mid A(s) \mid}$$

Where $A(s)$ is the set of all possible actions to take from state $s$. In other words, our probability will be 1 divided by the total number of possible actions from state $s$.

The other probability, $p(s', r \mid s, a)$ is only relevant when state transitions themselves are random. That is a scenario when you try to move left, but instead you end up going right.

The second policy we will look at is a completely deterministic policy.

From the start position you go directly to the goal state (up, up, right, right, right). However, if you are starting from any other state, the policy is to go directly to the losing state. So, we should expect the values on the upper left path to be positive, and the other values to be negative.

Also, as a note/clarification-when performing iterative policy evaluation with random actions, our final value function will differ from the bellman equation as follows; the original bellman equation starts off as:

$$\text{Bellman Equation} \rightarrow V_\pi(s) = \sum_a \pi(a \mid s) * \sum_{s'}\sum_r p(s',r \mid s,a) \Big \{ r + \gamma V_\pi(s') \Big \}$$¶

We can drop the $p(s', r \mid s, a)$, since as we stated earlier, our state transitions are not currently random. That means our value function looks like:

$$V_\pi(s) = \sum_a \pi(a \mid s) * \Big \{ r + \gamma V_\pi(s') \Big \}$$

Because $\pi(a \mid s)$ is a uniform random distribution, it is not dependent on the state and is actually equal to:

$$\pi(a) = \frac{1}{\mid A(s) \mid}$$

Hence, we can update our value equation to be:

$$V_\pi(s) = \sum_a \frac{1}{\mid A(s) \mid} * \Big \{ r + \gamma V_\pi(s') \Big \}$$

In pseudocode, that will look like:

new_v += p_a * (r + gamma * V[grid.current_state()])

Key Takeaway:

Another key thing to keep in mind, is how our value function is actually represented. Generally, when we think of a function we envision an equation of some sort, that maps one domain to another, such as:

$$f(x) = x^2 + 4x + 8 $$
However, a function can be more generally defined as: A rule that relates inputs to outputs in a dataset or system. This allows us to have an alternative way of viewing functions: As the actual set of input/output pairs it defines, or in other words, as a dataset. This is the way in which our value function is defined. As we converge on a final value function, we are not determining some final equation, but rather a final mapping of our input domain (our 11 states) to our output domain, the value associated with each state.

So remember:

We can view an equation equivalently in two ways: either via its mathematical expression or as a dataset consisting of a complete listing of the function's input/output pairs.

We can now get to the code!

SMALL_ENOUGH = 10e-4 # Threshold for convergence

"""Functions to help us visualize policies and values."""
def print_values(V, g): 
  """Takes in values dictionary and grid, draws grid, and in each position it prints
  the value. """
  for i in range(g.width):
    print("---------------------------")
    for j in range(g.height):
      v = V.get((i,j), 0)
      if v >= 0:
        print(" %.2f|" % v, end="")
      else:
        print("%.2f|" % v, end="") # -ve sign takes up an extra space
    print("")
    
def print_policy(P, g):
  """Takes in policy and grid, draws grid, and in each position it prints
  the action from the policy. Note, this will only work for deterministic policies, 
  since we can't print more than 1 thing per location."""
  for i in range(g.width):
    print("---------------------------")
    for j in range(g.height):
      a = P.get((i, j), ' ')
      print("  %s  |" % a, end="")
    print("")

if __name__ == '__main__':
  """Iterative Policy Evaluation:
      * Given a Policy -> find its value function V(s)
      * We will do this for both uniform random policy and fixed deterministic policy
      * NOTE: There are 2 sources of randomness
      * p(a|s)      -> deciding what action to take given the state
      * p(s',r|s,a) -> the next state and reward given your action-state pair
      * we are only modeling p(a|s) = uniform"""
  grid = standard_grid()
  
  # States will be positions (i, j). Simpler than tic-tac-toe, because we only have 
  # 1 game piece that can only be at one position at a time. We get the set of all states
  # from the grid, since these will be the keys to the value function dictionary.
  states = grid.all_states()
  
  # ----- Perform Iterative Policy Evaluation for Uniform Random Actions -----
  
  # Initialize all V(s) to be 0
  V = {v: 0 for v in states}
  gamma = 1.0 # Discount factor
  
  # Enter infinite loop, repeat until convergence
  while True:
    biggest_change = 0
    # Loop through all of the states, since we need to find Value function at all states
    for s in states: 
      old_v = V[s] # Keep copy of old V(s), so we can track magnitude of each change
      
      # Loop through all possible actions from this state. V(s) only has value if it is 
      # not a terminal state.
      if s in grid.actions:
        new_v = 0 # we will accumulate the answer
        p_a = 1.0 / len(grid.actions[s]) # Each action has equal probability
        for a in grid.actions[s]:
          # Look ahead action comes into play. Must first set state to s, and then do the
          # action, so that we can determine the next state s', since that is required
          # to use the bellman equation
          grid.set_state(s) 
          r = grid.move(a)
          new_v += p_a * (r + gamma * V[grid.current_state()]) # This is RL portion!
        V[s] = new_v
        biggest_change = max(biggest_change, np.abs(old_v - V[s]))
      
    if biggest_change < SMALL_ENOUGH:
      break
  print("values for uniformly random actions:")
  print_values(V, grid)
  print("\n\n")
    
  # ----- Perform Iterative Policy Evaluation for Fixed Policy -----
  # Print policy so we know what it looks like
  policy = {
    (2, 0): 'U',
    (1, 0): 'U',
    (0, 0): 'R',
    (0, 1): 'R',
    (0, 2): 'R',
    (1, 2): 'R',
    (2, 1): 'R',
    (2, 2): 'R',
    (2, 3): 'U',
  }
  print_policy(policy, grid)
  
  # Reinitialize all V(s) to be 0
  V = {v: 0 for v in states}
  
  # Let's see how V(s) changes as we get further away from the reward
  gamma = 0.9 # discount factor
  
  # Repeat until convergence. Simpler loop, we don't need to loop through any actions, 
  # because there is only one action per state. Because there is only 1 action per state,
  # the probability of that action in that state is 1. 
  while True:
    biggest_change = 0 
    for s in states: 
      old_v = V[s]
      
      # V(s) only has value if it's not a terminal state
      if s in policy:
        a = policy[s]
        grid.set_state(s)
        r = grid.move(a)
        V[s] = r + gamma * V[grid.current_state()]
        biggest_change = max(biggest_change, np.abs(old_v - V[s]))

    if biggest_change < SMALL_ENOUGH:
      break
  print("values for fixed policy:")
  print_values(V, grid)

values for uniformly random actions:
---------------------------
-0.03| 0.09| 0.22| 0.00|
---------------------------
-0.16| 0.00|-0.44| 0.00|
---------------------------
-0.29|-0.41|-0.54|-0.77|


---------------------------
  R  |  R  |  R  |     |
---------------------------
  U  |     |  R  |     |
---------------------------
  U  |  R  |  R  |  U  |
values for fixed policy:
---------------------------
 0.81| 0.90| 1.00| 0.00|
---------------------------
 0.73| 0.00|-1.00| 0.00|
---------------------------
 0.66|-0.81|-0.90|-1.00|

As we look at the results, we can see that most of the value for the uniform random policy are negative. That is because if you are moving randomly, there is a good chance you end up in the losing state. Remember, there are two ways you can end up in the losing state, but only one way you can enter the goal state. The most negative value is for the state right underneath the losing state, which makes sense.

Now, for the fixed policy and discount factor 0.9, we see exactly what we expect to see: the further away we get from the terminal state, the more the value decreases, and each time it decreases by exactly 10%.

4. Policy Improvement¶

We can now start discussing the control problem, the problem of how to find better policies, eventually leading to the optimal policy. What we know so far is how to find the value function given a fixed policy. Let's look at our recursive definition of $Q_\pi(s,a)$ again:

$$Q_{\pi}(s,a) = \sum_{s',r}p(s',r \mid s,a)\big[r + \gamma V_\pi (s')\big]$$

This tells us the value of doing action $a$, while in state $s$, using the policy $\pi$. Using the current policy, we simply get the current state value function.

$$V_\pi(s) = Q_\pi(s, \pi(s)) = \sum_{s',r}p(s',r \mid s,\pi(s))\big[r + \gamma V_\pi (s')\big] $$

Now let's say that we want to change just one of the actions in the policy; Can we do this? Of course we can! We have a finite set of actions, so all we need to do is just go through each one until we get a better Q, than $\pi(s)$ does:

$$find \; a \in A \; s.t. Q_\pi(s,a) > Q_\pi(s, \pi(s)) $$

This is where doing all of the programming exercises we go through becomes very useful. This looks like a really abstract equation, but all it's saying is:

If the policy is currently to go up, let's look at left, right, and down to see if we can get a bigger Q. If it does, let's change our policy for that state to this new action.

Formally speaking, what we are doing when we choose a new action for the state is finding a new policy $\pi'$, that has a bigger value for the state than $\pi$:

$$V_\pi(s) \leq V_{\pi'}(s)$$

All we need to do this is to pick an action that gives us maximum $Q$. We can write this in the form where we are using $Q$:

$$\pi'(s) = argmax_a\big(Q_\pi(s,a)\big)$$

Or in the form where we are doing a lookahead search on $V$:

$$\pi'(s) = argmax_a \big(\sum_{s', r} p(s', r \mid s,a) \big[r + \gamma V_\pi(s')\big]\big)$$

4.1 Things to notice¶

There are a few things about policy improvement that you should notice:

First, notice that it is greedy. We are never considering globally the value function at all states. We are only looking at the current state, and picking the best action based on the value function at that state.
Second, notice how it uses an imperfect version of $V_\pi(s)$. Once we change $\pi$, $V_\pi(s)$ also changes. We will see soon why this is not a problem.

One question that may arise is: How do we know when we are finished trying to change the policy? When we have found the optimal policy, the policy will not change with respect to the value function. In addition, the value function will no longer improve-it will stay constant. Notice how the inequality we had earlier was less than or equal to:

$$V_\pi(s) \leq V_{\pi'}(s)$$

In the case where it is less than, we are still improving. In the case where it is equal to, then we have found the optimal policy.

5. Policy Iteration¶

We are now going to discuss the algorithm we are going to use to find the optimal policy. It will solve the problem we encountered in the previous lecture, specifically, when we change the policy the value function also changes and becomes out of date. We will now see how we can rectify that problem.

So, what do we do when we change the policy and the value funcion becomes out of date? Well, we can simply recalculate the value function. Luckily we already know how to do find $V$ given $\pi$! We have written the code already, and the algorithm is called iterative policy evalutation. At a high level this algorithm is very simple: We just alternate between policy evaluation and policy improvement. We keep doing this until the policy doesn't change. Note that what we are not checking for is the value function converging (although it will anyway because once the policy stops changing we only need to do one more iteration of policy evaluation).

In pseudcode policy iteration looks like:

Step 1. Randomly initialize V(s) and policy pi(s)

Step 2. V(s) = iterative_policy_evaluation(pi)

Step 3. Policy Improvement
policy_changed = False
for s in all_states:      # Loop through all states
  old_a = policy(s)
  policy(s) = argmax[a] { sum[s', r] { p(s',r | s,a) [r + gamma*V(s')] } }
  if policy(s) != old_a:
    policy_changed = True
if policy_changed:
  go back to step 2

6. Policy Iteration in Code¶

SMALL_ENOUGH = 1e-3
GAMMA = 0.9
ALL_POSSIBLE_ACTIONS = ('U', 'D', 'L', 'R')

# Note: This is deterministic -> All p(s',r|s,a) = 1 or 0. In other words, when you try 
# to go up, you go up. The transition probability is always 0 or 1.

if __name__ == '__main__':
  # Negative grid gives reward of -0.1 for every non-terminal state
  # We will use this to see if it encourages finding a shorter path to the goal
  grid = negative_grid()
  
  # Print Rewards
  print("Rewards: ")
  print_values(grid.rewards, grid)
  
  # Create a deterministic random policy. We will randomly chose an action out of the set 
  # of possible actions for every state. 
  policy = {key: np.random.choice(ALL_POSSIBLE_ACTIONS) for key in grid.actions.keys()}
  
  # Initial policy
  print("Initial policy:")
  print_policy(policy, grid)
  
  # Initialize V(s) randomly -> terminal state initialized to 0
  V = {s: np.random.random() if s in grid.actions else 0 for s in states}
  
  # Repeat until convergence -> will break out when policy does not change. Alternates
  # between policy evaluation and policy improvement
  while True:
    
    # ----- Policy evaluation step -----
    # Same as before, but simpler version because all of actions, next states, and rewards
    # are all deterministic
    while True:
      biggest_change = 0
      for s in states:
        old_v = V[s]
        
        # V(s) only has a value if it's not a terminal state
        if s in policy: 
          a = policy[s]
          grid.set_state(s)
          r = grid.move(a)
          V[s] = r + GAMMA * V[grid.current_state()]
          biggest_change = max(biggest_change, np.abs(old_v - V[s]))
          
      if biggest_change < SMALL_ENOUGH:
        break
    
    # ----- Policy Improvement step -----
    is_policy_converged = True
    
    # Loop through all states
    for s in states:
      if s in policy:
        old_a = policy[s] # Grab existing action and assign to old_a
        new_a = None
        best_value = float('-inf')
        
        # Loop through all possible actions to find the best lookahead value
        for a in ALL_POSSIBLE_ACTIONS:
          grid.set_state(s)
          r = grid.move(a)
          v = r + GAMMA * V[grid.current_state()]
          if v > best_value: 
            best_value = v
            new_a = a # Choose action that gives us best lookahead value
        policy[s] = new_a
        if new_a != old_a:
          is_policy_converged = False # If actions in policy change, we aren't converged
    
    if is_policy_converged:
      break
      
  print("values:")
  print_values(V, grid)
  print("policy:")
  print_policy(policy, grid)

Rewards: 
---------------------------
-0.10|-0.10|-0.10| 1.00|
---------------------------
-0.10| 0.00|-0.10|-1.00|
---------------------------
-0.10|-0.10|-0.10|-0.10|
Initial policy:
---------------------------
  U  |  R  |  D  |     |
---------------------------
  D  |     |  R  |     |
---------------------------
  U  |  U  |  D  |  L  |
values:
---------------------------
 0.62| 0.80| 1.00| 0.00|
---------------------------
 0.46| 0.00| 0.80| 0.00|
---------------------------
 0.31| 0.46| 0.62| 0.46|
policy:
---------------------------
  R  |  R  |  R  |     |
---------------------------
  U  |     |  U  |     |
---------------------------
  U  |  R  |  U  |  L  |

Notice how if we are in the position below the wall the agent will choose to go right, instead of left and then up around the top left corner. This is because each step has a negative reward associated with it, and future rewards are discounted, so we want to try and get to the goal state as quickly as possible.

Again, we should keep in mind the current form of the bellman equation we are utilizing:

$$V_\pi(s) = \sum_a \pi(s \mid a) * \Big \{ r + \gamma V_\pi(s') \Big \}$$

Where are missing the transition probability term:

$$p(s',r \mid s,a)$$

Because we have created a deterministic system where if you are at position (0,0) and you try and go right you will. That probability is specifically saying:

Given you take the action to move the right and you are at position (0,0), what is the probability you get to state s' and have reward r?

Well, because we have created a deterministic system, the probability is just 1, and we can drop that term.

7. Policy Iteration in Windy Grid World¶

We are now going to go over a modification of our environment called windy grid world. Recall that we have two probability distributions to deal with:

Probability of doing an action $a$ when in state $s$: $$\pi(a \mid s)$$

State Transition Probability: $$p(s',r \mid s,a)$$

So far our state transition probability has been deterministic! When we do an action $a$ in state $s$ we always end up in state $s'$. This is also tied to the reward, since gridworld gives you a reward based on the state that you landed in. We will now consider the case where this probability distribution is not deterministic. In other words, the probability will have a value between 0 and 1.

7.1 Windy Gridworld¶

Imaginge you are walking along a windy street. You try to walk straight, but the wind pushes you to the side or backwards. This is what happens in windy gridworld. If the agent tries to go up, it will do so with probability 0.5. But it can also go left, down, or right with probability 0.5/3.

Let's dig into the code!

"""Next state and reward will now have some randomness. The agent will go in the desired
direction with probability 0.5. The agent will go in a random direction with probability
0.5/3"""

if __name__ == '__main__':
  """Change of step_cost. It was -0.1, now it is -1.0. Because the goal only gives +1 
  reward, and the losing state gives -1 reward, the agent is going to want to end the 
  game as soon as possible, even if this means ending up in the losing state. For example,
  if the agent is 3 steps away from the goal, they will get -3 reward before they can even
  reach the goal. But, if they are only 1 step away from the losing state, then they only
  get -1 reward. Remember, the goal of the agent is not to get to what we have defined as 
  the winning or losing states; all it does is try and maximize its reward. """
  grid = negative_grid(step_cost=-1.0)
  
  # print rewards
  print("rewards:")
  print_values(grid.rewards, grid)
  
  # Create a deterministic random policy. We will randomly chose an action out of the set 
  # of possible actions for every state. 
  policy = {key: np.random.choice(ALL_POSSIBLE_ACTIONS) for key in grid.actions.keys()}
  
  # Initialize V(s) randomly -> terminal state initialized to 0
  V = {s: np.random.random() if s in grid.actions else 0 for s in states}
  
  # Repeat until convergence - Will break when policy does not change
  while True:
    
    # ----- Policy Evaluation step -----
    while True:
      biggest_change = 0
      for s in states:
        old_v = V[s]
        
        # V(s) only has value if it's not a terminal state
        new_v = 0
        if s in policy:
          for a in ALL_POSSIBLE_ACTIONS: 
            # Adding in state transition probability. Policy itself is deterministic.
            # There are two probabilities we can play with, but we are only playing with the 
            # state transitions this time. 
            if a == policy[s]:             
              p = 0.5
            else:
              p = 0.5 / 3
            grid.set_state(s)
            r = grid.move(a)
            new_v += p*(r + GAMMA * V[grid.current_state()])
          V[s] = new_v
          biggest_change = max(biggest_change, np.abs(old_v - V[s]))
          
      if biggest_change < SMALL_ENOUGH:
        break
        
    # ----- Policy Improvement step -----
    is_policy_converged = True
    for s in states:
      if s in policy:
        old_a = policy[s]
        new_a = None
        best_value = float('-inf')
        
        # loop through all possible actions to find the best current action
        for a in ALL_POSSIBLE_ACTIONS: # chosen action
          v = 0
          for a2 in ALL_POSSIBLE_ACTIONS: # resulting action
            if a == a2:
              p = 0.5
            else:
              p = 0.5/3
            grid.set_state(s)
            r = grid.move(a2)
            v += p*(r + GAMMA * V[grid.current_state()])
          if v > best_value:
            best_value = v
            new_a = a
        policy[s] = new_a
        if new_a != old_a:
          is_policy_converged = False

    if is_policy_converged:
      break
      
  print("values:")
  print_values(V, grid)
  print("policy:")
  print_policy(policy, grid)
  # result: every move is as bad as losing, so lose as quickly as possible

rewards:
---------------------------
-1.00|-1.00|-1.00| 1.00|
---------------------------
-1.00| 0.00|-1.00|-1.00|
---------------------------
-1.00|-1.00|-1.00|-1.00|
values:
---------------------------
-4.52|-2.95|-0.86| 0.00|
---------------------------
-5.57| 0.00|-1.94| 0.00|
---------------------------
-5.76|-4.88|-3.44|-2.17|
policy:
---------------------------
  R  |  R  |  R  |     |
---------------------------
  U  |     |  R  |     |
---------------------------
  R  |  R  |  U  |  U  |

8. Value Iteration¶

We are now going to talk about an alternative technique to solve the control problem. This new technique is called value iteration. Recall that the previous technique that we just went over was called policy iteration. It is easy to get all of these confused, because all of these look and sound the same. So, coding these will definitely help getting used to the variety of solutions that we have for MDP's.

One of the disadvantages of policy iteration is that it is an iterative algorithm, and hence we have to wait for it to converge. At the same time the main loop contains two steps, and the first step itself is also an iterative algorithm. This leaves us with one iterative algorithm inside of another. The question arises: is there a more efficient way to go about solving this problem?

Recall that the policy evaluation algorithm ends when $V$ converges. It is reasonable to ask: Is there a point before $V$ converges, such that the resulting greedy policy wouldn't change? In fact, studies have shown that after only a few iterations of policy evaluation the resulting greedy policy is constant. What this tells us is that we don't actually need to wait for policy evaluation to converge when we are doing policy iteration. We can just do a few steps and then break, because the policy improvement step would find the same policy given additional iterations of policy evaluation.

8.1 Value Iteration Algorithm¶

However, the algorithm that we are studying in this lecture takes this one step further. It is called value iteration. What is does is combines the policy evaluation and policy improvement into one step:

$$V_{k+1}(s) = max_a \sum_{s'}\sum_r p(s',r \mid s,a) \big\{ r + \gamma V_k(s') \big\}$$

What is interesting about this is that it looks almost exactly like the equations that we have already seen. In fact, if you weren't paying attention it would look identical. So this equation is very similar to the policy evaluation equation, except we are taking the max over all possible actions.

It is also iterative, as we can see via the k index. Similar to policy evaluation, we don't actually need to wait for the entire kth iteration of $V$ to finish before calculating the (k+1)th iteration of $V$. Notice how this does both policy evaluation and policy improvement in one step. This is because policy improvement occurs by taking the argmax of the expression on the right. So, by just taking the max, we are doing the policy evaluation step on the policy we would have chosen had we taken the argmax.

8.2 Value Iteration Pseudocode¶

We can now write this out in pseudocode:

$ \text{initialize V(s) = 0 for all s}\in\text{S} \\ \text{while True:} \\ \hspace{2cm} \Delta = 0 \\ \hspace{2cm} \text{for each s} \in \text{S:} \\ \hspace{3cm} \text{old_v = V(s)} \\ \hspace{3cm} V(s) = max_a \sum_{s'}\sum_r p(s',r \mid s,a) \big\{ r + \gamma V(s') \big\}\\ \hspace{3cm} \Delta = \text{max(} \Delta \text{, |V(s) - old_v|)} \\ \hspace{2cm} \text{if} \Delta \text{< threshold: break} \\ \text{for each s} \in \text{S:} \\ \hspace{2cm} \pi(s) = argmax_A \sum_{s'}\sum_r p(s',r \mid s,a) \big\{ r + \gamma V(s') \big\} $

9. Value Iteration in Code¶

SMALL_ENOUGH = 1e-3
GAMMA = 0.9
ALL_POSSIBLE_ACTIONS = ('U', 'D', 'L', 'R')

# Note: This is deterministic -> All p(s',r|s,a) = 1 or 0. In other words, when you try 
# to go up, you go up. The transition probability is always 0 or 1.

if __name__ == '__main__':
  # Negative grid gives reward of -0.1 for every non-terminal state
  # We will use this to see if it encourages finding a shorter path to the goal
  grid = negative_grid()
  
  # Print Rewards
  print("Rewards: ")
  print_values(grid.rewards, grid)
  
  # Create a deterministic random policy. We will randomly chose an action out of the set 
  # of possible actions for every state. 
  policy = {key: np.random.choice(ALL_POSSIBLE_ACTIONS) for key in grid.actions.keys()}
  
  # Initial policy
  print("Initial policy:")
  print_policy(policy, grid)
  
  # Initialize V(s) randomly -> terminal state initialized to 0
  V = {s: np.random.random() if s in grid.actions else 0 for s in states}
  
  # Repeat until convergence. 
  # ---- Value iteration Loop. ----
  # Looks very similar to policy evaluation, except now we are looping through all 
  # actions, and taking the maximum value. We break when the biggest change is below the
  # SMALL_ENOUGH threshold
  while True:
    biggest_change = 0 
    for s in states:
      old_v = V[s]
      
      # V(s) only has value if it's not a terminal state
      if s in policy:
        new_v = float('-inf')
        for a in ALL_POSSIBLE_ACTIONS:
          grid.set_state(s)
          r = grid.move(a)
          v = r + GAMMA * V[grid.current_state()]
          if v > new_v:
            new_v = v
        V[s] = new_v
        biggest_change = max(biggest_change, np.abs(old_v - V[s]))
    
    if biggest_change < SMALL_ENOUGH:
      break 
      
    # Take our optimal value function, find optimal policy. Recall this is just argmax. 
    # We loop through all of the actions, and we find the action that gives us the best 
    # future reward.
    for s in policy.keys():
      best_a = None
      best_value = float('-inf')
      # Loop through all possible actions to find the best current action
      for a in ALL_POSSIBLE_ACTIONS:
        grid.set_state(s)
        r = grid.move(a)
        v = r + GAMMA * V[grid.current_state()]
        if v > best_value:
          best_value = v
          best_a = a
      policy[s] = best_a
      
  # our goal here is to verify that we get the same answer as with policy iteration
  print("values:")
  print_values(V, grid)
  print("policy:")
  print_policy(policy, grid)

Rewards: 
---------------------------
-0.10|-0.10|-0.10| 1.00|
---------------------------
-0.10| 0.00|-0.10|-1.00|
---------------------------
-0.10|-0.10|-0.10|-0.10|
Initial policy:
---------------------------
  U  |  R  |  R  |     |
---------------------------
  D  |     |  D  |     |
---------------------------
  L  |  L  |  D  |  L  |
values:
---------------------------
 0.62| 0.80| 1.00| 0.00|
---------------------------
 0.46| 0.00| 0.80| 0.00|
---------------------------
 0.31| 0.46| 0.62| 0.46|
policy:
---------------------------
  R  |  R  |  R  |     |
---------------------------
  U  |     |  U  |     |
---------------------------
  U  |  R  |  U  |  L  |

10. Dynamic Programming Summary¶

In the last section, we defined the Markov Decision Process. In this section we looked at one method for finding solutions to the MDP: Dynamic Programming. In particular we looked at:

Prediciton Problem: Find the value function given a policy. The algorithm we looked at to solve this was iterative policy evaluation.

Control Problem: Find the optimal policy and optimal value function. We looked at two algorithms that could do this. The first was called policy iteration, and it involved an iterative algorithm inside of another iterative algorithm, which could be inefficient. We then looked at value iteration, which truncates the policy evaluation step, and combines both policy evaluation and policy iteration into one step.

10.1 Asynchronous Dynamic Programming¶

Notice how all of the algorithms we have looked at so far have involved looping through the entire set of states. As we have discussed before, the state space in many realistic games can be incredibly large. Thus, even one iteration of value iteration can take a very long time. Also, recall that one way to speed up value iteration is to do the update "in place", meaning that we don't exclusively use the values from the previous iteration to update the values in the current iteration.

We can take that a step further with what is called asynchronous dynamic programming. The modification is simple: instead of looping through the entire state set, we loop through just a few or even just one of the states per iteration. We can chose these states randomly, or based on which states are more likely to be visited. You can learn this information by having the agent play a few rounds of the game.

10.2 Generalized Policy Iteration¶

Although we introduced policy iteration as a dynamic programming algorithm in this section, we will see that the main idea behind policy iteration will appear again and again through the course.

This main concept is:

We iteratively perform 2 steps: policy evaluation and policy improvement, and we alternate between the two until we reach convergence. The only way that we can reach convergence is when Bellman's equation has come true. In other words, the value has converged for the given policy, and the value has stabilized with respect to greedy selection on the value function.

A way to visualize this can be seen below:

Initially the policy and value function can be highly suboptimal. But, by requiring the policy to be greedy with respect to the value function, and the value function to be consistent with the policy, we can converge to the optimal policy and optimal value function.

10.3 Efficiency of Dynamic Programming¶

Let's consider how much better dynamic programming is than brute force search. Well, let's call the number of states $N$ and the number of possible actions $M$. If we assume that the agent can go from the start state to the goal state in $O(N)$ time, then we want a sequence of actions of length $O(N)$. The number of possible actions is then:

$$M*M*M*M...*M \rightarrow O(M^N)$$

Hence the number of possible permutations is $O(M^N)$. How you would solve this practically, is you would a list of all the possible permutations of state sequences and then do policy evaluation on all of them. You would then keep the policy that gives you the maximum value function. This is exponential in the number of states, so we can see how DP is a much more efficient solution.

10.4 Model-based vs. Model-free¶

Lastly, the DP solutions require a full of the environment. In particular, the state transition probabilities $p(s',r \mid s,a)$. In the real world, these may be hard to measure, especially if $|S|$ is large. In the remaining sections we are going to look at methods with don't require such a model of the environment. These are called model-free methods.

You also may have noticed, that these iterative methods require an initial estimate. In policy iteration and value iteration, we are essentially making estimates from other estimates; going back and forth between the value function and the policy. Making that initial estimate is called bootstrapping. You will see that the next method we look at, Monte Carlo does not require bootstrapping, while the method after that, Temporal Difference Learning, does require bootstrapping.