8. Approximation Methods¶

We are now going to look at approximation methods. Recall that in the last section, we discussed a major disadvantage to all of the methods we have studied so far. That is, they all require us to estimate the value function for each state, and in the case of the action-value function, we have to estimate it for each state and action pair. We learned early on that the state space can grow very large, very quickly. This makes all of the methods we have studied impractical.

$V$ - Need to estimate |S| values

$Q$ - Need to estimate |S|x|A| values

|S| and |A| can be very large

The solution to this is approximation.

1.1 Approximation Theory¶

Recall from our earlier work concerning deep learning, that neural networks are universal function approximators. That means that given the right architecture, a neural network can approximate any function to an arbitrary degree of accuracy. In practice, they do not perform perfectly, but they do perform very well.

Mathematically, what we are trying to do is first do a feature extraction: so from the state $s$ we can extract a feature vector $x$:

$$x = \varphi (s)$$

Our goal is to then find a function that takes in a feature vector $x$, and a set of parameters $\theta$, that faithfully approximates the value function $V(s)$:

$$\hat{f}(x, \theta) \approx V(s)$$

1.2 Linear Approximation¶

In this section, we are going to focus specifically on linear methods. We will see that function approximation methods require us to use models that are differentiable, hence we wouldn't be able to use something like a decision-tree or k-nearest neighbor. In the next set of notebooks (RL with deep learning) we will look at using deep learning methods, which are also differentiable. Unlike linear models, we won't need to do feature engineering before hand, although we could. Models like convolutional neural networks will allow us to use raw pixel data as the state, and the neural network will do its own automatic feature extraction and selection. However, those are harder to implement and take away from the fundamentals of RL, so we will hold off on them for now. For now, all we will need to know are linear regression and gradient descent.

1.3 Section 8 Outline¶

We are going to proceed with the following outline for this section:

We are first going to apply approximation methods to Monte Carlo Prediction. That means we will be estimating the value function given a fixed policy. But instead of representing the value function as a dictionary indexed by state, we will use a linear function approximator. Recall that MC methods require us to play the entire episodes and calculate the returns before doing any updates. So next we will...

Apply approximation methods to TD(0) prediction. Remember, TD(0) takes aspects of both MC sampling and the bootstrap method of DP.

After working on the prediction problem, we will move to the control problem, and we will use SARSA for this. But we will of course be replacing $Q$ with a linear function approximator.

1.4 Sanity Checking¶

One thing to keep in mind in this section, is that we can always sanity check our results by comparing to the non-approximated version. We expect our approximation to be close, but not perfect. One obstacle that we may encounter is that our algorithm may be implemented perfectly, but your model is bad. Remember, linear models are not very expressive. So, if we extract a poor set of features, the model won't be able to learn the value function well. In other words, the model will have a large error. To avoid this, we need to proactively think about what features are good for mapping states to values. We will need to put in manual work for feature engineering in order to improve our results.

2. Linear Models for RL¶

What we are going to do now is apply supervised learning to reinforcement learning. Recall that supervised learning basically amounts to function approximation. We are trying to find a parameterized function that closely approximates the true function. In this case, that true function is the value function that we use to solve MDPs.

Earlier in the course, we talked about the fact that rewards have to be real numbers. Since returns are sums of rewards, they also have to be real numbers. And since values are expected values of returns, they are also real numbers. So, thinking about the supervised learning techniques we have at our disposal-classification and regression-it should be clear that what we want to do here is regression.

2.2 Error¶

In particular, we want our estimate, $\hat{V}$, to be close to the true $V$. As we know, for all supervised learning methods we need a cost function, and the appropriate cost function for linear regression is squared error. We can represent it as follows:

$$Error = \big[V(s) - \hat{V}(s)\big]^2$$

Now that we have our basic error function, we can replace $V$ with is definition:

$$Error = \big[E[G(t) \mid S_t = s] - \hat{V}(s)\big]^2$$

However, since we do not know this expected value, we need to replace it with something else. In particular, we can take what we learned from Monte Carlo, and we can replace it with the sample mean of the actual returns:

$$Error = \big[\frac{1}{N}\sum_{i=1}^N G_{i,s}- \hat{V}(s)\big]^2$$

An alternative way of looking at this is that we treat each state and return pair as a training sample. In this way, we will try to minimize the individual squared differences between $G$ and $\hat{V}$ simultaneously. This will look just like linear regression, as expected:

$$Error = \sum_{i=1}^N \big[G_{i,s} - \hat{V}(s) \big]^2$$$$Error = \sum_{i=1}^N \big(y_i - \hat{y}_i \big)^2$$

2.3 Stochastic Gradient Descent¶

The advantage of representing the error in this way, is that it allows us to do stochastic gradient descent. This is where we take a small step in the direction of the gradient, with respect to the cost of only one sample at a time. This is perfect for our needs, because at every iteration of the game, we only have one sample (state and return pair) to look at.

2.4 Gradient Descent¶

We can recall, that in linear regression our function approximator is parameterized by a set of weights (generally either refered to as $w$ of $\theta$). Here, we can let $\hat{V}$ be parameterized by $\theta$. In other words, what we are trying to find is the $\theta$ that allows $\hat{V}$ to be the best approximation. To achieve this, we want to do gradient descent with respect to $\theta$, and minimize the error we derived earlier.

$$\theta = \theta - \alpha \frac{\partial E}{\partial \theta}$$

For clarity, recall that $\frac{\partial E}{\partial \theta}$ is just representing how the squared error changes with respect to a change in $\theta$. And once again, $\alpha$ represents the learning rate. We can take this a step further, and replace the error with the squared difference we just derived earlier:

$$\frac{\partial E}{\partial \theta} = \frac{\big[G_{i,s}- \hat{V}(s, \theta)\big]^2}{\partial \theta}$$$$\theta = \theta + \alpha \Big( G - \hat{V}(s, \theta)\Big) \frac{\partial \hat{V}(s,\theta)}{\partial \theta}$$

Note: above the 2 from the exponent is drop after the derivative is taken. Remember, this is stochastic gradient descent, so we are only looking at one sample of $G$ at a time.

2.5 Gradient Descent for Linear Models¶

Recall that we are only looking at linear models in this class, so $\hat{V}$ is the dot product of the feature vector $x$ and $\theta$.

$$\hat{V}(s, \theta) = \theta^T \varphi(s) = \theta^T x$$

This dot product is just a linear combination, which can be expanded into the more familiar form:

$$\hat{V}(s, \theta) = \theta_1 x_1 + \theta_2 x_2 + ... + \theta_n x_n $$

In other words, the derivative of $V$ with respect to $\theta$ is $x$:

$$\frac{\partial \hat{V} (s, \theta)}{\partial \theta} = x$$

So we can formulate our new update rule as follows:

$$\theta = \theta + \alpha \Big( G - \hat{V}(s, \theta)\Big) x$$

2.6 Relationship to Monte Carlo¶

Something interesting happens when we think back to our Monte Carlo methods; in other words, when we were not parameterizing $V$. Instead, $\hat{V}$ itself was the parameter we were trying to find, and we were trying to find it for all states $s$. If $\hat{V}$ itself is the parameter, then we get this update equation:

$$\hat{V}(s) = \hat{V}(s) + \alpha \big(G_s - \hat{V}(s)\big) \frac{\partial \hat{V}(s)}{\partial \hat{V}(s)}$$$$\hat{V}(s) = \hat{V}(s) + \alpha \big(G_s - \hat{V}(s)\big)$$

But we can recall that this is the exact same equation that we had before for updating the mean! So, we can see what we were doing before to find $V$ was actually an instance of gradient descent.

3. Feature Engineering¶

We are now going to look at how feature engineering can be applied in the case of RL. Recall, Neural Networks can in some sense automatically find good nonlinear transformations/features of the raw data. But, we are only looking at linear methods for now, which means we will need to come up with features on our own.

3.1 Mapping s $\rightarrow$ x¶

One way to think of states is that they are categorical variables. For instace:

(0,0) $\rightarrow$ category 1

(0,1) $\rightarrow$ category 2

and so on...

How do we treat categorical variables? We do so via the technique one-hot encoding. So, what is do one-hot encoding? Well, we have $S$ states, so the dimensionality of $x$, which we will call $D$, is:

$$D = \mid S \mid$$

This means that we have a long feature vector of size $\mid S \mid$. Then for each of the states, we set one of the values in $x$ to 1:

$$s = (0,0) \rightarrow x = [1,0,0,...]$$$$s = (0,1) \rightarrow x = [0,1,0,...]$$

The problem with this is that it doesn't allow us to compress the amount of space it takes to store the value function! It requires the same number of parameters as measuring $V(s)$ directly. Remember, compressing the amount of space is the whole reason we are doing this in the first place. If we store $V$ as a dictionary, it will have $\mid S \mid $ keys, and $\mid S \mid$ values. Hence, we make no improvement if we do one hot encoding. In fact, it is equivalent to what we were doing before, since each $\theta$ would represent the value for the corresponding state:

$$V(s=0) = \theta^T [1,0,0,...] = \theta_0$$

3.2 One-Hot Encoding¶

There is, however, one positive aspect to using one-hot encoding for your feature transformation. Let's suppose your algorithm isn't working, and $\hat{V}$ isn't representing the true $V$ very well. One reason your $\hat{V}$ may not be good, is because your features may be bad! So, your code may be fine and free of bugs, but still yield poor results because the features are bad. In that case, you could change your feature transformer to use one-hot encoding, where you could predict each $V(s)$ individually. If you do this and your code works, that tells your that your features are bad (since it's the same as a non-approximation method).

3.3 Alternative to One-Hot Encoding¶

So, if we know that one-hot encoding is bad, then what is good? Well, in the case of grid world, consider that each (i,j) represents a position in 2-d space. Therefore, it is more like a real number than a category. So, we can represent the $x$ vector as simply (i,j) itself. You may want to scale it so that its mean is 0 and variance is 1. We would consider this feature vector to simply be the raw data, without any feature engineering:

$$(x_1, x_2) = (i,j)$$

So, what is the problem with just setting $x$ to be the location of the agent? Well, remember that our model is linear. Let's say $j$ is fixed-that means $V$ can only change linearly with respect to $i$. That means $V$ can only be a line with respect to this x coordinate. So, it is always increasing, or always decreasing-this is not very expressive.

3.4 Polynomials¶

Recall, that one easy way to make new features is by creating polynomials. In calculus, it is shown that an infinite Taylor Expansion can approximate any function. So, if we start with $x_1$ and $x_2$ we can create the terms:

$$x_2^2$$$$x_2^2$$$$x_1 x_2$$

We can also create higher order polynomials, but we must be careful of overfitting. This is the method we will use from here on out.

4. Monte Carlo Prediction with Approximation¶

We can now start applying approximation methods to the prediction problem and control problem. We are going to start with the prediction problem! We are going to use Monte Carlo estimation but replace the value function dictionary with our linear model of the value function.

4.1 Two Main Steps¶

Recall that for MC estimation, we have two main steps which we want to do repeatedly.

Play the game and calculate a list of states and returns.

Update $\hat{V}(S)$ using the return as the target and $V(S)$ as the prediction:

$$\hat{V}(s) = \hat{V}(s) + \alpha \big(G_s - \hat{V}(s) \big)$$
Remember, this is gradient descent, but it is also equivalent to calculating the mean of all the returns for this state.

4.2 With Approximation¶

Since in this lecture we are approximation $V(s)$ with the parameter $\theta$, what we want to update instead is the parameter $\theta$:

Play the game and calculate a list of states and returns.

Update $\hat{V}(S)$ as average of returns:

$$\theta = \theta + \alpha \big(G - \hat{V}(s, \theta) \big)x$$
Here, we continue to do stochastic gradient descent, but with respect to $\theta$ instead of $\hat{V}$.

4.3 Pseudocode¶

Remember that this is for a fixed policy, since we are focusing on the prediction problem. Let's look at some pseudocode to solidify this idea:

$ \text{def mc_approx_prediction}(\pi)\text{:} \\ \hspace{1cm} \theta =\text{random} \\ \hspace{1cm} \text{for i=1..N:} \\ \hspace{2cm} \text{states_and_returns = play_game} \\ \hspace{2cm} \text{for s, g in states_and_returns:} \\ \hspace{3cm} x = \Phi(s)\\ \hspace{3cm} \theta \leftarrow \theta + \alpha (g - \theta^T x)x \\ $

We start by taking in an input policy, $\pi$, and randomly initializing the $\theta$ vector

Next, we enter a loop for some number of iterations

On each iteration, we play the game, which returns a sequence of states and returns

Loop through these states and returns, and apply our update equation to $\theta$, treating our return as the target

5. MC Prediction with Approximation in Code¶

import numpy as np
import matplotlib.pyplot as plt
from common import standard_grid, negative_grid, print_policy, print_values
from common import random_action, play_game, SMALL_ENOUGH, GAMMA, ALL_POSSIBLE_ACTIONS

# Note: This is policy evaluation, not optimization 

LEARNING_RATE = 0.001

if __name__ == '__main__':
    # use the standard grid again (0 for every step) so that we can compare
  # to iterative policy evaluation
  grid = standard_grid()

  # print rewards
  print("rewards:")
  print_values(grid.rewards, grid)

  # state -> action
  # found by policy_iteration_random on standard_grid
  # MC method won't get exactly this, but should be close
  # values:
  # ---------------------------
  #  0.43|  0.56|  0.72|  0.00|
  # ---------------------------
  #  0.33|  0.00|  0.21|  0.00|
  # ---------------------------
  #  0.25|  0.18|  0.11| -0.17|
  # policy:
  # ---------------------------
  #   R  |   R  |   R  |      |
  # ---------------------------
  #   U  |      |   U  |      |
  # ---------------------------
  #   U  |   L  |   U  |   L  |
  policy = {
    (2, 0): 'U',
    (1, 0): 'U',
    (0, 0): 'R',
    (0, 1): 'R',
    (0, 2): 'R',
    (1, 2): 'U',
    (2, 1): 'L',
    (2, 2): 'U',
    (2, 3): 'L',
  }
  
  # Randomly initialize theta vector. Our model is V_hat = theta.dot(x), where
  # x = [row, column, row*column, 1] , where the 1 is for the bias term
  theta = np.random.randn(4) / 2 
  
  # Define a function that turns the state into a feature vector x. The only nonlinear 
  # feature is the interaction effect between the i and j coordinate (s[0] and s[1])
  def s2x(s):
    return np.array([s[0] - 1, s[1] - 1.5, s[0]*s[1] - 3, 1])
  
  # Repeat until converge - Main Loop
  deltas = []
  t = 1.0
  for it in range(2000):
    if it % 100 == 0:
      t += 0.01
    alpha = LEARNING_RATE/t # Using decaying learning rate
    
    # Generate an episode using pi. Pattern is the same as before. We play the game, 
    # and get a sequence of states and returns. We then loop through the states and 
    # returns, but instead of updating V, we now update the parameter theta, using the 
    # equation we derived earlier for stochastic gradient descent. 
    biggest_change = 0
    states_and_returns = play_game(grid, policy)
    seen_states = set()
    for s, G in states_and_returns:
      # Check if we have already seen s. This is 'first-visit' MC policy evaluation
      if s not in seen_states:
        old_theta = theta.copy()
        x = s2x(s)
        V_hat = theta.dot(x)
        theta += alpha*(G - V_hat)*x
        biggest_change = max(biggest_change, np.abs(old_theta - theta).sum())
        seen_states.add(s)
    deltas.append(biggest_change)
  
  plt.plot(deltas)
  plt.show()
  
  # Obtain predicted values
  V = {}
  states = grid.all_states()
  for s in states:
    if s in grid.actions:
      V[s] = theta.dot(s2x(s))
    else:
      # terminal state or state we can't otherwise get to
      V[s] = 0

  print("values:")
  print_values(V, grid)
  print("policy:")
  print_policy(policy, grid)

rewards:
---------------------------
 0.00| 0.00| 0.00| 1.00|
---------------------------
 0.00| 0.00| 0.00|-1.00|
---------------------------
 0.00| 0.00| 0.00| 0.00|

values:
---------------------------
 0.43| 0.59| 0.75| 0.00|
---------------------------
 0.37| 0.00| 0.36| 0.00|
---------------------------
 0.32| 0.15|-0.02|-0.19|
policy:
---------------------------
  R  |  R  |  R  |     |
---------------------------
  U  |     |  U  |     |
---------------------------
  U  |  L  |  U  |  L  |

6. `TD(0)` Semi-Gradient Prediction¶

We now ask: Can we apply approximation's to TD learning, in addition to Monte Carlo? The answer is that yes we can, but there is one caveat. Let's start by looking at the algorithm.

Recall that the main difference between Monte Carlo and Temporal Difference Learning, is that with TD we don't use the actual return. We instead estimate the return based on the reward and the value function for the next state, i.e. instead of using $G$ we use:

$$r + \gamma V(s')$$

If we then look at how this transforms our algorithm:

$$\theta = \theta + \alpha \big(G - \hat{V}(s, \theta) \big)x$$$$\theta = \theta + \alpha \big(r + \gamma \hat{V}(s', \theta) - \hat{V}(s, \theta) \big)\frac{\partial \hat{V}(s,\theta)}{\partial \theta}$$$$\theta = \theta + \alpha \big(target - prediction \big)\frac{\partial prediction}{\partial \theta}$$

We can quickly see that we run into a problem! Our target is not a real target, because it requires a prediction that the model makes; in particular, $\hat{V}(s',\theta)$. So, in a sense we are using the model output as a target to fix the model parameters, which seems strange. However, at the same time it works, so we do it anyways.

We call this a semi-gradient method because since the target we are using is not a true target, the gradient we are using is not a true gradient.

6.1 `TD(0)` Semi-Gradient Prediction in Code¶

Remeber, there is a main differences between this and Monte Carlo:

In MC, we do not use the returns in the update for $\theta$. Because of this, when we play the game we also don't need to bother calculating the returns, since we will be using the rewards directly.

import numpy as np
import matplotlib.pyplot as plt
from common import standard_grid, negative_grid, print_policy, print_values, ALPHA
from common import random_action, play_game_td, SMALL_ENOUGH, GAMMA, ALL_POSSIBLE_ACTIONS

class Model: 
  def __init__(self):
    self.theta = np.random.randn(4) / 2
    
  def s2x(self, s):
    "Transforms states into feature vector. Same feature transformation as MC."
    return np.array([s[0] - 1, s[1] - 1.5, s[0]*s[1] -3, 1])
  
  def predict(self, s):
    "Takes in state s, transforms into x, and returns dot product between theta and x."
    x = self.s2x(s)
    return self.theta.dot(x)
  
  def grad(self, s):
    "Gradient function, just returns x."
    return self.s2x(s)
  
if __name__ == '__main__':
  # use the standard grid again (0 for every step) so that we can compare
  # to iterative policy evaluation
  grid = standard_grid()

  # print rewards
  print("rewards:")
  print_values(grid.rewards, grid)

  # state -> action
  policy = {
    (2, 0): 'U',
    (1, 0): 'U',
    (0, 0): 'R',
    (0, 1): 'R',
    (0, 2): 'R',
    (1, 2): 'R',
    (2, 1): 'R',
    (2, 2): 'R',
    (2, 3): 'U',
  }
  
  model = Model()
  deltas = []
  
  # Main loop. Repeat until convergence.
  k = 1.0
  for it in range(20000):
    if it % 10 == 0:
      k += 0.01
    alpha = ALPHA / k
    biggest_change = 0
    
    # Play game to get sequence of states and rewards
    states_and_rewards = play_game_td(grid, policy)
    
    # The first (s, r) tuple is the state we start in and 0 (since we don't 
    # get a reward) for simply starting the game. The last (s,r) tuple is the 
    # terminal state and the final reward. The value for the terminal state is
    # by definition 0, so we don't care about updating it. 
    # Loop through states and rewards.
    for t in range(len(states_and_rewards) - 1):
      s, _ = states_and_rewards[t]
      s2, r = states_and_rewards[t + 1]
      
      # We will update V(s) AS we experience the episode
      old_theta = model.theta.copy()
      if grid.is_terminal(s2):
        # Since we know value at terminal state is 0, we don't make prediction if s2 
        # is terminal
        target = r
        
      else:
        target = r + GAMMA * model.predict(s2)
      # Implement update equation for theta
      model.theta += alpha*(target - model.predict(s))*model.grad(s)
      biggest_change = max(biggest_change, np.abs(old_theta - model.theta).sum())
    deltas.append(biggest_change)
  
  plt.plot(deltas)
  plt.show()

  # obtain predicted values
  V = {}
  states = grid.all_states()
  for s in states:
    if s in grid.actions:
      V[s] = model.predict(s)
    else:
      # terminal state or state we can't otherwise get to
      V[s] = 0

  print("values:")
  print_values(V, grid)
  print("policy:")
  print_policy(policy, grid)

rewards:
---------------------------
 0.00| 0.00| 0.00| 1.00|
---------------------------
 0.00| 0.00| 0.00|-1.00|
---------------------------
 0.00| 0.00| 0.00| 0.00|

values:
---------------------------
 0.29| 0.40| 0.51| 0.00|
---------------------------
 0.14| 0.00|-0.08| 0.00|
---------------------------
-0.00|-0.34|-0.67|-1.00|
policy:
---------------------------
  R  |  R  |  R  |     |
---------------------------
  U  |     |  R  |     |
---------------------------
  U  |  R  |  R  |  U  |

7. Semi-Gradient SARSA¶

We are now going to move onto the control problem (choosing the optimal policy). However, instead of using $Q$, we are going to create a parameterized model to approximate $Q$. Note, this is going to be more difficult than approximating $V$, since instead of approximating $\mid S \mid$ values, we now have to approximate $\mid S \mid x \mid A \mid$ values. However, the basic idea is the same. Recall that our target is:

$$target = r +\gamma \hat{Q}(s',a')$$

And our prediction is:

$$prediction = \hat{Q}(s,a)$$

Since $\hat{Q}$ is an approximation and it appears in both the target and the prediction, this is also not a true gradient, so this is also a semi-gradient method:

$$\theta = \theta + \alpha \big [r +\gamma \hat{Q}(s',a') - \hat{Q}(s,a)\big] \frac{\partial \hat{Q}(s,a)}{\partial \theta}$$

7.1 Features¶

Since $Q$ is indexed by both $s$ and $a$, we need to find a way to combine them into a feature vector $x$:

$$(s,a) \rightarrow x$$

One simple method, would be to take what we had before for $V$ ($row, column, row*column$), and just one hot encode the actions. So, in effect we would have:

$$x = (row, column, row*column, U, D, L, R, 1)$$

However, if you do this, you will find that it is not quite expressive enough. In other words, a linear model given these features can't accurately approximate $Q$.

What does seem to work better is to add squared terms, and have a separate set of features for each action. So, for each action, we now have features that look like:

x = [
  r && U, 
  c && U,
  r*r && U,
  c*c && U,
  r*c && U,
  1 && U,
  r && D,
  ...
]

The 1 represents the one hot encoded action (not really a bias). You may wonder, given so many features does this still compress $Q$? Well, we have:

$$(6 \; per \; action) * (4 \; actions) + 1 \; bias = 25 \; features$$

Whereas our tabular method (storing $Q$ as a table) had:

$$9 \; states * 4 \; actions = 36$$

7.2 Semi-Gradient SARSA in Code¶

We are now going to implement semi-gradient SARSA in code to find the optimal policy (and solve the control problem)!

import numpy as np
import matplotlib.pyplot as plt
from common import standard_grid, negative_grid, print_policy, print_values, ALPHA
from common import random_action_td, play_game_td, SMALL_ENOUGH, GAMMA, ALL_POSSIBLE_ACTIONS
from common import max_dict

# Dictionary to map state-action pairs to indexes, and global index that keeps track 
# of number of state action pairs that we have already seen so that it can assign 
# the next index.
SA2IDX = {}
IDX = 0

class Model:
  def __init__(self):
    "Initializes theta. Dimensionality is now 25."
    self.theta = np.random.randn(25) / np.sqrt(25)
    
  def sa2x(self, s, a):
    # NOTE: using just (r, c, r*c, u, d, l, r, 1) is not expressive enough
        return np.array([
      s[0] - 1              if a == 'U' else 0,
      s[1] - 1.5            if a == 'U' else 0,
      (s[0]*s[1] - 3)/3     if a == 'U' else 0,
      (s[0]*s[0] - 2)/2     if a == 'U' else 0,
      (s[1]*s[1] - 4.5)/4.5 if a == 'U' else 0,
      1                     if a == 'U' else 0,
      s[0] - 1              if a == 'D' else 0,
      s[1] - 1.5            if a == 'D' else 0,
      (s[0]*s[1] - 3)/3     if a == 'D' else 0,
      (s[0]*s[0] - 2)/2     if a == 'D' else 0,
      (s[1]*s[1] - 4.5)/4.5 if a == 'D' else 0,
      1                     if a == 'D' else 0,
      s[0] - 1              if a == 'L' else 0,
      s[1] - 1.5            if a == 'L' else 0,
      (s[0]*s[1] - 3)/3     if a == 'L' else 0,
      (s[0]*s[0] - 2)/2     if a == 'L' else 0,
      (s[1]*s[1] - 4.5)/4.5 if a == 'L' else 0,
      1                     if a == 'L' else 0,
      s[0] - 1              if a == 'R' else 0,
      s[1] - 1.5            if a == 'R' else 0,
      (s[0]*s[1] - 3)/3     if a == 'R' else 0,
      (s[0]*s[0] - 2)/2     if a == 'R' else 0,
      (s[1]*s[1] - 4.5)/4.5 if a == 'R' else 0,
      1                     if a == 'R' else 0,
      1
    ])
  
  def predict(self, s, a):
    "Takes in state s, transforms into x, and returns dot product between theta and x."
    x = self.sa2x(s,a)
    return self.theta.dot(x)
  
  def grad(self, s, a):
    "Gradient function, just returns x."
    return self.sa2x(s, a)

def getQs(model, s):
  """At some point during SARSA, we need to take the argmax over Q, but we can't do that
  if Q isn't represented as a dictionary. So, the job of this function is to turn the 
  predictions into a dictionary given some state s."""
  Qs = {}
  for a in ALL_POSSIBLE_ACTIONS:
    q_sa = model.predict(s,a)
    Qs[a] = q_sa
  return Qs

if __name__ == '__main__':
  grid = negative_grid(step_cost=-0.1)

  # print rewards
  print("rewards:")
  print_values(grid.rewards, grid)

  # no policy initialization, we will derive our policy from most recent Q
  # enumerate all (s,a) pairs, each will have its own weight in our "dumb" model
  # essentially each weight will be a measure of Q(s,a) itself
  states = grid.all_states()
  for s in states:
    SA2IDX[s] = {}
    for a in ALL_POSSIBLE_ACTIONS:
      SA2IDX[s][a] = IDX
      IDX += 1
  
  model = Model()
  
  # Main Loop - repeat until convergence
  t = 1.0
  t2 = 1.0
  deltas = []
  for it in range(20000):
    if it % 100 == 0:
      t += 0.01
      t2 += 0.01
    if it % 1000 == 0:
      print("it:", it)
    alpha = ALPHA / t2
    
    # instead of 'generating' an epsiode, we will PLAY
    # an episode within this loop
    s = (2, 0) # start state
    grid.set_state(s)

    # get Q(s) so we can choose the first action
    Qs = getQs(model, s)
    
    # the first (s, r) tuple is the state we start in and 0
    # (since we don't get a reward) for simply starting the game
    # the last (s, r) tuple is the terminal state and the final reward
    # the value for the terminal state is by definition 0, so we don't
    # care about updating it.
    a = max_dict(Qs)[0]
    a = random_action_td(a, eps=0.5/t) # epsilon-greedy
    biggest_change = 0
    
    while not grid.game_over():
      r = grid.move(a)
      s2 = grid.current_state()

      # we need the next action as well since Q(s,a) depends on Q(s',a')
      # if s2 not in policy then it's a terminal state, all Q are 0
      old_theta = model.theta.copy()
      if grid.is_terminal(s2):
        model.theta += alpha*(r - model.predict(s, a))*model.grad(s, a)
      else:
        # not terminal
        Qs2 = getQs(model, s2)
        a2 = max_dict(Qs2)[0]
        a2 = random_action_td(a2, eps=0.5/t) # epsilon-greedy

        # we will update Q(s,a) AS we experience the episode
        model.theta += alpha*(r + GAMMA*model.predict(s2, a2) - model.predict(s, a))*model.grad(s, a)
        
        # next state becomes current state
        s = s2
        a = a2

      biggest_change = max(biggest_change, np.abs(model.theta - old_theta).sum())
    deltas.append(biggest_change)

  plt.plot(deltas)
  plt.show()

  # determine the policy from Q*
  # find V* from Q*
  policy = {}
  V = {}
  Q = {}
  for s in grid.actions.keys():
    Qs = getQs(model, s)
    Q[s] = Qs
    a, max_q = max_dict(Qs)
    policy[s] = a
    V[s] = max_q

  print("values:")
  print_values(V, grid)
  print("policy:")
  print_policy(policy, grid)

rewards:
---------------------------
-0.10|-0.10|-0.10| 1.00|
---------------------------
-0.10| 0.00|-0.10|-1.00|
---------------------------
-0.10|-0.10|-0.10|-0.10|
it: 0
it: 1000
it: 2000
it: 3000
it: 4000
it: 5000
it: 6000
it: 7000
it: 8000
it: 9000
it: 10000
it: 11000
it: 12000
it: 13000
it: 14000
it: 15000
it: 16000
it: 17000
it: 18000
it: 19000

values:
---------------------------
 0.63| 0.81| 1.03| 0.00|
---------------------------
 0.45| 0.00| 0.63| 0.00|
---------------------------
 0.27| 0.07| 0.17| 0.59|
policy:
---------------------------
  R  |  R  |  R  |     |
---------------------------
  U  |     |  U  |     |
---------------------------
  U  |  U  |  U  |  U  |

8. Summary¶

This section was all about approximation methods. We learned eariler in our discussions of RL that the state space of an MDP can quickly grow too large to enumerate. Hence, trying to store a value function for every state that needs to be defined in the state space can be infeasible. Approximation methods allow us to parameterize the value function, so that effectively we need a smaller number of parameters than the size of the state space.

Specifically, we saw that differentiable methods were a good fit, since they allow us to use stochastic gradient descent and continue learning in an online fashion, just like we have done throughout the course. Remember, online learning is desirable because episodes can be very long, so it is good if our agent can improve during the episode.

The type of differentiable model that we looked at was linear regression. But, we can just as easily apply this to neural networks and deep learning. Using linear regression, we were forced to engineer our own features, since the raw features we had- the $i$ coordinate and the $j$ coordinate- did not yield an expressive enough model.

We applied approximation methods to MC prediction, TD(0) prediction (semi-gradient), the control problem via the SARSA method (again, semi-gradient). Note that we did not apply approximation methods to Q-Learning. Recall that Q-learning is an off-policy method. Certain sources have stated that approximation techniques applied to off-policy control problems do not have good convergence characteristics. However, recall that the main difference between SARSA and Q-Learning is that the $Q$ that we use for the update does not necessarily correspond to the next action we take. See "Deep Q-Learning" for how this may be done.

8.1 Next steps¶

It is important to realize that the only requirement for our function approximators is that they be differentiable. Recall that deep learning methods are also differentiable. Libraries like theano and tensorflow automatically calculate tensorflow for us, which would make the update equations easier in the case where we have lots of parameters and complex functions.

A few things that we didn't see, but would be worth going over are:

Continuous state-spaces: We know that state spaces are not necessarily discrete. For example, if our state space is an image of our environment, images represent the real world-light intensity in a continuous 3-dimensional space. Light intensity is also continuous.

Continuous action-spaces: Actions spaces are not necessarily discrete either. For example, your agent may need to decide how much force to apply to a motor. Force is a continuous variable.

Policy Gradient Method: We didn't parameterize $\pi$ at all in this course, but that is a next step, commonly referred to as the policy gradient method.

Store input target pairs: Use previous experience to learn. In other words, replay previous episode and continue to update parameters of model with gradient descent based on those old episodes.