3. Components of an RL System¶

Let's quickly recall what we had discussed earlier concerning the components of an RL system. We talked about the following:

Agent: The thing that is playing the game, that we want to program the RL algorithm into.

Environment: The thing that the agent interacts with; the agents world.

State: Specific configuration of the environment that the agent is sensing. Note, the state only represents that which the agent can sense.

Actions: Things that an agent can do that will affect its state. In Tic-Tac-Toe, that's placing a piece on the board. Performing an actions always brings us to the next state, which also comes with a possible reward.

Rewards: Tells you how good your action was, not whether it was a correct or incorrect action. It does not tell you whether it was the best or worst action, it is just a number. The rewards you have received over the course of your existence doesn't necessarily represent possible rewards you could get in the future. For example, you could search a bad part of state space, hit a local max of 10pts, while the global max was actually 1000 pts. The agent does not know that (but in our case we will, since we designed the game).

1.1 Notation¶

We know that being in a state, $S(t)$, and taking an action $A(t)$, will lead us to the reward $R(t+1)$ and the state $S(t+1)$:

$$S(t), A(t) \rightarrow R(t+1), S(t+1)$$

However, when we drop the time index's, we represent this as the 4-tuple:

$$(s,a,r,s')$$

The $r$ is not given a prime as you would expect, but it is standard notation. So, the prime symbol doesn't strictly mean "at time t + 1"

1.2 New Terms¶

The first new term we want to discuss is Episode. This represents one run of the game. For example, we will start a game of tic tac toe with an empty board, and as soon as one player gets 3 pieces in a row, that's the end of the episode. As you may imagine, our RL agent will learn across many episodes. For example, after playin 1000, 10000, or 100000 episodes, we can possibly have trained an intelligent agent. The number of episodes we will use is a hyper parameter and will depend on the game being played, the number of states, how random the game is, etc.

Playing the game tic-tac-toe is an episodic task because you can play it again and again. This is different from a continuous task which never ends. We will not be looking at continuous tasks in this course.

Now, the next question we ask is: when is the end of an episode? Well, there are certain states in the state space that tell us when the episode is over. These are states from which no more action can be taken. They are referred to as terminal states. For tic-tac-toe these are when one player gets 3 in a row, or when the board is full (a draw).

1.3 Cart-Pole / Inverted Pendulum¶

This problem comes up all the time in RL and control systems. If you search google for inverted pendulum, you will see research papers concerning control systems, and if you search cart-pole you will see all kinds of RL research papers. At the beginning of an episode, the cart is stationary and the pole is perpendicular to the ground. Because the system is unstable, the pole will then begin to fall, and the job of the cart is to move so that the pole does not fall down.

When the pole falls so far that it is impossible to get back up, any angle past a threshold where it is impossible to get the pole back up is a terminal state. Note, because the angle in this example is a real number, is a continuous/infinite space. We will not deal with these.

1.4 Assigning Rewards¶

One difficult problem in reinforcement learning that we will come across is: defining rewards. We, the programmers, can be thought of as coaches to the AI. The reward is something that we give to the agent. So, we can define how we are going to reward the agent, which will drive how it learns.

For example, if we just give it the same reward no matter what it does, then the agent will probably just end up acting randomly, since any action will lead to the same value. You don't want to do that, because it will encourage bad behavior.

A real situation could be seen with a robot trying to solve a maze. If it manages to exit the maze it would receive a reward of 1, else it would receive 0. Most people would think that this seems reasonable, and it is semi-reasonable. However, with this reward structure, the robot may never actually solve the maze. So, if it has only ever received a reward of 0, it may think that it is the best it can do. A better solution would be to give the robot a -1 for every step it takes, and then it would be encouraged to solve the maze as quickly as possible.

1.4.1 Caution¶

One point of caution is to not build your own prior knowledge into the AI. For example, in a game such as chess, an agent should be rewarded only for winning, not taking opponent's pieces, or implementing some strategy that you read about in a chess book. You want to leave the agent free to come up with its own solution. The danger of rewarding the agent for achieving sub goals, is that they may find a novel way to maximize the reward for the subgoals, without actually winning the game. For example, taking all of the opponents chess pieces and then still losing the game.

So to summarize, we can say that:

"The reward signal is your way of telling the agent what you want it to achieve, now how you want it to be achieved."

3. The Value Function¶

Take a moment to consider the following scenario. You have an exam tomorow. You would like to hang out with your friends. You know if you hangout with you friends you will most likely have a dopamine hit and feel happy. On the other hand if you study for your exam you will feel tired and potentially bored. So why study? Well, this is the idea of planning.

In particular, we can describe a value function:

Value Function: We don't just think about immediate rewards, but future rewards too. We want to assign some value to the current state that reflects the future too.

Now, we can think of this in the reverse direction as well. Let's say you receive a reward; getting hired for your dream job. Now, if you look back to you career and things you did in school, what would you attribute your success to? What state of being in your past lead you to get your dream job today? This is refered to as the credit assignment problem:

Credit Assignment Problem: What did you do in the past that led to the reward you are receiving now? In other words, what action gets the credit.

The credit assignment problem shows up in online advertising as well, but the concept is referred to as attribution. The idea is that if a user is shown the same ad 10 different times before they buy, which ad gets the credit?

Now, closely related to the credit assignment problem is the idea of delayed rewards. Note that these are all kind of just different ways of saying the same thing, and the solution is also the same. Delayed rewards is just looking at the problem from the other direction. With credit assignment, we are looking into the past and asking "what action lead to the reward we are getting now?". With delayed rewards, we are asking "How is the action I am taking now related to rewards I may potentially receive in the future?"

The idea of delayed rewards tells us that an AI must have the ability of foresight, or planning.

3.1 Example Scenario¶

Imagine the following: you are in state A, which is the second last state in a game. There are only two possible next states, both of which are terminal states. B gives you a reward of 1, and C gives you a reward of 0. You have a 50% probability of going to either state; perhaps your agent doesn't know which one is best. We can think of the value of state B as 1, and the value of state C as 0. So, what is the value of state A? We can think of it as 0.5, since it is the expected value of your final reward, given that you have a 50% chance of ending up in either state:

$$Value(A) = 0.5*1 + 0.5*0 = 0.5$$

Now, lets say that you are in state A, and state A can only lead to state B, and there is no other possible next state:

If B gives you a reward of 1, then perhaps A's value should also be 1, since the only possible final scenario is to have a final reward of 1, once you reach A:

$$Value(A) = 1 * 1 = 1$$

Thus, the value tells us about the "future goodness" of a state. We can make this a little more formal; we actually call this value, the value function.

3.2 Value Function¶

The value function is a measure of the future rewards that we may get:

V(s) = the value (taking into account the probability of all possible future rewards) of a state

The name value, is not quite ideal, since it is very ambiguous. However, it is part of the RL nomenclature, so we will deal with it.

3.2.1 Rewards vs. Values¶

It is easy to get rewards and values mixed up. The difference is that the value of state, is a measure of the possible future rewards we may get from being in that state. Rewards on the other hand are immediate.

We, therefore, chose actions to take based on values of states, and not on the reward we would get by going to a state! The reward is the main goal, but we can't use the reward to tell us how good a state is, since it doesn't tell us anything about future rewards.

3.2.2 Efficiency¶

One way to think about the value function, is that it is a fast and efficient way to determine how good it is to be in a state, without needing to search the rest of the game tree. You could try to enumerate every possible state transition, and their probabilities of occuring; however, you can guess that this would be a computationally inefficient task. In fact, tic tac toe is easy since it is only a 3x3 board, so the number of states is approximately 3^(3x3) = 19683. However, that will grow exponentially with the size of the board! For example, if you have a connect 4 board, then you get 3^(4x4) = 43 million! As we know, exponential growth is never good, and hence searching the state space is only possible in the simplest of cases. Hence, a value function that can tell you how good you will do in the future, given only your current state, is extremely helpful. This means that $V(s)$ gives an answer instantly, in only $O(1)$ time! The only question is, is it even possible to find a value function...

3.2.3 Value Functions in RL¶

Estimating the value function is a central task in RL, but it should be noted that not all RL algorithms require it. For instance, a genetic algorithm simply spawns offspring that have random genetic mutations, and the ones who survive the longest will go on to spawn offspring in the next generation. So, by pure evolution and natural selection, we can breed better and better agents that get iteratively better at the game! However, this is not the type of algorithm that we are interested in for RL, most of the time.

3.2.4 Value Function: Math¶

So, we can dig into the math and notation surrounding the value function now. The value function takes in one parameter, the state, so we can denote it $V(s)$. It is the expected value of all future rewards, given that the current state is $s$:

$$V(s) = E[all \; future \; rewards \; | \; S(t) = s]$$

In other words, this is the average value of all possible future rewards, given that the current state is $s$.

3.2.5 Finding $V(s)$¶

Let's now look at a generic algorithm that we can use to find the value function. For now, we are going to introduce some unrealistic constraints, to show that the problem can actually be solved. However, later on it will be clear that we don't need to do this. The algorithm we will use is an iterative one, and each time we run the loop we will get closer and closer to the true value function.

So, the first thing that we do is initialize the value function:

For all states where the agent wins, we say the value is 1. $$V(s) = 1 \text{if s = winning state}$$

For all states where the agent loses or it is a draw, we say the value is 0. $$V(s) = 0 \text{if s = losing state}$$

For all other states, we say the value is 0.5. $$V(s) = 0.5 \text{if s = not winning or losing}$$

When we study the "real" algorithms soon, we will not need to focus on such careful initialization. For this game, $V(s)$ can be interpreted as probability of winning after arriving in s (for this game only).

Update Function
The update function looks simple, but there are some hidden details. It kind of looks like gradient descent:

$$V(s) \leftarrow V(s) + \alpha * (V(s') - V(s))$$

We take the value function at a state, $V(s)$, and we update it by adding the learning rate times $V(s')$ minus $V(s)$; here, $s$ is the current state, and $s'$ is the next state.

3.2.6 Detail 1¶

The first detail here is that $s$ represents every state that we encounter in an episode. That means that for each iteration of the loop, we actually need to play the game and remember all of the states that we were in. We then loop through each of the states in the state history, and then update the value using the above equation. Notice that a terminal state value will never be updated, because to update a state you need the next state value, which does not exist if we are in a terminal state.

In Pseudocode it may look like:

for t in range(max_iterations):
  state_history = play_game
  for (s, s') in state_history from end to start:
    V(s) = V(s) + learning_rate * (V(s') - V(s))

3.2.7 Detail 2¶

The next detail we need to think about is how do we actually play the game. Are we just going to take random actions? No! First, think about what taken random actions would lead to; it would result in a game tree having different probabilities than a game tree that was created by taking best actions. Second, remember that we have the value function! The value function tells us how good a state is, based on how good the future states will be. So, all we need to do is perform the action that leads to the next best state.

In pseudocode that may look like:

maxV = 0
maxA = None
for a, s' in possible_next_states:
  if V(s') > maxV:
    maxV = V(s')  // max value 
    maxA = a      // max action that lead to max value
perform action maxA

3.2.8 Problem with this¶

What is the main problem with this strategy? The problem is that the value function isn't accurate. If had the true value function, we wouldn't need to do any of this work in the first place. This problem is one that we have touched on before; the explore-exploit dilemma. In particular, by taking random actions we can learn about new states that we otherwise may never have gone to. By doing so, we can improve our estimate of the value function for those states.

However, in order to win, we need to take the action that yields the maximum value. So, we will be using the epsilon-greedy strategy to be making a tradeoff between exploration and exploitation.

3.2.9 Intuition¶

Let's take a moment to gain some intuition about why this iterative update works. First, you should recognize this iterative update equation from before. It is reminiscent of the low-pass filter and average-value-finding equation we saw earlier, as well as gradient descent. V(s') is like the target value, and we want to move $V(s)$ towards that value. However, with RL there are multiple possible next states, so there are multiple s'. So, by doing this update equation multiple times, we are pulling V(s) in multiple different directions all at once. The idea is that, by playing infinitely many episodes, the proportion of time we spend in s' will approach the true next state probabilities.

3.2.10 Extremely Important Detail¶

One incredibly important detail that is somewhat difficult to discern given only the update formula, is the order in which you have to update the values for any given state. We know that for any particular episode, we are only going to be updating the values for the states that were in that episode. But, the key of this equation is that we are moving $V(s)$ closer to $V(s')$. That mean that we are doing this update under the assumption that $V(s')$ is more accurate than $V(s)$. For the terminal state this is true, since that is always going to be 0 or 1, and will never change. But, if $V(s)$ and $V(s')$ are of the same accuracy, aka they are both not accurate at all, then making one closer to the other won't make anything better. So, the idea is we want to move backwards through the state history, because the current state value is always updated using the next state value. And we want the next state value to be more accurate, so we should update the state values in reverse order. This is also consistent with the terminal state being precisely accurate with a value of 0 or 1 that never changes.

$$V(terminal) = 0 \; or \; 1$$

4. Tic-Tac-Toe Implementation¶

Typically in ML we're implementing procedural algorithms. However, in RL we have multiple objects interacting, and we take an OOP approach.

At a high level, we will have two main objects: the agent and the environment. During any game, there will be two instances of the agent, and they will both interact with the same instance of environment.

In order to make these things interact, we can create a function called play_game(), which accepts the agent for player 1, the agent for player 2, and the environment object. Inside of this function, it will essentially be a single loop. We will alternate between the two players, and at each iteration of the loop we need to do a few things.

We need to make the current player perform an action which updates the environment.
We need to switch who the current player is, so that it alternates.
We need to check if the game is over, since that is when the loop will terminate.

We can write this function now:

def play_game(p1, p2, env, draw=False):
  # Loop until the game is over
  current_player = None
  while not env.game_over():
    # Alternate between players
    current_player = p2 if current_player == p1 else p1
    
    # Draw the board before the user who wants to see it makes a move
    condition1 = draw == 1 and current_player == p1
    condition2 = draw == 2 and current_player == p2
    if draw and (condition1 or condition2):
      env.draw_board()
      
    # Current player makes move
    current_player.take_action(env)
    
    # Update State histories
    state = env.get_state()
    p1.update_state_history(state)
    p2.update_state_history(state)
    
  if draw:
    env.draw_board()

  # Do the value function update
  p1.update(env)
  p2.update(env)

We can see that above we have created a partial API for some instance methods that our objects will need.

The environment will need a game_over() function, that returns a boolean. True if the game is over, false otherwise. You can guess that this function will need to scan the board to check if there is a winner, or if the board is full and it is a draw.
The agent will need a take action function, which accepts as a parameter the environment. This will update the environment, and hence the state.
Another function we need on the agent is an update function, that updates the agents internal estimate of the value function. This is where the update equation that we have talked about is going to go. The update function will have to accept the environment as well, since it will have to query the most current reward from the environment.
Another function that will be very useful for us, but not entirely necessary, is a draw board function. This will display what positions are currently occupied, and by what pieces.

Representing States¶

The next thing that we need to consider is how we will be representing states. We talked about this earlier and know that this is going to be a $O(1)$ lookup. One way to accomplish this will be to use a dictionary, since the keys to a dictionary can be any immutable object. So, we could convert the game board to a 2-d tuple (size 3x3) and store the values inside as tuple elements.

However, we can also do this so that each state maps to a number, and in doing so we can store this as a numpy array.

Mapping state to a number: We talked earlier about how to calculate the upper limit on the number of states earlier. We calculated it to be 3^9 = 19683, which is not too big of an array to store in memory. Notice, this also encodes impossible states, such as an x in all 9 board locations, when in fact you only need 3 in a row to win.

The idea of having 9 locations, and 3 possible values per location should feel similar to the idea of binary numbers. $N$ bits can store $2^N$ different integers, and we can enumerate all of those possibilities by finding all of the permutations of a string of length $N$, where each location can only contain 0 or 1. The equation to convert a binary number to a decimal number is:

$$D = 2^{N-1}b_{N-1} + ... + 2^1 b_1 + 2^0 b_0$$

And in our case it is:

$$D = 3^{N-1}b_{N-1} + ... + 3^1 b_1 + 3^0 b_0$$

In pseudocode we may end up with something like this:

k = 0 
h = 0
for i in range(LENGTH):
  for j in range(LENGTH):
    if self.board[i,j] == 0:
      v = 0 
    elif self.board[i,j] == self.x:
      v = 1
    elif self.board[i,j] == self.o:
      v = 2
    h += (3**k) * v
    k += 1 
return h

Keep in mind the following: The entire reason for creating a the above function/pseudocode is to be able to take what was a 3x3 board (array) and map it to a single number. Note, this number represents the board perfectly and contains all of the information necessary to rebuilt the board.

Initialize the value function¶

We can now talk about how we will be initializing the value function properly. So, the initialization that we need is:

$$\text{1 if s == winning terminal state}$$
$$\text{0 if s == losing or draw terminal state}$$ $$\text{0.5 otherwise}$$

In order to do this, we need to enumerate every possible state, and assign them values. Here is a question to think about; is it better to create a game tree, where you enumerate every possible sequence of moves, or is it better to permute all possible settings of all possible positions on the board, even though some of those represent impossible game states?

The problem with the game tree is that we end up with many redundant states. For instance, consider the case where x goes in the top middle, and then o goes in the center. We can reach this same state if o goes in the center, and then x goes in the top middle.

If we take a second to think about how many choices this will lead to we can see that at the root we have 9 different choices, and then at the layer below we have 8 different choices, and then 7 and so on. Hence the total number of states that we will see is $9!$ which is equivalent to 326,880, and is much greater than $3^9$. So, the algorthms that we want to use to generate the states is definitely generating all permutations.

How do we go about generating these permutations? Well, this is naturally a recursive problem, since a binary number of length $N$ can start with either 0 or 1, and then we can use all of the permutations for a binary number of length $N-1$, and append it to that. In pseudocode that may look like:

def generate_all_binary_numbers(N):
  results = []
  child_results = generate_all_binary_numbers(N-1)
  for prefix in ('0','1'):
    for suffice in child_results:
      new_result = prefix + suffix 
      results.append(new_result)
  return results

Note that above the base case is not shown for simplicity. Okay, now that was just an example for doing permutations in general. We are going to have a recursive function called get_state_hash_and_winner and this is going to return a list of triples. Each triple is going to contain the state as a hash integer, who the winner is if there is one, and whether or not this state represents a finished game. The 3 arguments for this function are the environment, which we really just need for the board matrix, and the i and j coordinates to place the next value, which can be 0, x, or o.

def get_state_hash_and_winner(env, i=0, j=0):
  """Recursive function that will return all possible states (as ints) and who the
  corresponding winner is for those sates (if any). (i,j) refers to the next cell 
  on the board to permute (we need to try -1, 0, 1). Impossible games are ignored;
  i.e. 3x's and 3o's in a row simultaneously."""
  results = []
  for v in (0, env.x, env.o):
    env.board[i, j] = v # if empty board, it should already be 0 
    if j == 2: 
      if i == 2:
        # Base Case: i = 2, j = 2, filling last board location, collect results and return
        state = env.get_state()
        ended = env.game_over(force_recalculate=True)
        winner = env.winner
        results.append((state, winner, ended))
      else: 
        # Recursive Call 1, at end of row: Want to go to next row, first column (j=0, i++)
        results += get_state_hash_and_winner(env, i + 1, 0)
    else:
      # Recursive Call 2, not at end of row: Increment j, leave i the same 
      results += get_state_hash_and_winner(env, i, j + 1)
  return results

We should also recognize that the value function for the two players won't be equivalent. One player has to be x and the other player has to be o. So, if the value function is 1 for x, it is going to be 0 for o, and vice versa. However, in the case of the draw both players do get 0. So, we will have two functions which are almost the reverse of eachother;= initialV_x and initialV_o. Note, these functions could most definitely be composed in a more efficient way, but we will leave them separate for clarity purposes.

def initialV_x(env, state_winner_triples):
  """Initialize state values as follows: 
  * if x wins, V(s) = 1
  * if x loses or draw, V(s) = 0
  * otherwise, V(s) = 0.5"""
  V = np.zeros(env.num_states)
  for state, winner, ended in state_winner_triples:
    if ended:
      if winner == env.x:
        v = 1
      else:
        v = 0
    else: 
      v = 0.5
    V[state] = v
  return V

def initialV_o(env, state_winner_triples):
  """Almost exact opposite of initialV_x: 
  * if o wins, V(s) = 1
  * if o loses or draw, V(s) = 0
  * otherwise, V(s) = 0.5"""
  V = np.zeros(env.num_states)
  for state, winner, ended in state_winner_triples:
    if ended:
      if winner == env.o:
        v = 1
      else:
        v = 0
    else: 
      v = 0.5
    V[state] = v
  return V

Recap¶

So, above, we just went over the following:

In order to initialize V(s) to our desired values, we must enumerate all the states
- There were two ways of doing this: either searching the game tree (9!) or finding board permutations ($3^9$)
We then coded out the recursive function that we are going to use in order to enumerate all of the states.

We are slowly putting the pieces together for our tic tac toe game!

The Environment Class¶

Recall that the environment is the thing that our agents are going to interact with. Lets take a look at the methods that we were going to look at.

__init__(): The constructor, where we initialize important instance variables

is_empty(i, j): returns true if location (i, j) is empty

reward(symbol): returns reward when we pass a symbol representing specific player

get_state(): looks at state of the board and returns a number

game_over(): looks at state of game and returns true or false. Will take a force_recalculate argument. This is because game_over is called multiple times in our script, and if a game has ended already, we don't want to have to do this complicated operation again and again-so we use the fact that it has alreaday ended as a shortcut. But, in our earlier function get_state_hash_and_winner, we recursively filled out each location on the board. This involves placing pieces on the board, taking them off, placing other pieces on, and so on. This means we will be going back and forth between a game over state and a non game over state, so we are not able to use that short cut. force_recalculate allows us to make sure the entire operation happens every time.

draw_board(): Draws tic-tac-toe board

class Environment:
  """Represents a tic-tac-toe game."""
  def __init__(self):
    self.board = np.zeros((LENGTH, LENGTH)) # intialize as 0, our empty symbol
    self.x = -1 # represents an x on the board, player 1
    self.o = 1 # represents an o on the board, player 2 
    self.winner = None
    self.ended = False
    self.num_states = 3**(LENGTH*LENGTH)
    
  def is_empty(self, i, j):
    return self.board[i, j] == 0
  
  def reward(self, sym):
    # no reward until game is over
    if not self.game_over():
      return 0
    return 1 if self.winner == sym else 0
  
  def get_state(self):
    """Returns the current state, represented as an int from 0...|S|-1,
    where S = set of all possible states. |S| = 3^(BOARD SIZE), since 
    each cell can have 3 possible values - empty, x, o - some states are 
    not possible, e.g. all cells are x, but we can ignore that detail.
    This is like finind the integer represented by a base 3 number."""
    k = 0
    h = 0
    for i in range(LENGTH):
      for j in range(LENGTH):
        if self.board[i,j] == 0:
          v = 0
        elif self.board[i,j] == self.x:
          v = 1
        elif self.board[i,j] == self.o:
          v = 2
        h += (3**k) * v
        k += 1
    return h
  
  def game_over(self, force_recalculate=False):
    """Returns true if game over (a player has won or it's a draw), 
    otherwise returns false. Also, sets 'winner' instance variable 
    and 'ended' instance variable."""
    if not force_recalculate and self.ended:
      return self.ended
    
    # -> Check if game over 
    # check rows
    for i in range(LENGTH):
      for player in (self.x, self.o):
        if self.board[i].sum() == player * LENGTH:
          self.winner = player
          self.ended = True
          return True
    
    # check columns
    for j in range(LENGTH):
      for player in (self.x, self.o):
        if self.board[:,j].sum() == player * LENGTH:
          self.winner = player
          self.ended = True
          return True
        
    # check diagonals - use trace (sum of all elements in a matrix along main diagonal)
    for player in (self.x, self.o):
      # top-left -> bottom-right diagonal
      if self.board.trace() == player * LENGTH:
        self.winner = player
        self.ended = True
        return True
      # top-right -> bottom-left diagonal (flip matrix, then trace)
      if np.fliplr(self.board).trace() == player*LENGTH:
        self.winner = player
        self.ended = True
        return True
    
    # check if draw -> are all elements on board simultaneously non-zero
    if np.all((self.board == 0) == False):
      self.winner = None
      self.ended = True
      return True
    
    # game is not over
    self.winner = None
    return False
  
  def draw_board(self):
    for i in range(LENGTH):
      print("-------------")
      for j in range(LENGTH):
        print("  ", end="")
        if self.board[i,j] == self.x:
          print("x ", end="")
        elif self.board[i,j] == self.o:
          print("o ", end="")
        else:
          print("  ", end="")
      print("")
    print("-------------")

The Agent Class¶

The next major class in our tic tac toe game will be the agent. So far, we know that this is the AI, or the thing that contains the AI. This is different from regular machine learning, because in this situation we are not just feeding it data. Instead, our agent needs to interact with the environment. A one-line update for $V(s)$ is a very small part of our agent, and yet it is responsible for 100% of its intelligence.

We can begin by looking at an overview of our instance methods:

__init__():

setV(V): Allows us to initialize the value function. Recall our value function is going to be initialized in a very specific way.

set_symbol(symbol): Allows us to give the agent a symbol that it will use when making move.

set_verbose(b): Prints more information if verbose is set to true.

reset_history(): Need this because we want to keep track of state history for every episode. However, once the episode is finished and we are done learning, we don't need that state history anymore and we can start a new episode.

take_action(env): Takes in the environment as an input. Checks the board for valid moves, and will make moves based on a strategy. Recall, we are using the epsilon greedy strategy for this game.

update_state_history(s): Adds a state to the state history. Updated whenever any player takes a turn.

update(env): Update function. Queries environment for latest reward, which corresponds to the end of an episode, because we are only going to call update at the end of an episode. This is where all of the learning will happen.

class Agent:
  def __init__(self, eps=0.1, alpha=0.5):
    self.eps = eps # probability of choosing random action instead of greedy
    self.alpha = alpha # learning rate
    self.verbose = False
    self.state_history = []
    
  def setV(self, V):
    self.V = V
    
  def set_symbol(self, sym):
    self.sym = sym
    
  def set_verbose(self, v):
    # if true, will print values for each position on the board
    self.verbose = v
    
  def reset_history(self):
    self.state_history = []
    
  def take_action(self, env):
    # choose an action based on epsilon-greedy strategy 
    r = np.random.rand()
    best_state = None
    
    if r < self.eps:
      # take random action 
      if self.verbose: print("Taking a random action.")
      possible_moves = []
      for i in range(LENGTH):
        for j in range(LENGTH):
          if env.is_empty(i, j): # find all empty positions on board
            possible_moves.append((i, j))
      idx = np.random.choice(len(possible_moves)) # select random move from empty positions
      next_move = possible_moves[idx] 
    
    else: 
      # Greedy portion. Choose the best action based on current values of states. 
      # Loop through all possible moves, get their values. Keep track of best value. 
      pos2value = {} # for debugging
      next_move = None
      best_value = -1 
      for i in range(LENGTH):
        for j in range(LENGTH):
          if env.is_empty(i, j):
            # what is the state if we made this move?
            env.board[i, j] = self.sym
            state = env.get_state()
            env.board[i, j] = 0 # don't forget to change it back!
            pos2value[(i,j)] = self.V[state]
            if self.V[state] > best_value:
              best_value = self.V[state]
              best_state = state
              next_move = (i, j)
      
      # if verbose, draw the board w/ the values
      if self.verbose:
        print("Taking a greedy action")
        for i in range(LENGTH):
          print("------------------")
          for j in range(LENGTH):
            if env.is_empty(i, j):
              # print the value
              print(" %.2f|" % pos2value[(i,j)], end="")
            else:
              print("  ", end="")
              if env.board[i,j] == env.x:
                print("x  |", end="")
              elif env.board[i,j] == env.o:
                print("o  |", end="")
              else:
                print("   |", end="")
          print("")
        print("------------------")  
    
    # make the move
    env.board[next_move[0], next_move[1]] = self.sym
    
  def update_state_history(self, s):
    """Cannot put this in take_action, because take_action only happens once every 
    other iteration for each player. State history needs to be updated every iteration.
    s = env.get_state(), don't want to do this twice, so pass it in."""
    self.state_history.append(s)
    
  def update(self, env):
    """Contains the 'AI'. Only want to do this at the end of an episode. We want to 
    BACKTRACK over the states, so that: 
    -> V(prev_state) = V(prev_state) + alpha * (V(next_state) - V(prev_state))
    -> where V(next_state) = reward if it's the most current state 
    NOTE: we ONLY do this at the end of an episode. Also, we can see that the first 
    target is exactly equal to the final reward. But after that the targets are all 
    just Value estimates. The hope is that our value estimates will converge over time."""
    reward = env.reward(self.sym)
    target = reward 
    for prev in reversed(self.state_history):
      value = self.V[prev] + self.alpha * (target - self.V[prev])
      self.V[prev] = value
      target = value
    self.reset_history()

Remaining functionality¶

class Human:
  """Human class, created so that we can play against the AI. We will set verbose to true
  so that we can see the value of each position every time the AI takes turn."""
  def __init__(self):
    pass
  
  def set_symbol(self, sym):
    self.sym = sym
    
  def take_action(self, env):
    while True:
      # Break if we make a legal move
      move = input("Enter coordinates i,j for your next move (i,j=0..2): ")
      i, j = move.split(',')
      i = int(i)
      j = int(j)
      if env.is_empty(i, j):
        env.board[i, j] = self.sym
        break
  def update(self, env):
    "Place holder."
    pass
  
  def update_state_history(self, s):
    "Place holder."
    pass

import numpy as np
import matplotlib.pyplot as plt

LENGTH = 3

if __name__ == '__main__':
  # train the agent
  p1 = Agent()
  p2 = Agent()

  # set initial V for p1 and p2
  env = Environment()
  state_winner_triples = get_state_hash_and_winner(env)


  Vx = initialV_x(env, state_winner_triples)
  p1.setV(Vx)
  Vo = initialV_o(env, state_winner_triples)
  p2.setV(Vo)

  # give each player their symbol
  p1.set_symbol(env.x)
  p2.set_symbol(env.o)

  T = 10000
  for t in range(T):
    if t % 200 == 0:
      print(t)
    play_game(p1, p2, Environment())

  # play human vs. agent
  # do you think the agent learned to play the game well?
  human = Human()
  human.set_symbol(env.o)
  while True:
    p1.set_verbose(True)
    play_game(p1, human, Environment(), draw=2)
    # I made the agent player 1 because I wanted to see if it would
    # select the center as its starting move. If you want the agent
    # to go second you can switch the human and AI.
    answer = input("Play again? [Y/n]: ")
    if answer and answer.lower()[0] == 'n':
      break

Notes on Learning/Value Function¶

At this point you should have a good high level overview of the reinforcement learning process used in order to train our agent to perform optimally at tic tac toe. However, there may still be some haze surrounding the exact process-this is okay, considering there was a lot going on above that didn't even have to do with our agent. The actual AI portion was kept to only a few lines of code, with the rest being the equivalent of a first year CS coding assignment (using object oriented programming in order to build an environment, human, agent, etc).

Let's walk through the diagram below to ensure that we have a complete understanding of how our agent is learning over time.

Above we can see that the process essentially works as follows:

We initialize our value function. We must be aware in this case that our value function is not defined in the "traditional" sense; in other words, we don't technically have a view of its internals. We simply have the final mapping of input (state) to output (value). Mathematically that just looks like: $$V(s) = value$$ We should be aware that this means that our value function will be an array. The index of the array will be the state (recall that our state is represented as a number, determined via a mapping from the board position). We can then access the value of a certain state like so: V[14021] = some value

We then enter the iterative process of training. Above that is set to be 10,000 iterations. In this process, we begin by playing a game, which we refer to as an episode. Here, when it is our agents turn, the value function is used to decide where to place the game piece! In other words, if the board is fully empty, then we have 9 possible options as to where we should place our piece. The agent will check the value for each of those 9 states (using our value function), and then choose the arg max (they all will be initialized to 0.5 since they are not winning or losing states).

However, if we always simply choose the best action we may converge too quickly, and miss exploring certain parts of the value function state space. To prevent this, an epsilon greedy approach is used! In other words, a small percentage of the time, instead of choosing the best action based on the value function, we will choose a random action to take.

At the end of each episode, we will take the state history for that particular episode and use it to update our value function. An example is shown in the diagram above. We see that in this episode, our agent won by placing 3 x's in a row diagonally. Based on how we initialized our value function, the last state (a winning state) has a value of 1: $$V(s) = 1$$ We can then update the value of our 2nd last state, as follows: we know that it originally was initialized to be 0.5 (since it was not a winning or losing state), so $V(s) = 0.5$. We have an $\alpha$ of 0.5, and our previous state (remember, they are reversed in this process) was the winning state-hence $V(s') = 1$. So our update equation looks like: $$V(s) \leftarrow V(s) + \alpha * (V(s') - V(s))$$
$$V(s) \leftarrow 0.5 + 0.5 * (1 - 0.5) = 0.75$$
We will continue doing this for the entire state history associated with this episode. So, for the 3rd last state we have: $$V(s) \leftarrow 0.5 + 0.5 * (0.75 - 0.5) = 0.625$$

We can see that for each episode, we will iteratively update our value function and over time it should approach the true value function. It is also clear why using an explore-exploit approach is beneficial; if we failed to do so, the state spaces that we update first as the only ones that would ever be chosen (since all others would remain at 0.5!)

RL vs. Supervised Learning¶

A common question that comes up is why not use a supervised learning model that maps states to actions directly? Well first and foremost, how would we come up with labels? The state space would grow quickly and become too large to enumerate; tic-tac-toe/connect-4 is probably the limit. Now imagine we are playing a video game with 60 FPS, where there are millions of pixels, and hundreds of values per pixel; the number of states in a game like this is so large it is hard to conceptualize. There is also the fact that labels only tell us if we are right or wrong; a reward is more flexible and gives us information because it tells us how right or how wrong. Rewards tell us how good we are doing. Rewards are delayed, but the value function has this information built in.