We are now going to formalize some of the concepts that we have learned about in reinforcement learning. We have learning about the terms:
- Agent
- Environment
- Action
- State
- Reward
- Episode
This section is about putting these concepts into a formal framework called Markov Decision Processes.
In this section we are going to describe the game that we are going to use for the rest of this course. It is in some ways simpler than tic-tac-toe, but it has some properties that allow us to explore some of the more interesting properties of RL.
In this game our agent is a robot, and the environment is a grid. The agent is allowed to move in 4 directions: up, down, left, and right. Grid world is generally built in the following way:
- at position (1, 1) there is a wall, so if the robot tries to go there it will bump into the wall.
- (0,3) is a winning state (terminal state with a +1 reward)
- (1, 3) is a losing state (terminal state with a -1 reward)
One thing we will notice about gridworld is that it has a much smaller number of states than tic-tac-toe; there are only 12 positions, 11 states (where the robot is), and 4 actions-that is a small game! However, there are many concepts to be learned!
Let us first review the Markov Property in the strict mathematical sense. Suppose we have a sequence:
$$\{x_1, x_2,...,x_t\}$$We can define a conditional probability on $x_t$, given all the previous $x$'s:
$$p\{x_t \mid x_{t-1}, x_{t-2}, ..., x_1\}$$Generally speaking, this can't be simplified. However, if we assume that the markov property is true, than it can be simplified. The markov property specifies how many previous $x$'s the current $x$ depends on. So, First-Order Markov means that $x_t$ depends only on $x_{t-1}$:
$$p\{x_t \mid x_{t-1}, x_{t-2}, ..., x_1\} = p \{x_t \mid x_{t-1} \}$$A Second Order Markov means that $x_t$ only depends on $x_{t-1}$ and $x_{t-2}$:
$$p\{x_t \mid x_{t-1}, x_{t-2}, ..., x_1\} = p \{x_t \mid x_{t-1}, x_{t-2} \}$$For now we will be working with the first order markove only, and we typically refer to this as the markov property.
Consider the sentence: "Let's do a simple example". Let's say that you are given:
"Let's do a simple"
In this case it is relatively easy to guess that the next word is "example". Now, all we are given is:
"simple"
It is not longer easy to predict the next word. This can be even hard! For instance, what if we were just given:
"a"
Now it is very difficult to predic the next word, "simple". Well, that is what the markov assumption is; we can clearly see that it can be quite limiting. However, we can also define the problem so that it is not.
So, what exactly does the markov property look like in RL? Recall, that taking an action $A(t)$ while in state $S(t)$ produces two things: the next state $S(t+1)$ and a reward $R(t+1)$:
$$\{S(t), A(t) \} \rightarrow \{ S(t+1), R(t+1)\}$$What the markov property in this case says, is that $S(t+1)$ and $R(t+1)$ depend only on $A(t)$ and $S(t)$, but not any $A$ or $S$ before that:
$$p\big(S_{t+1}, R_{t+1} \mid S_t, A_t, S_{t-1}, A_{t-1},...,S_0, A_0\big) = p \big( S_{t+1}, R_{t+1} \mid S_t, A_t\big) $$For convenience, we can also use the shorthand symbols we have mentioned earlier: $s, a, r, s'$:
$$p(s', r \mid s, a) = p(S_{t+1} = s', R_{t+1} = r \mid S_t = s, A_t = a)$$So, how is the different from the normal way that we usually write the markov property? Well, notice that this is a joint distribution on $s'$ and $r$. So, it is telling us the joint distribution of two variables, conditioned on two other variables. This is different from the usual markov form, where we have one variable on the left, and one variable on the right.
Given the above joint conditional distribution, it is of course just a matter of using the rules of probability to find the marginal and conditional distributions. For example, if we just want to know $s'$ given $s$ and $a$ we can use:
$$p(s' \mid s, a) = \sum_{r \in R}p(s', r \mid s, a )$$And if we just wanted to know $r$ given $s$ and $a$:
$$p(r \mid s,a) = \sum_{s' \in S} p(s', r \mid s, a)$$Also, note that for essentially all cases that we will consider, these probabilities will be deterministic. That means that the reward you get for going to the state will always be the same reward, and taking an action in a state will always take you to the same next state.
Let's look at a recent application of RL to demonstrate that the markov assumption is not necessarily limiting. DeepMind used the concatenation of the 4 most recent frames in order to represent the current state when playing Atari games.
We have essentially been looking at MDPs this entire time, but just not referring to them by name. Any RL task with a set of states, actions, and rewards, that follows the markov property is a markov decision process.
Formally speaking, the MDP is a 5 tuple, made up of:
- Set of states
- Set of actions
- Set of rewards
- State-transition probabilities, reward probabilities (as defined jointly earlier
- Discount factor
There is one final piece needed to complete our puzzle. The other key term in markov decission process is decision. The way that we make decisions, and chose what actions to take in what states, is called a policy. We generally denote the policy with the symbole $\pi$. Technically, $\pi$ is not part of the MDP itself, but it is part of the solution, along with the value function.
The reason we are just talking about the policy now is because it is somewhat of a weird symbol. We write down $\pi$ as a mathematical symbol, but there is no equation for $\pi$. For example, if $\pi$ is epsilon-greedy, how do we write that as an equation? It is more like an algorithm. The only exception to this is when you want to write down the optimal policy, which can be defined mathematically, in terms of the value function; we will discuss this later. So, for now we can just think of $\pi$ as the shorthand notation for the algorithm that the agent is using to navigate the environment.
Let's look at the state transition probability again:
$$p(s' \mid s, a)$$Recall that we said that this is typically deterministic, but that is not always the case. Why might that be so? Recall, that the state is only what is derived from what the agent senses from the environment; it is not the environment itself. The state can be an imperfect representation of the environment, in which case you would expect the state transition to be probabilistic. For example, the state you measure could represent multiple configurations of the environment. As an example of an imperfect representation of the environment, think about blackjack; you may think of the dealers next card as part of the state. But, as the agent, you can't see the next card so it is not part of your state. It is part of the environment.
When we think of actions, we typically think of joystick inputs (up/down/left/right/jump) or blackjack moves (hit/stand). However, actions can be very broad as well, such as how to distribute government funding. So, RL can be applied to making political decisions as well.
Sometimes there is a bit of confusion surrounding what constitutes the agent vs. the environment. You are navigating your environment, but what constitutes you? Are you your body? Your body is, more correctly, part of the environment! Your body isn't making decisions or learning; your body has sensors which pass on signals to your brain, but it is your brain and mind that make all decisions and do all learning!
We are now going to formalize the idea of total future reward. This refers to everything from $t+1$ and onward. We call this the return, $G(t)$:
$$G(t) = \sum_{\tau = 1}^\infty R(t + \tau)$$Notice how it does not depend on the current reward, $R(t)$. This is because when we arrive a state, we receive the reward for that state-there is nothing to predict about it, because it has already happened.
Now, think of a very long task, potentially containing thousands of steps. Your goal is to maximize your total reward. However, is there a difference between getting a reward now, and getting that same reward 10 years from now? Think about finance; we know that \$1000 today is worth less than $1000 10 years ago. Would you rather get \$1000 now, or 10 year from now? Choose today!
This causes us to introduce a discount factor on future rewards. We call the discount factor $\gamma$, and we use a number between 0 and 1 to represent it:
$$G(t) = \sum_{\tau = 0}^ \infty \gamma ^{\tau} R(t + \tau + 1)$$A $\gamma = 1$ means that we don't care how far in the future a reward is, all rewards should be weighted equally. A $\gamma = 0$ means that we don't care about the future rewards at all, and is a truly greedy algorithm since the agent would only try to maximize its immediate reward. Usually we choose something close to 1, such as 0.9. If we have a very short episodic task, it may not be worth discounting at all. An intuitive reason for discounting future rewards is that the further you look into the future, the harder it is to predict. Hence, there is not a lot of sense getting something 10 years from now, unless you are sure you can make it happen, and that your circumstances won't change.
You may notice that the sum for the return goes from 0 to $\infty$; this implies that we are looking at a continuous task, when in reality the tasks we have looked at so far (tic-tac-toe) are episodic. This is a mathematical subtlety, but we actually want to write all of our equations in continuous form; simply put, it makes the math a little easier to work with.
There is a way to merge episodic and continuous tasks so that they are equivalent. The way you do it is this: The episodic task has a terminal state. Pretend that there is a state transition from the terminal state to itself, that always happens with probability of 1, and always yields a reward of 0. In this way, the episodic task remains the same, but since it goes on forward, it is technically a continuous task.
We are now going to go over a very intuitive, graphical explanation of the bellman equation.
The most important concept in understanding the bellman equation, is the expected value. This concept is strange to many people as first; here is why: suppose we have a coin that has heads and tails, where heads is a win, and tails is a loss. Numerically, we encode these as heads = 1 and tails = 0. Now suppose the probability of winning is 60%, i.e. P(win) = 60%. Our expected value is then:
$$0.6 * 1 + 0.4 * 0 = 0.6$$Why is this weird? Because, in this case, the expected value is a value that you can never expect to get! You will never flip a 0.6!
The point of an expected value is that it tells us its mean, or the average. We may be gathering the heights of students in a classroom and find the average height; perhaps no student has that height, it is still a useful statistic. Similarly, it doesn't matter if the coin flip will never yield 0.6. If we flip the coin 1000 times, we know to expect 600 wins!
We can define expected value to be:
$$E(X) = \sum_x p(x)x$$Suppose we are playing the game of tic tac toe. Realistically, we can't perfectly predict what our opponent is going to do next. They can do any number of things, and sometimes we may end up winnings, other times losing. Each time, this will leave us with a different total reward. Again, we can use a tree to represent this idea:
You may also recall that trees are recursive. So, if we take the tree above, and look at one of the children nodes by the root, that is also called at tree! In other words, the child of any node in the tree, is the root of another tree! This idea of recursiveness will become very important later on.
Now, every time we play this game we are going to take a different path down the tree. Each path has a different probability, but we can still have this concept of:
What is the weighted sum of total rewards I get with this tree, with these states and these probabilities.
And so, the value of this state is precisely the average reward I will accumulate in the future, by being in this state now. Note this is not the exact reward, because that can be different every time. We are saying that it is the average, or what we would get on average if we chose that tree many times.
So, lets quickly recap what we know so far:
- At each state $s$, we are going to get a reward $R$.
- The overall return, $G$, is the sum of all the rewards we get.
However, games are random, so we need to be able to answer: "If we are in state $s$, what is the sum of rewards we will get in the future, on average?" To answer this we can say:
$$V(s) = E( G \mid s)$$In english this just means "The value at state $s$ is equal to the expected value of the overall return, $G$, given that we are in state $s$." As a note, anything to the left of the $\mid$ (given) symbol is random, while that to the right is not random. This is called a conditional expectation.
The next concept we discuss is one of the most fundamental concepts in RL. The idea is as follows:
We know that every game is going to consist of a series of states and rewards:
Let's pretend for a moment that everything is deterministic, hence the reward we would get is the only reward we could possibly get at any state. This lets us drop the expected value symbol for now. You may recall, the expected value of a value that is not random, is just the value itself. I.e., the expected value of 3 is 3: $E(3) = 3$.
So, because the value of a state is just the sum of all future rewards (if they are deterministic), we can say that:
$$V(s_1) = r_2 + r_3 + r_4 + ... + r_N$$$$V(s_2) = r_3 + r_4 + ... + r_N$$But, we can now see a special relationship between the values of successive states! In particular, we can plug in the expression for $V(s_2)$ into the expression for $V(s_1)$:
$$V(s_1) = r_2 + V(s_2)$$In other words, this is a recursive equation!
Of course, if you wanted to discount future rewards, that can be done too without any significant changes. We can say that:
$$V(s_1) = r_2 + \gamma r_3 + \gamma^2 r_4 + ...$$$$V(s_2) = r_3 + \gamma r_4 + ... $$$$V(s_1) = r_2 + \gamma V(s_2)$$This recursiveness is going to be a very important theme in this course. We know that this value function is going to be an expected value. And we know from our tic tac toe example that our job is to estimate this value. But we can see how it has this recursive structure. The value at this state is an estimate, and it depends on the value at another state-which is also an estimate. So, it is an estimate which itself depends on an estimate. This set of notebooks, in general, is all about how we optimize this array of estimates-which depend on eachother-all at the same time.
So, what can we do now that we have this relationship between the value at state $s$, and the value at state $s'$ (Notice that $s_1$ and $s_2$ have been replaced with $s$ and $s'$ just be more general). So:
$s$ = current state
$s'$ = next state
We can call the reward $r$, but it can also be denoted as $R(s, s')$ since it is technically the reward we get going from $s$ to $s'$.
$r$ = $R(s, s')$ = reward
In any case, remember that we said that the rewards and state transitions can all be probabilistic. So, in order to denote that, we just put the expected value symbol back in our equation:
$$V(s) = E \Big[ r + \gamma V(s')\Big]$$This is the essence of the Bellman Equation!
Since we can express $V(s)$ as an expected value, we can even expand it out:
$$V(s) = E \Big[ r + \gamma E[r' + \gamma V(s'')]\Big]$$$$V(s) = E \Big[ r + \gamma E\big[r' + \gamma E[r''+...]\big]\Big]$$What we have done above is just expand the recursion. This is mainly done for visual purposes, and we won't actually do this in any of our algorithms.
For simplicity sake, the conditional was dropped (among other things). We can put that back in easily:
$$V(s) = E \Big[ r + \gamma V(s') \mid s\Big]$$One useful function of the value function, $V$, which depends on $s$, is another value function $Q$. $Q$ not only depends on $s$, but also the action $a$. We call $V(s)$ the state-value function, and $Q(s,a)$ the action-value function:
V(s) = state-value function
What is my expected future return, now that I am in state $s$?
Q(s,a) = action-value function
What is my expected future return, given that I am now in state $s$ and I take action $a$?
Hence, $Q$ provides a way of incorporating more data into the prediction, and provide more granularity. What this means is that since $V$ doesn't depend on $a$, it must somehow take it into account, we just don't know how yet. We will discuss this more in a later section!
Earlier, we saw value functions when we discussed tic-tac-toe, but that was rather informal. Everything from here on out is pertaining to the official value function.
A value function is determined by a policy and has a state as a parameter. It is equal to the expected value of the return, given you are in state s:
$$V_\pi(s) = E_\pi \big[G(t) \mid S_t = s \big] = E_\pi \big[\sum_{\tau = 0}^\infty \gamma ^ \tau R(t + \tau + 1) \mid S_t = s \big] $$Where recall that we defined the return (total return), as $G(t)$:
$$G(t) = \sum_{\tau = 0}^ \infty \gamma ^{\tau} R(t + \tau + 1)$$Two things to note here:
We can now do some basic algebra, noticing that the return is recursive; we can separate $R(t+1)$, $R(t+2)$, and so on.
$$V_\pi(s) = E_\pi \big[R(t + 1) + \gamma \sum_{\tau = 0}^\infty \gamma ^ \tau R(t + \tau + 2) \mid S_t = s \big] $$$$V_\pi(s) = E_\pi \big[R(t + 1) + \gamma G(t+1) \mid S_t = s \big] $$Before we jump to the next portion of this derivation, there is something worth clearing up regarding expected value. Generally, if we want to find the expected value of a random variable $X$, we need two things:
- The different values that X can take on
- The probability associated with X taking on each of those values
This is seen clearly by the definition of expected value:
$$E(X) = \sum_X p(X)X$$We sum over all different values that $X$ can exhibit, and the associated probability of each value.
As a quick example, imagine playing the lottery. You can win the following prizes (prizes in this case represent our random variable):
Prize Values: 1, 5, 10, 20, 50, 100, 1000, 10000, and 100000 dollars.
So, we have the values that our random variable can hold. Now, we can define the associated probabilities:
Prize Probabilities: P(1) = 0.1, P(5) = 0.05, P(10) = 0.01, P(20) = 0.005, P(50) = 0.0001, P(100) = 0.00005, P(1000) = 0.00001, P(10000) = 0.000001, and P(100000) = 0.0000001
Great, we now have everthing we need to perform a quick calculation and determine our expected value of the lottery ticket:
$$1*0.1 + 5 * 0.5 + 10 * 0.01 + 20 * 0.005 + 50 * 0.0001 + 100 * 0.00005 + 1000 * 0.00001 + 100000 * 0.000001 + 100000 * 0.0000001 = 2.92$$So, in this case the expected value of our lottery ticket is 2.92 dollars. This means that if you pay 5 dollars for the ticket, over time (law of large numbers) you can expect to lose 2.08 each time you buy the ticket.
Okay, with that example out of the way, and hopefully a bit more intuition under our belts, let's talk about the notation of our expected value definition. We see that we simply define it as $E(X)$. However, in the value function equation, the expected value has a subscript referring to $\pi$, specifically: $E_\pi [...$. What exactly does this mean?
We can certainly agree that in our lottery example the final result that we calculated was most definitely dependent upon the probability distribution surrounding the prizes. In other words, we could say that we were calculating the expected value of the lottery ticket, with respect to the prize probability distribution. That means that we could have written it as:
$$E(X) = E_X(X) = E_{P(X)}(X)$$Each different notation above simply means that we are trying to determine the expected value of a the random variable X, with respect to its specific distribution. Now, the notation surrounding expected value and probability can be rather ambiguous, with different authors changing things seemingly haphazardly, however we should just be prepared to interpret the above as identical.
So, as we tie back into the example at hand, we can say that our policy will most definitely have an effect on the reward (since our reward is based on an action, and the action we take is based on the policy). Hence, we want to find the expected value of the total return $G(t)$, with respect to the distribution associated with our policy.
Given the above discussion concerning expected value, let's now tie together many of the things that we have been talking about so far. Because our expected value is over (with respect to) $\pi$, we can express $\pi$ as a probability distribution:
$$\pi = \pi(a \mid s)$$Now given the equation we had left off on:
$$V_\pi(s) = E_\pi \big[R(t + 1) + \gamma \sum_{\tau = 0}^\infty \gamma ^ \tau R(t + \tau + 2) \mid S_t = s \big] $$We can take advantage of the fact that expected values are linear operators, and find each term individually. So, we can first look at the expected value of $R(t+1)$, given $s$:
$$E_\pi \big[R(t + 1) \mid S_t = s \big]$$Now in order to find this, let's first clarify what the above equation is representing:
In english is it saying: "What is the expected value of the reward $R(t+1)$, given that we are in state $s$?"
Well, to gain intuition before jumping into the derivation, let's think about what is actually going into determining a reward? If we look back at section 2.2 concerning MDP's, we can see that a reward is yielded after an action $a$ is taken while being in a state $s$. This is illustrated below:
Great, so we can clearly see that in order to find the expected value of $R(t+1)$, we will need two things:
- Distribution of policy $\pi$
Our reward is dependent upon the action we take, and the action we take is dependent on our policy. Hence, we need to incorporate $\pi(a \mid s)$ into our solution.
- Probability of r (aka $R(t+1)$) given s and a
By definition, the expected value of a random variable depends on the probability distribution associated with that random variable. In this case, our random variable is $r$, and the distribution of $r$ depends on the state $s$ and action $a$ that we take. This leaves us with:
$$E_{p(r \mid s, a)} = \sum_r r * p(r \mid s, a) $$
With those two things in mind, we can derive the following:
$$E_\pi \big[R(t + 1) \mid S_t = s \big] = \sum_a \pi(a \mid s) * \sum_r r * p(r \mid s,a)$$And elaborate in the following image:
A Note on the traditional expected value and marginalization:
Now pertaining to the probability in the tradtional expected value, i.e.:
Recall, from section 2.2, that MDP states that:
$$p\big(S_{t+1}, R_{t+1} \mid S_t, A_t, S_{t-1}, A_{t-1},...,S_0, A_0\big) = p \big( S_{t+1}, R_{t+1} \mid S_t, A_t\big) $$And we can subsequently rewrite it as:
$$p(s', r \mid s, a)$$If we then marginalize out the $s'$, we can find the probability we are seeking, $p(r \mid s,a)$:
$$p(r \mid s,a) = \sum_{s' \in S} p(s', a \mid s, a)$$As a note, this marginalization process works by considering a value of $r$, and then iterating through every single value of $s'$ (all the ways that $r$ could happen given $s'$, and summing them up). For example, if our reward could take on a value from ${1...5}$, and our state could take on a value from ${1...3}$, we can find the probability that the reward is equal to 5 as follows:
$$p(r = 5 \mid s,a) = \sum_{s' \in S} p(s', r =5 \mid s,a)$$$$p(r = 5 \mid s,a) = p(s' = 1, r =5 \mid s,a) +p(s' = 2, r =5 \mid s,a)+ p(s' = 3, r =5 \mid s,a)$$And we have subsequently found $p(r \mid s,a)$, marginalizing out $s'$.
A simple example:
Based on the equation that we found earlier:
We can look at a small example to see how it looks in action. Imagine for a second that we are inside of the right summation, and the r index is currently equal to 5:
$$5 * p(r = 5 \mid s,a)$$But what is the action a? This where the left term comes in; we must include it, and hence set a, which will be used in above equation, andthen multiply by the specific $\pi(a \mid s)$ probability. This can be thought of as two nested for loops.
Of course, we can rewrite our above solution in terms in $p(s', r \mid s, a)$:
$$E_\pi \big[R(t + 1) \mid S_t = s \big] = \sum_a \pi(a \mid s) * \sum_{s'}\sum_r r * p(s',r \mid s,a)$$In fact, what we have just determined above is a general expression for expected values, meaning that we can use it to calculate the expected value of anything, given the policy and the current state
$$E_\pi \big[ANY \mid S_t = s \big] = \sum_a \pi(a \mid s) * \sum_{s'}\sum_r (ANY) * p(s',r \mid s,a)$$This means that we can use the above equation to calculate the second part of the expected value from our original equation:
$$V_\pi(s) = E_\pi \Big[R(t + 1) + \gamma \sum_{\tau = 0}^\infty \gamma ^ \tau R(t + \tau + 2) \mid S_t = s \Big] $$$$V_\pi(s) = \sum_a \pi(a \mid s) * \sum_{s'}\sum_r p(s',r \mid s,a) \Big \{ r + \gamma E_{\pi} \Big[ \sum_{\tau = 0}^\infty \gamma^{\tau} R(t + \tau + 2) \mid S_{t+1} = s' \Big] \Big \}$$Now, going from the first to second line's above, there may be some confusion. It is most likely clear how the left hand side of the inner portion of the curly brackets, $r$, was derived-it is entirely identical to what we had just done early. But, there was something slightly tricky done related to right hand side; we added a nested expected value directly the right of the $\gamma$ term:
$$E_{\pi} \Big[ \sum_{\tau = 0}^\infty \gamma^{\tau} R(t + \tau + 2) \mid S_{t+1} = s' \Big]$$So, one expected value was taken away and another added in its place. Why was this done? Well, to see we must again go back to the rules of probability.
Expected Value of an Expected Value
Recall that the expected value of the expected value of a random variable is just the expected value of the random variable:
This can be done infinitely, and it will not change:
$$E\big(E(E(E(...E(X))))\big) = E(X)$$Therefore, if we have one expected value, $E(A + B)$, we can insert another value in there, without breaking any rules of probability:
$$E(A + B) = E(A + E(B))$$Conditional Expectation
We then make use of the conditional expectation rule, which says that the expected value of $X$ given $Y$, is a function of the random variable $Y$ (assuming $Y$ is not fixed). That means we can write:
Law of the unconcious Statistician
We can then use the law of the unconcious statistician to help us even further. The law is used to calculate the expected value of a function, $f(Y)$, when the probability distribution of $Y$ is know, but the distribution of $f(Y)$ is not known directly:
Law of Total Expectation
This can then be utilized finally in the law of total expectation! The lay of total expectation show that:
To reach this conclusion, we take our result from the LOTUS:
$$ E\big(E(X \mid Y)\big) = E\big( f(Y) \big) = \sum_{y \in Y} f(Y) * P(Y = y)$$And we replace $f(Y)$ with what it was representing all along:
$$E\big(E(X \mid Y)\big) = \sum_{y \in Y} E(X \mid Y) * P(Y = y)$$By definition we know that the inner expected value, $E(X \mid Y)$, is defined as:
$$\sum_{x \in X} x * P(X = x \mid Y)$$So we can substitute that in:
$$E\big(E(X \mid Y)\big) = \sum_{y \in Y} \sum_{x \in X} x * P(X = x \mid Y) * P(Y = y)$$By the definition of conditional probability, we can represent the conditional times marginal distribution as the joint distribution of $X$ and $Y$:
$$E\big(E(X \mid Y)\big) = \sum_{y \in Y} \sum_{x \in X} x * P(X = x, Y =y)$$Assuming the series is finite (or countably infinite), we can switch the summations around:
$$E\big(E(X \mid Y)\big) = \sum_{x \in X} x \sum_{y \in Y} P(X = x, Y =y)$$At which point the summation over $Y$ will reduce our probability to $P(X = x)$:
$$E\big(E(X \mid Y)\big) = \sum_{x \in X} x * P(X = x)$$Which is simply $E(X)$!
$$E\big(E(X \mid Y)\big) = E(X)$$Back to V(S)
Okay so what does this all mean for our scenario? Let's use our new tools and make a few quick simplifications; we started with the following for $V_\pi(s)$:
We will let the left and right terms be represented by $A$ and $B$ respectively:
$$A = R(t + 1) $$$$B = \gamma \sum_{\tau = 0}^\infty \gamma ^ \tau R(t + \tau + 2)$$We can see via the rule derived earlier, we can place an expected value around $B$ without breaking any rules:
$$E(A + B) = E(A + E(B))$$Meaning our full equation (keeping $A$ and $B$) is now:
$$E_\pi(A + B \mid S_t = s) = E_\pi(A \mid S_t = s + E_\pi(B \mid S_t = s))$$Now, looking at the other rule we derived (LOTUS):
$$E\big(E(X \mid Y)\big) = E(X)$$It can be applied to our above equation as follows; first separate our left and right term because the expected value is a linear operator:
$$ E_\pi(A \mid S_t = s + E_\pi(B \mid S_t = s)) = E_\pi(A \mid S_t = s) + E_\pi \big( E_\pi(B \mid S_t = s)\big)$$At which point the right hand term has the form of our LOTUS rule:
$$E\big(E(X \mid Y)\big) \leftrightarrow E_\pi \big(E_\pi(B \mid S_t = s)\big)$$Then because we know the the conditional term in the above equations can be removed, (i.e. on the left $Y$ will be reduced away, and on the left $S_t = s$ will), we can take advantage and place any conditional expectation in its place:
$$E_\pi \big(E_\pi(B \mid S_t = s)\big) \rightarrow E_\pi \big(E_\pi(B \mid ANY )\big) \rightarrow E_\pi(B)$$So we pick ANY to be that which corresponds to the value function for $s'$!
$$ E_\pi \big(E_\pi(B \mid ANY )\big) \rightarrow E_\pi \big(E_\pi(B \mid S_{t+1} = s' )\big)$$And if we replace $B$ with its definition:
$$E_\pi \big(E_\pi(\gamma \sum_{\tau = 0}^\infty \gamma ^ \tau R(t + \tau + 2) \mid S_{t+1} = s' )\big)$$And since we know that the expected value of an expected value just the inner expected value, this reduces to:
$$E_\pi \big [\gamma \sum_{\tau = 0}^\infty \gamma ^ \tau R(t + \tau + 2) \mid S_{t+1} = s' \big ] $$Which is just equal to $V(s')$!
$$V(s') = E_\pi \big [ \gamma \sum_{\tau = 0}^\infty \gamma ^ \tau R(t + \tau + 2) \mid S_{t+1} = s' \big ]$$Bringing it all together:
So, we started with the following:
$$V_\pi(s) = E_\pi \Big[R(t + 1) + \gamma \sum_{\tau = 0}^\infty \gamma ^ \tau R(t + \tau + 2) \mid S_t = s \Big] $$
And then determined a general expression for expected value:
$$V_\pi(s) = \sum_a \pi(a \mid s) * \sum_{s'}\sum_r p(s',r \mid s,a) \Big \{R(t + 1) + \gamma \sum_{\tau = 0}^\infty \gamma ^ \tau R(t + \tau + 2) \mid S_t = s \Big \}$$Then, we made some alterations based on allowed manipulations of expected value:
$$V_\pi(s) = \sum_a \pi(a \mid s) * \sum_{s'}\sum_r p(s',r \mid s,a) \Big \{ r + \gamma E_{\pi} \Big[ \sum_{\tau = 0}^\infty \gamma^{\tau} R(t + \tau + 2) \mid S_{t+1} = s' \Big] \Big \}$$And that brings us to the final line in the derivation, which we replace what we know is equal to $V(s')$, with $V(s')$ itself.
$$V_\pi(s) = \sum_a \pi(a \mid s) * \sum_{s'}\sum_r p(s',r \mid s,a) \Big \{ r + \gamma V_\pi(s') \Big \}$$We can now appreciate this equation in all of its glory! This recursive structure shows that $V$ at a certain state, depends on $V$ at the next subsequent state. What we care about is the total future reward-which could be infinite-but we don't have to look that far in to the future!
We know that $V(s')$ already encapsulates the future by definition, so $V(s')$ is as far as we need to look. In other words, in order to tell the future, all we need to do is look one step ahead!
The key to remember here is that this allows us to avoid enumerating all possible future states (which could be infinitely long, in order to choose every action! In any situation we want to do a global optimization-we are not so much interested in making 5 dollars in the next minute; we want to know how to make 1 billion dollars in a lifetime.
What RL says is that we do not need to consider every future step; as long as you know the value function, then you can always safely look just one step ahead! Everything we need to know about future rewards is already encapsulated in the value function by definition.
This recursive equation has a special place in RL, and it is the equation from which all of the algorithms we learn are based. It is known as the Bellman Equation:
It is named after the mathematician Richard Bellman. He also pioneered the technique known as dynamic programming. It is related to recursion, but is efficient due to building up a solution with a bottom up approach.
There is one more value function that we will need to learn about in our discussion of value functions. What we just learned about, $V(s)$, is called the state-value function:
$$ V_\pi(s) = E_\pi \big [ G(t) \mid S_t = s\big ]$$There is another value function, $Q(s,a)$, called the action-value function:
$$Q(s,a) = E_\pi \big[ G(t) \mid S_t =s , A_t = a\big]$$It is called this because the action $a$ is also a parameter of the function. Notice that because $Q$ has two inputs, the amount of space we need to store it is quadratic. In particular, the amount of space we need to store it is the size of the set of states, times the size of the set of actions:
$$\text{Space required: } |S| * |A|$$Let's work through a few simple models that contain a handful of states, which are hence easy to solve by hand. We also want to keep in mind throughout these examples, the central idea of "solving for V(s)". That is essentially what these RL notebooks are about:
- We have model called an MDP
- There is a value function, V(s). All we need to do is find it.
It really is that simple! Once the details have been worked out, we are able to take a step back and appreciate the generality of this framework to solve problems.
Before our first example, we should highlight the goal of the value function one more time:
Goal Of Value Function: The value function allows us to answer the following question: "If we are in state s , what is the sum of rewards we will get in the future, on average?"
We will start with the most basic example: We have 2 states, start and end. The probability of going from start to end is 1.
Clearly, every time we play this game it is completely deterministic. We can say that the end state gives us a reward of 1, and the discount factor is $\gamma = 0.9$. The question is, what do we need to find?
Remember, we want to be solving for the value function! Because there are only two states, in particular we need to find $V(start)$ and $V(end)$.
To do this, we must first recall that the value is the sum of all future rewards. At the terminal state there are no longer any future rewards, meaning the value of a terminal state is 0. Hence:
$$V(end) = 0$$It follows that because everything else is deterministic:
$$V(start) = 1$$Why is this? Well, from an intuitive standpoint we can think of this as a probability tree. Since this is the only edge in that tree, the weight of the edge is 1, and the reward is 1, leaving us with one as our answer. This would still hold true when using the Bellman equation:
$$V(start) = V(s = start) = E\big[ R(end) \mid s = start\big]$$$$V(start) = E \big[1 \mid start \big]$$$$V(start) =1 $$Keep in mind that we do not multiply this by $\gamma$, because $\gamma$ only applies to future states. So, the full equation is technically:
$$V(start) = R + \gamma V(end)$$But the second term just resolves to 0.
We can now add a 3rd state:
Note that everything is still deterministic, and every game just has 3 steps: start, middle, and end. The reward again is 1 for arriving in the end start, and 0 at the middle state. $\gamma$ will remain at 0.9.
As we know, our goal is to find $V(start)$, $V(middle)$, and $V(end)$.
Because end is a terminal state, we again know its value is:
$$V(end) = 0$$The middle state takes on the role that start did in example 1, so it is subsequently:
$$V(middle) = 1$$Finally, we can use the bellman equation to determine the value of the start state:
$$V(start) = R(start, mid) + \gamma V(mid) = 0 + 0.9 *1 = 0.9$$We are now going to slightly tweak example 3, by changing the reward at the middle state to be $R = -0.1$.
Again, we know the value of the start and middle states:
$$V(end) = 0$$$$V(middle) = 1$$Keep in mind, the reason that the value of the middle state did not change is because it is only effected by future rewards. And in the case of the start state:
$$V(start) = R(start, mid) + \gamma V(mid) = 1*-0.1 + 0.9 *1 = 0.8$$In this example, we are going to introduce some randomness:
Our states are $S_1, S_2, S_3, S_4$. $S_4$ is a terminal state, and you always get a reward of 1 when you go here. $S_2$ and $S_3$ always go to $S_4$ 100% of the time, so this part is deterministic. $S_1$ goes to $S_2$ 50% of the time, and to $S_3$ 50% of the time, hence this portion is random. The reward for $S_4$ is 1, for $S_2$ is -0.2, and for $S_3$ is -0.3.
Before we begin trying to find a solution, it would be beneficial to clear up a potential nuance that was just introduced; what do the probabilities we are working with refer to? When we say "there is a 50% chance of going to $S_2$ and a 50% chance of going to $S_3$", what does that really man? Becuse this is reinforcement learning, and there are multiple components in an MDP, this is not just a simple coin flip. We don't flip a coin and go which way it tells us to go. We are an intelligent agent, and as such have an internal policy, $\pi(a \mid s)$, that tells us which action to take. So, importantly we should make clear:
$\pi(a \mid s)$ does not tell us where to go, but rather what action to take.
But, we must recall that this is not the only relevant probability. We also have the probability:
$$p(s', r \mid s, a)$$This just means that by doing some action $a$ in state $s$, the result can be random. So if the action is to flip a coin, of course there can be multiple possible results. We may get heads, or we may get tails. That is one way an action may result in a probabilistic next state.
$p(s', r \mid s, a)$ tells us where we end up, after taking a certain action.
So, when we say that there is a 50% chance of going to $S_2$ and a 50% chance of going to $S_3$, that sounds very intuitive and simple, but it oversimplifies MDP's and results in some ambiguity (although when you verbalize it, it sounds simpler).
For this example, let's assume that $p(r,s' \mid s,a)$ is deterministic, and the 50% probabilities mentioned earlier, refer to the policy $\pi(a \mid s)$, where the action is simply just to go to the next state. That means that the agent itself decides where to go by random chance, and it is not the result of the environment.
So, as before, our terminal state $S_4$ has a value of 0:
$$V(S_4) = 0$$Because moving from $S_2$ and $S_3$ is deterministic, they both have the same value as well:
$$V(S_2) = 1$$$$V(S_3) = 1$$Finally, $V(S_1)$ must be calculated using expected values:
$$V(S_1) = p(S_2 \mid S_1) \big[R_2 + \gamma V(S_2) \big] + p(S_3 \mid S_1) \big[R_3 + \gamma V(S_3) \big]$$$$V(S_1) = 0.5 * \big[-0.2 + 0.9 *1 \big] + 0.5 * \big[-0.3 + 0.9 *1 \big] = 0.65$$We will now extend the previous example a bit further. In example 4, the expected value was easy to calculate because we knew we would end up in the same terminal state $S_4$. In this example that won't be the case. Note that $S_1$ can go to either $S_2$ or $S_3$, and from $S_2$ and $S_3$ we can end up in either $S_4$ or $S_5$.
So, now we have 2 terminal states, with multple paths to either one. The only relevant rewards for this example are +1 if we end up in $S_5$, and -1 if we end up in $S_4$.
We know that $S_4$ and $S_5$ are terminal states, hence:
$$V(S_4) = 0$$$$V(S_5) = 0$$Secondly, we can calculate $V(S_2)$ and $V(S_3)$ just from the rewards, just because we know the next state must be terminal (so we don't need to use the full bellman equation).
$$V(S_2) = 0.8 * -1 + 0.2 * 1 = -0.6$$$$V(S_3) = 0.9*1 + 0.1*-1 = 0.8$$Finally, to determine $V(S_1)$ we must consider both $V(S_2)$ and $V(S_3)$. It is still simple, however, because the rewards at those states are 0.
$$V(S_1) = p(S_2 \mid S_1) \big[R_2 + \gamma V(S_2) \big] + p(S_3 \mid S_1) \big[R_3 + \gamma V(S_3) \big]$$$$V(S_1) = 0.5\big[0 + 0.9 * -0.6 \big] + 0.5 \big[0 + 0.9 * 0.8 \big]$$$$V(S_1) = 0.09$$Now, in this example we are going to be looking at a simpler state diagram, but with more complex probabilities. This will be a somewhat more realistic scenario so we can picture what is happening. It will also allow us to separate policy randomness from state-arrival randomness.
The start state is that you are standing. At this point someone throws a ball at you (think dodgeball). You have two potential actions, you can either decide to "jump" or "duck". The next possible state is either you git hit, or you don't get hit (which means you are safe). We can say that the reward for getting hit is -1, and the reward for not getting hit is 0.
Importantly, we can see why these state diagrams are somewhat over simplistic. In particular, when looking at a tree like this, you may think that the action is to go to the next state. However, that does not include all of the components of an MDP. We can see why that would not work in this scenario. We can't just choose "don't get hit", we can only choose to "jump" or "duck". This follows any real-life scenario. For example, we cannot choose to be rich. We can start a company and it may fail or it may succeed and make us rich, but can't decide that. All we can do is take the action of starting the company. The state that we end up in (rich or not rich) is probabilistic.
So, as we continue with our example, we can use the following probabilities:
Action based on policy probabilities:
$$\pi(jump \mid start) = 0.5$$
$$\pi(duck \mid start) = 0.5$$
State transition probabilities:
$$p(hit, reward=-1 \mid jump, start) = 0.8$$
$$p(hit, reward=0 \mid jump, start) = 0$$
$$p(safe, reward=-1 \mid jump, start) = 0$$
$$p(safe, reward=0 \mid jump, start) = 0.2$$
$$p(hit, reward=-1 \mid duck, start) = 0.4$$
$$p(hit, reward=0 \mid duck, start) = 0$$
$$p(safe, reward=-1 \mid duck, start) = 0$$
$$p(safe, reward=0 \mid duck, start) = 0.6$$
Notice that we defined the distribution over the reward. We did not have to do this, however it shows what the entire probability space looks like. We can place in any other number for the reward and the probability is zero, as by the specifications for the problem (i.e. a reward of 10 would have 0% probability, because that does not exist in this game).
We can of course just marginalize these in order to get rid of the reward completely since it is deterministic:
$$p(hit \mid jump, start) = 0.8$$$$p(safe \mid jump, start) = 0.2$$$$p(hit \mid duck, start) = 0.4$$$$p(safe \mid duck, start) = 0.6$$Now we can solve for the value of each state. As always, the value of the terminal states is 0:
$$V(safe) = 0$$$$V(hit) = 0$$Calculating $V(start)$ is the more challenging portion of this problem. We can start by writing the following:
$$V(start) = \sum_{a \in \{duck, jump\}} \pi(a \mid start) \sum_{s' \in \{safe, hit\}} p(s' \mid start, a) r(s, s')$$The above is still simpler than the full bellman equation, because we don't need to consider the value for the next state. In a situation like this it is nice to do a bit of a sanity check. When we see a summation inside of a another summation, we can think of at as a nested for loop. So, in the outer loop we have 2 things to sum over, and in the inner loop we have 2 things to sum over. We also know that the number times thing run in total is the size of the outer loop times the size of the inner loop. So, in total we should have 4 things to sum over.
For brevity, we are going to stop showing the conditioning on the start state, since that is very clear, and we will abreviate jump with $j$ and duck with $d$. We end up with:
$$V(start) = \sum_{a \in \{d, j\}} \pi(a) \sum_{s' \in \{safe, hit\}} p(s' \mid a) r(s, s')$$$$V(start) = \pi(j)p(safe \mid j)*0 + \pi(j)p(hit \mid j)*-1 + \pi(d)p(safe \mid d)*0 + \pi(d) p(hit \mid d) * -1$$We won't plug in the numbers, since the entire goal of this example was just to set things up, but feel free if you would like to.
For our final example, we are going to get rid of notion "just work backwards", and introduce a cycle.
Above, we can see that $S_3$ is a terminal state, and once you reach it the game is over and you receive a reward of 1. $S_1$ and $S_2$ are non terminal states, and they can both go to eachother, so there is no notion of a starting position. Only $S_2$ can go to $S_3$. but $S_1$ can go back to itself.
Let's assume that the only probabilities here are the actions, and that the actions are simply to go to whatever state is next. As we have discussed, the actions don't have to be choosing what to state to go to next, but it will help in simplifying the probabilities that we must consider.
So, the question now arises: how do we solve this problem? Well, since we cannot just work backwards as we did before, our only option is to just apply the bellman equation for each state and see what we end up with.
First we can solve for $V(S_1)$:
$$V(S_1) = p(S_1 \mid S_1)*(R_1 + \gamma V(S_1)) + p(S_2 \mid S_1)*(R_2 + \gamma V(S_2)) $$$$V(S_1) = 0.3*(-0.1 + 0.9*V(S_1)) + 0.7*(0.4 + 0.9 V(S_2)) $$$$V(S_1) = -0.1 + 0.27V(S_1) + 0.63V(S_2)$$And next for $V(S_2)$: $$V(S_2) = p(S_1 \mid S_2)(R_1 + \gamma V(S_1)) + p(S_3 \mid S_2)(R_3 + \gamma V(S_3))$$ $$V(S_2) = 0.6(-0.1 + 0.9 V(S_1)) + 0.4(1 + 0.9 V(S_3))$$ $$V(S_2) = 0.34 + 0.54V(S_1) + 0.36V(S_3)$$
And $V(S_3)$ is a terminal state, so: $$V(S_3) = 0$$
Now, looking at these 3 equations we may notice something:
$$V(S_1) = -0.1 + 0.27V(S_1) + 0.63V(S_2)$$$$V(S_2) = 0.34 + 0.54V(S_1) + 0.36V(S_3)$$$$V(S_3) = 0$$This is a system of linear equations, just like you have seen in linear algebra! To make this a bit more clear, we can group together like terms:
$$0.1 = -0.73V(S_1) + 0.63V(S_2) + 0*V(S_3)$$$$ -0.34 = 0.54V(S_1) - V(S_2) + 0.36V(S_3)$$$$0 = 0*V(S_1) + 0*V(S_2) + V(S_3)$$Now, we can neatly separate this equation into 3 parts:
$$\begin{bmatrix} -0.73 & 0.63 & 0 \\ 0.54 & -1 & 0.36 \\ 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} V(S_1) \\ V(S_2) \\ V(S_3) \end{bmatrix} = \begin{bmatrix} 0.1 \\ -0.34 \\ 0 \end{bmatrix} $$Where we now have a vector containing $V(S_1), V(S_2), V(S_3)$, a 3x3 matrix to multiply it by, and a vector of numbers that just sits by itself. This clearly takes the form:
$$Ax = b$$And we can use numpy to easily solve this:
x = np.linalg.solve(A, b)
import numpy as np
A = np.array([[-0.73, 0.63, 0], [0.54, -1, 0.36], [0, 0, 1]])
b = np.array([0.1, -0.34, 0])
x = np.linalg.solve(A, b)
print(x)
As seen above, we end up with the following solution:
$$V(S_1) = 0.293$$$$V(S_2) = 0.498$$$$V(S_3) = 0$$We know that solving a linear system of equations is not a scalable solution, so the remainder of these notebooks are going to focus on what some scalable algorithms are that can solve this modeling problem!
Let's now take a moment to discuss optimal policies and optimal value functions. You will see that they are interdependent and need to be discussed together. This is a very key concept, and we will learn there is a great deal of depth behind it.
We can talk about the relative "goodness" of policies. Let's say we have two policies, $\pi_1$ and $\pi_2$. We can say that $\pi_1$ is better than $\pi_2$ if the expected return of $\pi_1$ is greater than or equal to the expected return of $\pi_2$:
$$\pi_1 \geq \pi_2 \hspace{1cm} if V_{\pi_1}(s) \geq V_{\pi_2}(s)\hspace{1cm} \forall s \in S$$Because we can talk about the relative goodness of each policy, and one policy can be better than another policy, then we also have the notion of the best policy. In RL this is known as the optimal policy. We use the symbol $\pi = *$ to represent it. The optimal policy is the policy for which there is no greater value function. We can define it as:
$$V_{*}(s) = max_\pi \{ V_\pi (s)\} \hspace{1cm} \forall s \in S $$Note that optimal policies are not necessarily unique, but optimal value functions are. You can imagine how two different policies can lead to the same rewards, and hence lead to the same value function. But if you have one value function that is greater than another, then the optimal policy will be that which leads to the greater value function.
We can similarly define the optimal action-value function, as the max of the action-value function over all policies:
$$Q_{*}(s,a) = max_\pi \{ Q_\pi (s,a)\} \hspace{1cm} \forall s \in S, a \in A$$Because the optimal action-value function is with respect to the optimal policy, we can define it recursively in terms of the optimal state value function:
$$Q_{*}(s,a) = E \big[ R(t+1) + \gamma V_* (S_{t+1}) \mid S_t = s, A_t = a\big]$$The state value function and action value function can be related in this way:
The state value function means that we are always choosing the best actions, and therefore means that we are always choosing the best overall actions from $Q$.
Notice that $Q$ leads to some practical advantages in implementation. If we only have $V(s)$ then we must try all the possible actions $a$ in order to take us to the next states, so we can get the values for those states. But, if we hae $Q$, then we can directly choose the best overall action $a$, from all actions $A$,
Let's continue manipulating our equation for $V_*(s)$. We call this the bellman optimality equation for the state value function. Notice how it similar, but not quite the same as the Bellman equation:
$$V_*(s) = max_a E \big[R(t+1) + \gamma V_*(S_{t+1}) \mid S_t = s, A_t = a\big]$$$$V_*(s) = max_a \sum_{s', r}p(s', r \mid s, a) \big[r + \gamma V_*(s')\big]$$We can do a similar thing to get the Bellman Optimality Equation for the action value function:
$$Q_*(s,a) = E \big[R(t+1) + \gamma max_{a'} Q_*(S_{t+1}, a') \mid S_t =s, A_t =a\big]$$$$Q_*(s,a) = \sum_{s', r}p(s', r \mid s, a) \big[r + \gamma max_{a'}Q_*(s', a')\big]$$One final thing to mention is, after we have the optimal value function, how do we implement the optimal policy? The key point is that the value function already takes into account future rewards, so nothing extra needs to be done in order to optimize the total expected future reward. All we need to do is choose the action that yields the best value for the next state, $V(s')$.
Notice, how for the state value function $V$, we need to do a look ahead search because $V$ itself does not depend on $a$. So implementation wise, we need to do all of the possible actions $A$, look at what state we arrive in, $s'$, and choose the one that gives us the largest $V(s')$.
If we have $Q(s,a)$, we can do this look up directly and simply choose the argmax. What $Q(s,a)$ does then is effectively caches the results of all one step ahead searches.
This section was purely theoretical, in order to set up all prerequisites we need for later sections. In particular, we:
- Formalized the Markov decision process framework
- Talked about policies
- Returns (total future reward)
- Discounting future rewards with discount rate $\gamma$
- Rigorously defined the value function and discussed two types: State-value function and Action-value function
We also looked at the Bellman equation, which recursively defines the value function in terms of the value function at the next state:
We then looked at the notion of optimality, and defined the optimal policy, optimal state value function, and optimal action value function. Through this, we were able to recursively define the optimal value functions in terms of the optimal value functions at the next state.