Ai¶

1. Reinforcement Learning¶

1. What is Reinforcement Learning
We can first note that the difference between supervised and unsupervised machine learning is rather small; this should be contrasted to the very large difference when compared to reinforcement learning. In the supervised/unsupervised cases, we imagine having the same interface to interact with (modeled after scikit learn):

2. Multi-Armed Bandit
Imagine you are at a casino playing slots (hence the term "arm"). The slot machines are bandits because they are taking your money. Not all the slot machines are equal, and the win rate is generally unknown. For example, they may be:

3. Components of an RL System
Let's quickly recall what we had discussed earlier concerning the components of an RL system. We talked about the following:

4. Markov Decision Processes
We are now going to formalize some of the concepts that we have learned about in reinforcement learning. We have learning about the terms:

5. Intro to Dynamic Programming and Iterative Policy Evaluation
We are now going to start looking at solutions to MDP's. As we saw in the last section, the center piece of the discussion is the Bellman Equation:

6. Monte Carlo Intro
In this section we are going to be discussing another technique for solving MDP's, known as Monte Carlo. In the last section, you may have noticed something a bit odd; we have talked about how RL is all about learning from experience and playing games. Yet, in none of our dynamic programming algorithms did we actually play the game. We had a full model of the environment, which included all of the state transition probabilities. You may wonder: is it reasonable to assume that we would have that type of information in a real life environment? For board games, perhaps. But, what about self driving cars?

7. Temporal Difference Learning Introduction
We are now going to look at a third method for solving MDPs, Temporal Difference Learning. TD is one of the most important ideas in RL, and we will see how it combines ideas from the first two techniques, Dynamic Programming and Monte Carlo.

8. Approximation Methods
We are now going to look at approximation methods. Recall that in the last section, we discussed a major disadvantage to all of the methods we have studied so far. That is, they all require us to estimate the value function for each state, and in the case of the action-value function, we have to estimate it for each state and action pair. We learned early on that the state space can grow very large, very quickly. This makes all of the methods we have studied impractical.