1. What is Reinforcement Learning¶

We can first note that the difference between supervised and unsupervised machine learning is rather small; this should be contrasted to the very large difference when compared to reinforcement learning. In the supervised/unsupervised cases, we imagine having the same interface to interact with (modeled after scikit learn):

class SupervisedModel:
    def fit(X, Y): ...
    def predict(X): ...


class UnsupervisedModel:
    def fit(X): ...
    def transform(X): // PCA, Autoencoders, RBMs
                      // K-Means, GMM, don't really transform data...

The common theme to both of these is that the interface is just training data, which is subsequently just a matrix of numbers. In the case of supervised learning we are then able to make predictions on unseen/future data.

Reinforcement Learning on the other hand is different. It is able to guide an agent on how to act in the real world. Interface is much more broad than just data (training vectors), it's the entire environment. That environment can be the real world, or it can be a simulated world like a video game. As an example, you could create a reinforcement agent to vacuum your house-then it would be interfacing with the real world. You could create a robot (another RL agent) that learns to walk-again it would be interacting with the real world.

There is another big leap when moving from supervised/unsupervised to Reinforcement learning. In addition to the broad way that we interact with the interface, RL algorithms train in a completely different way as well. There are many referrences to psychology, and indeed RL can be used to model animal behavior. RL algorithms have objectives in terms of a goal. This is different from supervised learning where the object was to get good accuracy, or to minimize a cost function. RL algorithms get feedback as the agent interacts with its environment. So, feedback symbols, aka rewards, are given to the agent automatically by the environment. This differs greatly for supervised learning where it can be extremely costly to hand label data. So, in this way RL is very different from SL:

SL requires a hand labeled data set. RL learns automatically from signals in the environment.

1.1 Goals¶

Phrasing our object terms of goals allows us to solve a much wider variety of problems. The goal of AlphaGo is to win Go. The goal of a video game AI is either to get as far as possible in the game (win the game) or get the highest score. What is interesting is when you consider animals, specifically humans. Evolutionary psychologists (Richard Dawkins) have said that our genes are selfish, and that all they really want to do is make more of themselves. This is very interesting, because just as with AlphaGo we have found many round about and unlikely ways to achieve this. Experts commented that AlphaGo used some surprising and unusual techniques. For example, some people have a desire to be rich and make a lot of money; but, what makes you feel that way? Perhaps, those with the specific set of genes that are related to the desire to be rich ended up being more prominent in our gene pool due to natural selection. Perhaps the desire to be rich, led to being rich, which lead to better healthcare for those types of people, which led to genes maximizing their central goal: to make more of themselves.

In our case, we are not particularly interested in the specifics of whether a person is "desiring money" or "being healthy and strong". For us, it is more interesting that there is just one main object to maximize, but various novel ways to achieve it. These things are always fluctuating in time. At one point in history, seeking as much sugar as possible would give you energy and help you survive. Today we keep that trait since evolution is slow, but in todays world that trait would actually kill us. Our genes method of maximizing their reward is through mutation and natural selection, which is slow, but an AI's method is reinforcement learning, which is fast.

1. Technical Detail¶

Of course, you can never sense the entire world at once (even humans don't do this). We have sensors which feed signals from the environment to our brain. These signals don't tell us everything about the room we are in, much less the world. So, we necessarily have limited information about our environment, as do robots with limited numbers and types of sensors. The measurements we get from these sensors (e.g. sight, sound, touch) make up a "state". For now we will only look at environments where there are a small, finite number of states. But of course it is possible to consider environments with an infinite number of states too.

Now, let's quickly create some strong definitions for 5 of the important terms we will be seeing throughout these notebooks.

Agent: The thing that senses the environment, the thing we're trying to code intelligence/learning into.

Environment: Real world or simulated world that the agent lives in.

State: Different configurations of the environment that the agent can sense.

Reward: This is what differentiates RL from other types of ML algorithms. An agent will try and maximize not only its immediate reward, but also its future rewards as well. Often, RL algorithms will find novel ways of accomplishing this.

Actions: Actions are what an agent does in its environment. For example, if you are a 2-D video game character, your actions may be { up, down, left, right, jump }. We will only look at a finite set of actions.

1.2 SAR Triples¶

The last thing things we just mentioned are often thought about as a triple: State, Action, Reward. You are in a state, you take an action, and you get a reward. These are referred to as SAR triples.

1.3 Timing¶

Timing is also very important in RL as well. This is because every time you play a game you get a sequence of states, actions, and rewards. Within this framework you start in a state $S(t)$, you take an action $A(t)$, and you then receive a reward $R(t+1)$. So, the reward you get always results from the state and action $(s, a)$ that you took in the previous step. This action also results in your being in a new state, $S(t+1)$. So, another important triple is $[S(t), A(t), S(t+1)]$, which can also be denoted as $(s, a, s')$

1.4 Summary¶

That is RL in a nutshell. We program an agent to be intelligent, and the agent interacts with its environment by being in a state, taking an action based on that state, which then brings it to another state. The environment gives the agent a reward when it arrives in the next state, either positive or negative (but must be a number), and the goal of the agent is to maximize its total rewards.