As with the notebooks related to Hidden Markov Models, Recurrent Neural Networks are all about learning sequences. But, where Markov Models are limited by the Markov Assumption, Recurrent Neural Networks are not. As a result, they are more expressive and powerful than anything we have seen and haven't made progress on in decades.
So, what will these notebooks contain, and how will they build on the previous notebooks surrounding Neural Networks and Hidden Markov Models?
- In the first section, we are going to add time to our neural networks. This will introduce us to the Simple Recurrent Unit, also known as the Elman Unit.
- We will then revisit the XOR problem, but will extend it so that it becomes the parity problem. We will demonstrate that regular feed forward neural networks will have trouble solving this problem, but recurrent networks will work because the key is to treat the input as a sequence.
- Next, we will revisit one of the most popular applications of RNN's, Language Modeling. In the Markov Models notebooks we have generated poetry, and discriminate from two different poets just from the sequence of parts of speech tags that they used. We will extend our language model so that it no longer makes the markov assumption.
- Another popular application for RNN's is word vectors or word embeddings. The most common technique for this is called word-2-vec, but we will go over how RNN's can also be used for creating word vectors.
- We will then look at the very popular LSTM, Long-Short term memory unit, and the more modern and efficient GRU, Gated Recurrent Unit, which has been proven to yield comparable performance.
- Finally we will apply these to more practical problems, such as learning a language model from wikipedia, and visualizing the word embeddings as a result.
I will offer a tip that helped me in understanding RNN's: Understand the mechanics first, and worry about the "meaning" later. When we talk about LSTM's, we are going to talk about the ability to remember and forget things. Keep in mind, these are just convenient names that are utilized by way of analogy. We are not actually building something that is remembering or forgetting. They are just mathematical formulas. So, worry about the math and let the meaning come naturally to you.
What you most definitely do not want to do is the opposite; try to understand the meaning without understanding the mechanics. When you do that, the result is usually a sensationalist media article, or a pop science book. This set of notebooks is the opposite of that; we want to understand on a technical level what is happening. Explaining things in layman terms, of thinking of real life analogies is icing on the cake, only if you understand the technicalities.
Let's begin by talking about the Softmax function. The softmax function is what we use to classify two or more classes. It is a bit more complicated to work with than the sigmoid-in particular it's derivative is harder to derive-but, we will let theano and tensorflow take care of that stuff for us. Remeber, all we do here is take an array of numbers, exponentiate them, divide by the sum, and that allows us to interpret the output of the softmax as a probability:
$$y_k=\frac{e^{a_k}}{\sum_je^{a_j}}$$$$p\big(y=k \mid x \big) =\frac{e^{W_k^T x}}{\sum_je^{W_j^Tx}}$$Where k represents the class $k$ in the output layer. In other words our $y$ output is going to be a (kx1) vector for a single training example, and an (Nxk) matrix when computed for the entire training set.
In code, it will look like:
def softmax(a):
expA = np.exp(a)
return expA / expA.sum(axis=1, keepdims=True)
All machine learning models have two main functions, prediction and training. Going in the forward direction is how we do prediction, and at the output we get a probability that tells us what the most likely answer is. For training, we use gradient descent. That just means that we take the a cost function (squared error for regression, cross entropy for classification), calculate its derivative with respect to each parameter, and move the parameters slowly in that direction:
$$W \leftarrow W - \eta \nabla J$$Eventually, we will hit a local minimum and the slope will be 0, so the weights won't change anymore.
We have seen that deep networks can be used to find patterns in data that doesn't have labels (i.e. language). We can learn a probability distribution from the data!
Suppose you have a bunch of states; these can represent words in a sentence, the weather, what page of a website you are on, etc. We can define them as:
$$states = \{ 1,2,...,M\}$$A Markov Model is a model that makes the markov assumption:
The next state only depends on the previous state:
$$p\big(s(t) \mid s(t-1), s(t-2),...,s(1)\big) = p \big(s(t) \mid s(t-1)\big)$$
For example, if you want to know whether or not it is going to rain tomorrow, you assume that that only depends on today's weather. As another example, consider the following sentence:
"I love dogs and cats."
Let's say we are trying to predict the last word, which we know is cats. But lets say all we are given is the word "and", it will be impossible to predict "cats" based off of only that information. In this course, we will see that our models no longer make the markov assumption!