4. Advanced RNN Units¶

We are now going to move into the realm of advanced RNN units. Up until this point, we have been dealing with the simple recurrent unit exclusively. However, implementing Rated units, Gated Recurrent Units, and Long Short-Term Memory will allow our models to become far more powerful.

1. Rated RNN Units¶

A rated recurrent unit is simply a straightforward modification to the simple recurrent unit. Recall, a simple recurrent unit has the form:

Which can be observed on a lower level as:

And mathematically, we can define $h(t)$ as:

$$h(t) = f\big(W_x^T x(t) + W_h^T h(t-1)\big)$$

Where $f$ is our chosen activation function. The idea is that we we want to weight two things:

$f\big(x(t), h(t-1)\big)$, which is the output that we would have gotten in a simple recurrent unit.
$h(t-1)$, the previous value of the hidden state.

In order to accomplish this weighting, we use a matrix known as the rate matrix, $z$, which is the same size as the hidden layer. We then perform an element by element multiplication on each dimension:

$$h(t) = (1-z) \odot h(t-1) + z \odot f\big( x(t), h(t-1)\big)$$

$z$ is known as the rate. You may notice that this looks very similar to the low pass filter that we created in the previous post. The idea here to is to learn $z$ in a way that it let's the some values from previous points in time carry greater weight, and others depend mainly on the most recent values. Visually, the rated recurrent unit looks like:

The changes needed to implement this in code are relatively small and straightforward. We simply need to add a new param of size $M$, and include the above equation in the recurrence function.

1.1.1 Rate Matrix Modifications¶

There are more options for the rate matrix than simply what was discussed above. For example, we can make it dependent on the input and previous hidden state via a sigmoid and more weight matrices.

We can also make it a matrix (size $M x M$) so that it multiplies against $h$ with matrix multiplication, instead of element wise multiplication. In this case, you can picture what is happening as the same type of everything connected to everything scenario that we had for the hidden to hidden state.

1.1.2 Modifications to Poetry Generation¶

The next thing that we are going to do is redo our poem generation example with the rated recurrent unit. It will be clear that this is just a small modification from the simple recurrent unit, so the code is almost exactly the same. Note, this new parameter will be trained in exactly the same way as all the other parameters-via gradient descent. In fact, when we look at more complex models the training will still remiain the same.

To add a bit more complexity to the mix, we are going to try and solve the problem of the lines ending too quickly. Recall that this is because we have many training samples that result in the next word being the end token. Because the end token shows up in every line, it is over represented in the training data. To prevent this, we will only train for going to the end of the sequence 10% of the time. Otherwise we stop on the second to last word.

Another change that we will make is that we will not model the initial distribution anymore. Recall, we use this to prevent every line from starting with the same word. This would happen because neural networks will give us the same prediction every time if we use the argmax. Instead, we will use the START token as the input, and the output will represent a probability distribution. This is valid because we already know that softmax gives us a valid probability distribution. We can then sample from this distribution to generate the next word. This way the sequences that we generate will be more stochastic, and it will treat the output probability as an actual probability, rather than a deterministic prediction:

p(w0) = softmax(f(START))
w0 = randint(V, p=p(w0))

1.1.3 Rated RNN Unit in Code¶

import theano
import theano.tensor as T
import numpy as np
import matplotlib.pyplot as plt

from sklearn.utils import shuffle
from rnn_util import init_weight, get_robert_frost


class SimpleRNN:
    def __init__(self, D, M, V):
        self.D = D # dimensionality of word embedding
        self.M = M # hidden layer size
        self.V = V # vocabulary size

    def fit(self, X, learning_rate=10., mu=0.9, reg=0., activation=T.tanh, epochs=500, show_fig=False):
        N = len(X)
        D = self.D
        M = self.M
        V = self.V

        # initial weights
        We = init_weight(V, D)
        Wx = init_weight(D, M)
        Wh = init_weight(M, M)
        bh = np.zeros(M)
        h0 = np.zeros(M)
        # z  = np.ones(M)
        Wxz = init_weight(D, M)
        Whz = init_weight(M, M)
        bz  = np.zeros(M)
        Wo = init_weight(M, V)
        bo = np.zeros(V)

        thX, thY, py_x, prediction = self.set(We, Wx, Wh, bh, h0, Wxz, Whz, bz, Wo, bo, activation)

        lr = T.scalar('lr')

        cost = -T.mean(T.log(py_x[T.arange(thY.shape[0]), thY]))
        grads = T.grad(cost, self.params)
        dparams = [theano.shared(p.get_value()*0) for p in self.params]

        updates = []
        for p, dp, g in zip(self.params, dparams, grads):
            new_dp = mu*dp - lr*g
            updates.append((dp, new_dp))

            new_p = p + new_dp
            updates.append((p, new_p))

        self.predict_op = theano.function(inputs=[thX], outputs=prediction)
        self.train_op = theano.function(
            inputs=[thX, thY, lr],
            outputs=[cost, prediction],
            updates=updates
        )

        costs = []
        for i in range(epochs):
            X = shuffle(X)
            n_correct = 0
            n_total = 0
            cost = 0
            for j in range(N):
                if np.random.random() < 0.1:
                    input_sequence = [0] + X[j]
                    output_sequence = X[j] + [1]
                else:
                    input_sequence = [0] + X[j][:-1]
                    output_sequence = X[j]
                n_total += len(output_sequence)

                # we set 0 to start and 1 to end
                c, p = self.train_op(input_sequence, output_sequence, learning_rate)
                # print "p:", p
                cost += c
                # print "j:", j, "c:", c/len(X[j]+1)
                for pj, xj in zip(p, output_sequence):
                    if pj == xj:
                        n_correct += 1
            if i % 50 == 0:
                print("i:", i, "cost:", cost, "correct rate:", (float(n_correct)/n_total))
            if (i + 1) % 500 == 0:
                learning_rate /= 2
            costs.append(cost)

        if show_fig:
            plt.plot(costs)
            plt.show()
    
    def save(self, filename):
        np.savez(filename, *[p.get_value() for p in self.params])

    @staticmethod
    def load(filename, activation):
        # TODO: would prefer to save activation to file too
        npz = np.load(filename)
        We = npz['arr_0']
        Wx = npz['arr_1']
        Wh = npz['arr_2']
        bh = npz['arr_3']
        h0 = npz['arr_4']
        Wxz = npz['arr_5']
        Whz = npz['arr_6']
        bz = npz['arr_7']
        Wo = npz['arr_8']
        bo = npz['arr_9']
        V, D = We.shape
        _, M = Wx.shape
        rnn = SimpleRNN(D, M, V)
        rnn.set(We, Wx, Wh, bh, h0, Wxz, Whz, bz, Wo, bo, activation)
        return rnn

    def set(self, We, Wx, Wh, bh, h0, Wxz, Whz, bz, Wo, bo, activation):
        self.f = activation

        # redundant - see how you can improve it
        self.We = theano.shared(We)
        self.Wx = theano.shared(Wx)
        self.Wh = theano.shared(Wh)
        self.bh = theano.shared(bh)
        self.h0 = theano.shared(h0)
        self.Wxz = theano.shared(Wxz)
        self.Whz = theano.shared(Whz)
        self.bz = theano.shared(bz)
        self.Wo = theano.shared(Wo)
        self.bo = theano.shared(bo)
        self.params = [self.We, self.Wx, self.Wh, self.bh, self.h0, self.Wxz, self.Whz, self.bz, self.Wo, self.bo]

        thX = T.ivector('X')
        Ei = self.We[thX] # will be a TxD matrix
        thY = T.ivector('Y')

        def recurrence(x_t, h_t1):
            # returns h(t), y(t)
            hhat_t = self.f(x_t.dot(self.Wx) + h_t1.dot(self.Wh) + self.bh)
            z_t = T.nnet.sigmoid(x_t.dot(self.Wxz) + h_t1.dot(self.Whz) + self.bz)
            h_t = (1 - z_t) * h_t1 + z_t * hhat_t
            y_t = T.nnet.softmax(h_t.dot(self.Wo) + self.bo)
            return h_t, y_t

        [h, y], _ = theano.scan(
            fn=recurrence,
            outputs_info=[self.h0, None],
            sequences=Ei,
            n_steps=Ei.shape[0],
        )

        py_x = y[:, 0, :]
        prediction = T.argmax(py_x, axis=1)
        self.predict_op = theano.function(
            inputs=[thX],
            outputs=[py_x, prediction],
            allow_input_downcast=True,
        )
        return thX, thY, py_x, prediction


    def generate(self, word2idx):
        # convert word2idx -> idx2word
        idx2word = {v:k for k,v in word2idx.items()}
        V = len(word2idx)
        n_lines = 0
        X = [ 0 ]
        while n_lines < 4:
            # print "X:", X
            PY_X, _ = self.predict_op(X)
            PY_X = PY_X[-1].flatten()
            P = [ np.random.choice(V, p=PY_X)]
            X = np.concatenate([X, P]) # append to the sequence
            P = P[-1] # just grab the most recent prediction
            if P > 1:
                # it's a real word, not start/end token
                word = idx2word[P]
                print(word, end=" ")
            elif P == 1:
                # end token
                n_lines += 1
                X = [0]
                print('')


def train_poetry():
    sentences, word2idx = get_robert_frost()
    rnn = SimpleRNN(50, 50, len(word2idx))
    rnn.fit(sentences, learning_rate=1e-4, show_fig=True, activation=T.nnet.relu, epochs=2000)
    rnn.save('RRNN_D50_M50_epochs2000_relu.npz')

def generate_poetry():
    sentences, word2idx = get_robert_frost()
    rnn = SimpleRNN.load('RRNN_D50_M50_epochs2000_relu.npz', T.nnet.relu)
    rnn.generate(word2idx)


if __name__ == '__main__':
#     train_poetry()
    generate_poetry()

/usr/local/lib/python3.6/site-packages/ipykernel_launcher.py:139: UserWarning: DEPRECATION: If x is a vector, Softmax will not automatically pad x anymore in next releases. If you need it, please do it manually. The vector case is gonna be supported soon and the output will be a vector.

and give of people a hornyhanded kindness as the arm ground snow snow cellar out 
and piano tell here him was stop in her me 
of then acquaintance out but a house them 
they be leafs a long it didnt held

2. Gated Recurrent Unit¶

We are now going to introduce a unit that is more powerful than the ellman unit, the Gated Recurrent Unit, (GRU). GRU's were introduced in 2014, while LSTM's (which we will be going over next) were introduced in 1997. I have chosen to start with GRU's because they are slightly less complex, and thus a better place to begin. The GRU is very similar to LSTM, incorporating many of the same concepts, but having far fewer parameters, meaning it can train faster at a constant hidden layer size.

2.1 GRU Architecture¶

To start, let's go over the architecture of the GRU. Before we do that, however, I want to quickly recap the previous architectures that we have seen, particularly treating the hidden unit as a black box of sorts. This compartmental point of view will help us in extending the simple unit and rated unit to the GRU.

Feedforward Unit
In the simplest feedforward network, this black box simply contains some nonlinear function like $tanh$ or $relu$:

Simple Recurrent Unit
In a simple recurrent network, we just connect the output of the black box back to itself, with a time delay of one:

Rated Recurrent Unit
With the rated recurrent unit, we add a rating operation between what would have been the output of the simple recurrent network, and the previous output value.

We can think of of this new operation as a gate. Since it has to take on a value between 0 and 1, and the other gate has to take on the 1 minus that value, it is a gate that is choosing between two things: taking on the old value or taking on the new value. The result here is that we get a mixture of both.

Gated Recurrent Unit
Finally, we arrive at the gated recurrent unit! The architecure simply requires that we add one more gate:

The above diagram corresponds with the following mathematical equations:

$$r_t = \sigma \big(x_t W_{xr} + h_{t-1}W_{hr} + b_r \big)$$$$z_t = \sigma \big(x_t W_{xz} + h_{t-1}W_{hz} + b_z\big)$$$$\hat{h}_t = g \big(x_t W_{xh} + (r_t \odot h_{t-1})W_{hh} + b_h\big)$$$$h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \hat{h}_t$$

Note: $g$ represents an activation, and $\odot$ is the symbol for element wise multiplication. With that said, we are simply seeing more of the same thing that we have been encountering thus far; weight matrices multiplied by inputs and passed through non linear functions and gates.

Again, we see the update gate $z_t$ that we encountered with the rated unit. It still balances how much of the previous hidden value and how much of the new candidate hidden value combines to get the new hidden value.

The new component is $r_t$, or the reset gate, which has the exact same functional form as the update gate- all of its weights are the same size. However, it is position in the black box is different. The reset gate is multiplied by the previous hidden state value. It controls how much of the previous hidden state we will consider, when we create the new candidate hidden value. In other words, it has the ability to reset the hidden value.

For instance, consider the situation where $r_t$ is equal to 0, then we get:

$$\hat{h}_t = g \big(x_t W_{xh} + b_h\big)$$

This would be as if $x_t$ were the beginning of a new sequence. Note that this is not the full picture, since $\hat{h}$ is only a candidate for the new $h_t$, since $h_t$ will be a combination of $\hat{h}$ and $h_{t-1}$ (which is controlled by the updated gate $z_t$).

Now, if that all sounds a bit complex, there is a practical way to reason with it. At the most general level, we are simply adding more parameters to our model, and making it more expressive. Adding more parameters allows us to fit more complex patterns.

2.2 GRU in Code¶

When it comes to implement a GRU in code, it should be relatively simple if you already understand the simple recurrent unit and the rated recurrent unit. We simply need to add more weights, and modify the recurrence. The next step to making our code better is to modularize it. We have mentioned that the GRU can be looked at as a black box. To do this, we can simply make the GRU into a class so that it can be abstracted away. By doing this, we can stack GRU's, meaning anywhere that a hidden layer could go, a GRU can go as well. This allows us to just think of it as a thing that takes an input and produces an output. The fact that it contains a memory of previous inputs is just an internal detail of the black box.

Some pseudocode is show below:

class GRU:
    def __init__(Mi, Mo):
        Wxr = random(Mi, Mo)
        # ...

    def recurrence(x_t, h_t1):
        r = sigmoid(x_t.dot(Wxr) + h_t1.dot(Whr) + br)
        # ...
        return (1 - z)*h_t1 + z*hhat

    def output(x):
        return scan(recurrence, x)

And, a full class implementation:

import numpy as np
import theano
import theano.tensor as T

from rnn_util import init_weight


class GRU:
    def __init__(self, Mi, Mo, activation):
        self.Mi = Mi
        self.Mo = Mo
        self.f = activation

        Wxr = init_weight(Mi, Mo) # Input into reset gate
        Whr = init_weight(Mo, Mo) # Hidden into reset gate
        br = np.zeros(Mo)         # Bias reset gate
        Wxz = init_weight(Mi, Mo) # Input to update gate
        Whz = init_weight(Mo, Mo) # Hidden to update gate
        bz = np.zeros(Mo)         # Bias update gate
        Wxh = init_weight(Mi, Mo)
        Whh = init_weight(Mo, Mo)
        bh = np.zeros(Mo)
        h0 = np.zeros(Mo)

        # Create theano variables
        self.Wxr = theano.shared(Wxr)
        self.Whr = theano.shared(Whr)
        self.br = theano.shared(br)
        self.Wxz = theano.shared(Wxz)
        self.Whz = theano.shared(Whz)
        self.bz = theano.shared(bz)
        self.Wxh = theano.shared(Wxh)
        self.Whh = theano.shared(Whh)
        self.bh = theano.shared(bh)
        self.h0 = theano.shared(h0)
        self.params = [self.Wxr, self.Whr, self.br, self.Wxz, self.Whz, self.bz, self.Wxh, self.Whh, self.bh, self.h0]

    def recurrence(self, x_t, h_t1):
        r = T.nnet.sigmoid(x_t.dot(self.Wxr) + h_t1.dot(self.Whr) + self.br) # Reset gate
        z = T.nnet.sigmoid(x_t.dot(self.Wxz) + h_t1.dot(self.Whz) + self.bz) # Update gate
        hhat  = self.f(x_t.dot(self.Wxh) +(r*h_t1).dot(self.Whh) + self.bh)  # Candidate for h
        h = (1 - z)*h_t1 + z*hhat
        return h

    def output(self, x):
        # Output for this unit taking in an input sequence X. X is 2 dimensional: Time x Dimension
        h, _ = theano.scan(
            fn=self.recurrence,
            sequences=x,
            outputs_info=[self.h0],
            n_steps=x.shape[0]
        )
        return h

3. Long Short-Term Memory (LSTM)¶

We are now going to move on from the GRU and start working with the LSTM. LSTM stands for Long Short-Term Memory, and it is a type of recurrent unit that has become very popular in recent years due to its superior performance/ability to overcome the vanishing gradient problem. Each recurrent unit that we have dug into has gotten progressively more complex, and each incorporated concepts from the previous unit.

The LSTM is the most complex recurrent unit we will cover. However, that complexity does not involve abstract ideas or new mathematics; rather, the number of components that we are dealing with is higher. Essentially, we will now have three different gates, called the input, output, and forget gate. Additionally, we are going to add yet another internal unit that will exist alongside the hidden state, sometimes referred to as a memory cell. One way to think of this cell is that it takes the place of $\hat{h}$, or the candidate hidden value, when we talked about the GRU.

3.1 LSTM Mathematics¶

Now, the input gate, $i_t$, and forget gate, $f_t$, should remind you of the rate gate, $z_t$, from the GRU. Before, we were using $z_t$ and $1- z_t$; however, now we just have two seperate gates:

Input Gate
The input gate just controls how much of the new value goes into the cell:

$$i_t = \sigma\big(x_t W_{xi} + h_{t-1}W_{hi} + c_{t-1}W_{ci} + b_i\big)$$

Forget Gate
The forget gate controls how much of the previous cell value goes into the current cell value:

$$f_t = \sigma\big( x_t W_{xf} + h_{t-1}W_{hf} + c_{t-1}W_{cf} + b_f\big)$$

Candidate
The candidate for the new cell value looks a lot like what would be the simple recurrent unit's value, right before it gets multiplied by the input gate:

$$c_t = f_tc_{t-1} + i_t \;tanh\big(x_t W_{xc} + h_{t-1}W_{hc} + b_c\big)$$

Output Gate
Finally, the output gate takes into account everything: the input at time $t$, the previous hidden state, and the current cell value:

$$o_t = \sigma \big( x_t W_{xo} + h_{t-1}W_{ho} + c_t W_{co} + b_o\big) $$

New Hidden State
The new hidden state is just the $tanh$ of the cell value, multiplied by the output gate:

$$h_t = o_t \; tanh(c_t)$$

3.2 LSTM Parameters¶

Now, we just introduced a substantial number of new parameters, so it may be useful to try and break down exactly what they are.

Model Component	Parameters	Depends on
Input Gate	$W_{xi}$, $W_{hi}$, $W_{ci}$, $b_i$	$x_t$, $h_{t-1}$, $c_{t-1}$
Forget Gate	$W_{xf}$, $W_{hf}$, $W_{cf}$, $b_f$	$x_t$, $h_{t-1}$, $c_{t-1}$
Candidate Cell	$W_{xc}$, $W_{hc}$, $b_c$	$x_t$, $h_{t-1}$
Output Gate	$W_{xo}$, $W_{ho}$, $W_{co}$, $b_o$	$x_t$, $h_{t-1}$, $c_{t}$

In total, we have 15 new weights and biases for this black box! We have come a long way from our original feedforward net, where we only had 1 weight and 1 bias. Another way that we can think of this is that as we add more parameters, we enable our model to be more expressive.

3.3 LSTM Architecture¶

Now, from an architecture perspective, an LSTM has the high level form of:

Where $\hat{c}$ is just meant to represent $x_t W_{xc} + h_{t-1}W_{hc} + b_c$ in the candidate equation. Now, for some this diagram may be sufficient, however, I would prefer to have a bit more insight to the inner working of the LSTM at a lower level. This can be seen below:

And we can highlight the actual cell as follows:

I encourage you to take the equations the we defined earlier and follow them through the architecture diagram above to make sure they make sense. I should point out that in order to reduce visual noise in the diagram I didn't inclue the weight matrices or biases.

3.4 LSTM in Code¶

We can now implement an LSTM in code:

import numpy as np
import theano
import theano.tensor as T

from rnn_util import init_weight


class LSTM:
    def __init__(self, Mi, Mo, activation):
        self.Mi = Mi
        self.Mo = Mo
        self.f = activation

        Wxi = init_weight(Mi, Mo)
        Whi = init_weight(Mo, Mo)
        Wci = init_weight(Mo, Mo)
        bi = np.zeros(Mo)
        Wxf = init_weight(Mi, Mo)
        Whf = init_weight(Mo, Mo)
        Wcf = init_weight(Mo, Mo)
        bf = np.zeros(Mo)
        Wxc = init_weight(Mi, Mo)
        Whc = init_weight(Mo, Mo)
        bc = np.zeros(Mo)
        Wxo = init_weight(Mi, Mo)
        Who = init_weight(Mo, Mo)
        Wco = init_weight(Mo, Mo)
        bo = np.zeros(Mo)
        c0 = np.zeros(Mo)
        h0 = np.zeros(Mo)

        # Create Theano variables
        self.Wxi = theano.shared(Wxi)
        self.Whi = theano.shared(Whi)
        self.Wci = theano.shared(Wci)
        self.bi = theano.shared(bi)
        self.Wxf = theano.shared(Wxf)
        self.Whf = theano.shared(Whf)
        self.Wcf = theano.shared(Wcf)
        self.bf = theano.shared(bf)
        self.Wxc = theano.shared(Wxc)
        self.Whc = theano.shared(Whc)
        self.bc = theano.shared(bc)
        self.Wxo = theano.shared(Wxo)
        self.Who = theano.shared(Who)
        self.Wco = theano.shared(Wco)
        self.bo = theano.shared(bo)
        self.c0 = theano.shared(c0)
        self.h0 = theano.shared(h0)
        self.params = [
            self.Wxi,
            self.Whi,
            self.Wci,
            self.bi,
            self.Wxf,
            self.Whf,
            self.Wcf,
            self.bf,
            self.Wxc,
            self.Whc,
            self.bc,
            self.Wxo,
            self.Who,
            self.Wco,
            self.bo,
            self.c0,
            self.h0,
        ]

    def recurrence(self, x_t, h_t1, c_t1):
        i_t = T.nnet.sigmoid(x_t.dot(self.Wxi) + h_t1.dot(self.Whi) + c_t1.dot(self.Wci) + self.bi)
        f_t = T.nnet.sigmoid(x_t.dot(self.Wxf) + h_t1.dot(self.Whf) + c_t1.dot(self.Wcf) + self.bf)
        c_t = f_t*c_t1 + i_t*T.tanh(x_t.dot(self.Wxc) + h_t1.dot(self.Whc) + self.bc)
        o_t = T.nnet.sigmoid(x_t.dot(self.Wxo) + h_t1.dot(self.Who) + c_t.dot(self.Wco) + self.bo)
        h_t = o_t*T.tanh(c_t)
        return h_t, c_t

    def output(self, x):
        # input X should be a matrix (2-D)
        # rows index time
        [h, c], _ = theano.scan(
            fn=self.recurrence,
            inputs=x,
            outputs_info=[self.h0, self.c0], # Include c so that recurrence has access to it!
            n_steps=x.shape[0]
        )
        return h

3.4 Learning from Wikipedia¶

We are now going to take our RNN/LSTM and utilize it on a set of data from wikipedia. Our goal is going to be to create a language model just as we had done for poetry. Both simply hold sequences of words, but wikipedia has a much larger vocabulary and total size (13 GB). We are going to see if our learned word embeddings end up producing useful word analogies, and specifically visualize them on a scatter plot (after performing dimensionality reduction).

3.4.1 Getting the Data¶

Overall, this section does not have a ton of new information in terms of theory/RNNs, but it focus's more on a how it would be applied in a real scenario. As such, I want to take a minute to go over how we actually would get the data. It can be found here. Once you have arrived, you can decide whether you want to get the entire datate set (14.7 GB at this point), or if you would rather just deal with a specific set of pages articles. I am choosing to work with a set of pages articles (i.e. not the entire dataset), because it will require some careful thought to prevent crashing python by overloading it's memory.

3.4.2 Convert data to text files¶

Next, we are going to need to convert the above data into a text file format. To do this we will use the wp2txt library. We can install it by running the following:

sudo gem install wp2txt

And it can then be run on a single file via:

wp2txt -i <filename>

3.4.3 Processing text for our RNN¶

Finally, we can talk about how we are going to take our text files and get it into the right format for our neural network. To do this we will be using the function get_wikipedia_data:

def get_wikipedia_data(n_files, n_vocab, by_paragraph=False):
    """Converts Wikipedia txt files into correct format for Neural Network

    This function takes in a number of files that is too large to fit into memory if all data is loaded
    at once. 100 or less seems to be ideal. The vocabulary also needs to be limited, since it is a lot
    larger than the poetry dataset. We are going to have ~500,000-1,000,000 words. Note that the output
    target is the next word, so that is 1 million output classes, which is a lot of output classes.
    This makes it hard to get good accuracy, and it will make our output weight very large. To remedy
    this, the vocabulary size will be restricted to n_vocab. This is generally set to ~2000 most
    common words.

    Args:
        n_files: Number of input files taken in
        n_vocab: Vocabulary size
        by_paragraph:

    Returns:
        sentences: list of lists containing sentences mapped to index
        word2idx_small: word2index mapping reduced to size n_vocab
    """
    wiki_relative_path = f'{relative_path}wikipedia/unzipped'
    input_files = [f for f in os.listdir(wiki_relative_path) if f.startswith('enwiki') and f.endswith('txt')]

    # Return Variables
    sentences = []
    word2idx = {'START': 0, 'END': 1}
    idx2word = ['START', 'END']
    current_idx = 2
    word_idx_count = {0: float('inf'), 1: float('inf')}

    print(input_files)
    print(len(input_files))

    if n_files is not None:
        input_files = input_files[:n_files]

    for f in input_files:
        print('Reading: ', f )
        for line in open(f'{wiki_relative_path}/{f}'):
            line = line.strip()
            # Don't count headers, structured data, lists, etc
            if line and line[0] not in ('[', '*', '-', '|', '=', '{', '}'):
                if by_paragraph:
                    sentence_lines = [line]
                else:
                    sentence_lines = line.split()
                for sentence in sentence_lines:
                    tokens = my_tokenizer(sentence)
                    for t in tokens:
                        if t not in word2idx:
                            word2idx[t] = current_idx
                            idx2word.append(t)
                            current_idx += 1
                        idx = word2idx[t]
                        word_idx_count[idx] = word_idx_count.get(idx, 0) + 1
                    sentences_by_idx = [word2idx[t] for t in tokens]
                    sentences.append(sentences_by_idx)

    # Reduce vocabulary size to n_vocab
    sorted_word_idx_count = sorted(word_idx_count.items(), key=operator.itemgetter(1), reverse=True)
    word2idx_small = {}
    new_idx = 0
    idx_new_idx_map = {}
    for idx, count in sorted_word_idx_count[:n_vocab]:
        word = idx2word[idx]
        print(word, count)
        word2idx_small[word] = new_idx
        idx_new_idx_map[idx] = new_idx
        new_idx += 1
    # Let 'unknown' be last token
    word2idx_small['UNKNOWN'] = new_idx
    unknown = new_idx

    assert('START' in word2idx_small)
    assert('END' in word2idx_small)
    assert('king' in word2idx_small)
    assert('queen' in word2idx_small)
    assert('man' in word2idx_small)
    assert('woman' in word2idx_small)

    # Map old idx to new idx
    sentences_small = []
    for sentence in sentences:
        if len(sentence) > 1:
            new_sentence = [idx_new_idx_map[idx] if idx in idx_new_idx_map else unknown for idx in sentence]
            sentences_small.append(new_sentence)

    return sentences_small, word2idx_small

A few things to note about the above function:

Because the number of files that we are working with is too large to fit into memory, we must limit the amount of data that is loaded at a single time
We must also limit the vocabulary size, the 2000 most common words is appropriate
To do this, we must keep a count of how many times a word has appeared in the dictionary. Next, we need to sort that dictionary by value, in reverse so that the highest count value comes first. This will give us the index's for the top 2000 words. However, we are still not done. Our word embedding matrix needs to be of size V = 2000, so the word index's must be from 0-2000. But the word index's we currently have are just 2000 random numbers from 0-1,000,000. Hence, we need to create a new word mapping from old index to new index, where the old word index is any number from 0 to 1 million, and the new word index is a number from 0 to 2000. This also means we need to create a new word2idx dictionary as well.

3.4.4 Training on the entire dataset¶

Now, I should now that if we train on the entire dataset there is a good chance that we will run out of memory. To prevent this, we don't want to load all of our data into memory at the same time. One strategy would be to train on each separate text file, opening each within each epoch. We would have to, of course, keep a dictionary of the word2idx mapping handy, so that it remains consistent between each file. This would be rather slow of course.

Another option, would be to convert each sentence into a list of word index's before running the neural network, and then save the word2idx mapping to a file as well. That way our input data can be just a bunch of arrays of word index's. This would still need to be in multiple files since it would still take up too much RAM, but we would prevent needing to convert the data each time we open the file.

A final option would to use a DB like MySQL in order to store the arrays of word indexes. This way we could retrieve rows at random, without needing to store them in memory.

3.4.5 Training RNN with LSTM from wikipedia data¶

We will now implement this is code:

import sys
import theano
import theano.tensor as T
import numpy as np
import matplotlib.pyplot as plt
import json

from datetime import datetime
from sklearn.utils import shuffle
from gru import GRU
from lstm import LSTM
from rnn_util import init_weight, get_wikipedia_data


class RNN:
    def __init__(self, D, hidden_layer_sizes, V):
        """RNN class constructor

        Args:
            hidden_layer_sizes: Number of units in each hidden layer
            D: Embedding size
            V: Vocabulary size
        """
        self.hidden_layer_sizes = hidden_layer_sizes
        self.D = D
        self.V = V

    def fit(self, X, learning_rate=1e-5, mu=0.99, epochs=10,
            show_fig=True, activation=T.nnet.relu, RecurrentUnit=GRU, normalize=True):
        """Fit RNN class to data set via unsupervised learning (no Y labels)

        Args:
            X: Input samples (sentences)
            learning_rate: learning rate
            mu: momentum
            epochs: Number of training iterations
            show_fig: Show figure
            activation: non linear activation function
            RecurrentUnit: GRU or LSTM
            normalize: Normalizes word embeddings

        Returns:
        """
        D = self.D
        V = self.V
        N = len(X) # Number training samples (number of sentences)

        We = init_weight(V, D)
        self.hidden_layers = []
        Mi = D
        for Mo in self.hidden_layer_sizes: # Create hidden layers
            ru = RecurrentUnit(Mi, Mo, activation) # Create recurrent unit (from class GRU or LSTM)
            self.hidden_layers.append(ru)
            Mi = Mo

        Wo = init_weight(Mi, V)
        bo = np.zeros(V)

        # Create theano variables
        self.We = theano.shared(We)
        self.Wo = theano.shared(Wo)
        self.bo = theano.shared(bo)
        self.params = [self.Wo, self.bo]

        # For each recurrent unit in hidden_layers, append the params
        for ru in self.hidden_layers:
            self.params += ru.params

        # Create theano inputs and outputs
        thX = T.ivector('X') # Sequence of word indexes
        thY = T.ivector('Y') # Sequence of word indexes, offset for thX by 1

        Z = self.We[thX] # Input sequence
        for ru in self.hidden_layers:
            Z = ru.output(Z)
        py_x = T.nnet.softmax(Z.dot(self.Wo) + self.bo)
        prediction = T.argmax(py_x, axis=1)

        # Let's return py_x too so we can draw a sample instead
        self.predict_op = theano.function(
            inputs=[thX],
            outputs=[py_x, prediction],
            allow_input_downcast=True,
        )

        # Create cost and do gradient descent
        cost = -T.mean(T.log(py_x[T.arange(thY.shape[0]), thY]))
        grads = T.grad(cost, self.params)
        dparams = [theano.shared(p.get_value()*0) for p in self.params]

        # Gradient descent for word embeddings
        dWe = theano.shared(self.We.get_value()*0)
        gWe = T.grad(cost, self.We)
        dWe_update = mu*dWe - learning_rate*gWe
        We_update = self.We + dWe_update
        if normalize:
            We_update /= We_update.norm(2)

        updates = [
            (p, p + mu*dp - learning_rate*g) for p, dp, g in zip(self.params, dparams, grads)
        ] + [
            (dp, mu*dp - learning_rate*g) for dp, g in zip(dparams, grads)
        ] + [
            (self.We, We_update), (dWe, dWe_update)
        ]

        self.train_op = theano.function(
            inputs=[thX, thY],
            outputs=[cost, prediction],
            updates=updates
        )

        # Main training loop
        costs = []
        for i in range(epochs):
            t0 = datetime.now()
            X = shuffle(X)
            n_correct = 0
            n_total = 0
            cost = 0
            for j in range(N): # one sentence at a time
                # 1% of time, will include END token
                if np.random.random() < 0.01 or len(X[j]) <= 1:
                    input_sequence = [0] + X[j]
                    output_sequence = X[j] + [1]
                else:
                    input_sequence = [0] +X[j][:-1]
                    output_sequence = X[j]
                n_total += len(output_sequence)

                # test:
                try:
                    # we set 0 to start and 1 to end
                    c, p = self.train_op(input_sequence, output_sequence)
                except Exception as e:
                    PYX, pred = self.predict_op(input_sequence)
                    print("input_sequence len:", len(input_sequence))
                    print("PYX.shape:",PYX.shape)
                    print("pred.shape:", pred.shape)
                    raise e

                cost += c
                for pj, xj in zip(p, output_sequence):
                    if pj == xj:
                        n_correct += 1
                if j % 200 == 0:
                    # Using sys here helps compact the output.
                    sys.stdout.write("j/N: %d/%d correct rate so far: %f\r" % (j, N, float(n_correct) / n_total))
                    sys.stdout.flush()
            print("i:", i, "cost:", cost, "correct rate:", (float(n_correct) / n_total), "time for epoch:",
                  (datetime.now() - t0))
            costs.append(cost)

        if show_fig:
            plt.plot(costs)
            plt.show()


def train_wikipedia(we_file='word_embeddings.npy', w2i_file='wikipedia_word2idx.json', RecurrentUnit=GRU):
    sentences, word2idx = get_wikipedia_data(n_files=20, n_vocab=2000)
    print("finished retrieving data")
    print("vocab size:", len(word2idx), "number of sentences:", len(sentences))
    rnn = RNN(30, [30], len(word2idx))
    rnn.fit(sentences, learning_rate=1e-5, epochs=10, show_fig=True, activation=T.nnet.relu)

    np.save(we_file, rnn.We.get_value())
    with open(w2i_file, 'w') as f:
        json.dump(word2idx, f)


def find_analogies(w1, w2, w3, we_file='word_embeddings.npy', w2i_file='wikipedia_word2idx.json'):
    We = np.load(we_file)
    with open(w2i_file) as f:
        word2idx = json.load(f)

    king = We[word2idx[w1]]
    man = We[word2idx[w2]]
    woman = We[word2idx[w3]]
    v0 = king - man + woman

    def dist1(a, b):
        return np.linalg.norm(a - b)
    def dist2(a, b):
        return 1 - a.dot(b) / (np.linalg.norm(a) * np.linalg.norm(b))

    for dist, name in [(dist1, 'Euclidean'), (dist2, 'cosine')]:
        min_dist = float('inf')
        best_word = ''
        for word, idx in word2idx.items():
            if word not in (w1, w2, w3):
                v1 = We[idx]
                d = dist(v0, v1)
                if d < min_dist:
                    min_dist = d
                    best_word = word
        print("closest match by", name, "distance:", best_word)
        print(w1, "-", w2, "=", best_word, "-", w3)

if __name__ == '__main__':
    we = 'models/lstm_word_embeddings2.npy'
    w2i = 'models/lstm_wikipedia_word2idx2.json'
#     train_wikipedia(we, w2i, RecurrentUnit=LSTM)
    find_analogies('king', 'man', 'woman', we, w2i)
    find_analogies('france', 'paris', 'london', we, w2i)
    find_analogies('france', 'paris', 'rome', we, w2i)
    find_analogies('paris', 'france', 'italy', we, w2i)

closest match by Euclidean distance: seems
king - man = seems - woman
closest match by cosine distance: looking
king - man = looking - woman
closest match by Euclidean distance: pain
france - paris = pain - london
closest match by cosine distance: authority
france - paris = authority - london
closest match by Euclidean distance: plane
france - paris = plane - rome
closest match by cosine distance: culture
france - paris = culture - rome
closest match by Euclidean distance: score
paris - france = score - italy
closest match by cosine distance: picked
paris - france = picked - italy

We can see that the embeddings that the RNN yielded do not seem to have the relationships we would like. However, if we utilize a dimensionality reduction technique, we may be able to improve upon this! Note, the above embeddings are based off of one iteration of training. If we had used 10, we would have improved significantly. On my local machine, training one iteration took 90 minutes.

3.4.6 Visualize the word embeddings¶

Below, I will use the interactive plotting library, plotly, in junction with dimensionality reduction via TSNE to plot our word embeddings. We can then tinker with the plot to see what types of embeddings our network learned.

import plotly.plotly as py
import plotly
import plotly.graph_objs as go
from plotly.offline import iplot
# Using cufflinks in offline mode
import cufflinks
cufflinks.go_offline()
# Set the global theme for cufflinks
cufflinks.set_config_file(world_readable=True, theme='pearl', offline=True)
import warnings
warnings.filterwarnings('ignore')

import json
import numpy as np
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA, TruncatedSVD

We = np.load('models/gru_nonorm_part1_word_embeddings.npy')
V, D = We.shape
with open('models/gru_nonorm_part1_wikipedia_word2idx.json') as f:
    word2idx = json.load(f)
idx2word = {v:k for k,v in word2idx.items()}

model = TSNE()
Z = model.fit_transform(We)

x = Z[:,0]
y = Z[:,1]
text = [idx2word[i] for i in range(V)]
color = [i for i in range(V)]
trace = go.Scatter(
    x=x,
    y=y,
    text=text,
    mode='markers',
    marker=dict(
        size=12,
        color=color,
        colorscale='Viridis',
        showscale=True
    )
)
data = [trace]

layout=go.Layout(
    title='Word Embeddings with Dimensionality Reduction',
    hovermode='closest',
    xaxis=dict(title='Principal Component 1'),
    yaxis=dict(title='Principal Component 2')
)

fig = go.Figure(data=data, layout=layout)

# py.iplot(fig)
div_plot = plotly.offline.plot(
    fig,
    include_plotlyjs=False,
    output_type='div'
)

4. Batch Training¶

The final thing that we will go over as it relates to RNN's is the concept of batch training. You may have been wondering why it has taken us so long to get to batch training, in the first place. We took a while to introduce the idea of batching because the scan function is not trivial-it has many parts and takes a bit to fully sink in. However, we have worked with it in a variety of settings now, so hopefully we are ready to move to the next level of complexity.

To start, let's look at our original recurrence function:

def recurrence(x_t, h_t1):
    # Returns h(t), y(t)
    h_t = self.f(x_t.dot(self.Wx) + h_t1.dot(self.Wh) + self.bh)
    y_t = T.nnet.softmax(h_t.dot(self.Wo) + self.bo)
    return h_t, y_t

Does anything jump out about this function as being inefficient? Well, the answer is that doing one large matrix transformation is much faster than doing a small matrix multiply in a loop. So, doing $x(t)W_x$ and looping through $t$ is slower than just doing $XW_x$. So we can take our original:

$$x(t)W_x$$

Out of the recurrence and replace it with:

$$XW_x$$

The same situation occurs with $y(t)$. Originally we have:

$$y(t) = h(t)W_o + b_o$$

But it can be updated to be:

$$Y = softmax(HW_o + b_o)$$

The only place that we actually need the recurrence is to calculate $h(t)$, since it is dependent on $h(t-1)$.

4.1 Update Recurrence Pseudocode¶

So, let's look at how the recurrence would look if we took out the multiplication by $X$ and the last multiplication to get the output $Y$.

XW = X.dot(Wx)

def recurrence(xw_t, h_t1):
    h_t = self.f(xw_t + h_t1.dot(Wh) + bh)
    return h_t

h, _ = theano.scan(
    fn=recurrence,
    outputs_info=[h0],
    sequences=[XW],
    n_steps=XW.shape[0],
)

y = T.nnet.softmax(h.dot(Wo) + bo)

4.2 Batch Training¶

Now, that is a step in the right direction; we moved 2 matrix multiplies out of the loop and turned them into 2 big single matrix multiplies. However, we are still doing stochastic gradient descent. How do we move this from stochastic to batch gradient descent? Well, the next step is to start calculating the output for multiple X's at the same time.

Let's think about one possible solution. We could make $X$ a 3-D tensor, where the shape is $(N x T x D)$, where $N$ is batch size, $T$ is sequence length, and $D$ is the number of features. However, in that case we would also have to loop through each individual $x$. This would mean that we have a recurrence inside of a recurrence, aka a nested scan (one scan to loop through each sample, one scan to loop through each time point). Recall, that scan is almost as bad as a for loop, so we don't want to do that.

Consider another possible solution. Suppose we flattened out $X$ batch so that instead of $N x T x D$ it becomes $(N*T, D)$. Now it is 2-dimensional, and we could multiply the whole thing by $W$ at the same time. So the first stage is done. But, can we just pass this into the recurrence? Well, the problem arises when we try and calculate $h(t)$. Now that we have flattened all of $X$ into a 2-d array, the recurrence function thinks it is one big long sequence, even though it is the concatenation of multiple sequences in the batch. So, if we can find a way to get $h(t)$ to start back at $h(0)$ at the start of each sequence instead of $h(t-1)$, then this problem is solved. The question is, can we do this? The answer is that we can, but it requires knowledge of theano that we currently don't have.

What we need is a way to keep track of where each of the sequences starts and ends. So, to keep track of where each sequence starts and ends, we can create an indicator array of size $N*T$ (Batch size times sequence length). It will be an indicator array, so it will be all zeros except at the start of the sequence. For example:

Xbatch = [[1,2,3],[4,5],[6,7,8]]
Xbatchflat = [1,2,3,4,5,6,7,8]
startIndicator = [1,0,0,1,0,1,0,0]

So, if it is at the start of a sequence it is a one, otherwise it is a zero. Now, how would we actually implement this? We are going to need a bit of additional theano knowledge. The reason why there is a slight problem is due to the fact that we want to check this indicator sequence to decide what to do:

if is_start:
    h(t) = f(xw_t + h0.dot(Wh) + bh
else:
    h(t) = f(xw_t + h_t1.dot(Wh) + bh)

This would work fine if we were in numpy, but we are in theano. The issue is the same that arises if we try and use a for loop in theano: we can't do if <something that does not have a value>. None of these things have values! Recall, we are building a theano graph that will tell it what to do once it has real numbers. However, something like an if statement that wants to evaluate at the time you call that if statement will not work. So, just as the scan function was the theano equivalent of a for loop, there is an equivalent if statement.

4.2.1 Theano Switch Function¶

The new piece of functionality is the switch function. It looks like:

new_value = T.switch(
    condition,
    value_if_true,
    value_if_false
)

We see that it takes in 3 arguments. The first is a conditional expression composed of theano graph nodes, the second is what to assign if the condition is true, and the third is what to assign if the condition is false. So, the non theano equivalent would be:

if condition:
    new_value = value_if_true
else:
    new_value = value_if_false

4.2.2 New Recurrence¶

Our new recurrence, utilizing the switch function, will look like:

def recurrence(xw_t, is_start, h_t1, h0):
    # if at a boundary, state should be h0
    h_t = T.switch(
        T.eq(is_start, 1),                              # Equivalent of is_start == 1
        self.f(xw_t + h0.dot(self.Wh) + self.bh),
        self.f(xw_t + h_t1.dot(self.Wh) + self.bh)
    )
    return h_t

And here is how we will call the scan function:

thStartPoints = T.ivector() # Indicator array 
...
h, _ = theano.scan(
    fn=recurrence,
    outputs_info=[self.h0],
    sequences=[XW, thStartPoints],
    non_sequences=[self.h0],
    n_steps=XW.shape[0]
)

Nearly everything else in the main code will remain the same! The main change outside of the new recurrence is that we now have to loop through batches instead. We reshape Xbatch and Ybatch before calling train, and we have to create the indicator array of start points.

for in range(epochs):
    X, Y = shuffle(X, Y)
    n_correct = 0
    cost = 0
    for j in range(n_batches):
        Xbatch = X[j*batch_sz:(j+1)*batch_sz].reshape(sequenceLength*batch_sz, D)
        Ybatch = Y[j*batch_sz:(j+1)*batch_sz].reshape(sequenceLength*batch_sz, D).astype(np.int32)
        c, p, rout = self.train_op(Xbatch, Ybatch, startPoints)
        ...

In the case of the parity problem, since all of the sequences are of the same length, we can just create this indicator array of evenly spaced ones outside of the main training loop. We can then just use this same array over and over again:

startPoints = np.zeros(sequenceLength*batch_sz, dtype=np.int32)
for b in range(batch_sz):
    startPoints[b*sequenceLength] = 1

This generally won't be the case if you have sequences of variable length.

4.2.3 Classification Rate¶

The final thing to talk about is how do we assess the classification rate. This is a little bit of a problem since the predictions are now a flattened 1-d array. So, the way we can do this is loop through the number of batches and calculate the correct index, which is at the end of each sequence. If we let $b$ be the loop index, then $b*T$ is the start of a sequence. Subsequently, $(b+1)*T - 1$ will give us the end of a sequence.

for b in range(batch_sz):
    idx = sequenceLength * (b + 1) - 1
    if p[idx] == Ybatch[idx]:
        n_correct += 1

Here is how it looks when implemented on the parity problem:

import theano
import theano.tensor as T
import numpy as np
import matplotlib.pyplot as plt

from sklearn.utils import shuffle
from util import init_weight, all_parity_pairs_with_sequence_labels


class SimpleRNN:
    def __init__(self, M):
        self.M = M # hidden layer size

    def fit(self, X, Y, batch_sz=20, learning_rate=1.0, mu=0.99, reg=1.0, activation=T.tanh, epochs=100, show_fig=False):
        D = X[0].shape[1] # X is of size N x T(n) x D
        K = len(set(Y.flatten()))
        N = len(Y)
        M = self.M
        self.f = activation

        # initial weights
        Wx = init_weight(D, M)
        Wh = init_weight(M, M)
        bh = np.zeros(M)
        h0 = np.zeros(M)
        Wo = init_weight(M, K)
        bo = np.zeros(K)

        # make them theano shared
        self.Wx = theano.shared(Wx)
        self.Wh = theano.shared(Wh)
        self.bh = theano.shared(bh)
        self.h0 = theano.shared(h0)
        self.Wo = theano.shared(Wo)
        self.bo = theano.shared(bo)
        self.params = [self.Wx, self.Wh, self.bh, self.h0, self.Wo, self.bo]

        thX = T.fmatrix('X') # will represent multiple batches concatenated
        thY = T.ivector('Y')
        thStartPoints = T.ivector('start_points')

        XW = thX.dot(self.Wx)

        # other solution - loop through all sequences simultaneously
        def recurrence(xw_t, is_start, h_t1, h0):
            # if at a boundary, state should be h0
            h_t = T.switch(
                T.eq(is_start, 1),
                self.f(xw_t + h0.dot(self.Wh) + self.bh),
                self.f(xw_t + h_t1.dot(self.Wh) + self.bh)
            )
            return h_t

        h, _ = theano.scan(
            fn=recurrence,
            outputs_info=[self.h0],
            sequences=[XW, thStartPoints],
            non_sequences=[self.h0],
            n_steps=XW.shape[0],
        )

        # h is of shape (T*batch_sz, M)
        py_x = T.nnet.softmax(h.dot(self.Wo) + self.bo)
        prediction = T.argmax(py_x, axis=1)

        cost = -T.mean(T.log(py_x[T.arange(thY.shape[0]), thY]))
        grads = T.grad(cost, self.params)
        dparams = [theano.shared(p.get_value()*0) for p in self.params]

        updates = [
            (p, p + mu*dp - learning_rate*g) for p, dp, g in zip(self.params, dparams, grads)
        ] + [
            (dp, mu*dp - learning_rate*g) for dp, g in zip(dparams, grads)
        ]

        self.train_op = theano.function(
            inputs=[thX, thY, thStartPoints],
            outputs=[cost, prediction, py_x],
            updates=updates
        )

        costs = []
        n_batches = N // batch_sz
        sequenceLength = X.shape[1]

        startPoints = np.zeros(sequenceLength*batch_sz, dtype=np.int32)
        for b in range(batch_sz):
            startPoints[b*sequenceLength] = 1
        for i in range(epochs):
            X, Y = shuffle(X, Y)
            n_correct = 0
            cost = 0
            for j in range(n_batches):
                Xbatch = X[j*batch_sz:(j+1)*batch_sz].reshape(sequenceLength*batch_sz, D)
                Ybatch = Y[j*batch_sz:(j+1)*batch_sz].reshape(sequenceLength*batch_sz).astype(np.int32)
                c, p, rout = self.train_op(Xbatch, Ybatch, startPoints)
                cost += c
                for b in range(batch_sz):
                    idx = sequenceLength*(b + 1) - 1
                    if p[idx] == Ybatch[idx]:
                        n_correct += 1
            if i % 10 == 0:
                print("shape y:", rout.shape)
                print("i:", i, "cost:", cost, "classification rate:", (float(n_correct)/N))
            if n_correct == N:
                print("i:", i, "cost:", cost, "classification rate:", (float(n_correct)/N))
                break
            costs.append(cost)

        if show_fig:
            plt.plot(costs)
            plt.show()

def parity(B=12, learning_rate=1e-3, epochs=3000):
    X, Y = all_parity_pairs_with_sequence_labels(B)

    rnn = SimpleRNN(4)
    rnn.fit(X, Y,
        batch_sz=10,
        learning_rate=learning_rate,
        epochs=epochs,
        activation=T.nnet.sigmoid,
        show_fig=False
    )

parity()