This section is intended to serve as an introduction to many of the things that we will touch on in my NLP section pertaining to Deep learning applied to natural language processing.
You will notice that a lot of the RNN examples that we go over, as well as elsewhere on the web will use word sequences as examples. Why is that?
Let's focus on number 2 above for a moment. As an example, consider the sentence:
"Dogs love cats and I"
It almost has the correct gramatical structure, but its meaning is most certainly different from the original sentence:
"I love dogs and cats"
So, there is a lot of information (in the quantitative sense) that is thrown away when you use bag-of-words. At this point I am assuming that you have gone through my posts concerning Logistic Regression and intro to NLP, which both go over sentiment analysis and utilize bag-of-words. But in case you have not, let me define bag of words quickly.
Consider a the task of sentiment analysis, where we are trying to determine whether a sentence is positive or negative. A positive sentence may be:
"Wow, today is a great day!"
While a negative sentence may be:
"Ugh, this movie is absolutely terrible."
In order to turn each sentence into an input for the classifier, we first start with a vector of 0's of size $V$ (our vocabularly size), so there is an entry for every individual word:
X=[0,0,0,...,0]
len(X) = V
We keep track of which word goes with which index using a dictionary, word2idx
. Now, for every word in the sentence, we will set the corresponding index in the vector to 1
, or perhaps some other frequency measure:
X[idx_of_word] = 1
So, there is a nonzero value for every word that appears in the sentence, and everywhere else zero:
X = [0,1,0,0,...,1]
You can see how given this vector, it wouldn't be easy to determine the correct order of words in the sentence. It isn't completely impossible, if for instance the words are such that their is only one possible ordering, but generally some information is lost.
Now, what happens when you have the two similar sentences:
"Today is a good day."
And:
"Today is not a good day."
Well, these lead to nearly the exact same input vector, except X[not] = 1
. This is actually a known drawback of bag-of-words; they are notoriously bad at being able to handle negation. Now, given what we know about RNN's, you can imagine that they may be good at this because they keep state! For instance, if the RNN saw the word not, it may negate everything that comes after it.
This brings us to a paramount question: How do we treat words in deep learning? The popular method at the moment, which has been able to produce very impressive results, is the use of word embeddings or word vectors. That means that given a vocabulary size $V$, we choose a dimensionality that is much smaller than that, $D$, where $D << V$, and then map each word vector to somewhere in the $D$ dimensional space. By training a model to do certain things like trying to predict the next word, or try to predict surrounding words, we get vectors (word embeddings) that can be manipulated via arithmetic to produce analogies such as:
king - man $\approx$ queen - woman
The question now is how do we use word embeddings with Recurrent Neural Networks? To accomplish this, we simply create an embedding layer in the RNN. So, the input simply arrives as a one hot encoded word, and in the next layer it becomes a $D$ dimensional vector.
This requires the word embedding transformation matrix to be a $VxD$ matrix, where the $i$th row is the word vector for the $i$th word. For reference, all of the matrix dimensions are below:
$$W_e = VxD$$$$W_x = DxM$$$$W_h = MxM$$$$W_o = MxK$$Two questions will naturally arise at this point. The first being:
The answer to this is our old friend, gradient descent. We will also see later that when we do Word2Vec that there are some variations on the cross entropy error function that will help us speed up training. The second question is:
This is a good question because language models don't necessarily have targets. You can attempt to learn word embeddings on a sentiment analysis task, so your targets could be movie ratings or some kind of movie score. Your targets could also be next word prediction as we discussed before. Again, if we use Word2Vec, the targets will also change based on the particular Word2Vec method we use.
We are now going to go over how you actually can perform calculations that show that:
king - man $\approx$ queen - woman
It is quite simple, but worth going through so that intuitions can start forming about this entire process.
We can start be rewriting the above as:
king - man + woman = ?
Then there are two main steps:
vec(king) = Word_embedding[word2idx["king"]]
v0 = vec(king) - vec(man) + vec(woman)
And v0
is just a vector in space with an infinte number of values!
v0
, and return that word.Why do we need to do that? Well, the result of vec(king) - vec(man) + vec(woman)
just gives us a vector. There is no way to map from vectors to words, since a vector space is continuous, and that would require and infinite number of words. So, the idea is that we just find closest word.
There are various ways of defining distance in the context above. Sometimes, we will simply use Euclidean Distance:
$$\text{Euclidean Distance: } ||a - b||^2$$It is also common to use the cosine distance:
$$\text{Cosine Distance: } cosine\_distance(a, b) = \frac{1 - a^Tb}{||a|| \; ||b||}$$In this later form, since only the angle matters, because:
$$a^Tb = ||a|| \; ||b|| cos(a,b)$$During training we normalize all of the word vectors so that their length is 1:
$$cos(0) = 1, \; cos(90) = 0, \; cos(180) = -1$$When two vectors are closer, $cos(\theta)$ is bigger. So, we want our distance to be:
$$\text{Distance} = 1 - cos(\theta)$$At this point we can say that all of the word embeddings lie on the unit sphere.
Once we have our distance function, how do we actually find the closest word? The simplest word is to just look at every word in the vocabulary, and get the distance between each vector and your expression vector. Keep track of the smallest distance and then return that word.
min_dist = Infinity; best_word = ''
for word, idx in word2idx.items():
v1 = Word_embedding[idx]
if dist(v0, v1) < min_dist:
min_dist = dist(v0, v1)
best_word = word
print("The best word is: ", best_word)
We may want to leave out the words from the left side of the equation, in this case king, man, and woman. Note that we will not be using this on our upcoming poetry data, since it doesn't have the kind of vocabulary we are looking for. We are more interested in things like nouns when we do word analogies. We want to compare kings and queens, men and women, occupations, etc. We will look more at word analogies later on.
Let's quickly go over one small detail from the upcoming code, that may be slightly confusing. We have a word embedding matrix, $W_e$, which is of size $V x D$ (V = vocabulary size, D = word vector dimensionality):
Note that each row of the word embedding matrix is a word vector:
And we have an input sequence of word indexes of length $T$:
We would like to get a sequence of word vectors that represent a sentence, which is a $TxD$ matrix. In other words, we want to take the indices from our vector above, and grab the correspond word vectors from the embedding matrix:
However, we will need to update the word embeddings via backpropagation, so the $TxD$ matrix we get after grabbing the word vectors cannot be the input into the neural network. This is because the word embeddings must be part of the neural network so that they can be updated via gradient descent with the other weights. This means that the input to the neural network will actually just be a list of word index's, and this list will correspond to however we decide to build our dictionary. This saves a lot of space, because now we can represent each input by a $Tx1$ vector of integers, rather than a $TxD$ matrix of floats.
From a conceptual standpoint, we are trying to do the following:
word_vectors = []
for index in input_sequence:
word_vector = We[index, :]
word_vectors.append(word_vector)
return word_vectors
Here, we are simply taking each word index from the input sequence, grabbing its corresponding word vector, and adding it to a list of word vectors which is the output sequence. Mathematically speaking, the way that you would get a vector is by multiplying the one hot encoded word index vector, by the word embedding matrix:
s = one_hot_encode(input word index) # ex. [0,0,0,...,1,...,0] (1 x V)
x = sW_e # (1 x D)
Above we are taking the dot product of a $1 x V$ vector with a $V x D$ matrix, which will result in a $1 x D$ word vector; exactly what we wanted. Visually, this can be seen below. We have a one hot encoded word, in this case "dog", that has a word index of 147, and our word embedding matrix:
We then perform the dot product of our one hot encoded word with the word embedding matrix:
We can see that because all entries in the one hot encoded word are zero (besides a 1 at the index 147, representing our dog), our result will be the exact row vector at index 147 of the word embedding matrix. This means that we can skip the computation, and just extract the row specified by our index! This would be trivial in numpy: W_e[147]
.
We are now going to dive into an interesting use case of an RNN: poetry generation. This is an unsupervised model, and as we discussed in the previous section, the softmax output will be the probability of the next word, given the previous sequence of words:
$$p\big( w_t \mid w_{t-1}, w_{t-2},...,w_0\big)$$For this first iteration, we will also create an initial word distribution which we will call $\pi$. $\pi$ is just the distribution across all words that line in our poetry will start with a particular word:
$$\pi = p\big( w_0 \big)$$We will sample from $\pi$ so that each line can start with a different word. The reason for doing this is so that we can generate different sequences. If we consistently had the same start token for input into our recurrent net, it would always output the same prediction (because neural network output is deterministic), and then it would put those two tokens into the recurrent net, and so on. Our predictions will take the form:
w0 = rnn.predict(START)
w1 = rnn.predict(w0, START)
w2 = rnn.predict(w0, w1, START)
# and so on...
Remember, the prediction of a neural network is just the argmax of the softmax:
$$w_t = argmax\Big(softmax \big(f(w_{t-1}, w_{t-2}, ...)\big)\Big)$$Sampling from the initial word distribution will allow each line to possibly start with a different word, which in turn will allow us to generate different sequences.
Because we are dealing with a language model, we will also need word embeddings. This means that this RNN will be slightly different than the one we had built for the parity problem. The first difference is that it is going to take, in addition to the hidden layer size, $M$, the dimensionality of the word embeddings, $D$, and the vocabularly size $V$, since the word embedding matrix, $W_e$, needs to be of size $V x D$.
The second difference is that our fit
function will only take in $X$, because there are no targets:
rnn.fit(X)
Within the fit
function, however, we will create our own targets. The targets for word one to $t-1$, should be the word at time $t$:
Input: | START | $x_0$ | $x_1$ | $x_2$ | $x_3$ |
---|---|---|---|---|---|
Target: | $x_0$ | $x_1$ | $x_2$ | $x_3$ | END |
We will need to predict the end of the sequence, however, or else we would just go on creating an infinite line. To do this, we will make the target of the full sequence the END
token. Similarly, we will add a START
token at the beginning of an input sequence and its target will be the first word. To summarize, the input sequence will be prepended with the START
token, and the output sequence will be appended with the END
token.
Unlike the parity problem, we want to measure accuracy not simply by the last word, but by every predicted word. Due to that, we will accumulate the number of correct words guessed, and divide it by the total number of words, in order to get the final accuracy:
$$Accuracy = \frac{\text{words correctly guessed}}{\text{sum(len(sentence) + 1 for sentence in sentences})}$$Additionally, because we may want to generate new poetry without having to retrain the model every time, we will want to save our model after it is trained, and also have a way to load said model. The API to do so is shown below:
rnn = SimpleRNN.load(filename)
rnn.save(filename)
Note that the first method (load
) is a static method, and the second method (save
) is an instance method. Because theano functions need to be compiled, we can't just set the weights to saved numpy arrays. We must reinitialize the object with all of the required theano functions in order to make predictions.
Now, the data that we are going to be dealing with is a collection of Robert Frost poems; about 1500 lines. Each line is a separate sequence. We will perform preprocessing that consists of:
word2idx
map. Indexes start from 0 and increment by 1word2idx
mapThis is generally the same process that we will follow for building any language model. However, we will see how we can introduce more modifications when we look at more complicated data.
import numpy as np
import string
import theano
import theano.tensor as T
import matplotlib.pyplot as plt
from sklearn.utils import shuffle
def init_weight(Mi, Mo):
return np.random.randn(Mi, Mo) / np.sqrt(Mi + Mo)
def remove_punctuation(s):
return s.translate(str.maketrans('','',string.punctuation))
def get_robert_frost():
word2idx = {'START': 0, 'END': 1}
current_idx = 2
sentences = []
for line in open('../../data/poems/robert_frost.txt'):
line = line.strip()
if line:
tokens = remove_punctuation(line.lower()).split()
sentence = []
for t in tokens:
if t not in word2idx:
word2idx[t] = current_idx
current_idx += 1
idx = word2idx[t]
sentence.append(idx)
sentences.append(sentence)
return sentences, word2idx
class SimpleRNN:
def __init__(self, D, M, V):
self.D = D # Dimensionality of word embedding
self.M = M # Hidden layer size
self.V = V # Vocabularly size
def fit(self, X, learning_rate=10e-1, mu=0.99, reg=1.0, activation=T.tanh, epochs=500, show_fig=False):
N = len(X)
D = self.D
M = self.M
V = self.V
self.f = activation
# Initialize our weights as np matrices and vectors
We = init_weight(V, D)
Wx = init_weight(D, M)
Wh = init_weight(M, M)
bh = np.zeros(M)
h0 = np.zeros(M)
Wo = init_weight(M, V)
bo = np.zeros(V)
# Make weights theano shared. No name is being supplied.
self.We = theano.shared(We)
self.Wx = theano.shared(Wx)
self.Wh = theano.shared(Wh)
self.bh = theano.shared(bh)
self.h0 = theano.shared(h0)
self.Wo = theano.shared(Wo)
self.bo = theano.shared(bo)
# Collect params so gradient descent is easy
self.params = [self.We, self.Wx, self.Wh, self.bh, self.h0, self.Wo, self.bo]
# Define X
thX = T.ivector('X') # A sequence of word index's
Ei = self.We[thX] # Word embedding indexed by thX indices, T x D matrix
thY = T.ivector('Y')
def recurrence(x_t, h_t1):
"""Recurrence function that we define, will be passed into theano scan function. """
# Returns h(t), y(t)
h_t = self.f(x_t.dot(self.Wx) + h_t1.dot(self.Wh) + self.bh)
y_t = T.nnet.softmax(h_t.dot(self.Wo) + self.bo)
return h_t, y_t
# Create scan function -> The scan function allows us to pass in the length of a
# theano variable as the number of times it will loop. As a reminder, the structure
# of scan is as follows:
# - fn: Function to be applied to every element of sequence passed in, in our case
# the function is the recurrence function
# - outputs_info: Initial value of recurring variables
# - sequences: Actual sequence being passed in
# - n_steps: number of things to iterate over, generally len(sequences)
[h, y], _ = theano.scan(
fn=recurrence,
outputs_info=[self.h0, None],
sequences=Ei,
n_steps=Ei.shape[0],
)
# Get output
py_x = y[:, 0, :] # Only care about 1st and last dimension
prediction = T.argmax(py_x, axis=1)
cost = -T.mean(T.log(py_x[T.arange(thY.shape[0]), thY])) # Standard cross entropy cost
grads = T.grad(cost, self.params) # Calculate all gradients in one step
dparams = [theano.shared(p.get_value() * 0) for p in self.params] # Set momentum params
# Update params with gradient descent with momentum
updates = []
for p, dp, g in zip(self.params, dparams, grads):
new_dp = mu*dp - learning_rate*g
updates.append((dp, new_dp))
new_p = p + new_dp
updates.append((p, new_p))
# Define predict op and train op
self.predict_op = theano.function(inputs=[thX], outputs=prediction)
self.train_op = theano.function(
inputs=[thX, thY],
outputs=[cost, prediction, y],
updates=updates
)
# Enter main training loop
costs = []
n_total = sum((len(sentence) + 1) for sentence in X)
for i in range(epochs):
X = shuffle(X)
n_correct = 0 # Restart number correct and cost to be 0 at the start of each epoch
cost = 0
# Perform stochastic gradient descent
for j in range(N):
input_sequence = [0] + X[j] # [0] for start token
output_sequence = X[j] + [1] # [1] for end token
# we set 0 to START and 1 to END
c, p, y = self.train_op(input_sequence, output_sequence)
cost += c # Accumulate cost
for pj, xj in zip(p, output_sequence): # loop through all predictions
if pj == xj:
n_correct += 1
if i % 50 == 0:
print("i:", i, "cost:", cost, "correct rate:", (float(n_correct)/n_total))
costs.append(cost)
if show_fig:
plt.plot(costs)
plt.show()
def save(self, filename):
np.savez(filename, *[p.get_value() for p in self.params]) # Save multiple arrays at once
# Static load method
@staticmethod
def load(filename, activation):
npz = np.load(filename)
We = npz['arr_0']
Wx = npz['arr_1']
Wh = npz['arr_2']
bh = npz['arr_3']
h0 = npz['arr_4']
Wo = npz['arr_5']
bo = npz['arr_6']
V, D = We.shape
_, M = Wx.shape
rnn = SimpleRNN(D, M, V)
rnn.set(We, Wx, Wh, bh, h0, Wo, bo, activation)
return rnn
def set(self, We, Wx, Wh, bh, h0, Wo, bo, activation):
# Pass in np arrays, turn them into theano variables
self.f = activation
# redundant - see how you can improve it
self.We = theano.shared(We)
self.Wx = theano.shared(Wx)
self.Wh = theano.shared(Wh)
self.bh = theano.shared(bh)
self.h0 = theano.shared(h0)
self.Wo = theano.shared(Wo)
self.bo = theano.shared(bo)
self.params = [self.We, self.Wx, self.Wh, self.bh, self.h0, self.Wo, self.bo]
thX = T.ivector('X')
Ei = self.We[thX] # will be a TxD matrix
thY = T.ivector('Y')
def recurrence(x_t, h_t1):
# returns h(t), y(t)
h_t = self.f(x_t.dot(self.Wx) + h_t1.dot(self.Wh) + self.bh)
y_t = T.nnet.softmax(h_t.dot(self.Wo) + self.bo)
return h_t, y_t
[h, y], _ = theano.scan(
fn=recurrence,
outputs_info=[self.h0, None],
sequences=Ei,
n_steps=Ei.shape[0],
)
py_x = y[:, 0, :]
prediction = T.argmax(py_x, axis=1)
self.predict_op = theano.function(
inputs=[thX],
outputs=prediction,
allow_input_downcast=True,
)
def generate(self, pi, word2idx):
# Generate poetry given the saved model
# convert word2idx -> idx2word
idx2word = {v:k for k,v in word2idx.items()}
V = len(pi)
# generate 4 lines at a time (4 line verses)
n_lines = 0
# Initial word is randomly sampled from V, with sampling distribution pi
X = [ np.random.choice(V, p=pi) ]
print(idx2word[X[0]], end=" ")
while n_lines < 4:
P = self.predict_op(X)[-1] # Predict based on current sequence X
X += [P] # Concact prediction onto sequence X
if P > 1:
# it's a real word, not start/end token (start is 0, end is 1)
word = idx2word[P]
print(word, end=" ")
elif P == 1:
# end token
n_lines += 1
print('')
if n_lines < 4:
X = [ np.random.choice(V, p=pi) ] # reset to start of line
print(idx2word[X[0]], end=" ")
def train_poetry():
sentences, word2idx = get_robert_frost()
print(len(word2idx))
rnn = SimpleRNN(30, 30, len(word2idx))
rnn.fit(sentences, learning_rate=1e-4, show_fig=True, activation=T.nnet.relu, epochs=5)
rnn.save('RNN_D30_M30_epochs2000_relu.npz')
def generate_poetry():
# Can call after training poetry
sentences, word2idx = get_robert_frost()
rnn = SimpleRNN.load('RNN_D30_M30_epochs2000_relu.npz', T.nnet.relu)
# determine initial state distribution for starting sentences
V = len(word2idx)
pi = np.zeros(V)
for sentence in sentences:
pi[sentence[0]] += 1
pi /= pi.sum()
rnn.generate(pi, word2idx)
if __name__ == '__main__':
train_poetry()
generate_poetry()
Something that is very intersting to note about the above solution is that we started with a randomly initialized word embedding and ended up with a word embedding matrix that yielded a decent classification! In other words, this process allowed us to take words and convert them into vectors. Even if you have not gone through my posts on Word2Vec yet, this should still give a small taste as to how it may work!
We are now going to move onto a different type of problem where RNN's can be applied: classifying poetry. Specifically, we are going to try and discriminate between Robert Frost and Edgar Allen Poe poems, given only sequences of parts-of-speech-tags.
Parts of speech tags are essentially things like noun, pronoun, adverb, determiner, etc, that tell us the role of each word in a sentence. So, given a sentence/sequence of words, we can get a sequence of pos tags of the same length. To do this we will use a library called NLTK
.
Once we have converted our sequences into sequences of pos tags, we then do the same thing that we did in the previous exercise. Namely, we turn each sequence of pos tags into a sequence of index's that represent those pos tags. We will add some caching ability since NLTK's pos tagging and word tokenization is extremely slow. We will also have the ability to input the number of desired samples per class. It simply saves the data, targets, and vocabulary size to a numpy blob.
Note, we don't actually need the POS tag to index mapping, since we don't actually care about what the actual pos tags are-we just need to be able to differentiate them in order to do classification. Also, keep in mind that the dimensionality will be much smaller here, since there aren't that many pos tags.
Now let's look at the actual classifier itself. Again, it is slightly different from what we built before. The key differences are as follows:
One final note: we will be using a variable learning rate to prevent the cost function from jumping around in the later epochs. To do this, we will simply make the learning rate smaller after every epoch:
learning rate *= 0.9999
import numpy as np
from nltk import pos_tag, word_tokenize
def get_tags(s):
tuples = pos_tag(word_tokenize(s))
return [y for x, y in tuples]
def get_poetry_classifier_data(samples_per_class, loaded_cached=True, save_cached=True):
datafile = 'poetry_classifier.npz'
if loaded_cached and os.path.exists(datafile):
npz = np.load(datafile)
X = npz['arr_0'] # Data
Y = npz['arr_1'] # Targets, 0 or 1
V = int(npz['arr_2']) # Vocabulary size
return X, Y, V
word2idx = {}
current_idx = 0
X = []
Y = []
for fn, label in zip(('../../data/poems/robert_frost.txt', '../../data/poems/edgar_allan_poe.txt'), (0,1 )):
count = 0
for line in open(fn):
line = line.rstrip()
if line:
print(line)
tokens = get_tags(line)
if len(tokens) > 1:
for token in tokens:
if token not in word2idx:
word2idx[token] = current_idx
current_idx += 1
sequence = np.array([word2idx[w] for w in tokens])
X.append(sequence)
Y.append(label)
count += 1
print(count)
if count >= samples_per_class:
break
if save_cached:
np.savez(datafile, X, Y, current_idx)
return X, Y, current_idx
import theano
import theano.tensor as T
import matplotlib.pyplot as plt
from sklearn.utils import shuffle
from rnn_util import init_weight, get_poetry_classifier_data
class SimpleRNN:
def __init__(self, M, V):
self.M = M
self.V = V
def fit(self, X, Y, learning_rate=10e-1, mu=0.99, reg=1.0, activation=T.tanh, epochs=500, show_fig=False):
M = self.M
V = self.V
K = len(set(Y)) # Number of class's (number of unique parts of speech tags)
print('Number of unique parts of speech tags, V: ', V)
# Create validation set
X, Y = shuffle(X, Y)
Nvalid = 10
Xvalid, Yvalid = X[-Nvalid:], Y[-Nvalid:]
X, Y = X[:-Nvalid], Y[:-Nvalid]
N = len(X)
# Initialize weights, no word embedding here
Wx = init_weight(V, M)
Wh = init_weight(M, M)
bh = np.zeros(M)
h0 = np.zeros(M)
Wo = init_weight(M, K)
bo = np.zeros(K)
# To prevent repetition, going to utilize set function. This will set theano shared's and functions
thX, thY, py_x, prediction = self.set(Wx, Wh, bh, h0, Wo, bo, activation)
cost = -T.mean(T.log(py_x[thY]))
grads = T.grad(cost, self.params)
dparams = [theano.shared(p.get_value() * 0) for p in self.params]
lr = T.scalar('learning_rate') # Symbolic adaptive/variable learning rate
# Update params first with momentum and variable learning rate, then update momentum params
updates = [
(p, p + mu*dp - lr*g) for p, dp, g in zip(self.params, dparams, grads)
] + [
(dp, mu*dp - lr*g) for dp, g in zip(dparams, grads)
]
# Define train op
self.train_op = theano.function(
inputs=[thX, thY, lr],
outputs=[cost, prediction],
updates=updates,
allow_input_downcast=True,
)
# Main Loop
costs = []
for i in range(epochs):
X, Y = shuffle(X, Y)
n_correct = 0
cost = 0
for j in range(N): # Perform stochastic gradient descent
c, p = self.train_op(X[j], Y[j], learning_rate)
cost += c
if p == Y[j]:
n_correct += 1
learning_rate *= 0.9999 # Adaptive learning rate. Update at the end of each iteration.
# Calculate Validation Accuracy
n_correct_valid = 0
for j in range(Nvalid):
p = self.predict_op(Xvalid[j])
if p == Yvalid[j]:
n_correct_valid += 1
if i % 50 == 0:
print('i: ', i, 'cost: ', cost, 'correct rate: ', (float(n_correct)/N), end=' ')
print('Validation correct rate: ', (float(n_correct_valid)/Nvalid))
costs.append(cost)
if show_fig == True:
plt.plot(costs)
plt.show()
def save(self, filename):
np.savez(filename, *[p.get_value() for p in self.params])
@staticmethod
def load(filename, activation):
# TODO: would prefer to save activation to file too
npz = np.load(filename)
Wx = npz['arr_0']
Wh = npz['arr_1']
bh = npz['arr_2']
h0 = npz['arr_3']
Wo = npz['arr_4']
bo = npz['arr_5']
V, M = Wx.shape
rnn = SimpleRNN(M, V)
rnn.set(Wx, Wh, bh, h0, Wo, bo, activation)
return rnn
def set(self, Wx, Wh, bh, h0, Wo, bo, activation):
self.f = activation
self.Wx = theano.shared(Wx)
self.Wh = theano.shared(Wh)
self.bh = theano.shared(bh)
self.h0 = theano.shared(h0)
self.Wo = theano.shared(Wo)
self.bo = theano.shared(bo)
self.params = [self.Wx, self.Wh, self.bh, self.h0, self.Wo, self.bo]
thX = T.ivector('X')
thY = T.iscalar('Y')
def recurrence(x_t, h_t1):
"""self.Wx is not dotted with x_t in this case because we can improve the efficiency of the
operation by simply selecting the row of Wx that corresponds to the pos tag index represented
by x_t. This will end up being identical to performing the dot product in this case."""
h_t = self.f(self.Wx[x_t] + h_t1.dot(self.Wh) + self.bh)
y_t = T.nnet.softmax(h_t.dot(self.Wo) + self.bo)
return h_t, y_t
[h, y], _ = theano.scan(
fn=recurrence,
outputs_info=[self.h0, None],
sequences=thX,
n_steps=thX.shape[0],
)
py_x = y[-1, 0, :] # Only want last element of sequence -> the final prediction
prediction = T.argmax(py_x)
self.predict_op = theano.function(
inputs=[thX],
outputs=prediction,
allow_input_downcast=True,
)
return thX, thY, py_x, prediction
def train_poetry():
X, Y, V = get_poetry_classifier_data(samples_per_class=500)
rnn = SimpleRNN(30, V)
rnn.fit(X, Y, learning_rate=1e-6, show_fig=True, activation=T.nnet.relu, epochs=1000)
train_poetry()