In the introduction section we went over some very basic NLP techniques, and more importantly gained an understanding for how to apply basic machine learning models and APIs to the a problem revolving around text data.
One of the things that we went over was the term-document matrix, and how it could be utilized in our process of converting text data to numerical data, so a model can understand it. Well, one of the things that we did not discuss is there is a slight problem with that method, namely its simple counting process. One of the things that tends to happen is that words such as "a", "the", "and", "in", "to", etc, have a high count for ALL documents, no matter what the category is! This is a very large amount of noise that will often overshadow the meaningful words.
These words are known as stopwords and one common technique is to just remove them from the dataset before doing any machine learning.
However, there is another technique that we can utilize: Term Frequency-Inverse Document Frequency, (TF-IDF). We will not go into the full details of TF-IDF, however, the jist is as follows: We know that words that appear in many documents are probably less meaningful. With this in mind, we can weight each vector component (in this case a word) by something related to how many documents that word appears in. So, intuitively speaking, we may do something like:
$$\frac{\text{raw word count}}{\text{document count}}$$So, the numerator tells us how many times does this word appear in this document, and the denominator tells us how many documents does this word appear in, in total. Now, in practice we do some transformations on these, like taking the log count, smoothing, and so on. However, the specific implementation isn't nearly as important for this course as is the general understand behind the process.
One of the most important things to keep in mind during all subsequent posts is that no matter what technique we are using, we are always interested in a matrix of size $(V x D)$, where $V$ is the vocabulary size (the number of total words), and $D$ is the vector dimensionality, which if we are doing something like counting up the total number of times a word appears in a set of books, $D$ is the total number of books.
A final thing to note, we are going to encounter the term Word-Embedding quite a bit. This is just a fancy word for an old and relatively straight forward concept. A word-embedding is just a fancy name for a feature vector that represents a word. In other words, we can take a categorical object-a word in this case-and then map this object to a list of numbers (in other words, a vector). We say that we have embedded this word into a vector space, and that is why we call them word embeddings.
One of the most popular applications of word embeddings is word analogies. This is where the famous king - man = queen - woman comes from.
So, we are now going to focus on two main questions:
First, however, a few examples. We can start with:
King - Queen ~= Prince - Princess
Above on each side, we have a male member of the royal family minus a female member of the royal family, who is from the same generation.
France - Paris ~= Germany - Berlin
Now, on each side we have a country minus a famous city from that country.
Japan - Japanese ~= China - Chinese
Here, we have a country minus the term used to refer to the people of that country.
Brother - Sister ~= Uncle - Aunt
Now, we have a male member of the family, subtracting a close female member of the family.
Walk - Walking ~= Swim - Swimming
Finally, we can see that we were able to learn something about verb tense.
So, how can we actually visualize these analogies? As usual, it is very helpful to think of things geometrically. First, recall that word embedding just means word vector. In other words, if we have a grid, each word is just represented by a dot on the grid. So, what will happen when subtracting one vector from another vector? Well, of course that will just yield the vector between the two vectors. If we can say that the difference of the two vectors on the left is approximately equal to the two vectors on the right, then we know that the two difference vectors are approximately the same.
Now, we know that a vector has two components: a direction and a magnitude. So, when we say that these two vectors are approximately the same, what we are really saying is that their magnitude and direction are very close to one another. This can be visualized below.
With that said, how do we actually we find word analogies? Well, we know that their are four words in every analogy. So, what we can do is take three of these words and try to find the fourth word. So, we have an input of three words, and an output of the fourth word. Notice that we because we are dealing entirely with vectors, we can just rearrange our equation as follows:
$$\text{King - Man = ? - Woman}$$$$\downarrow$$$$\text{King - Man + Woman = ?}$$We know that the $?$ is representing Queen, however we will refer to it as SomeVector for the time being.
So, our job is to find the word that is most closely associated with SomeVector. We will do this utilizing distance. In pseudocode this may look like:
closest_distance = infinity
best_word = None
test_vector = king - man + woman
for word, vector in vocabulary:
    distance = get_distance(test_vector, vector)
    if distance < closest_distance:
        closest_distance = distance
        best_word = wordNote that utilizing a for loop will be very slow, and we will of course want to vectorize this process utilizing numpy.
Now, we did not define get_distance in the above pseudocode. We have a variety of options when deciding how to calculate distance. Sometimes, we will simply use Euclidean Distance:
It is also common to use the cosine distance:
$$\text{Cosine Distance: } cosine\_distance(a, b) = \frac{1 - a^Tb}{||a|| \; ||b||}$$In this later form, since only the angle matters, because:
$$a^Tb = ||a|| \; ||b|| cos(a,b)$$During training we normalize all of the word vectors so that their length is 1:
$$cos(0) = 1, \; cos(90) = 0, \; cos(180) = -1$$When two vectors are closer, $cos(\theta)$ is bigger. So, we want our distance to be:
$$\text{Distance} = 1 - cos(\theta)$$At this point we can say that all of the word embeddings lie on the unit sphere.
One pretty interesting fact about neural word embedding algorithms is that they can find these analogies at all. Once we have covered these algorithms, specifically word2vec and GloVe, it will become clear that these algorithms have no concept of analogies; in other words, what they want to optimize is totally unrelated to word analogies. So, the fact that word analogies suddenly emerge out of the training process is very intruiging. Remember, in all cases we are still dealing with a $VxD$ word embedding matrix. This is the case whether we are just using raw word counts, TF-IDF, or word2vec. Yet, raw word counts and TF-IDF do not give us good analogies. So, the fact that good word analogies emerge from the model and training process of word2vec is a very interesting research area.
We are now going to take a look at some pretrained word vectors found by researchers at Stanford, who also created the GloVe algorithm. We will be studying GloVe shortly, but I want to to give some motivation and intuition before we do. We start with GloVe and not Word2Vec because the vectors are provided in plain text, and with a smaller vocabulary than the pretrained Word2Vec vectors. This has some advantages:
The pretrained word vectors can be found here.
import numpy as np
from sklearn.metrics.pairwise import pairwise_distances
def dist1(a, b):
    return np.linalg.norm(a - b)
def dist2(a, b):
    return 1 - a.dot(b) / (np.linalg.norm(a) * np.linalg.norm(b))
# Select a distance type
dist, metric = dist2, 'cosine'
# dist, metric = dist1, 'euclidean'
def find_analogies(w1, w2, w3):
    """Slower implementation."""
    for w in (w1, w2, w3):
        if w not in word2vec:
            print(f'{w} not in dictionary.')
            return
    king = word2vec[w1]
    man = word2vec[w2]
    woman = word2vec[w3]
    v0 = king - man + woman
    
    min_dist = float('inf')
    best_word = ''
    for word, v1 in items(word2vec):
        if word not in (w1, w2, w3):
            d = dist(v0, v1)
            if d < min_dist:
                min_dist = d
                best_word = word
    print(w1, '-', w2, '=', best_word, '-', w3)
    
def find_analogies(w1, w2, w3):
    """Faster implementation."""
    for w in (w1, w2, w3):
        if w not in word2vec:
            print("%s not in dictionary" % w)
            return
    king = word2vec[w1]
    man = word2vec[w2]
    woman = word2vec[w3]
    v0 = king - man + woman
    # Utilize scikit-learn pairwise_distances
    distances = pairwise_distances(v0.reshape(1, D), embedding, metric=metric).reshape(V)
    idxs = distances.argsort()[:4]
    for idx in idxs:
        word = idx2word[idx]
        if word not in (w1, w2, w3): 
            best_word = word
            break
    return (f'{w1} - {w2} = {best_word} - {w3}')    
def nearest_neighbors(w, n=5):
    if w not in word2vec:
        print(f'{w} not in dictionary')
        return
    v = word2vec[w]
    distances = pairwise_distances(v.reshape(1, D), embedding, metric=metric).reshape(V)
    idxs = distances.argsort()[1:n+1]
    print(f'Neighbors of: {w}')
    
    for idx in idxs:
        print(f'\t {idx2word[idx]}')
    return ''
Now we can load in the pretrained word vectors:
print('Loading word vectors...')
word2vec = {}
embedding = []
idx2word = []
with open('../../data/nlp/glove.6B/glove.6B.50d.txt', encoding='utf-8') as f:
    # Is just a space-separated text file in the format:
    #  - word vec[0] vec[1] vec[2] ...
    for line in f:
        values = line.split()
        word = values[0]
        vec = np.asarray(values[1:], dtype='float32')
        word2vec[word] = vec
        embedding.append(vec)
        idx2word.append(word)
print('Found %s word vectors.' % len(word2vec))
embedding = np.array(embedding)
V, D = embedding.shape
print(find_analogies('king', 'man', 'woman'))
print(find_analogies('france', 'paris', 'london'))
print(find_analogies('france', 'paris', 'rome'))
print(find_analogies('paris', 'france', 'italy'))
print(find_analogies('france', 'french', 'english'))
print(find_analogies('japan', 'japanese', 'chinese'))
print(find_analogies('japan', 'japanese', 'italian'))
print(find_analogies('japan', 'japanese', 'australian'))
print(find_analogies('december', 'november', 'june'))
print(find_analogies('miami', 'florida', 'texas'))
print(find_analogies('einstein', 'scientist', 'painter'))
print(find_analogies('china', 'rice', 'bread'))
print(find_analogies('man', 'woman', 'she'))
print(find_analogies('man', 'woman', 'aunt'))
print(find_analogies('man', 'woman', 'sister'))
print(find_analogies('man', 'woman', 'wife'))
print(find_analogies('man', 'woman', 'actress'))
print(find_analogies('man', 'woman', 'mother'))
print(find_analogies('heir', 'heiress', 'princess'))
print(find_analogies('nephew', 'niece', 'aunt'))
print(find_analogies('france', 'paris', 'tokyo'))
print(find_analogies('france', 'paris', 'beijing'))
print(find_analogies('february', 'january', 'november'))
print(find_analogies('france', 'paris', 'rome'))
print(find_analogies('paris', 'france', 'italy'))
print(nearest_neighbors('king'))
print(nearest_neighbors('france'))
print(nearest_neighbors('japan'))
print(nearest_neighbors('einstein'))
print(nearest_neighbors('woman'))
print(nearest_neighbors('nephew'))
print(nearest_neighbors('february'))
print(nearest_neighbors('rome'))
So, from the above we can see that for the most part, these vectors make a lot of sense and could prove to be very powerful!
We are now going to see how we could apply pretrained word vectors in order to do text classification. Unlike the remaining walkthroughs in section 2, we are going to use a bag of words model. What this means is that toy dog would be represented in the same way as dog toy, even though we know they have different meanings. In other words, we do not consider the order of the words. A few examples of bag of words that we have seen already are word counting and TF-IDF.
So, how can we apply word vectors (such as what we would get from word2vec and GloVe) to bag-of-words. Well, it is quite simple; we know that we would like to represent a sentence, or document, as a single vector. First, for each word in the sentence, we can use our list of word vectors to get a vector. Then, we can just take the average of all of those word vectors; in other words, we add up all the word vectors, and divide by the number of vectors.
For example, let's say that our sentence is:
"I like eggs."
Our feature vector would then be:
Feature_vector = [vec('I') + vec('like') + vec('eggs')] / 3Mathematically we can write this as:
$$feature = \frac{1}{|sentence|} \sum_{w \in sentence} vec(w)$$At this point we are able to convert each document in our data set into a single feature vector, now what? Well, it is actually very simple! We can take this representation and plug it into any supervised machine learning classifier, such as Naive Bayes, Logistic Regression, Random Forest, Extra Trees (found in Scikit-Learn), etc! This is due to the standard classifcation api that scikit learn provides, namely:
model = Model()
model.fit(X_train,Y_train)
model.score(X_test, Y_test)One thing that you may have realized at this point is that in order for this work, our feature vectors must be good, or else we won't get an acceptable classification accuracy. In order for our feature vectors to be good feature vectors, the individual word vectors that are dependent on must make sense! So, this process of performing classification is actually a method use to evaluate how good a word embedding algorithm is!
There are other methods for evaluating how good a word embedding algorithm is, such as word analogies and similarity. Of course, these evaluations are just a proxy for how well each algorithm will work on your particular problem with your particular dataset.
import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import ExtraTreesClassifier, RandomForestClassifier
from gensim.models import KeyedVectors
train = pd.read_csv('../../data/nlp/r8-train-all-terms.txt', header=None, sep='\t')
test = pd.read_csv('../../data/nlp/r8-test-all-terms.txt', header=None, sep='\t')
train.columns = ['label', 'content']
test.columns = ['label', 'content']
class GloveVectorizer:
    """Glove Vectorizer Fit Transform class"""
    def __init__(self):
        print('Loading word vectors...')
        word2vec = {}
        embedding = []
        idx2word = []
        with open('../../data/nlp/glove.6B/glove.6B.50d.txt', encoding='utf-8') as f:
            # Is just a space-separated text file in the format:
            #  - word vec[0] vec[1] vec[2] ...
            for line in f:
                values = line.split()
                word = values[0]
                vec = np.asarray(values[1:], dtype='float32')
                word2vec[word] = vec
                embedding.append(vec)
                idx2word.append(word)
        print(f'Found {len(word2vec)} word vectors.')
        
        self.word2vec = word2vec
        self.embedding = np.array(embedding)
        self.word2idx = {v:k for k,v in enumerate(idx2word)}
        self.V, self.D = self.embedding.shape
        
    def fit(self, data):
        pass
    
    def transform(self, data):
        X = np.zeros((len(data), self.D)) # initialize data matrix X
        n = 0  # index's data
        emptycount = 0 # how many sentences had words we coudn't find vectors for. 
        for sentence in data: # Loop through each sentence in the data
            tokens = sentence.lower().split()
            vecs = [] # stores all word vectors we encounter for this document (sentence)
            for word in tokens: # Loop through each words
                if word in self.word2vec:  # if word is in vocabularly, append its vector to vecs
                    vec = self.word2vec[word]
                    vecs.append(vec)
            if len(vecs) > 0:            # Check if vecs has any vectors. If yes, assign mean to X[n]
                vecs = np.array(vecs)
                X[n] = vecs.mean(axis=0)
            else:
                emptycount += 1
            n += 1
        print(f'Number of samples with no words found: {emptycount} / {len(data)}')
        return X
    
    def fit_transform(self, data):
        self.fit(data)
        return self.transform(data)        
vectorizer = GloveVectorizer() # Instantiate vectorizer
Xtrain = vectorizer.fit_transform(train.content)
Ytrain = train.label
Xtest = vectorizer.transform(test.content)
Ytest = test.label
# create the model, train it, print scores
model = RandomForestClassifier(n_estimators=200)
model.fit(Xtrain, Ytrain)
print("train score:", model.score(Xtrain, Ytrain))
print("test score:", model.score(Xtest, Ytest))
We can see that our glove vectors ended up performing well in the final Random Forest classification! This is a useful metric to show that our individual word vectors worked well.