2. Language Modeling and Neural Networks¶

This post is about language modeling and its relation to neural networks. We will start with what may very well be the simplest task possible: creating a bigram language model.

2.1 Bigram Language Model¶

To start, in case it not clear, what is a language model? A language model is a model of the probabilities of sequences of words. In english, we refer to a sequence of words a sentence. So, for example, if we had the sentence:

The quick brown fox jumps over the lazy dog.

A language model will allow us to calculate:

$$p\Big(\text{The quick brown fox jumps over the lazy dog.}\Big)$$

The form that the above probability distribution takes is what makes up the model, and typically that is going to involve making some assumptions about the structure of language and how sentences are formed.

2.1.1 What is a Model anyways?¶

I want to take a moment to be very clear here and describe exactly what a model is. A model is trying to capture some real world phenomema (i.e. language, motion, finances, etc), and it will never be 100% correct. It is always going to make simplifying assumptions. The idea is that they will be correct most of the time, but some of the time they will be incorrect. For example, Newtonian Mechanics was determined to be incorrect (based on work by Einstein), but it still proves to be useful!

2.1.2 What is a Bigram?¶

A bigram is simply two consecutive words in a sentence. So, from our above example, the bigrams would be:

The quick
quick brown
brown fox
fox jumps
jump over
over the
the lazy
lazy dog

We could also have trigrams and n-grams, which deal with three and $n$ consecutive words respectively. In terms of the bigram model, we are going to be modeling each bigram as a probability:

$$Bigram \; model: p\big(w_t \mid w_{t-1}\big)$$

So for example, we could have:

$$p\big(brown \mid quick\big) = 0.5$$$$p\big(the \mid the\big) = 0$$

In the above, all the statement is saying is: the probability of seeing the word brown given that we just saw the word quick is 0.5. Now, how do we actually find these probabilities? We just count! So, to find that $p\big(brown \mid quick\big) = 0.5$, we would simply count up how many times $quick \rightarrow brown$ appears in our documents, and how many times $brown$ appears, and then divide the former by the later. This will give us the maximum likelihood probability:

$$p\big(brown \mid quick\big) = \frac{count(quick \rightarrow brown)}{count(quick)}$$

I would like to clarify what I mean when I refer to documents. Generally speaking, we are going to have some training data - a list of exmaple sentences - to create our model. For our purposes, we will mostly be using wikipedia, but in general documents just refer to a set of files that contains sentences. This sometimes will be called a corpus.

2.1.3 What is a Language Model¶

Returning to the idea of a language model, recall that we want to know the probability of an entire sentence. How can bigrams help us do that? As per our previous discussion, this is going to involve making some assumptions. Let's look at a simpler example, specifically the sentence:

I like dogs.

Our goal is to find $p\big(I like dogs\big)$. Well, we can apply the rules of conditional probability here. For a quick refresher, recall that the rules of conditional probability state that:

$$p\big(A \mid B\big) = \frac{p \big(A \cap B \big)}{p\big(B\big)}$$$$p \big(A \cap B \big) = p\big(A \mid B\big) p\big(B\big)$$

This can of course be extended to 3 variables like so:

$$p \big(A, B, C \big) = p \big(C \mid A, B\big) p\big(A, B\big)$$

Which, in the case of our sentence above would leave us with:

$$p \big(I, like, dogs \big) = p \big(dogs \mid I, like\big) p\big(I, like\big)$$

We can apply this rule of conditional probability yet again the joint probability $ p\big(I, like\big)$, resultin in:

$$p \big(I, like, dogs \big) = p \big(dogs \mid I, like\big) p\big(like \mid I\big) p \big(I\big)$$

This could simply be continued if we had a longer sentence. This process encapsulates what is known as the chain rule of probability. So, why is the above important? Well, if we look at our resulting expression above, we can see that one of the probabilities is part of the bigram model:

$p \big(dogs \mid I, like\big)$ $p\big(like \mid I\big)$ $p \big(I\big)$

Now, the two other terms in blue are not bigrams, but that is okay! We can still calculate them using maximum likelihood estimation. For the unigram $p(I)$, this is simply the number of times $I$ appears in the corpus, relative to the total corpus length:

$$p\big(I\big) = \frac{count(I)}{corpus \; length }$$

For the trigram $p \big(dogs \mid I, like\big)$, we would need to perform the following counts:

$$p \big(dogs \mid I, like\big) = \frac{count(I, like, dogs)}{count(I, like)}$$

We can extend the above logic to sentences of any length. So, if we were dealing with a sentence:

A B C D E

We could model it as:

$$p\big( A, B, C, D, E\big) = p\big(E \mid A, B, C, D\big) p\big( D \mid A, B, C\big) p\big( C \mid A, B\big) p\big( B \mid A \big) p\big( A\big)$$

Note, above we are using commas to separate our words, which makes it look like a joint probability. However, we must keep in mind that we are looking at a sequence, and not simply a joint probability. With that said, what we should be taking away from the above equation is that modeling these $n$-grams will at some point become problematic. For example, if we return to our original sentence:

The quick brown fox jumps over the lazy dog.

Perhaps this is the only sentence like this in our corpus! We know that:

The quick brown fox jumps over the lazy cat.

Is a valid and reasonable sentence. However, if it never shows up in our corpus, its maximum likelihood probability is 0. Zero is not an accurate probability in this case, since we know that the sentence makes sense, and that our language model should allow for it.

2.1.4 Add-One Smoothing¶

One simple way to overcome these zero probabilities is to add a small number to each count, instead of performing vanilla maximum-likelihood counting. For instance, if we have a vocabulary size of $V$, our probability would look like:

$$p_{smooth}\big(B \mid A\big) = \frac{count(A \rightarrow B) + 1}{count(A) + V}$$

We add $V$ to the denominator to ensure that our probabilities sum to one. This process ensures that even if a phrase does not appear in our corpus, it still has a small probability of occuring.

2.1.5 The Markov Assumption¶

Another thing we can do is make the Markov Assumption. This is that whatever you see now depends only on what you saw in the previous step. Mathematically this looks like:

$$p\big( w_t \mid w_{t-1}, w_{t-2}, ... , w_1 \big) = p \big(w_t \mid w_{t-1}\big)$$

This is know as a first order markov because it only depends on one previous term. For example, in our previous situation when modeling A B C D E we ended up with:

$$p\big( A, B, C, D, E\big) = p\big(E \mid A, B, C, D\big) p\big( D \mid A, B, C\big) p\big( C \mid A, B\big) p\big( B \mid A \big) p\big( A\big)$$

If we made the markov assumption, the first term on the right would be reduced to:

$$p\big(E \mid A, B, C, D\big) = p\big(E \mid D\big)$$

The entire sentence would be reduced to:

$$p\big( A, B, C, D, E\big) = p\big(E \mid D\big) p\big( D \mid C\big) p\big( C \mid B\big) p\big( B \mid A \big) p\big( A\big)$$

We end up with a probability consisting entirely of bigrams and one unigram! Why is this important? Well, we need to keep in mind that the longer a sentence, the less likely it is to appear in our training data. This is because our training data makes up only a tiny fraction of the entire space of possible sentences. However, very short phrases like bigrams are going to be very common. So, while phrases such as:

The quick brown fox jumps over the lazy cat.

The quick brown fox jumps over the lazy lizard.

Are not likely to appear in our corpus, phrases such as lazy cat and lazy lizard most likely do. Hence, it is easier to model the probability for lazy lizard than it is to model the probability for The quick brown fox jumps over the lazy lizard.. This in turn makes the full sentence much more probable.

2.2 Creating a Bigram Language Model with NLTK¶

We are about to create a bigram language model in code using NLTK, but before we do there are a few things to consider. First, we know that probabilities are always between 0 and 1, and that the full joint probability of our bigram model is just the multiplication of each bigram probability in the sentence:

$$p\big(w_1,...,w_T \big) = p\big(w_1\big) \prod_{t=2}^T p \big( w_t \mid w_{t-1}\big)$$

We also know that multiplying two numbers less than one together will always yield a smaller number. The result is that if we just keep multiplying probabilities together, we may encounter the underflow problem, which means that we hit the limit of numerical precision that our computer can handle, and it will just round down to 0. The solution to this is to use the log probability instead:

$$log \Big(p\big(w_1,...,w_T \big)\big) = log \Big(p\big(w_1\big)\Big) \sum_{t=2}^T log \Big( p \big( w_t \mid w_{t-1}\big) \Big)$$

We can use this because we know that the log function is monotonically increasing, so if $A > B$ then $log(A) > log(B)$. The other thing that we are going to want to do is normalize each sentence. Since probabilities are between 0 and 1, log probabilities are always negative. Hence, the longer our sentences, the more negative numbers we are going to add together. This means that if we compare raw log probabilities, there is always going to be a bias towards shorter sentences. Shorter sentences will always have a higher log probabilty, simply because they have fewer negative numbers to add together. For example:

$$logp\big( \text{the the the} \big) > logp\big( \text{A real, but much longer sentence than the one to left} \big)$$

To solve this, we can just compare the log probabilities, divided by the length of the sentence, $T$:

$$\frac{1}{T}logp\big(w_1,...,w_T \big) = \frac{1}{T} \Big[ logp\big(w_1\big)\Big) \sum_{t=2}^T logp \big( w_t \mid w_{t-1}\big) \Big)\big]$$

2.3 Bigram Language Model in Code¶

We will start with our imports:

import numpy as np
import operator
from nltk.corpus import brown

And then write a few functions to load our data:

def get_sentences():
    """Returns 57,430 sentences from the brown corpus. Each sentence is a list of individual string tokens."""
    return brown.sents()

def get_sentences_with_word2idx():
    """Converts sentences from word representation to index representation.
    
    Assign a unique integer, starting from 0, to every word that appears in the corpus.
    Returns a dictionary that contains a mapping from every word to its corresponding index."""
    sentences = get_sentences()
    indexed_sentences = []
    
    i = 2 
    word2idx = {'START': 0, 'END': 1}
    for sentence in sentences:
        indexed_sentence = []
        for token in sentence: 
            token = token.lower()
            if token not in word2idx:
                word2idx[token] = i
                i += 1
            indexed_sentence.append(word2idx[token])
        indexed_sentences.append(indexed_sentence)
    print('Vocabulary size: ', i)
    return indexed_sentences, word2idx

KEEP_WORDS = set([
  'king', 'man', 'queen', 'woman',
  'italy', 'rome', 'france', 'paris',
  'london', 'britain', 'england',
])

def get_sentences_with_word2idx_limit_vocab(n_vocab=2000, keep_words=KEEP_WORDS):
    sentences = get_sentences()
    indexed_sentences = []
    
    i = 2 
    word2idx = {'START': 0, 'END': 1}
    idx2word = ['START', 'END']
    
    word_idx_count = {
        0: float('inf'),
        1: float('inf'),
    }
    
    for sentence in sentences:
        indexed_sentence = []
        for token in sentence: 
            token = token.lower()
            if token not in word2idx:
                idx2word.append(token)
                word2idx[token] = i
                i += 1

            # keep track of counts for later sorting
            idx = word2idx[token]
            word_idx_count[idx] = word_idx_count.get(idx, 0) + 1
            
            indexed_sentence.append(idx)
        indexed_sentences.append(indexed_sentence)
    
    # ---- Restrict vocab size ----
    # Set all the words that should be kept to infinity so that they are included when
    # we pick the most common words
    for word in keep_words:
        word_idx_count[word2idx[word]] = float('inf')
        
    # Sort word counts dictionary by value, in descending order
    sorted_word_idx_count = sorted(word_idx_count.items(), key=operator.itemgetter(1), reverse=True)
    word2idx_small = {}
    new_idx = 0
    idx_new_idx_map = {}
    for idx, count in sorted_word_idx_count[:n_vocab]:
        word = idx2word[idx]
#         print(word, count)
        word2idx_small[word] = new_idx
        idx_new_idx_map[idx] = new_idx
        new_idx += 1
    # let 'unknown' be the last token
    word2idx_small['UNKNOWN'] = new_idx 
    unknown = new_idx
    
    assert('START' in word2idx_small)
    assert('END' in word2idx_small)
    for word in keep_words:
        assert(word in word2idx_small)
        
    # map old idx to new idx
    sentences_small = []
    for sentence in indexed_sentences:
        if len(sentence) > 1:
            new_sentence = [idx_new_idx_map[idx] if idx in idx_new_idx_map else unknown for idx in sentence]
            sentences_small.append(new_sentence)

    return sentences_small, word2idx_small

Now we can start with our language model:

def get_bigram_probs(sentences, V, start_idx, end_idx, smoothing=1):
    # Structure of bigram probability matrix will be:
    # (last word, current word) -> probability
    # Utilizing add-1 smoothing
    bigram_probs = np.ones((V, V)) * smoothing
    for sentence in sentences:
        for i in range(len(sentence)):
            if i == 0:
                # Beginning word
                bigram_probs[start_idx, sentence[i]] += 1
            else:
                # Middle word
                bigram_probs[sentence[i-1], sentence[i]] += 1
            if i == len(sentence) - 1:
                # Final Word
                # We update the bigram for last -> current
                # AND current -> End otken
                bigram_probs[sentence[i], end_idx] += 1
    bigram_probs /= bigram_probs.sum(axis=1, keepdims=True)
    return bigram_probs

if __name__ == '__main__':
    # Load in the data
    # Note: sentences are already converted to sequences of word indexes
    # Note: you can limit the vocab size if you run out of memory
    sentences, word2idx = get_sentences_with_word2idx_limit_vocab(10000)

    # Vocab size
    V = len(word2idx)
    print("Vocab size:", V)

    # Treat beginning of sentence and end of sentence as bigrams
    # START -> first word
    # last word -> END
    start_idx = word2idx['START']
    end_idx = word2idx['END']

    # A matrix where:
    # - row = last word
    # - col = current word
    # value at [row, col] = p(current word | last word)
    bigram_probs = get_bigram_probs(sentences, V, start_idx, end_idx, smoothing=0.1)
    
    def get_score(sentence):
        score = 0
        for i in range(len(sentence)):
            if i == 0:
                # Beginning word
                score += np.log(bigram_probs[start_idx, sentence[i]])
            else:
                # Middle word
                score += np.log(bigram_probs[sentence[i-1], sentence[i]])
        # Final word        
        score += np.log(bigram_probs[sentence[-1], end_idx])
        
        return score / (len(sentence) + 1)
    
    # Map word indexes back to real words - helpful to display sentences
    idx2word = dict((v, k) for k, v in word2idx.items())
    def get_words(sentence):
        return ' '.join(idx2word[i] for i in sentence)

    # when we sample a fake sentence, we want to ensure not to sample start token or end token
    sample_probs = np.ones(V)
    sample_probs[start_idx] = 0
    sample_probs[end_idx] = 0
    sample_probs /= sample_probs.sum()
    
    # Test our model on real and fake sentences
    while True:
        # real sentence
        real_idx = np.random.choice(len(sentences))
        real = sentences[real_idx]

        # fake sentence
        fake = np.random.choice(V, size=len(real), p=sample_probs)

        print("REAL:", get_words(real), "SCORE:", get_score(real))
        print("FAKE:", get_words(fake), "SCORE:", get_score(fake))

        # input your own sentence
        custom = input("Enter your own sentence:\n")
        custom = custom.lower().split()

        # check that all tokens exist in word2idx (otherwise, we can't get score)
        bad_sentence = False
        for token in custom:
            if token not in word2idx:
                bad_sentence = True

        if bad_sentence:
            print("Sorry, you entered words that are not in the vocabulary")
        else:
            # convert sentence into list of indexes
            custom = [word2idx[token] for token in custom]
            print("SCORE:", get_score(custom))

        cont = input("Continue? [Y/n]")
        if cont and cont.lower() in ('N', 'n'):
            break

Vocab size: 10001
REAL: when at last she could suffer the UNKNOWN no longer , nor face the girl's UNKNOWN , she said in a voice UNKNOWN : SCORE: -5.303214027073128
FAKE: panic refer r yankees friction stained szold bake rico sought passions eleanor admiration cause aide charles keeping muzzle midst wtv transport pip's draws exert SCORE: -9.399468870608093
Enter your own sentence:
Hoping that this works
SCORE: -7.237799525364752
Continue? [Y/n]n